airaria / textbrewer Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 236.0 7.72 MB

A PyTorch-based knowledge distillation toolkit for natural language processing

Home Page: http://textbrewer.hfl-rc.com

License: Apache License 2.0

Python 100.00%

bert distillation knowledge nlp pytorch

textbrewer's People

Contributors

Stargazers

Watchers

Forkers

gladuo awesome-archive yespon michaelzhouwang michael-wzhu liuweiping2020 xiongmaoxia zhipengchen qianrenjian awyshw sthsf trendingtechnology joytianya zhishihc jiezhanggt xgdlt fword xrosliang yaxche-io ishine databill86 nangeblog single430 guhaifudeng chenruiqing yuanqinglee easonfzw sunyilgdx arcral smallzh libertatis wanglongxingtianxia zmskye xiaojie2018 phychaos kingleao bobkentt sshleifer richiesui superrichiesui caoxu915683474 xu-yijie 18106574249 liyandan yyht askintution xianling5188 duxiaochao autwind berryhn zhujiangang xiaofengzhu paopao6 zuiwufenghua vincentlux yuweifamily junsi jinsongpan ltrainzhang psy2013github githubmyk renxingkai jozhouxian gdh756462786 rogervaas auto-ml sjliu0920 buzzit-jimmytse wxyhv fancyerii leandroufrgs qingqingsun ethanlovequeen zheng5yu9 geekerwl ljw23 yeyexie johnson7788 xmy123 shan6333 acproject kunpeng199494 cgq0816 fangd123 r00kkie lilujunai liang-shihao sxjscience ianliyi1996 dumpmemory muximuxi gaowenxin95 houpanpan frostjsy jiasumatrix napoler smj0 shiyuzh2007 z1qsx2wdc jangqh

textbrewer's Issues

请问有没有BERT蒸馏到简单模型的sample，比如说BiGRU、CNN之类的？

你们的实验中是只统计最后输出层的loss么

About loss functions

Hi, thanks for your great work! I have a question about loss functions. Are there any experiments about which loss function is preferable between kd_mse_loss and kd_ce_loss?

there is a warning in the function "_select_logits_with_mask"

Warning: masked_scatter_ received a mask with dtype torch.uint8, this behavior is now deprecated,please
use a mask with dtype torch.bool instead.

when I change mask = mask.unsqueeze(-1).expand_as(logits).to(torch.uint8) with mask = mask.unsqueeze(-1).expand_as(logits).to(torch.bool) the warning disappear.

my environment is

python 3.7
pytorch 1.4
cuda 10.0
cudnn7.6

msra命名实体识别train 过程无法复现结果

您好，msra命名实体识别任务中，teacher模型的train过程无法复现结果啊，运行多次，变化不同的epoches，f1结果最高才79.54%啊，是我哪里的参数设置错误了嘛

RuntimeError: Incoming model is an instance of torch.nn.parallel.DataParallel. Parallel wrappers should only be applied to the model(s) AFTER the model(s) have been returned from amp.initialize.

运行bert蒸馏到4层的示例时出现如下问题

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Traceback (most recent call last):
File "main.distill.py", line 199, in
main()
File "main.distill.py", line 191, in main
distiller.train(optimizer, scheduler_class=scheduler_class, scheduler_args=scheduler_args, dataloader = train_dataloader,
File "/data/homework/anaconda3/lib/python3.8/site-packages/textbrewer/distiller_basic.py", line 277, in train
optimizer, scheduler, tqdm_disable = self.initialize_training(optimizer, scheduler_class, scheduler_args, scheduler)
File "/data/homework/anaconda3/lib/python3.8/site-packages/textbrewer/distiller_basic.py", line 89, in initialize_training
(self.model_S, self.model_T), optimizer = amp.initialize([self.model_S, self.model_T], optimizer, opt_level=self.t_config.fp16_opt_level)
File "/data/homework/anaconda3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/data/homework/anaconda3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_initialize.py", line 168, in _initialize
check_models(models)
File "/data/homework/anaconda3/lib/python3.8/site-packages/apex-0.1-py3.8.egg/apex/amp/_initialize.py", line 74, in check_models
raise RuntimeError("Incoming model is an instance of {}. ".format(parallel_type) +
RuntimeError: Incoming model is an instance of torch.nn.parallel.DataParallel. Parallel wrappers should only be applied to the model(s) AFTER
the model(s) have been returned from amp.initialize.
网上的一个解决方案是：
https://blog.csdn.net/qq_23944915/article/details/103966211
那么我该修改源码，在distill_basic.py第89行amp.initialize后加上模型并行化吗？我不确定问题出在哪

why use scheduler_class and scheduler_args rather than scheduler?

Can you please share with me the insights of deprecating "scheduler" and using "scheduler_class" and "scheduler_args" to construct the scheduler object? Thanks!

关于不同模型之间的蒸馏问题

想问下是不是还不支持类似把Roberta蒸馏到electra这样的方式，只支持同源预训练模型之间的蒸馏(至少词表大小一致？)？我尝试了，发现词表大小不匹配，导致报错

蒸馏的时候映射层的对应关系是什么样的

非常棒的工具，感谢您的开发。近期用您的工具蒸馏一个 electra-base 模型到 electra-small. 级别，蒸馏20轮准确率上出现了下降约1.8%左右的问题，我在教师模型和学生模型映射层对应的时候使用的是 0-0 8-8 12-12 这样的对应关系（因为考虑到electra-base electra-small 都是12层模型）。请问这样的做法会导致问题的产生吗。此外之前 hugging face 对于 bert 模型的蒸馏12 层-6层我记得他经过了很多实验，找到了一些映射层对应的策略（教师的第几层对应学生的第几层），我的问题是假如我也想蒸馏一个electra-base 6层学生模型这类的映射层对应有没有可以参考的实验记录呢谢谢您

How to perform data augmentation?

Thanks for the awesome work! I have a question on the examples (while not on the framework itself).
In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples.
Thank you again!

更换模型无法复现finetune结果

前辈您好，NER任务的finetune过程，我将模型改为bert-base后，学习率改为2e-5/3e-5，衰减系数为1，反复试验多次，结果均很差很差，是我哪里设置出了问题嘛，按理说F1结果应该在90%才对啊，求指导，谢谢。

msra命名实体识别任务无法复现结果

mnli_example实验相关

您好，请问mnli_example这个实验环境是多少？如：pytorch和transformers等。
在多次尝试运行时都会出现以下问题：
2021/02/03 16:54:38 - INFO - utils - Loading features from cached file data_root_dir/SST2/cached_train_128_sst-2
2021/02/03 16:54:39 - INFO - utils - Loading features from cached file data_root_dir/SST2/cached_dev_128_sst-2
2021/02/03 16:54:39 - INFO - Main - Data loaded
Traceback (most recent call last):
File "main.trainer.py", line 220, in
main()
File "main.trainer.py", line 160, in main
assert len(missing_keys)==0
AssertionError

请教MultiTeacherDistiller模型

感谢您的成果分享！麻烦问一下有没有MutiTeacherDistiller模型的相关实现细节的介绍，我看论文里没太提到，代码里像是用了多个教师模型的损失的平均，有没有相关的研究论文？感谢！

Hard loss 有用吗？

作者，您有试过加hard loss的实验吗？

如果student模型是随机初始化的话，我在cmrc2018的实验中尝试添加了 hard loss且hard_loss_weight=1的时候蒸馏训练结果奇差（F1 25%左右，EM 7% 左右）。即便，我将hard_loss_weight设置成0.001这样超级小的值，也会让原本蒸馏最好的性能下降0.5%-1% 大概。

我非常疑惑“添加hard loss”为什么会有这么明显的伤害？期待解答。

What is CustomMatch?

I found this parameter in the API, can you please explain what is this and when to use it? Is it something similar as IntermediateMatch? Thanks.

不支持多标签分类任务蒸馏的原因

非常感谢前辈的工作，这是一个非常标准的蒸馏框架，最近也在持续学习和使用。

不过看文档中写了，不支持 multiLabel ，想问一下不支持的原因是什么，感觉技术理论上应该没有限制。

还是说可能需要单独写一个 distilller 来对特殊的 loss 和label 做适配

self.d_config.is_caching_logits=True about results_T

大佬好，有一点不能理解
看到代码里面对 is_caching_logits 的解释
is_caching_logits (bool): if ``True``, caches the batches and the output logits of the teacher model in memory, so that those logits will only be calcuated once. It will speed up the distillation process. This feature is **only available** for :class:~textbrewer.BasicDistiller and :class:~textbrewer.MultiTeacherDistiller, and only when distillers' ``train()`` method is called with ``num_steps=None``. It is suitable for small and medium datasets.

但是在distiller_basic.py中，
#261 batch, cached_logits = batch
#276 results_T = {'logits':[logits.to(self.t_config.device) for logits in cached_logits]}
result_T只是将对应的batch to devide,貌似没有经过self.model_T be calcuated once。还是说在哪里统一的计算了，是我弄错了吗？请不吝赐教！

蒸馏后学生模型乱码

你好，我参考cmrc2018_example做了一个QA的bert三层蒸馏，但是蒸馏后学生模型的输出却乱码了，请问这种情况的原因一般是什么呢，是代码的问题还是调参的问题呢，我的超参数选择了推荐的lr: 1e-4和epoch: 50，macthes选择了'L3_hidden_smmd', 'L3_hidden_mse', 'L3_attention_mse', 'L3_attention_ce', 'L3_attention_mse_sum', 'L3_attention_ce_mean'。

shall we give some examples about text classification like cmrc、ner tasks？

Question about distillation on MLM task

Hi,

Thank you so much for open sourcing this toolkit! This is very helpful. I tried to finetune on several downstream tasks and it works great.

May I ask if you ever tried to distill a random-initialized model using a large teacher model finetuned on a specific dataset using MLM as objective? Specifically, I am trying to distill a T4tiny using roberta-wwm-ext finetuned on my own dataset using MLM as objective. However, the perplexity sticks at around 60...

Speaking of implementation, my adaptor returns losses, logits, hidden, attention for distillation, and the intermediate matches I used is following the T4tiny configuration: L4t_hidden_mse, L4_hidden_smmd. The dataset I am using for MLM is part of webtext qa dataset, which sizes ~500mb.

Is it possible that you could provide some notes that I should pay attention to? Thank you so much for your kind help!

Best,
Vincent

请问一下msra实体识别任务中，matches没有给参数，默认是用那种蒸馏方式啊？

mnli问题

请问下mnli数据集在哪里可以下载

NER 任务中teacher model 和student model的matches如匹配

前辈您好，请问NER 任务中teacher model 和student model的matches如匹配呢？

The flsw_temperature_scheduler may cause nan error !

以下这段代码在v\t的更新部分会产生除0错误，加入eps即可

def flsw_temperature_scheduler(logits_S, logits_T, base_temperature):
        v = logits_S.detach()
        t = logits_T.detach()
        with torch.no_grad():
            v = v/torch.norm(v,dim=-1,keepdim=True)
            t = t/torch.norm(t,dim=-1,keepdim=True)
            w = torch.pow((1 - (v*t).sum(dim=-1)),gamma)
            tau = base_temperature + (w.mean()-w)*beta
        return tau

以下是可能的解决方案

def flsw_temperature_scheduler_builder(beta,gamma,eps=1e-3,*args):
    '''
    adapted from arXiv:1911.07471
    '''
    def flsw_temperature_scheduler(logits_S, logits_T, base_temperature):
        v = logits_S.detach()
        t = logits_T.detach()
        with torch.no_grad():
            v = v/(torch.norm(v,dim=-1,keepdim=True)+eps)
            t = t/(torch.norm(t,dim=-1,keepdim=True)+eps)
            w = torch.pow((1 - (v*t).sum(dim=-1)),gamma)
            tau = base_temperature + (w.mean()-w)*beta
        return tau
    return flsw_temperature_scheduler

蒸馏效果不好，请问一下怎么解决？

1、使用官方提供参数
蒸馏用的学习率 lr=1e-4(除非特殊说明)。训练30~60轮。
2、语料集大小 1W左右标注数据
3、模型加载
# Define models
bert_config = BertConfig.from_json_file('bert/bert_config/bert_config.json')
bert_config_T6 = BertConfig.from_json_file('bert/bert_config/bert_config_T6.json')

    bert_config.output_hidden_states = True
    bert_config_T6.output_hidden_states = True  
    
    bert_config.num_labels=self.num_labels
    bert_config_T6.num_labels=self.num_labels
    
    teacher_model = BertForSequenceClassification.from_pretrained(self.distill_bert_path,config=bert_config)
    student_model = BertForSequenceClassification(bert_config_T6) #, num_labels = 2

4、实验结果
学生模型最终在数据集上的准确率 0.14，如果用大模型fine-turning的话也是一个很好的效果，不知道蒸馏之后的效果为什么那么差，还是我使用的方式不对

蒸馏代码
` def distill_fit(self, train_df,dev_df,*args):
self.get_model_config()

    if self.set_en_train:
        if not self.tokenizer:self.tokenizer = BertTokenizer.from_pretrained(self.en_vocab_path)
    else:
        if not self.tokenizer:self.tokenizer = BertTokenizer.from_pretrained(self.distill_bert_path)
        
    #检测gpu是否可用
    device=PlatformUtils.get_device()
    logging.info('--------cuda device--------:%s'%(device))
    device='cuda:0'
    
    self.device = torch.device(device if device else "cpu")
    
    train_it,dev_it,train_dl=self.transform(train_df,dev_df,*args)
    
    
    # Define models
    bert_config = BertConfig.from_json_file('bert/bert_config/bert_config.json')
    bert_config_T6 = BertConfig.from_json_file('bert/bert_config/bert_config_T6.json')
    
    bert_config.output_hidden_states = True
    bert_config_T6.output_hidden_states = True  
    
    bert_config.num_labels=self.num_labels
    bert_config_T6.num_labels=self.num_labels
    
    #teacher_model = BertForSequenceClassification(bert_config) #, num_labels = 2 self.bert_path
    teacher_model = BertForSequenceClassification.from_pretrained(self.distill_bert_path,config=bert_config)
    # Teacher should be initialized with pre-trained weights and fine-tuned on the downstream task.
    # For the demonstration purpose, we omit these steps here
    
    student_model = BertForSequenceClassification(bert_config_T6) #, num_labels = 2
    
    teacher_model.to(device=self.device)
    student_model.to(device=self.device)
    
    # Optimizer and learning rate scheduler
    optimizer = AdamW(student_model.parameters(), lr=1e-4)
    
    
    if not self.criterion:self.criterion=nn.CrossEntropyLoss()
    
    num_epochs = 60
    num_training_steps = len(train_it) * num_epochs
    
    scheduler_class = get_linear_schedule_with_warmup
    # arguments dict except 'optimizer'
    scheduler_args = {'num_warmup_steps':int(0.1*num_training_steps), 'num_training_steps':num_training_steps}
    
    # display model parameters statistics
    print("\nteacher_model's parametrers:")
    result, _ = textbrewer.utils.display_parameters(teacher_model,max_level=3)
    print (result)
    print("student_model's parametrers:")
    result, _ = textbrewer.utils.display_parameters(student_model,max_level=3)
    print (result)
    from functools import partial
    callback_fun = partial(self.validate, eval_dl=train_dl) # fill other arguments
    # Initialize configurations and distiller
    train_config = TrainingConfig(device=self.device)
    distill_config = DistillationConfig(
        temperature=8,
        hard_label_weight=0,
        kd_loss_type='ce',
        probability_shift=False,
        is_caching_logits=True,
        intermediate_matches=[
            {"layer_T":0, "layer_S":0, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":2, "layer_S":1, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":4, "layer_S":2, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":6, "layer_S":3, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":8, "layer_S":4, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":10,"layer_S":5, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":12,"layer_S":6, "feature":"hidden", "loss":"hidden_mse", "weight":1}]
    )
    
    print ("train_config:")
    print (train_config)
    
    print ("distill_config:")
    print (distill_config)
    
    distiller = GeneralDistiller(
        train_config=train_config, distill_config = distill_config,
        model_T = teacher_model, model_S = student_model, 
        adaptor_T = self.simple_adaptor, adaptor_S = self.simple_adaptor)
    
    # Start distilling
    with distiller:
        distiller.train(optimizer,train_it, num_epochs=num_epochs, 
        scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=callback_fun) `

Installation form source not possible.

Installation from source results in:

ERROR: Could not find a version that satisfies the requirement tensorboardtqdm (from textbrewer==0.1.8) (from versions: none)
ERROR: No matching distribution found for tensorboardtqdm (from textbrewer==0.1.8)

Will send a PR in a second.

命名`temperature_scheduler`的合理性

如题。

scheduler一般需要某个或某些值随训练过程而变化，但temperature_scheduler并没有，而只是每次临时计算的值，影响kd_loss的计算，并没有被记录下来或影响下一batch。

因此我觉得命名为scheduler不妥当，有误导，其实是和loss一样作为function。

对roberta-wwm-ext蒸馏遇到维度不匹配

` with open('config.json') as f:
config = json.load(fp=f)

num_epochs=2

bert_l12 = BertConfig.from_json_file('./bert_config/bert_l12.json')
bert_l4 = BertConfig.from_json_file('./bert_config/bert_l4.json')
teacher_model = BertModel.from_pretrained('hfl/chinese-roberta-wwm-ext').cpu()
student_model = BertModel(bert_l4).cpu()
print("\nteacher_model's parametrers:")
result, _ = textbrewer.utils.display_parameters(teacher_model, max_level=3)
print(result)

print("student_model's parametrers:")
result, _ = textbrewer.utils.display_parameters(student_model, max_level=3)
print(result)

train_config = TrainingConfig(device=torch.device('cpu'))

distill_config = DistillationConfig(
    intermediate_matches=
        [{"layer_T": 0, "layer_S": 0, "feature": "hidden", "loss": "hidden_mse", "weight": 1,
          "proj": ["linear", 312, 768]},
         {"layer_T": 3, "layer_S": 1, "feature": "hidden", "loss": "hidden_mse", "weight": 1,
          "proj": ["linear", 312, 768]},
         {"layer_T": 6, "layer_S": 2, "feature": "hidden", "loss": "hidden_mse", "weight": 1,
          "proj": ["linear", 312, 768]},
         {"layer_T": 9, "layer_S": 3, "feature": "hidden", "loss": "hidden_mse", "weight": 1,
          "proj": ["linear", 312, 768]},
         {"layer_T": 12, "layer_S": 4, "feature": "hidden", "loss": "hidden_mse", "weight": 1,
          "proj": ["linear", 312, 768]}])

tokenizer = BertTokenizer.from_pretrained('hfl/chinese-roberta-wwm-ext')
data_dir = json_2_csv(config['json_file_path'], config['csv_file_output_dir']) + '.csv'
train_dataset = load_examples(tokenizer,
                              data_dir=data_dir,
                              max_seq_len=config['max_seq_len'],
                              set_type='train',
                              config=config)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,
                              sampler=train_sampler,
                              batch_size=config['hyper_parameters']['batch_size'])

optimizer = AdamW(student_model.parameters(), lr=1e-4)
scheduler = None
scheduler_class = get_linear_schedule_with_warmup
num_training_steps = len(train_dataloader) * num_epochs
scheduler_args = {'num_warmup_steps':int(0.1*num_training_steps), 'num_training_steps':num_training_steps}


distiller = GeneralDistiller(
    train_config=train_config, distill_config=distill_config,
    model_T=teacher_model, model_S=student_model,
    adaptor_T=simple_adaptor, adaptor_S=simple_adaptor)

distiller.train(optimizer, train_dataloader, num_epochs=num_epochs, scheduler_class=scheduler_class, scheduler_args=scheduler_args, callback=None)`

Traceback (most recent call last):
File "/Users/ray.yao/Desktop/daas-text-align/model/know_dis.py", line 96, in
distiller.train(optimizer, train_dataloader, num_epochs=num_epochs, scheduler_class=scheduler_class, scheduler_args=scheduler_args, callback=None)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/textbrewer/distiller_basic.py", line 283, in train
self.train_with_num_epochs(optimizer, scheduler, tqdm_disable, dataloader, max_grad_norm, num_epochs, callback, batch_postprocessor, **args)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/textbrewer/distiller_basic.py", line 212, in train_with_num_epochs
total_loss, losses_dict = self.train_on_batch(batch,args)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/textbrewer/distiller_general.py", line 74, in train_on_batch
(teacher_batch, results_T), (student_batch, results_S) = get_outputs_from_batch(batch, self.t_config.device, self.model_T, self.model_S, args)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/textbrewer/distiller_utils.py", line 274, in get_outputs_from_batch
results_T = auto_forward(model_T,batch,args)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/textbrewer/distiller_utils.py", line 294, in auto_forward
results = model(*batch, **args)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/transformers/modeling_bert.py", line 728, in forward
embedding_output = self.embeddings(
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/ray.yao/opt/anaconda3/envs/text-align/lib/python3.8/site-packages/transformers/modeling_bert.py", line 177, in forward
embeddings = inputs_embeds + position_embeddings + token_type_embeddings
RuntimeError: The size of tensor a (30) must match the size of tensor b (512) at non-singleton dimension 1

Process finished with exit code 1

inputs_embeds.size() = [512, 30, 768] # batch, seq_len, hid_dim
position_embeddings.size() = [512, 768]
token_type_embeddings = [512, 30, 768]

运行时出现错误

使用torch.save(model.dict()，teacher.pt)保存模型，使用torch.load()加载时出现错误

1 frames
/usr/local/lib/python3.7/dist-packages/textbrewer/distiller_utils.py in (.0)
64 model_t.eval()
65 elif isinstance(self.model_T,dict):
---> 66 self.model_T_is_training = {name:model.training for name,model in self.model_T.items()}
67 for name in self.model_T:
68 self.model_T[name].eval()

AttributeError: 'Tensor' object has no attribute 'training'

感谢指教。

Question on different input format for teacher and student models

Hi! I wanted to use TextBrewer to distill a BertModel (teacher model) to a DistilBertModel (student model), both using architectures from transformers. But I got an error on the student model (TypeError: forward() got an unexpected keyword argument 'token_type_ids') because it doesn't recognize the dict key "token_type_ids", while the key is necessary for inputs for the teacher model.

Is it possible for the current system to handle such different input representations? If so, could you give me any suggestion?

在对中间层求mse loss的时候会出现这层loss 不收敛，其他不同种类的loss 都会有不同程度的收敛？

您好，感谢您的代码分享，在运行自己数据集的时候出现了这种情况，在训练teacher 模型的时候loss 波动很大，init_learning为
5e-5,crf 的init_learning为0.003，loss如下图：

这种情况更可能是数据问题还是学习率或者优化器使用不恰当呢？

在做NER蒸馏的时候，在对中间层求mse loss的时候会出现这层loss 不收敛，其他不同种类的loss （如ce 或者kl）都会有不同程度的收敛

对BERT-wwm-ext进行蒸馏时遇到以下问题，代码已贴出

def distill_fit(self, train_df,dev_df,*args):
    self.get_model_config()
    
    if self.set_en_train:
        if not self.tokenizer:self.tokenizer = BertTokenizer.from_pretrained(self.en_vocab_path)
    else:
        if not self.tokenizer:self.tokenizer = BertTokenizer.from_pretrained(self.bert_path)
        
    #检测gpu是否可用
    device=PlatformUtils.get_device()
    logging.info('--------cuda device--------:%s'%(device))
    device='cuda:0'
    
    self.device = torch.device(device if device else "cpu")
    
    train_it,dev_it=self.transform(train_df,dev_df,*args)
    
    
    # Define models
    bert_config = BertConfig.from_json_file('bert/bert_config/bert_config.json')
    bert_config_T6 = BertConfig.from_json_file('bert/bert_config/bert_config_T6.json')
    
    bert_config.output_hidden_states = True
    bert_config_T6.output_hidden_states = True  
    
    bert_config.num_labels=self.num_labels
    bert_config_T6.num_labels=self.num_labels
    
    #teacher_model = BertForSequenceClassification(bert_config) #, num_labels = 2 self.bert_path
    teacher_model = BertForSequenceClassification.from_pretrained(self.bert_path, num_labels=self.num_labels)
    # Teacher should be initialized with pre-trained weights and fine-tuned on the downstream task.
    # For the demonstration purpose, we omit these steps here
    
    student_model = BertForSequenceClassification(bert_config_T6) #, num_labels = 2
    
    teacher_model.to(device=self.device)
    student_model.to(device=self.device)
    
    # Optimizer and learning rate scheduler
    optimizer = AdamW(student_model.parameters(), lr=1e-4)
    scheduler = None
    num_epochs = 30
    num_training_steps = len(train_it) * num_epochs
    
    scheduler_class = get_linear_schedule_with_warmup
    # arguments dict except 'optimizer'
    scheduler_args = {'num_warmup_steps':int(0.1*num_training_steps), 'num_training_steps':num_training_steps}
    
    # display model parameters statistics
    print("\nteacher_model's parametrers:")
    result, _ = textbrewer.utils.display_parameters(teacher_model,max_level=3)
    print (result)
    print("student_model's parametrers:")
    result, _ = textbrewer.utils.display_parameters(student_model,max_level=3)
    print (result)
    from functools import partial
    callback_fun = partial(self.validate, eval_dataset=dev_it, device=self.device) # fill other arguments
    # Initialize configurations and distiller
    train_config = TrainingConfig(device=self.device)
    distill_config = DistillationConfig(
        temperature=8,
        hard_label_weight=0,
        kd_loss_type='ce',
        probability_shift=False,
        is_caching_logits=True,
        intermediate_matches=[
            {"layer_T":0, "layer_S":0, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":2, "layer_S":1, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":4, "layer_S":2, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":6, "layer_S":3, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":8, "layer_S":4, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":10,"layer_S":5, "feature":"hidden", "loss":"hidden_mse", "weight":1}, 
           {"layer_T":12,"layer_S":6, "feature":"hidden", "loss":"hidden_mse", "weight":1}]
    )
    
    print ("train_config:")
    print (train_config)
    
    print ("distill_config:")
    print (distill_config)
    
    distiller = GeneralDistiller(
        train_config=train_config, distill_config = distill_config,
        model_T = teacher_model, model_S = student_model, 
        adaptor_T = self.simple_adaptor, adaptor_S = self.simple_adaptor)
    
    # Start distilling
    with distiller:
        distiller.train(optimizer,train_it, num_epochs=num_epochs, 
        scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=callback_fun)

File "d:\Users\cgq\Anaconda3\Lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "d:\Users\cgq\Anaconda3\Lib\site-packages\transformers\modeling_bert.py", line 211, in forward
embeddings = inputs_embeds + position_embeddings + token_type_embeddings

builtins.RuntimeError: The size of tensor a (256) must match the size of tensor b (8) at non-singleton dimension 2

最终提示我是配置编码的维度和输入的维度不匹配，不知道是哪里出了问题？
按照example里的取数据的方法对数据进行tokenizer处理，如下：
features=self.distill_tok_collate(df,labelmap)
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)

    all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)

    dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
    return dataset

关于MMD loss

TextBrewer/src/textbrewer/losses.py

Lines 267 to 269 in 6c9a21b

    
           gram_S = torch.bmm(state_S_0, state_S_1.transpose(1, 2)) / state_S_1.size(1)  # (batch_size, length, length) 
        
           gram_T = torch.bmm(state_T_0, state_T_1.transpose(1, 2)) / state_T_1.size(1) 
        
           loss = (F.mse_loss(gram_S, gram_T, reduction='none') * mask.unsqueeze(-1) * mask.unsqueeze(1)).sum() / valid_count

如上，bmm计算了同一句话的隐层向量相关度后，为什么除的是 .size(1) （即句子长度），但是 mask==None的时候除的又是 .size(2):

TextBrewer/src/textbrewer/losses.py

Lines 261 to 262 in 6c9a21b

    
           gram_S = torch.bmm(state_S_0, state_S_1.transpose(1, 2)) / state_S_1.size(2)  # (batch_size, length, length) 
        
           gram_T = torch.bmm(state_T_0, state_T_1.transpose(1, 2)) / state_T_1.size(2)

哪一种是正确的？

[Question] 论文中数据增强方法的一些疑惑？

论文中数据增强在小样本数据集上取得了大幅的性能提升，但是数据增强具体的方案没有细谈。我根据论文中的描述，不知道按照以下理解是否正确：

论文中的数据增强是指，通过添加 训练数据集外 相似的文本数据，让教师网络和学生网络在这些样本中通过中间层 hidden or att 的匹配 loss 进行学习，而忽略掉 logits，因此数据增强这部分其实不需要任何标签。

感谢 HFL 能一直提供如此之多的高质量开源项目，实实在在的为中文 NLP 带来了巨大的积极影响！

kd_loss不下降，准确率也基本不动，但是隐层好像拟合的很好，是什么原因呢？

tensorboard中的loss图是这样的，其他隐层貌似都学的很好

然而kd_loss 是这样的

同时准确率也一直维持在不动的状态

这是我蒸馏的配置：
{
"teachers":[
{
"model_type":"bert",
"prefix":"bert-base-chinese",
"vocab_file":"/data/private/syk/zyb/TextMatch/Bert/mnli_example/vocab.txt",
"config_file":"/data/private/syk/zyb/TextMatch/Bert/mnli_example/config.json",
"checkpoint":"/data/private/syk/zyb/TextMatch/Bert/base_models/best.pth.tar",
"tokenizer_kwargs":{"do_lower_case": true},
"disable":false
}
],
"student":{
"model_type":"bert",
"prefix":"rbt_3",
"vocab_file":"/data/private/syk/zyb/TextMatch/Bert/mnli_example/vocab.txt",
"config_file":"/data/private/syk/zyb/TextMatch/rbt3/bert_config.json",
"checkpoint": "/data/private/syk/zyb/TextMatch/rbt3/pytorch_model.bin",
"tokenizer_kwargs":{"do_lower_case": true},
"disable":false
}
}

match方式是 L3_hidden_mse L3_hidden_smmd ，请问是哪里没有配置对吗？救命！

这个工作很棒，有几个小问题

这个工作很及时，一下子将蒸馏的复杂度和可操作性降低下来了，很有意义。

刚开始了解这个项目，有几点反馈：

快速开始里面：“同时匹配教师的第8层和学生的第2层”
能否说明一下，这是什么意思，也就是说为什么要用教师的第8层和学生的第2层，而不是其他层之间的匹配。
数据增强有效的提升了最终的效果，那么能否也列一下只有蒸馏而没有采用数据增强的效果，这样能看到单纯的蒸馏的效果。
更为关键的对比是，直接训练和蒸馏的效果的对比能否贴出来，如直接预训练过的T3和蒸馏过的T3在CMRC 2018上的对比，即希望能看到蒸馏相对于非蒸馏的增值部分。
中文任务上，LCQMC这个任务模型的区分能力是不够的。如我们的测评中最好的模型比bert-base在测试集上只提升了0.4个点，就会导致比较一般的模型也还有不错的分数。那么能否在其他区分性大的任务上也测一测蒸馏的效果。

the link is fail

See Feed Different batches to Student and Teacher, Feed Cached Values for details of the above features.

the link can't visit

TypeError: model should be either torch.nn.Module or a dict

TypeError: model should be either torch.nn.Module or a dict
And how to load dataloader, optimizer and all.
If anyone could share a colab note book it would be of great help.

关于初始化训练好的student模型做推理预测问题

感谢大佬的开源，您好，请教一下。尝试复现了一下中文句子对分类模型，student模型也生成了，然后想直接加载student模型进行推理预测，然后把teacher相关的初始化都注释了，但是发现评估的结果比在训练时评估日志结果要差很多（fscore低十几个点），后来把tearcher的模型重新注解回来，一起初始化，然后用mode_s去predict 结果进行评估，发现结果好了很多（跟跑训练时日志结果一致）。奇怪了，student模型，还需要fine-tuned好的teacher模型进行初始化吗，看代码好像两者是独立初始化？

加入match之后list index out of range

2020-05-15 11:58:57,166 - INFO - matches:['L3_hidden_mse', 'L3_hidden_smmd']
2020-05-15 11:58:57,177 - INFO - 加载字典
2020-05-15 11:58:57,241 - INFO - using device:cuda
2020-05-15 11:58:57,241 - INFO - 初始化BERT模型
2020-05-15 11:58:59,816 - INFO - 加载Teacher模型～
2020-05-15 11:59:03,855 - INFO - /home/user10000281/notespace/dialogue/model/bert.model.epoch.29 loaded!
2020-05-15 11:59:03,856 - INFO - 将模型发送到计算设备(GPU或CPU)
2020-05-15 11:59:03,948 - INFO - 加载Student模型～
2020-05-15 11:59:04,240 - INFO - /home/user10000281/notespace/dialogue/lib/bert/pytorch_model.bin loaded!
2020-05-15 11:59:04,249 - INFO - 将模型发送到计算设备(GPU或CPU)
2020-05-15 11:59:04,292 - INFO - 加载训练数据
2020-05-15 11:59:52,372 - INFO - 声明需要优化的参数
2020-05-15 11:59:52,373 - INFO - Length of all_trainable_params: 2
2020-05-15 11:59:52,389 - INFO - [{'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': [0, 0], 'layer_S': [0, 0], 'feature': 'hidden', 'loss': 'mmd', 'weight': 1}, {'layer_T': [4, 4], 'layer_S': [1, 1], 'feature': 'hidden', 'loss': 'mmd', 'weight': 1}, {'layer_T': [8, 8], 'layer_S': [2, 2], 'feature': 'hidden', 'loss': 'mmd', 'weight': 1}, {'layer_T': [12, 12], 'layer_S': [3, 3], 'feature': 'hidden', 'loss': 'mmd', 'weight': 1}]
2020/05/15 11:59:53 - INFO - Distillation - Training steps per epoch: 21234
2020/05/15 11:59:53 - INFO - Distillation - Checkpoints(step): [0]
0%| | 0/50 [00:00<?, ?it/s]2020/05/15 11:59:53 - INFO - Distillation - Epoch 1
2020/05/15 11:59:53 - INFO - Distillation - Length of current epoch in forward batch: 21234
0it [00:00, ?it/s]
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_distil.py", line 239, in
main()
File "train_distil.py", line 189, in main
num_epochs=args.num_train_epochs, callback=callback_func)
File "/home/user10000281/.local/lib/python3.6/site-packages/textbrewer/distillation.py", line 435, in train
super(GeneralDistiller, self).train(optimizer, scheduler, dataloader, num_epochs, num_steps, callback, batch_postprocessor, **args)
File "/home/user10000281/.local/lib/python3.6/site-packages/textbrewer/distillation.py", line 194, in train
total_loss = self.train_on_batch(batch,args)
File "/home/user10000281/.local/lib/python3.6/site-packages/textbrewer/distillation.py", line 505, in train_on_batch
inter_S = inters_S[feature][layer_S]
IndexError: list index out of range
不知道为什么会出现这个问题

运行时错误，询问。

运行run_cmrc2018_distill_T3.sh 出现错误，想询问是什么原因？

0it [00:00, ?it/s]
0% 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.distill.py", line 212, in
main()
File "main.distill.py", line 202, in main
num_epochs=args.num_train_epochs, callback=callback_func)
File "/usr/local/lib/python3.7/dist-packages/textbrewer/distiller_basic.py", line 283, in train
self.train_with_num_epochs(optimizer, scheduler, tqdm_disable, dataloader, max_grad_norm, num_epochs, callback, batch_postprocessor, **args)
File "/usr/local/lib/python3.7/dist-packages/textbrewer/distiller_basic.py", line 212, in train_with_num_epochs
total_loss, losses_dict = self.train_on_batch(batch,args)
File "/usr/local/lib/python3.7/dist-packages/textbrewer/distiller_general.py", line 79, in train_on_batch
total_loss, losses_dict = self.compute_loss(results_S, results_T)
File "/usr/local/lib/python3.7/dist-packages/textbrewer/distiller_general.py", line 141, in compute_loss
inter_T = inters_T[feature][layer_T]
IndexError: list index out of range

运行的脚本：
python -u main.distill.py
--vocab_file $BERT_DIR/vocab.txt
--do_lower_case
--bert_config_file_T $STUDENT_CONF_DIR/bert_config.json
--bert_config_file_S $STUDENT_CONF_DIR/bert_config_L3.json
--tuned_checkpoint_T $trained_teacher_model
--init_checkpoint_S $BERT_DIR/pytorch_model.bin
--do_train
--do_eval
--do_predict
--doc_stride 128
--max_seq_length ${length}
--train_batch_size ${batch_size}
--random_seed $torch_seed
--train_file $cmrc_train_file
--fake_file_1 $DA_file
--predict_file $cmrc_dev_file
--num_train_epochs ${ep}
--learning_rate ${lr}e-5
--ckpt_frequency 1
--schedule slanted_triangular
--s_opt1 ${sopt1}
--output_dir $OUTPUT_DIR
--gradient_accumulation_steps ${accu}
--temperature ${temperature}
--output_att_score true
--output_att_sum false
--output_encoded_layers true
--output_attention_layers true
--matches L3_hidden_mse
L3_hidden_smmd
--tag RB \

SQuAD上的实验代码可以展示一下吗

无法复习结果

请问中间层的匹配是如何实现的，feature的选项有哪些？

我想复现miniLM，请问你们的项目可以在不修改bert接口的情况下提取transformer的value_layer信息吗？

可以用于NMT的任务中吗？

你好，请问TextBrewer可用于机器翻译的任务中吗？

AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'

bash run_conll2003_train.sh:

issue:
(tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, RobertaConfig, DistilBertConfig)),
AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'

downgrade the transformers version(2.0.0) solves the issue,but come with more issues.

eg:
1.ImportError: cannot import name 'get_linear_schedule_with_warmup'
2.from transformers import RobertaConfig, RobertaForTokenClassification, RobertaTokenizer
ImportError: cannot import name 'RobertaForTokenClassification'

it needs the transformers version(>2.0.0).(pip3 install git+https://github.com/huggingface/transformers.git --upgrade)

So,how does it work?

Thanks!

	gram_S = torch.bmm(state_S_0, state_S_1.transpose(1, 2)) / state_S_1.size(1) # (batch_size, length, length)
	gram_T = torch.bmm(state_T_0, state_T_1.transpose(1, 2)) / state_T_1.size(1)
	loss = (F.mse_loss(gram_S, gram_T, reduction='none') * mask.unsqueeze(-1) * mask.unsqueeze(1)).sum() / valid_count