shannonai / dice_loss_for_nlp Goto Github PK
View Code? Open in Web Editor NEWThe repo contains the code of the ACL2020 paper `Dice Loss for Data-imbalanced NLP Tasks`
License: Apache License 2.0
The repo contains the code of the ACL2020 paper `Dice Loss for Data-imbalanced NLP Tasks`
License: Apache License 2.0
In the code comment, ohem_ratio
refers to the max ratio of positive/negative, defautls to 0.0, which means no ohem.
But later in the code, it is computing keep_num
with a formula keep_num = min(int(pos_num * self.ohem_ratio / logits_size), neg_num)
.
Should the comment be changed to negative/positive
?
您好,请问您用您的dice loss跑过多分类的吗?我照常用您的代码跑多分类,但是得到loss NAN(同样的代码使用交叉熵是正常的)。
dice_loss_for_NLP/loss/dice_loss.py
Line 141 in 418d09d
您好,在做二分类任务时,我参考adaptive_dice_loss.py中代码:
intersection = torch.sum((1-flat_input)**self.alpha * flat_input * flat_target, -1) + self.smooth
denominator = torch.sum((1-flat_input)**self.alpha * flat_input) + flat_target.sum() + self.smooth
return 1 - 2 * intersection / denominator
写了对应的tensorflow版的损失函数:
def dice_loss(alpha=0.1, smooth=1e-8):
def dice_loss_fixed(y_pred, y_true):
intersection = K.sum((1-y_pred)**alpha * y_pred * y_true, -1) + smooth
denominator = K.sum((1-y_pred)**alpha * y_pred,-1) + K.sum(y_true) + smooth
return 1 - 2 *intersection / denominator
return dice_loss_fixed
可在训练中,损失值一直显示为NAN,不知为何,还请麻烦解答指正,谢谢~
model.compile(optimizer=keras.optimizers.RMSprop(),
loss=[dice_loss(alpha=0.1,smooth=1e-8)],
metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=64, epochs=5,
validation_data=(x_test, y_test))
I am using:
When setting alpha > 0
in DiceLoss
it results in following error:
RuntimeError: "bitwise_and_cpu" not implemented for 'Float' in DiceLoss
at line:
https://github.com/ShannonAI/dice_loss_for_NLP/blob/master/loss/dice_loss.py#L120
This is due to wrong operator evaluation order. First & is evaluated, which is wrong. You can avoid it by adding brackets around boolean operations:
cond = ((torch.argmax(flat_input, dim=1) == label_idx) & (flat_input[:, label_idx] >= threshold)) | pos_example.view(-1)
Thanks for your paper and implementation,
I am using DIce loss for multiclass text classification, however the value of the Dice loss is not optimized at all
I can't see where is the problem, is anyone having the same problem ?
Thanks in advance
为什么在CPU上跑tasks/squad/train.py的时候会有这个报错
“”“
Traceback (most recent call last):
File "/Users/hodge/Documents/GitHub/diceLoss/tasks/squad/train.py", line 369, in
main()
File "/Users/hodge/Documents/GitHub/diceLoss/tasks/squad/train.py", line 363, in main
trainer.fit(model)
File "/Users/hodge/opt/anaconda3/envs/diceloss/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/Users/hodge/opt/anaconda3/envs/diceloss/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1083, in fit
self.accelerator_backend.setup(model)
File "/Users/hodge/opt/anaconda3/envs/diceloss/lib/python3.6/site-packages/pytorch_lightning/accelerators/cpu_backend.py", line 26, in setup
raise MisconfigurationException('amp + cpu is not supported. Please use a GPU option')
pytorch_lightning.utilities.exceptions.MisconfigurationException: amp + cpu is not supported. Please use a GPU option
”“”
但是我已经把第一个参数--gpus="1"改成--gpus="0"了
I read the ACL2020 paper and it suggests self-adjustment in the Dice Loss with Figure 1, which explains the derivative approaches zero right after p exceeds 0.5. This is the case when the alpha is 1.0. However, the script for OntoNotes5 data use alpha=0.01, which is very small adjustment and gives almost same performance with just squared form of Dice. When I use alpha=1.0 and learn the model with the script and CoNLL2003 data, the model does not learn well (the F1 was about 28.96). I wonder why the self-adjustment does not affect well. Could you explain which value of alpha is best in general?
I've been trying to use dice loss for task of token classification with 9 classes.
after I have fixed few errors in _multiple_class
for example in line 143 we have flat_input_idx.view(-1, 1)
which throws an error because tensors are not contiguous.
I used this instead:
loss_idx = self._compute_dice_loss(flat_input_idx.reshape(-1, 1), flat_target_idx.reshape(-1, 1))
And now I've tried to train a model with this and it seems to me that loss isn't changing at all. I don't know what I am doing wrong
https://github.com/Zhylkaaa/simpletransformers/blob/dice_loss/simpletransformers/ner/ner_model.py#L489 - this is where I am trying to integrate dice_loss.
I can prepare minimal example if you want to take a look
stuck in validation(squad task)
GPU mem keeps but no usage.
I tried the same pl version(0.9.0) and the newest.
Could you please help figure out that?
Hi, when will you publish the evaluation code of SQuAD2.0?
I have two part question,
dice_loss_for_NLP/loss/dice_loss.py
Line 41 in 418d09d
input = torch.FloatTensor([[1., .0, .0, .0],[0., 1, .0, .0]])
target = torch.LongTensor([0, 1])
loss = DiceLoss(with_logits=False,reduction=None,ohem_ratio=0.)
input.requires_grad=True
output = loss(input, target)
Output
tensor([1.9998, 1.9998], grad_fn=)
一堆版本不兼容的问题。。。。。。。
您好,我首先在我的文本分类任务上用官方所提供的Dice_loss代码跑了一下,发现loss会出现nan,然后我调小学习率后,验证效果很差,所以我就用官方代码跑了一下官方所提的Tnews任务,发现val_loss在0.55左右,不知道这个结果是否正常?是Dice Loss的实现哪里有问题吗?
Hi,
Thank you very much for your paper and model. I've been trying to replicate your best experimental results on OntoNotes 5.0, however I cannot find the dataset at the link you have provided? Could you please provide link? Thanks.
我用两块RTX 2080Ti还是会OOM,还有用你们的代码使用三块gpu必报错,多gpu混合精度也会报错,估计是pytorch-lightning的原因?没用过这个库
你好,我们在复现命名实体识别数据集zh_onto4结果时,按照readme的指导,运行的是scripts/ner_zhonto4/bert_dice.sh. 脚本,脚本超参没有修改过,但测试集 spanF1的分数只有80.80,与文章中的84.47的结果差距较大,运行日志和测试结果见下文,麻烦看一下是哪个地方的问题,多谢!
h-4.3$sh scripts/ner_zhonto4/bert_dice.sh
DEBUG INFO -> loss sign is dice_1_0.3_0.01
DEBUG INFO -> save hyperparameters
DEBUG INFO -> pred_answerable train_infer
DEBUG INFO -> check bert_config
BertForQueryNERConfig {
"activate_func": "relu",
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"construct_entity_span": "start_and_end",
"directionality": "bidi",
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"pred_answerable": true,
"truncated_normal": false,
"type_vocab_size": 2,
"vocab_size": 21128
}
Some weights of the model checkpoint at /home/ma-user/work/bert-base-chinese were not used when initializing BertForQueryNER: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
This IS expected if you are initializing BertForQueryNER from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForQueryNER from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQueryNER were not initialized from the model checkpoint at /home/ma-user/work/bert-base-chinese and are newly initialized: ['start_outputs.dense_layer.weight', 'start_outputs.dense_layer.bias', 'start_outputs.dense_to_labels_layer.weight', 'start_outputs.dense_to_labels_layer.bias', 'end_outputs.dense_layer.weight', 'end_outputs.dense_layer.bias', 'end_outputs.dense_to_labels_layer.weight', 'end_outputs.dense_to_labels_layer.bias', 'answerable_cls_output.dense_layer.weight', 'answerable_cls_output.dense_layer.bias', 'answerable_cls_output.dense_to_labels_layer.weight', 'answerable_cls_output.dense_to_labels_layer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Checkpoint directory /home/ma-user/work/dice_loss_for_NLP-master/output/dice_loss/mrc_ner/reproduce_zhonto_dice_base_8_300_2e-5_polydecay_0.1_2_5_1.0_0.002_0.1_1_1_0.3_dice_1_0.3_0.01 exists and is not empty with save_top_k != 0.All files in this directory will be deleted when a checkpoint is saved!
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not log computational graph since the model.example_input_array
attribute is not set or input_array
was not given
warnings.warn(*args, **kwargs)
| Name | Type | Params
num_workers
argument(try 72 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. warnings.warn(*args, **kwargs) Validation sanity check: 0it [00:00, ?it/s]Truncation was not explicitly activated but
max_lengthis provided a specific value, please use
truncation=Trueto explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to
truncation. /opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workers argument
(try 72 which is the number of cpus on this machine) in the DataLoader
init to improve performance.num_workers
argument(try 72 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.Thank you for sharing your code and I am very interested in dice loss especially for NER task.
Here you are sharing CoNLL2003 data of MRC format but the script for NER (with hyper-parameters) is for OntoNotes5 (English). Can you share the script (actually, the hyper-parameters) suitable for CoNLL2003 data?
When I used the script for OntoNotes5 with CoNLL2003 data, I could get about 92.08 F1 (with 10 epochs) but this is a bit lower performance than 93.33 F1, which is reported in the ACL2020 paper. On the contrary, I could get 92.35 F1 with BCE loss and 5 epochs.
And can you share OntoNotes5 data of MRC format or at least query sentences?
I tried replacing BCE loss with DICE and my model wouldn't converge. When I looked closer I noticed that whilst the input and target are flattened, the mask isn't. So if you pass a mask that is the same shape as the target, then the multiplication flat_input * mask
unflattens flat_input
def _binary_class(self, input, target, mask=None):
flat_input = input.view(-1)
flat_target = target.view(-1).float()
flat_input = torch.sigmoid(flat_input) if self.with_logits else flat_input
if mask is not None:
mask = mask.float()
flat_input = flat_input * mask
flat_target = flat_target * mask
else:
mask = torch.ones_like(target)
I made the following change and my model started converging immediately
if mask is not None:
mask = mask.float()
flat_input = flat_input * mask.view(-1)
flat_target = flat_target * mask.view(-1)
else:
mask = torch.ones_like(target)
Although I think a better fix is to actually apply the mask rather than mask out the masked inputs/targets ie.
mask = mask.view(-1)
flat_input = flat_input[mask]
flat_target = flat_target[mask]
I have questions about some details in training.
in your paper, you said your backbone model for NER experiment is proposed in "A Unified MRC Framework for Named Entity Recognition". But when I look in the code and run the experiment to reproduce the reported results, I found that you ignore the match_loss (match_loss is introduced to teach the model to match which predicted start token with which predicted end token).
So how we can inference this model in the case of sentence with multiple Ner entities?
Also, you introduce a new loss - cls_answerable_loss to teach model to classify whether input contains entity or loss. Why do you not mention this loss in your paper.
Thank you.
请问如果要复现mrc-ner模型下dice loss的效果,采用当前仓库dice loss的写法么🤔
看到https://github.com/ShannonAI/mrc-for-flat-nested-ner/blob/master/loss/adaptive_dice_loss.py 仓库下也有对应的实现,有点疑问咨询下大佬,感谢🙏
Hello,
First of all, cool work! :)
Now let me get to the point:
I found the following bug in your code:
if mask is not None: # here is the problem!! flat_input and flat_target are already made one-hot, thus the multiplication will not work!
mask = mask.float()
flat_input = flat_input * mask
flat_target = flat_target * mask
else:
mask = torch.ones_like(target)
An easy fix is the following:
if mask is not None:
mask = mask.float()
flat_input = (flat_input.t() * mask).t()
flat_target = (flat_target.t() * mask).t()
else:
mask = torch.ones_like(target)
Suppose I have the following probs and labels (Binary classification):
probs = torch.FloatTensor([[0.3],
[0.8],
[0.2],
[0.7]])
targets = torch.LongTensor([[0],
[1],
[0],
[1]])
Execute the following code:
loss = DiceLoss(alpha=1, smooth=1, with_logits=False, ohem_ratio=0.0, reduction='mean')
output = loss(inputs, targets)
print(output)
No doubt, the code will enter binary_ class().
def _binary_class(self, input, target, mask=None):
flat_input = input.view(-1)
flat_target = target.view(-1).float()
At this time, the shape of flat_input
is:
tensor([0.3000, 0.8000, 0.2000, 0.7000])
The shape of flat_target
is:
tensor([0., 1., 0., 1.])
So far, there is no problem, but when I switched to multi-classification for testing, I found that there was a problem with the shape of flat_input
and flat_target
.
Suppose I have the following probs and labels (Multi classification):
probs = torch.FloatTensor([[0.1,0.8,0.7],
[0.5,0.1,0.6],
[0.7,0.5,0.8],
[0.4,0.6,0.9]])
targets = torch.LongTensor([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 0]])
Execute the following code:
loss = DiceLoss(alpha=1, smooth=1, with_logits=False, ohem_ratio=0.0, index_label_position=False, reduction='mean')
output = loss(inputs, targets)
print(output)
No doubt, the code will enter multiple_class().
def _multiple_class(self, input, target, logits_size, mask=None):
flat_input = input
flat_target = F.one_hot(target, num_classes=logits_size).float() if self.index_label_position else target.float()
At this time, the shape of flat_input
is:
tensor([0.1000, 0.5000, 0.7000, 0.4000])
The shape of flat_target
is:
tensor([1., 0., 0., 0.])
But after the following code:
loss_idx = self._compute_dice_loss(flat_input_idx.view(-1, 1), flat_target_idx.view(-1, 1))
The shape of flat_input
is:
tensor([[0.1000],
[0.5000],
[0.7000],
[0.4000]])
The shape of flat_target
is:
tensor([[1.],
[0.],
[0.],
[0.]])
I don't understand why when you calculate dice loss, flat_ input
and flat_target
has different shapes.
您好,我现在想在ner的任务中使用dice_loss,我的设置如下:
a = torch.rand(13,3)
b = torch.tensor([0,1,1,1,1,1,1,1,1,1,1,1,2])
f = DiceLoss(with_logits=True,smooth=1, ohem_ratio=0.3,alpha=0.01)
f(a,b)
当我运行之后,报错如下:
发生异常: TypeError
unsupported operand type(s) for &: 'int' and 'Tensor'
报错的位置在 _multiple_class
cond = (torch.argmax(flat_input, dim=1) == label_idx & flat_input[:, label_idx] >= threshold) | pos_example.view(-1)
或许是先运行了label_idx & flat...
由于我没有仔细阅读论文中的算法描述,所以并不清楚这一部分的逻辑,也不知道如何修改,特来请教!
如题,每次一个epoch都没有训练完就会卡死,并不报错,也不闪退
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.