How to perform data augmentation? about textbrewer HOT 7 CLOSED

LorrinWWW commented on May 24, 2024

How to perform data augmentation?

from textbrewer.

Comments (7)

airaria commented on May 24, 2024

For SQuAD, we select all the passage-question pairs that are answerable (have non-empty answer spans in the passages) from the NewsQA training set, and treat them as the SQuAD PQ-pairs. For CONLL-2003, we select randomly from passages in HotpotQA (wiki articles) and split them into sentences. We tokenize the sentences with NLTK and treat each tokenized sentence as an NER example. The augmented dataset contains about 10M tokens. We feed all the examples (the original and the augmented) to the teachers and asked the student to learn from the representations from the teachers. Hope it helps. At 2021-01-13 11:10:02, "Jue WANG" <[email protected]> wrote: Thanks for the awesome work! I have a question on the examples (while not on the framework itself). In the example, HotpotQA is used for data augmentation on CoNLL-2003, and NewsQA is used for data augmentation on SQuAD. Can you describe how to do that? And, if possible, providing the augmented dataset will help a lot in reproducing the results in the examples. Thank you again! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

from textbrewer.

LorrinWWW commented on May 24, 2024

Thank you for the quick reply! It solves my question.

from textbrewer.

LorrinWWW commented on May 24, 2024

Sorry to bother again, but I have a little problem in reproducing the NER results in the paper.

By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results.

I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper.
I use the following hyper-parameters: lr=1e-4, batch size=32, warmup steps=0.1, epochs=100.
And I got: F1(of T3) = 84.6, F1(of T6) = 89.7.

Here is the train script:

export OUTPUT_DIR=outputs-model-distill
export BATCH_SIZE=32
export GRAD_ACCUM_STEPS=1
export NUM_EPOCHS=100
export SAVE_STEPS=750
export SEED=42
export MAX_LENGTH=128
export BERT_MODEL_TEACHER=outputs-model-base
export BERT_MODEL_STUDENT=bert-base-cased

mkdir -p $OUTPUT_DIR

python run_ner_distill.py \
--data_dir data \
--model_type bert \
--model_name_or_path $BERT_MODEL_TEACHER \
--model_name_or_path_student $BERT_MODEL_STUDENT \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--num_hidden_layers 3 \
--save_steps $SAVE_STEPS \
--learning_rate 1e-4 \
--warmup_steps 0.1 \
--seed $SEED \
--do_distill \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--overwrite_cache

I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria

from textbrewer.

airaria commented on May 24, 2024

Maybe you could try distilling without the mmd loss in NER. For the hyperparameters, l'll check that with my colleague next Monday. At 2021-01-16 14:46:45, "Jue WANG" <[email protected]> wrote: Sorry to bother again, but I have a little problem in reproducing the NER results in the paper. By using the example script (./examples/conll2003_example/), I can successfully reproduce BERT's result in the supervised learning setting (I got 91.3 vs 91.1 in the paper). But I only get much worse distilled results. I distill the logits (ce loss) as well as the intermediate hidden states (mse and mmd loss) as described in the paper. I use the following hyper-parameters: lr=1e-4, batch size=32, warmup steps=0.1, epochs=100. And I got: F1(of T3) = 84.6, F1(of T6) = 89.7. Here is the train script: export OUTPUT_DIR=outputs-model-distill export BATCH_SIZE=32 export GRAD_ACCUM_STEPS=1 export NUM_EPOCHS=100 export SAVE_STEPS=750 export SEED=42 export MAX_LENGTH=128 export BERT_MODEL_TEACHER=outputs-model-base export BERT_MODEL_STUDENT=bert-base-cased mkdir -p $OUTPUT_DIR python run_ner_distill.py \ --data_dir data \ --model_type bert \ --model_name_or_path $BERT_MODEL_TEACHER \ --model_name_or_path_student $BERT_MODEL_STUDENT \ --output_dir $OUTPUT_DIR \ --max_seq_length $MAX_LENGTH \ --num_train_epochs $NUM_EPOCHS \ --per_gpu_train_batch_size $BATCH_SIZE \ --gradient_accumulation_steps $GRAD_ACCUM_STEPS \ --num_hidden_layers 3 \ --save_steps $SAVE_STEPS \ --learning_rate 1e-4 \ --warmup_steps 0.1 \ --seed $SEED \ --do_distill \ --do_train \ --do_eval \ --do_predict \ --overwrite_output_dir \ --overwrite_cache I think it should be due to my inappropriate experimental setting. Do you have any idea on that? @airaria — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

from textbrewer.

LorrinWWW commented on May 24, 2024

Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either.
When I used data augmentation on T3 and T6, their F1s could successfully exceed the reported scores. However, as stated in this document, only T4-tiny and T12-nano use data augmentation, so there should be other factors that affect my results orz..

from textbrewer.

airaria commented on May 24, 2024

Sorry for the unclarity in experiment settings! We have treated CoNLL-2003 differently with fine-tuning the student model before distillation. Take T3 for example. Our procedure is : 1. Initialize T3 with the weights from the first 3 layers and the embddings of BERT-base. 2. Fine-tune the T3 on the training set. 3. Perform knowledge distilation on the fine-tuned model. This strategy can help to improve about 2 points on T3. At 2021-01-19 14:52:33, "Jue WANG" <[email protected]> wrote: Hi @airaria, do you have any findings? I tried removing the mmd loss, but it did not help improve the F1; I also tried a smaller learning rate, but it did not help either. When I used data augmentation on T3 and T6, their F1s could successfully exceed the reported scores. However, as stated in this document, only T4-tiny and T12-nano use data augmentation, so there should be other factors that affect my results orz.. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

from textbrewer.

LorrinWWW commented on May 24, 2024

Thanks a lot! That's very useful!

from textbrewer.

How to perform data augmentation? about textbrewer HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent