xuyige / bert4doc-classification Goto Github PK

View Code? Open in Web Editor NEW

599.0 599.0 99.0 814 KB

Code and source for paper ``How to Fine-Tune BERT for Text Classification?``

License: Apache License 2.0

Python 100.00%

bert natural-language-processing text-classification

bert4doc-classification's People

Contributors

Stargazers

Watchers

Forkers

flyingwaters sufeheisenberg nlp-hua zhengwsh gsukr xhyandwyy yuehchuan betterboytph xwild youhebuke shomyliu fotstrt 365andreas yll1997 qianrenjian lixufang jbdatascience officework1993 jhnlp akkarimi sshuster sigkdd jason-lee-lxx dll1314 jackxxu cb1473258684 chanchimin mahzy somiljain7 gdscamargo jojojun anshiquanshu66 abdullahmuaad9 isunxiao litingfeng00 zbn123 hungrysharkkk dcl91 trinh-hoang-hiep zmqgeek fybei adair1990 sciengineer dinxin chenhou31 yuzhemao evrys m-sas88 gladiator566 techthiyanes nsood-ai rocke2020 wilfoderek ronner1234 roxanney aqhali jiawozhong gshan4056 zhouyonglong vanessadourado nicexw quyenthucdoan super-buster jeonsworld joyxj wtj-github-ing huyennguyenhelen mikasa-changfang catherinezhou suppent s-sheikhaei jh1470 phymucs codecse9 nthon 7568 junshipeng sunleler konglavender liu33333 onlysixpence iq-scm pallihua hbcbh1999 echoyi someone110-al yanjiangjerry yazidzinedineh catlyg codebilibili phamvuhuyentrang zizhe-wang01 yichenmigaloo superhg bearvip geoz-lab star00star

bert4doc-classification's Issues

Bert中的Embedding Layer

您好，Bert当中的Embedding Layer是在Layer0之前的，他的学习率设置为Layer0乘以权重ξ（0.95）会不会更好一点？

python version

Hello, what is the python version of your code

For Layer-wise Decreasing Layer Rate

Thanks for your hard work!
I have two questions. First, for Layer-wise Decreasing Layer Rate, did you use a warm-up or polynomial_decay simultaneous?,and it means that warm-up rate and Layer-wise Decreasing Layer Rate are used simultaneous? Second, for large bert, how did you set the Learning rate and Decay factor which the paper didn't give?

save_checkpoints_steps doesnt work.

The parser option for save_checkpoints_steps doesnt do anything for me.

Im running:

python3 run_classifier_single_layer.py --task_name imdb --do_train --do_eval --do_lower_case --data_dir ./stock --vocab_file ./uncased_L-12_H-768_A-12/vocab.txt --bert_config_file ./uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint ./uncased_L-12_H-768_A-12/pytorch_model.bin --max_seq_length 512 --train_batch_size 16 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir ./stock_output --seed 42 --layers 11 10 --trunc_medium -1 --save_checkpoints_steps 1000

Any idea to solve this?

Further Pre-Training on the IMDB dataset

Dear Yige,
thanks a lot for sharing the code!
I was wondering if you could provide some more detail on "further pre-training" on the IMDB dataset, e.g. the hyperparameter settings for it.
Or, is it possible to share the BERT model which did the LM pre-training on the IMDB dataset?

希望出个中文版readme

强烈建议国人学者开源工作时能出个对应的中文版

Validation dataset split

Hi,
Thanks so much for sharing the code for this fantastic work!
In the paper you mentioned that "We empirically set the max number of the epoch to 4 and save the best model on the validation set for testing". I am wondering how did you create the validation dataset for the classification tasks? Did you split the original train dataset into train/val? If that's the case, what's the ratio you split train/validation dataset for the IMDB, AGnews etc.?

Thanks so much for your help in advance!

Dealing with multiple sentences

Hi sorry to bother you, but I have one question.

Documents have multiple sentences so how do you deal with that ? Do you split the text into sentences and the concatenate the final embeddings for each sentence or do you remove all punctuation marks so the text won't have any [SEP] tokens.

eval only for imdb sentiment classification

How to fine tuning model on multi-tasks?

Sorry to bother you!
But it seems to me, the run_classifier_single_layer.py does not save the model, and what should I do to fine tuning the fine tuned model?
Thanks!

further-pretraining

I got this error when doing further-pretraining

my environment
Ubuntu 18.04.4 LTS (GNU/Linux 5.4.0-74-generic x86_64)
GPU 2080ti

I use following command
python run_pretraining.py
--input_file=./tmp/tf_AGnews.tfrecord
--output_dir=./uncased_L-12_H-768_A-12_AGnews_pretrain
--do_train=True
--do_eval=True
--bert_config_file=./uncased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=./uncased_L-12_H-768_A-12/bert_model.ckpt
--train_batch_size=8
--max_seq_length=128
--max_predictions_per_seq=20
--num_train_steps=100000
--num_warmup_steps=10000
--save_checkpoints_steps=10000
--learning_rate=5e-5

I got this message and further pretraining does not work
How can I fix this problem?

WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 62 vs previous value: 62. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
W0622 17:33:44.304897 140418054317888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 62 vs previous value: 62. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.

Questions about discriminative_fine_tuning

In Section 5.4.3 " We find that assign a lower learn- ing rate to the lower layer is effective to fine-tuning BERT, and an appropriate setting is ξ=0.95 and lr=2.0e-5."
Compared to the code in https://github.com/xuyige/BERT4doc-Classification/blob/master/codes/fine-tuning/run_classifier.py#L812
Seem that you divide the bert layer into 3 part (4 layers for one part) and set different learning rate for each part.
Some questions about it:

How could the decay factor 0.95 match the number 2.6 in code ?
And the last classify layer seem not be contained , no need to set lr for it ?

OOM when batchSize=1

Hi, thanks for your great work.
While running run_pretraining.py, I kept getting OOM for any size of the matrix.
I already reduce the batch size to 1 but didn't help.
I'm using 960M, TensorFlow-gpu1.10, Cuda toolkit 9.0
I'm wondering what version of TensorFlow are you using? Any thoughts on this issue?
Thanks in advance.

Question about Further Pre-training

Hi:
I tried to use your code on my own corpus to do classification which consists of many short sentences.I want to try some expriements with further pre-training without the NSP task.But from your code of "create_pretraining_data.py" ,I found you random choose a doc from the dataset to concatenate to another doc after [SEP] as input which confuse me a lot,could you please explain to me why this is done？Thanks a lot.

hight perplexity when Further Pre-Training

When do further pre-training on my own datas the ppl is too much high for example 709. I have 3582619 examples, and use batch size=8, epoch=3, learing rate=5e-5. Is there any advice ? Thanks a lot!

max_sequence_length in create_pretraining_data

你好，非常感谢，这个项目对我目前的工作很有帮助。我在做学生作文自动评分的项目，数据量是450mb，大约93万篇学生作文。我用create_pretraining_data这个脚本生成了一个17G的tf.records 文件，max_sequence_length 选择的是128。我的问题是：在生成预训练数据这个步骤中，max_sequence_length 是选择最大的文章的长度，还是最长的一句话的长度？

Resource exhausted

Hi,

first, thank u for having sharing ur cod with us

I am trying to further pretraining a bert model on my own corpus on colab gpu but I am getting an error of resource exhausted
can someone tell me how to fix this

Also what are the expected output of this further pretraining
Are they the bert tenserflow files that we can use for fine-tuning ( checkpoint, config, and vocab)?

Thank u

Generate Further Pre-Training Corpus

Hi,
Thank you for sharing your code. I met the following problem when running "python generate_corpus_agnews.py".

Traceback (most recent call last):
File "generate_corpus_agnews.py", line 18, in
f.write(str(test_data[i][1])+"\n")
IndexError: index 1 is out of bounds for axis 0 with size 1

And also, could provide some guideline on how I can apply your code on my own dataset?

How much time did it take to run the further pre-training step?

@xuyige Time taken

!python run_pretraining.py
--input_file=./tmp/tf_examples.tfrecord
--output_dir=./tmp/pretraining_output
--do_train=True
--do_eval=True
--bert_config_file=./uncased_L-12_H-768_A-12/bert_config.json
--init_checkpoint=./uncased_L-12_H-768_A-12/bert_model.ckpt
--train_batch_size=32
--max_seq_length=128
--max_predictions_per_seq=20
--num_train_steps=100000
--num_warmup_steps=10000
--learning_rate=5e-5
--use_tpu=False
--save_checkpoints_steps=10000

further pre-training

Hi,

I followed ur code to further pre-train a bert model on my own corpus but I got only checkpoint files without any config or vocab.txt file any ideas plz?

Thank u

0 instances wrote while further pre-training on my own dataset

Hey,
When i run the command create_pretraining_data.py i see the following msg:

INFO:tensorflow:*** Reading from input files ***
I1210 15:59:58.812381 140714487977856 create_pretraining_data.py:419] *** Reading from input files ***
INFO:tensorflow:*** Writing to output files ***
I1210 15:59:58.815751 140714487977856 create_pretraining_data.py:430] *** Writing to output files ***
INFO:tensorflow: tmp/tf_AGnews.tfrecord
I1210 15:59:58.815884 140714487977856 create_pretraining_data.py:432] tmp/tf_AGnews.tfrecord
WARNING:tensorflow:From create_pretraining_data.py:97: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

W1210 15:59:58.816398 140714487977856 module_wrapper.py:139] From create_pretraining_data.py:97: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

INFO:tensorflow:Wrote 0 total instances
I1210 15:59:58.819541 140714487977856 create_pretraining_data.py:162] Wrote 0 total instances
Does this mean no data is created? If, yes, can you tell me why this is happening?

Thanks in advance.