Hi @yumeng5 !
A complete and intuitive work!
I followed your work those couple days. My own Windows10 PC sports only one Quadro P2000 GPU, and the problems are solved from the issues #2 .Thans a lot for your guidements!
But the GPU is too small to keep a high speed and enough batch. So I use my GPU cluster server at my school which uses the IBM's LSF system.
Environment Info
- transformers version: 3.4.0
- Platform: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-58-generic x86_64)
- Python version:3.7.4
- PyTorch version (GPU?): 1.6.0
- GPU: Tesla V100S-PCIE-32GB * 2
- Using distributed or parallel set-up in script?: yes
And I encountered some bugs when I run the different shell scripts.
agnews.sh
scirpts
#!/bin/sh
#BSUB –gpu "num=2:mode=exclusive_process"
#BSUB -n 2
#BSUB -q gpu
#BSUB -o LOTClass.out
#BSUB -e LOTClass.err
#BSUB -J LOTClass
#BSUB -R "rusage[ngpus_physical=2]"
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
python src/train.py --dataset_dir datasets/agnews/agnews_data/
--label_names_file label_names.txt
--train_file train.txt
--test_file test.txt
--test_label_file test_labels.txt
--max_len 200
--train_batch_size 32
--accum_steps 2
--eval_batch_size 64
--gpus 2
--mcp_epochs 3
--self_train_epochs 1 \
OUTPUT
Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/agnews_data/', dist_port=12345, early_stop=False, eval_batch_size=64, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Loading encoded texts from datasets/agnews/agnews_data/train.pt
Loading texts with label names from datasets/agnews/agnews_data/label_name_data.pt
Loading encoded texts from datasets/agnews/agnews_data/test.pt
Loading category vocabulary from datasets/agnews/agnews_data/category_vocab.pt
Class 0 category vocabulary: ['politics', 'political', 'politicians', 'Politics', 'government', 'elections', 'issues', 'history', 'democracy', 'affairs', 'policy', 'politically', 'politician', 'society', 'policies', 'voters', 'people', 'debate', 'election', 'culture', 'economics', 'forces', 'relations', 'governance', 'parliament', 'leadership', 'campaign', 'problems', 'opposition', 'military', 'movements', 'diplomacy', 'war', 'polls', 'congress', 'campaigning', 'nature', 'dynamics', 'debates', 'taxes', 'struggles', 'control', 'campaigns', 'economy', 'officials', 'ideology', 'leaders', 'religion', 'geography', 'state', 'Congress', 'wars', 'corruption', 'roads', 'territory', 'voting', 'climate', 'agriculture', 'balance']
Class 1 category vocabulary: ['sports', 'sport', 'Sports', 'sporting', 'soccer', 'athletics', 'athletic', 'baseball', 'hockey', 'basketball', 'regional', 'travel', 'matches', 'coaches', 'youth', 'Sport', 'health', 'teams', 'recreational', 'team', 'medical', 'match', 'cultural', 'gaming', 'play', 'golf', 'local', 'outdoor', 'tennis', 'schools', 'league', 'radio', 'stadium', 'recreation', 'activities', 'transportation', 'club', 'wrestling', 'rugby', 'everything', 'training', 'fields', 'city', 'fans', 'leagues', 'school', 'safety', 'national', 'aquatic', 'summer', 'track', 'air', 'letters', 'rules', 'championship', 'racing', 'grounds', 'pro', 'arts', 'leisure', 'great', 'clubs', 'broadcast']
Class 2 category vocabulary: ['business', 'trade', 'Business', 'businesses', 'trading', 'commercial', 'market', 'enterprise', 'corporate', 'financial', 'sales', 'commerce', 'job', 'shop', 'economic', 'professional', 'world', 'operation', 'family', 'name', 'line', 'career', 'retail', 'firm', 'operations', 'marketing', 'good', 'work', 'private', 'personal', 'chain', 'time', 'group', 'division', 'investment', 'industrial', 'house', 'side', 'companies', 'store', 'global', 'task', 'consumer', 'shopping', 'street', 'property', 'special', 'merchant', 'part', 'department', 'town', 'real', 'traffic', 'space', 'concern', 'selling']
Class 3 category vocabulary: ['technology', 'technologies', 'Technology', 'tech', 'technological', 'equipment', 'device', 'innovation', 'system', 'information', 'generation', 'infrastructure', 'phone', 'devices', 'energy', 'capability', 'concept', 'systems', 'computer', 'hardware', 'technique', 'Internet', 'design', 'program', 'protocol', 'ability', 'technical', 'platform', 'digital', 'knowledge', 'content', 'method', 'techniques', 'strategy', 'material', 'internet', 'Tech', 'web', 'development', 'invention', 'feature', 'IT', 'project', 'facility', 'intelligence', 'process', 'card', 'wireless', 'car', 'format', 'concepts', 'gene', 'model', 'features', 'smart', 'app', 'computers', 'machine', 'also', 'talent', 'solution', 'idea', 'speed', 'algorithm', 'style']
Loading model trained via masked category prediction from datasets/agnews/agnews_data/mcp_model.pt
Start self-training.
PS:
Read file <LOTClass.err> for stderr output of this job.
ERRORS
2021-01-29 11:43:51.382337: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at bert-base-cased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-cased/ and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2021-01-29 11:44:04.598968: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-01-29 11:44:08.260233: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 59, in main
trainer.self_train(epochs=args.self_train_epochs, loader_name=args.final_model)
File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 566, in self_train
mp.spawn(self.self_train_dist, nprocs=self.world_size, args=(epochs, loader_name))
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 531, in self_train_dist
model = self.set_up_dist(rank)
File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 69, in set_up_dist
model = self.model.to(rank)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 607, in to
return self._apply(convert)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 376, in _apply
param_applied = fn(param)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 605, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
and I print the “nvidia-smi” of two GPUs, and no other programms take up GPU resources, which confuses me all day along. I use different methods BUT still get the same problem.
amazon.sh
scripts
parameters same as yours, and look likes agnews.sh
OUTPUT
Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/amazon/amazon_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['bad'], 1: ['good']}
Loading encoded texts from datasets/amazon/amazon_data/train.pt
Loading texts with label names from datasets/amazon/amazon_data/label_name_data.pt
Reading texts from datasets/amazon/amazon_data/test.txt
Converting texts into tensors.
PS:
Read file <LOTClass.err> for stderr output of this job.
ERRORS
The same as #8 .
RuntimeError: The task could not be sent to the workers as it is too large for
send_bytes.
imdb.sh
scripts
parameters same as yours, and look likes agnews.sh
OUTPUT
Namespace(accum_steps=8, category_vocab_size=100, dataset_dir='datasets/imdb/imdb_data/', dist_port=12345, early_stop=False, eval_batch_size=32, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=512, mcp_epochs=4, out_file='out.txt', self_train_epochs=4.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=8, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['bad'], 1: ['good']}
Loading encoded texts from datasets/imdb/imdb_data/train.pt
Loading texts with label names from datasets/imdb/imdb_data/label_name_data.pt
Loading encoded texts from datasets/imdb/imdb_data/test.pt
Contructing category vocabulary.
Class 0 category vocabulary: ['bad', 'Bad', 'wrong', 'nasty', 'worst', 'badly', 'negative', 'sad', 'sorry', 'rotten', 'low', 'violent', 'weird', 'dark', 'shit', 'crazy', 'dirty', 'serious', 'sick', 'small', 'stupid', 'scary', 'dumb', 'much', 'gross', 'foul', 'dangerous', 'crap', 'mixed', 'fast', 'sour', 'miserable', 'severe', 'lost', 'hit', 'dreadful', 'trouble', 'gone']
Class 1 category vocabulary: ['good', 'excellent', 'high', 'Good', 'wonderful', 'amazing', 'fantastic', 'fair', 'positive', 'sure', 'sound', 'quality', 'light', 'solid', 'brilliant', 'awesome', 'smart', 'happy', 'bright', 'safe', 'true', 'clean', 'rich', 'successful', 'full', 'special', 'fun', 'popular', 'sweet', 'superior', 'simple', 'average', 'superb', 'normal', 'important', 'love', 'cool', 'quick', 'easy', 'whole', 'hot', 'interesting', 'damn']
Preparing self supervision for masked category prediction.
Number of documents with category indicative terms found for each category is: {0: 873, 1: 828}
There are totally 1701 documents with category indicative terms.
Training model via masked category prediction.
Epoch 1:
Average training loss: 0.5981351137161255
Epoch 2:
Average training loss: 0.23333437740802765
Epoch 3:
Average training loss: 0.09165686368942261
Epoch 4:
Average training loss: 0.056073933839797974
Start self-training.
PS:
Read file <LOTClass.err> for stderr output of this job.
ERROR
The same as agnews.sh:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
dbpedia.sh
scripts
parameters same as yours, and look likes agnews.sh
OUTPUT
Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/dbpedia/dbpedia_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['company'], 1: ['school', 'university'], 2: ['artist'], 3: ['athlete'], 4: ['politics'], 5: ['transportation'], 6: ['building'], 7: ['river', 'mountain', 'lake'], 8: ['village'], 9: ['animal'], 10: ['plant', 'tree'], 11: ['album'], 12: ['film'], 13: ['novel', 'publication', 'book']}
Loading encoded texts from datasets/dbpedia/dbpedia_data/train.pt
Loading texts with label names from datasets/dbpedia/dbpedia_data/label_name_data.pt
Reading texts from datasets/dbpedia/dbpedia_data/test.txt
Converting texts into tensors.
ERROR
The same error as #8 .
RuntimeError: The task could not be sent to the workers as it is too large for
send_bytes.
Can you help me deal with these errors?
Sincerely,
Heisenberg