yumeng5 / lotclass Goto Github PK

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach

License: Apache License 2.0

Shell 11.82% Python 88.18%

text-classification weakly-supervised-learning language-model

lotclass's Introduction

LOTClass

The source code used for Text Classification Using Label Names Only: A Language Model Self-Training Approach, published in EMNLP 2020.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Also, you need to download the stopwords in the NLTK library:

import nltk
nltk.download('stopwords')

Python 3.6 or above is strongly recommended; using older python versions might lead to package incompatibility issues.

Reproducing the Results

We provide four get_data.sh scripts for downloading the datasets used in the paper under datasets and four training bash scripts agnews.sh, dbpedia.sh, imdb.sh and amazon.sh for running the model on the four datasets.

Note: Our model does not use training labels; we provide the training/test set ground truth labels only for completeness and evaluation.

The training bash scripts assume you have two 10GB GPUs. If you have different number of GPUs, or GPUs of different memory sizes, refer to the next section for how to change the following command line arguments appropriately (while keeping other arguments unchanged): train_batch_size, accum_steps, eval_batch_size and gpus.

Command Line Arguments

The meanings of the command line arguments will be displayed upon typing

python src/train.py -h

The following arguments directly affect the performance of the model and need to be set carefully:

train_batch_size, accum_steps, gpus: These three arguments should be set together. You need to make sure that the effective training batch size, calculated as train_batch_size * accum_steps * gpus, is around 128. For example, if you have 4 GPUs, then you can set train_batch_size = 32, accum_steps = 1, gpus = 4; if you have 1 GPU, then you can set train_batch_size = 32, accum_steps = 4, gpus = 1. If your GPUs have different memory sizes, you might need to change train_batch_size while adjusting accum_steps and gpus at the same time to keep the effective training batch size around 128.
eval_batch_size: This argument only affects the speed of the algorithm; use as large evaluation batch size as your GPUs can hold.
max_len: This argument controls the maximum length of documents fed into the model (longer documents will be truncated). Ideally, max_len should be set to the length of the longest document (max_len cannot be larger than 512 under BERT architecture), but using larger max_len also consumes more GPU memory, resulting in smaller batch size and longer training time. Therefore, you can trade model accuracy for faster training by reducing max_len.
mcp_epochs, self_train_epochs: They control how many epochs to train the model on masked category prediction task and self-training task, respectively. Setting mcp_epochs = 3, self_train_epochs = 1 will be a good starting point for most datasets, but you may increase them if your dataset is small (less than 100,000 documents).

Other arguments can be kept as their default values.

Running on New Datasets

To execute the code on a new dataset, you need to

Create a directory named your_dataset under datasets.
Prepare a text corpus train.txt (one document per line) under your_dataset for training the classification model (no document labels are needed).
Prepare a label name file label_names.txt under your_dataset (each line contains the label name of one category; if multiple words are used as the label name of a category, put them in the same line and separate them with whitespace characters).
(Optional) You can choose to provide a test corpus test.txt (one document per line) with ground truth labels test_labels.txt (each line contains an integer denoting the category index of the corresponding document, index starts from 0 and the order must be consistent with the category order in label_names.txt). If the test corpus is provided, the code will write classification results to out.txt under your_dataset once the training is complete. If the ground truth labels of the test corpus are provided, test accuracy will be displayed during self-training, which is useful for hyperparameter tuning and model cherry-picking using a small test set.
Run the code with appropriate command line arguments (I recommend creating a new bash script by referring to the four example scripts).
The final trained classification model will be saved as final_model.pt under your_dataset.

Note: The code will cache intermediate data and model checkpoints as .pt files under your dataset directory for continued training. If you change your training corpus or label names and re-run the code, you will need to first delete all .pt files to prevent the code from loading old results.

You can always refer to the example datasets when preparing your own datasets.

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2020text,
  title={Text Classification Using Label Names Only: A Language Model Self-Training Approach},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Xiong, Chenyan and Ji, Heng and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
  year={2020},
}

lotclass's People

Contributors

Stargazers

Watchers

Forkers

jeannie-thai bhacquin fancycheung akakaala dadelani osmanatam dheeraj7596 ailwg pylxtu slshan yongtso wangwxb astroch xuetf xrosliang laomagic yipingnus cultivater shadowkiller33 heylinyuhao s200331082 aaaqeczyh leopra neelik matteo-grella qingkaizeng zhangqile900621 bobsbimal58 ryusuketa kapetis dumpmemory codeaudit yoom618 andreaspung keykaren yuanhang-zheng lumelon yuelupenbgpeng123 jungyitsai goncalogiga hsinmosyi oztalha kianqunki dominicslw00 czy1999 lunnada loyalbenny wqn1 jianzhu tianbuwei hap123ccj xjy531171158 nbsyxx xiaoguo1992 erickzli zeyihou zijianan himynameiscici03 alphapav yerongli znsoftm

lotclass's Issues

Category Vocabulary Creation

Hi,

How to create the category vocabulary for label names which do not appear in training examples?

Thanks for sharing your work!

EnvironmentError when running code

certifi 2020.12.5
chardet 4.0.0
click 7.1.2
dataclasses 0.8
filelock 3.0.12
future 0.18.2
idna 2.10
joblib 1.0.0
nltk 3.5
numpy 1.19.5
packaging 20.8
pip 21.0
pyparsing 2.4.7
regex 2020.11.13
requests 2.25.1
sacremoses 0.0.43
sentencepiece 0.1.95
setuptools 36.4.0
six 1.15.0
tokenizers 0.8.1rc2
torch 1.5.0
tqdm 4.56.0
transformers 3.3.1
urllib3 1.26.3
wheel 0.29.0

Run ''sh agnews.sh", get the following errors,Could you help check it?
Namespace(accum_steps=4, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=1, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Downloading: 100%|###########################################################################################################################################################| 232k/232k [13:44<00:00, 281B/s]
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Traceback (most recent call last):
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/transformers/configuration_utils.py", line 359, in get_config_dict
raise EnvironmentError
OSError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 53, in main
trainer = LOTClassTrainer(args)
File "/root/app/LOTClass/src/trainer.py", line 51, in init
num_labels=self.num_class)
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/transformers/modeling_utils.py", line 854, in from_pretrained
**kwargs,
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/transformers/configuration_utils.py", line 315, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/transformers/configuration_utils.py", line 368, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'bert-base-uncased'. Make sure that:

'bert-base-uncased' is a correct model identifier listed on 'https://huggingface.co/models'
or 'bert-base-uncased' is the correct path to a directory containing a config.json file

My performance in reproducing the results of agnews using the code you provided is relatively poor

I have been reading your article on LOTClass recently, but my performance in reproducing the results of agnews using the code you provided is relatively poor. I would like to ask you what the reason is. Thank you very much!

Here are my parameter settings:

Wishing you a happy life!

[Improvement] More post-processing for the category vocabulary

Hi, I really enjoyed reading your paper and the code quality is very impressive as well!

I'm trying to reproduce the experiments on DBpedia dataset. Below is the category vocabulary for the seed word village.

Class 8 category vocabulary: ['village', 'villages', 'settlement', 'town', 'east', 'population', 'rural', 'municipality', 'parish', 'na', 'temple', 'pa', 'commune', 'pre', 'ha', 'north', 'hamlet', 'settlements', 'chamber', 'administrative', 'neighbourhood', 'township', 'lies', 'camp', 'locality', 'os', 'villagers', 'iran', 'nest', 'se', 'neighborhood', 'living', 'daily', 'junction', 'palace', 'county', 'crossing', 'south', 'approximately', 'garde', 'market', 'il', 'far', 'reared', 'romanized', 'non', 'west', 'right', 'court', 'wa', 'km', 'hen']

I notice there's still quite a bit of noise like "pa", "il", which I suppose to be state names. Wonder if we had removed those noise, how will the end-to-end text classification accuracy be affected? Some additional filtering that you can consider besides stop words:

Filter too short words. Words with 2-3 chars usually are noisy or ambiguous.
In my previous work, I kept only nouns for topic classification and adjectives for sentiment classification. There're definitely exceptions but it's a tradeoff between keyword quality and coverage.

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

Hi @yumeng5 !
A complete and intuitive work!

I followed your work those couple days. My own Windows10 PC sports only one Quadro P2000 GPU, and the problems are solved from the issues #2 .Thans a lot for your guidements!

But the GPU is too small to keep a high speed and enough batch. So I use my GPU cluster server at my school which uses the IBM's LSF system.

Environment Info

transformers version: 3.4.0
Platform: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-58-generic x86_64)
Python version:3.7.4
PyTorch version (GPU?): 1.6.0
GPU: Tesla V100S-PCIE-32GB * 2
Using distributed or parallel set-up in script?: yes

And I encountered some bugs when I run the different shell scripts.

agnews.sh

scirpts

#!/bin/sh
#BSUB –gpu "num=2:mode=exclusive_process"
#BSUB -n 2
#BSUB -q gpu
#BSUB -o LOTClass.out
#BSUB -e LOTClass.err
#BSUB -J LOTClass
#BSUB -R "rusage[ngpus_physical=2]"

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1

python src/train.py --dataset_dir datasets/agnews/agnews_data/
--label_names_file label_names.txt
--train_file train.txt
--test_file test.txt
--test_label_file test_labels.txt
--max_len 200
--train_batch_size 32
--accum_steps 2
--eval_batch_size 64
--gpus 2
--mcp_epochs 3
--self_train_epochs 1 \

OUTPUT

Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/agnews_data/', dist_port=12345, early_stop=False, eval_batch_size=64, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Loading encoded texts from datasets/agnews/agnews_data/train.pt
Loading texts with label names from datasets/agnews/agnews_data/label_name_data.pt
Loading encoded texts from datasets/agnews/agnews_data/test.pt
Loading category vocabulary from datasets/agnews/agnews_data/category_vocab.pt
Class 0 category vocabulary: ['politics', 'political', 'politicians', 'Politics', 'government', 'elections', 'issues', 'history', 'democracy', 'affairs', 'policy', 'politically', 'politician', 'society', 'policies', 'voters', 'people', 'debate', 'election', 'culture', 'economics', 'forces', 'relations', 'governance', 'parliament', 'leadership', 'campaign', 'problems', 'opposition', 'military', 'movements', 'diplomacy', 'war', 'polls', 'congress', 'campaigning', 'nature', 'dynamics', 'debates', 'taxes', 'struggles', 'control', 'campaigns', 'economy', 'officials', 'ideology', 'leaders', 'religion', 'geography', 'state', 'Congress', 'wars', 'corruption', 'roads', 'territory', 'voting', 'climate', 'agriculture', 'balance']

Class 1 category vocabulary: ['sports', 'sport', 'Sports', 'sporting', 'soccer', 'athletics', 'athletic', 'baseball', 'hockey', 'basketball', 'regional', 'travel', 'matches', 'coaches', 'youth', 'Sport', 'health', 'teams', 'recreational', 'team', 'medical', 'match', 'cultural', 'gaming', 'play', 'golf', 'local', 'outdoor', 'tennis', 'schools', 'league', 'radio', 'stadium', 'recreation', 'activities', 'transportation', 'club', 'wrestling', 'rugby', 'everything', 'training', 'fields', 'city', 'fans', 'leagues', 'school', 'safety', 'national', 'aquatic', 'summer', 'track', 'air', 'letters', 'rules', 'championship', 'racing', 'grounds', 'pro', 'arts', 'leisure', 'great', 'clubs', 'broadcast']

Class 2 category vocabulary: ['business', 'trade', 'Business', 'businesses', 'trading', 'commercial', 'market', 'enterprise', 'corporate', 'financial', 'sales', 'commerce', 'job', 'shop', 'economic', 'professional', 'world', 'operation', 'family', 'name', 'line', 'career', 'retail', 'firm', 'operations', 'marketing', 'good', 'work', 'private', 'personal', 'chain', 'time', 'group', 'division', 'investment', 'industrial', 'house', 'side', 'companies', 'store', 'global', 'task', 'consumer', 'shopping', 'street', 'property', 'special', 'merchant', 'part', 'department', 'town', 'real', 'traffic', 'space', 'concern', 'selling']

Class 3 category vocabulary: ['technology', 'technologies', 'Technology', 'tech', 'technological', 'equipment', 'device', 'innovation', 'system', 'information', 'generation', 'infrastructure', 'phone', 'devices', 'energy', 'capability', 'concept', 'systems', 'computer', 'hardware', 'technique', 'Internet', 'design', 'program', 'protocol', 'ability', 'technical', 'platform', 'digital', 'knowledge', 'content', 'method', 'techniques', 'strategy', 'material', 'internet', 'Tech', 'web', 'development', 'invention', 'feature', 'IT', 'project', 'facility', 'intelligence', 'process', 'card', 'wireless', 'car', 'format', 'concepts', 'gene', 'model', 'features', 'smart', 'app', 'computers', 'machine', 'also', 'talent', 'solution', 'idea', 'speed', 'algorithm', 'style']

Loading model trained via masked category prediction from datasets/agnews/agnews_data/mcp_model.pt

Start self-training.
PS:
Read file <LOTClass.err> for stderr output of this job.

ERRORS

2021-01-29 11:43:51.382337: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Some weights of the model checkpoint at bert-base-cased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-cased/ and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2021-01-29 11:44:04.598968: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-01-29 11:44:08.260233: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 59, in main
trainer.self_train(epochs=args.self_train_epochs, loader_name=args.final_model)
File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 566, in self_train
mp.spawn(self.self_train_dist, nprocs=self.world_size, args=(epochs, loader_name))
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 531, in self_train_dist
model = self.set_up_dist(rank)
File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 69, in set_up_dist
model = self.model.to(rank)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 607, in to
return self._apply(convert)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply
module._apply(fn)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 376, in _apply
param_applied = fn(param)
File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 605, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

and I print the “nvidia-smi” of two GPUs, and no other programms take up GPU resources, which confuses me all day along. I use different methods BUT still get the same problem.

amazon.sh

scripts

parameters same as yours, and look likes agnews.sh

OUTPUT

Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/amazon/amazon_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['bad'], 1: ['good']}
Loading encoded texts from datasets/amazon/amazon_data/train.pt
Loading texts with label names from datasets/amazon/amazon_data/label_name_data.pt
Reading texts from datasets/amazon/amazon_data/test.txt
Converting texts into tensors.
PS:
Read file <LOTClass.err> for stderr output of this job.

ERRORS

The same as #8 .
RuntimeError: The task could not be sent to the workers as it is too large for send_bytes.

imdb.sh

scripts

parameters same as yours, and look likes agnews.sh

OUTPUT

Namespace(accum_steps=8, category_vocab_size=100, dataset_dir='datasets/imdb/imdb_data/', dist_port=12345, early_stop=False, eval_batch_size=32, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=512, mcp_epochs=4, out_file='out.txt', self_train_epochs=4.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=8, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['bad'], 1: ['good']}
Loading encoded texts from datasets/imdb/imdb_data/train.pt
Loading texts with label names from datasets/imdb/imdb_data/label_name_data.pt
Loading encoded texts from datasets/imdb/imdb_data/test.pt
Contructing category vocabulary.
Class 0 category vocabulary: ['bad', 'Bad', 'wrong', 'nasty', 'worst', 'badly', 'negative', 'sad', 'sorry', 'rotten', 'low', 'violent', 'weird', 'dark', 'shit', 'crazy', 'dirty', 'serious', 'sick', 'small', 'stupid', 'scary', 'dumb', 'much', 'gross', 'foul', 'dangerous', 'crap', 'mixed', 'fast', 'sour', 'miserable', 'severe', 'lost', 'hit', 'dreadful', 'trouble', 'gone']

Class 1 category vocabulary: ['good', 'excellent', 'high', 'Good', 'wonderful', 'amazing', 'fantastic', 'fair', 'positive', 'sure', 'sound', 'quality', 'light', 'solid', 'brilliant', 'awesome', 'smart', 'happy', 'bright', 'safe', 'true', 'clean', 'rich', 'successful', 'full', 'special', 'fun', 'popular', 'sweet', 'superior', 'simple', 'average', 'superb', 'normal', 'important', 'love', 'cool', 'quick', 'easy', 'whole', 'hot', 'interesting', 'damn']

Preparing self supervision for masked category prediction.
Number of documents with category indicative terms found for each category is: {0: 873, 1: 828}
There are totally 1701 documents with category indicative terms.

Training model via masked category prediction.
Epoch 1:
Average training loss: 0.5981351137161255
Epoch 2:
Average training loss: 0.23333437740802765
Epoch 3:
Average training loss: 0.09165686368942261
Epoch 4:
Average training loss: 0.056073933839797974

Start self-training.

PS:

Read file <LOTClass.err> for stderr output of this job.

ERROR

The same as agnews.sh:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

dbpedia.sh

scripts

parameters same as yours, and look likes agnews.sh

OUTPUT

Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/dbpedia/dbpedia_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['company'], 1: ['school', 'university'], 2: ['artist'], 3: ['athlete'], 4: ['politics'], 5: ['transportation'], 6: ['building'], 7: ['river', 'mountain', 'lake'], 8: ['village'], 9: ['animal'], 10: ['plant', 'tree'], 11: ['album'], 12: ['film'], 13: ['novel', 'publication', 'book']}
Loading encoded texts from datasets/dbpedia/dbpedia_data/train.pt
Loading texts with label names from datasets/dbpedia/dbpedia_data/label_name_data.pt
Reading texts from datasets/dbpedia/dbpedia_data/test.txt
Converting texts into tensors.

ERROR

The same error as #8 .
RuntimeError: The task could not be sent to the workers as it is too large for send_bytes.

Can you help me deal with these errors?

Sincerely,
Heisenberg

Some issues for duplicating the paper results

Hi,

Thanks for open sourcing this.

When trying to duplicate the results as described in the paper, I encountered some issues. Much appreciated for some advice.

Taking the AGNews dataset as an example,

About MCP Data Generation. The default code shows if there are 20/50 overlapping indicative words, then there will be pseudo-labeled documents for MCP training. However, when first running category_vocabulary method, there will be less than 50 words (as specified by args.category_vocab_size) for each label since the sorted_dict = {k:v for k, v in sorted(cat_dict.items(), key=lambda item: item[1], reverse=True)[:category_vocab_size]} in filter_keywords (see this line) is run at the beginning before the words filtering starts. Due to this, I changed valid_idx = (match_count > match_threshold) & (input_mask > 0) in this line to valid_idx = (match_count > len(category_vocab) * match_threshold / top_pred_num) & (input_mask > 0). Without this change, I got 0 pseudo-labeled documents for MCP. Now I can get thousands of pseudo-labeled documents for MCP. Hence, with this change, I move on to the next step in the pipeline - Start MCP training. Did I miss something if it works on your side without this change?
About MCP Training. With the pseudo-labeled documents generated from last step, I successfully started the MCP training. By looking at the loss as follows, it seemed nothing wrong.

Training model via masked category prediction.
Epoch 1:
100%|██████████| 873/873 [01:42<00:00,  8.51it/s]
Average training loss: 0.9677731394767761
  0%|          | 0/873 [00:00<?, ?it/s]Epoch 2:
100%|██████████| 873/873 [01:42<00:00,  8.49it/s]
Average training loss: 0.4305942952632904
  0%|          | 0/873 [00:00<?, ?it/s]Epoch 3:
100%|██████████| 873/873 [01:42<00:00,  8.49it/s]
Average training loss: 0.31928300857543945

However, some issues popped up when I was attempting to run the MCP-trained model at this stage for inference on the test set. The inference method always returns the preds with the same probability distribution across different examples (different input ids). Something like this:

input_ids, input_mask, preds = self.inference(model, dataset_loader, rank, return_type="data")
print(preds)

tensor([[0.1776, 0.1679, 0.3113, 0.3432],
        [0.1776, 0.1679, 0.3113, 0.3432],
        [0.1776, 0.1679, 0.3113, 0.3432],
        ...,
        [0.1776, 0.1679, 0.3113, 0.3432],
        [0.1776, 0.1679, 0.3113, 0.3432],
        [0.1776, 0.1679, 0.3113, 0.3432]], device='cuda:0')

As a result, the MCP-trained model always gives random-guess mean accuracy (0.25 in the AGNews case). I am wondering if you guys have the same issue or I miss something? Much appreciation for any pointers.

About Self Train. Endeavoring to fix the issues as described in step 2, I hacked the code. The preds are the softmax probs over the logits of BERT's [CLS] last hidden state, so I am wondering if the self_train will solve the issue as it takes the [CLS] token into account. Unfortunately, the training still failed and the training log is as follows.

100%|██████████| 1600/1600 [03:04<00:00,  8.68it/s]
lr: 5.336e-07
Average training loss: 0.05783357098698616
Test acc: 0.25
100%|██████████| 1600/1600 [03:05<00:00,  8.63it/s]
lr: 9.925e-07
Average training loss: 0.016020258888602257
Test acc: 0.25
100%|██████████| 1600/1600 [03:05<00:00,  8.61it/s]
lr: 9.332e-07
Average training loss: 0.0071106418035924435
Test acc: 0.25

I have been stuck in these issues after many hours hacking for the code and ended up with no solutions. It would be really helpful for my research if there are some pointers or guides to solve these issues. Looking forward to reproduce the 0.864 test accuracy in AGNews as reported in the paper. Thanks.

FYI, my training environment is:

Since I only have a budget-limited GPU on my desktop. I run the experiments on a single 6GB 2060 GPU. To fit into the memory, I set the batch size to be 8 with accumulation steps of 16. I kept the rest of hyper-parameters the same as in agnews.sh.

Bug exists when using the original data and codes

Whether the server is Linux or Windows is limited?
A bug exists when the same versions of r.py were installed as shown in requirements and the original data and codes were used, which is shown as follows:

Reading texts from D:/LOTClass-master/datasets/agnews/agnews_data/train.txt
Converting texts into tensors.
Traceback (most recent call last):

File "", line 1, in
runfile('D:/LOTClass-master/src/train.py', wdir='D:/LOTClass-master/src')

File "E:\Program Files\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)

File "E:\Program Files\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "D:/LOTClass-master/src/train.py", line 66, in
main()

File "D:/LOTClass-master/src/train.py", line 53, in main
trainer = LOTClassTrainer(args)

File "D:\LOTClass-master\src\trainer.py", line 52, in init
self.read_data(args.dataset_dir, args.train_file, args.test_file, args.test_label_file)

File "D:\LOTClass-master\src\trainer.py", line 199, in read_data
find_label_name=True, label_name_loader_name="label_name_data.pt")

File "D:\LOTClass-master\src\trainer.py", line 109, in create_dataset
results = Parallel(n_jobs=self.num_cpus)(delayed(self.encode)(docs=chunk) for chunk in chunks)

File "E:\Program Files\Anaconda3\lib\site-packages\joblib\parallel.py", line 1042, in call
self.retrieve()

File "E:\Program Files\Anaconda3\lib\site-packages\joblib\parallel.py", line 921, in retrieve
self._output.extend(job.get(timeout=self.timeout))

File "E:\Program Files\Anaconda3\lib\site-packages\joblib_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)

File "E:\Program Files\Anaconda3\lib\concurrent\futures_base.py", line 432, in result
return self.__get_result()

File "E:\Program Files\Anaconda3\lib\concurrent\futures_base.py", line 384, in __get_result
raise self._exception

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

In w/o self-training baseline, how are the final test predictions made if the [CLS] hasn't be trained？

Hi! Thanks for presenting such an interesting idea and work. I have a tiny question about the method. For the w/o self-training baseline, will the language model be trained on those documents that contain category words? If not, the embedding of [CLS] token hasn't been tuned for the classification task, making it hard to predict a test document if the test document doesn't contain category words.

CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

Hi,
Beautiful paper...
I need to reconstruct your results in order to use them for my purpose.

The installed packages:

Package | Version | Latest Version
boto3 | 1.17.30 | 1.17.30
botocore | 1.20.30 | 1.20.30
certifi | 2020.12.5 | 2020.12.5
chardet | 4.0.0 | 4.0.0
click | 7.1.2 | 7.1.2
filelock | 3.0.12 | 3.0.12
idna | 2.10 | 3.1
jmespath | 0.10.0 | 0.10.0
joblib | 1.0.1 | 1.0.1
nltk | 3.5 | 3.5
numpy | 1.20.1 | 1.20.1
packaging | 20.9 | 20.9
pip | 21.0.1 | 21.0.1
pyparsing | 2.4.7 | 2.4.7
python-dateutil | 2.8.1 | 2.8.1
regex | 2021.3.17 | 2021.3.17
requests | 2.25.1 | 2.25.1
s3transfer | 0.3.4 | 0.3.4
sacremoses | 0.0.43 | 0.0.43
sentencepiece | 0.1.95 | 0.1.95
setuptools | 54.1.2 | 54.1.2
six | 1.15.0 | 1.15.0
tokenizers | 0.8.1rc2 | 0.10.1
torch | 1.8.0 | 1.8.0
tqdm | 4.59.0 | 4.59.0
transformers | 3.3.1 | 4.4.1
typing-extensions | 3.7.4.3 | 3.7.4.3
urllib3 | 1.26.4 | 1.26.4

I am running it with pycharm-community-2020.3.4 on a Linux OS with one GPU: RTX 2080 Ti

I followed the instructions and ran the following bash:

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1

DATASET=agnews
LABEL_NAME_FILE=label_names.txt
TRAIN_CORPUS=train.txt
TEST_CORPUS=test.txt
TEST_LABEL=test_labels.txt
MAX_LEN=200
TRAIN_BATCH=8
ACCUM_STEP=16
EVAL_BATCH=128
GPUS=1
MCP_EPOCH=3
SELF_TRAIN_EPOCH=1

python src/train.py --dataset_dir datasets/${DATASET}/ --label_names_file ${LABEL_NAME_FILE} \
                    --train_file ${TRAIN_CORPUS} \
                    --test_file ${TEST_CORPUS} --test_label_file ${TEST_LABEL} \
                    --max_len ${MAX_LEN} \
                    --train_batch_size ${TRAIN_BATCH} --accum_steps ${ACCUM_STEP} --eval_batch_size ${EVAL_BATCH} \
                    --gpus ${GPUS} \
                    --mcp_epochs ${MCP_EPOCH} --self_train_epochs ${SELF_TRAIN_EPOCH} \

However, I get the following error:

/.../LOTClass/agnews.sh
Namespace(accum_steps=16, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=1, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=8, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading encoded texts from datasets/agnews/train.pt
Loading texts with label names from datasets/agnews/label_name_data.pt
Loading encoded texts from datasets/agnews/test.pt
Contructing category vocabulary.
  0%|                                                                                                                                                                                                     | 0/62 [00:00<?, ?it/s]
CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
Traceback (most recent call last):
  File "src/train.py", line 66, in <module>
    main()
  File "src/train.py", line 55, in main
    trainer.category_vocabulary(top_pred_num=args.top_pred_num, category_vocab_size=args.category_vocab_size)
  File "/.../LOTClass/src/trainer.py", line 295, in category_vocabulary
    mp.spawn(self.category_vocabulary_dist, nprocs=self.world_size, args=(top_pred_num, loader_name))
  File "/.../LOTClass/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/.../LOTClass/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/.../LOTClass/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 139, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

Same goes with "imdb.sh" and "dbpedia.sh".
Moreover, when I execute the "amazon.sh" it gets killed!

/.../LOTClass/amazon.sh
Namespace(accum_steps=16, category_vocab_size=100, dataset_dir='datasets/amazon/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=1, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=0.1, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=8, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['bad'], 1: ['good']}
Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading encoded texts from datasets/amazon/train.pt
Loading texts with label names from datasets/amazon/label_name_data.pt
Reading texts from datasets/amazon/test.txt
Converting texts into tensors.
/.../LOTClass/amazon.sh: line 25:  9306 Killed                  python src/train.py --dataset_dir datasets/${DATASET}/ --label_names_file ${LABEL_NAME_FILE} --train_file ${TRAIN_CORPUS} --test_file ${TEST_CORPUS} --test_label_file ${TEST_LABEL} --max_len ${MAX_LEN} --train_batch_size ${TRAIN_BATCH} --accum_steps ${ACCUM_STEP} --eval_batch_size ${EVAL_BATCH} --gpus ${GPUS} --mcp_epochs ${MCP_EPOCH} --self_train_epochs ${SELF_TRAIN_EPOCH}

I appreciate it if you help me on that.

bus error when running the code

Red Hat 4.8.5-11
Four V100
Python3.6
torch 1.7.1
transformer 3.3.1

Run ''sh agnews.sh", get the following errors, I wonder if it is due to the multi-processing. Could you help check it?

`Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Some weights of the model checkpoint at bert-base-uncased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased/ and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading encoded texts from datasets/agnews/train.pt
Loading texts with label names from datasets/agnews/label_name_data.pt
Loading encoded texts from datasets/agnews/test.pt
Contructing category vocabulary.
/home/hadoop-aipnlp/anaconda3/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown
len(cache))
agnews.sh: line 25: 5931 Bus error (core dumped) python src/train.py --dataset_dir datasets/${DATASET}/ --label_names_file ${LABEL_NAME_FILE} --train_file ${TRAIN_CORPUS} --test_file ${TEST_CORPUS} --test_label_file ${TEST_LABEL} --max_len ${MAX_LEN} --train_batch_size ${TRAIN_BATCH} --accum_steps ${ACCUM_STEP} --eval_batch_size ${EVAL_BATCH} --gpus ${GPUS} --mcp_epochs ${MCP_EPOCH} --self_train_epochs ${SELF_TRAIN_EPOCH}`

Changing the Pretrained model with a multingual model

hello @yumeng5.
First of all, all my compliments to you and your team for the great results with this model.
I want to test it a lot, but to use it I need to change the pretrained model with a multingual one (Italian language problem).
I changed this part from model.py:

`from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers.modeling_bert import BertOnlyMLMHead
from torch import nn
import sys
class LOTClassModel(AutoModelForMaskedLM):
    tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
    model = AutoModelForMaskedLM.from_pretrained("dbmdz/bert-base-italian-cased")

def __init__(self, config):
    super().__init__(config)
    self.num_labels = config.num_labels
    self.bert = model(config, add_pooling_layer=False)
    self.cls = BertOnlyMLMHead(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.dense = nn.Linear(config.hidden_size, config.hidden_size)
    self.activation = nn.Tanh()
    self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    self.init_weights()
    # MLM head is not trained
    for param in self.cls.parameters():
        param.requires_grad = False`

Then I changed this part of trainer.py:
self.pretrained_lm = 'dbmdz/bert-base-italian-cased' self.tokenizer = AutoTokenizer.from_pretrained(self.pretrained_lm) self.vocab = self.tokenizer.get_vocab() self.vocab_size = len(self.vocab) self.mask_id = self.vocab[self.tokeniz

And I get the error:

`Epoch 1:
  0%|                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 66, in <module>
    main()
  File "train.py", line 57, in main
    trainer.mcp(top_pred_num=args.top_pred_num, match_threshold=args.match_threshold, epochs=args.mcp_epochs)
  File "/comments_class/code/lotclass/src/trainer.py", line 454, in mcp
    mp.spawn(self.mcp_dist, nprocs=self.world_size, args=(epochs, loader_name))
  File "/my_env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/my_env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/my_env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "//my_env/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/comments_class/code/lotclass/src/trainer.py", line 423, in mcp_dist
    attention_mask=input_mask)
  File "/my_env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "//my_env/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "//my_env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/my_env/lib64/python3.6/site-packages/transformers/modeling_bert.py", line 1149, in forward
    assert kwargs == {}, f"Unexpected keyword arguments: {list(kwargs.keys())}."
AssertionError: Unexpected keyword arguments: ['pred_mode'].`

What am I doing wrong?
Thanks again.

Error running with test set

Hi! I receive the following error when try this code for a sample data with 20 class labels. It works with 4 class labels though! can you help me what can be the problem?

Start self-training.
Traceback (most recent call last):
File "/content/gdrive/MyDrive/TVEyesResearch/Codes/LOTClass/src/train.py", line 66, in
main()
File "/content/gdrive/MyDrive/TVEyesResearch/Codes/LOTClass/src/train.py", line 59, in main
trainer.self_train(epochs=args.self_train_epochs, loader_name=args.final_model)
File "/content/gdrive/.shortcut-targets-by-id/1RpJyUMfb0726y9lCj2gYMZpFOy1sWZdr/TVEyesResearch/Codes/LOTClass/src/trainer.py", line 566, in self_train
mp.spawn(self.self_train_dist, nprocs=self.world_size, args=(epochs, loader_name))
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/content/gdrive/.shortcut-targets-by-id/1RpJyUMfb0726y9lCj2gYMZpFOy1sWZdr/TVEyesResearch/Codes/LOTClass/src/trainer.py", line 532, in self_train_dist
test_dataset_loader = self.make_dataloader(rank, self.test_data, self.eval_batch_size) if self.with_test_label else None
File "/content/gdrive/.shortcut-targets-by-id/1RpJyUMfb0726y9lCj2gYMZpFOy1sWZdr/TVEyesResearch/Codes/LOTClass/src/trainer.py", line 223, in make_dataloader
dataset = TensorDataset(data_dict["input_ids"], data_dict["attention_masks"], data_dict["labels"])
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataset.py", line 158, in init
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError

RuntimeError: No rendezvous handler for tcp://

Hi @yumeng5 😊

Exciting work. 👍I'm trying to run the agnews.sh script you gave. And get the error below. I only have one GPU. Is this the error caused by this?

Environment info

transformers version: 3.4.0
Platform: windows10
Python version:3.6.9
PyTorch version (GPU?): 1.7
GPU: 1080ti * 1
Using distributed or parallel set-up in script?: yes

agnews.sh

"""
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0

DATASET=agnews
LABEL_NAME_FILE=label_names.txt
TRAIN_CORPUS=train.txt
TEST_CORPUS=test.txt
TEST_LABEL=test_labels.txt
MAX_LEN=200
TRAIN_BATCH=32
ACCUM_STEP=4
EVAL_BATCH=128
GPUS=1
MCP_EPOCH=3
SELF_TRAIN_EPOCH=1
......
"""

Problem

"""
Administrator@it-202007061711 MINGW64 /e/PycharmProjects/CCF/reference/LOTClass-master
$ sh agnews.sh
Namespace(accum_steps=4, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=1, label_names_file='label_names.txt', match_threshold=20, max_l
en=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading encoded texts from datasets/agnews/train.pt
Loading texts with label names from datasets/agnews/label_name_data.pt
Loading encoded texts from datasets/agnews/test.pt
Contructing category vocabulary.
Traceback (most recent call last):
File "src/train.py", line 69, in
main()
File "src/train.py", line 56, in main
trainer.category_vocabulary(top_pred_num=args.top_pred_num, category_vocab_size=args.category_vocab_size)
File "E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py", line 296, in category_vocabulary
mp.spawn(self.category_vocabulary_dist, nprocs=self.world_size, args=(top_pred_num, loader_name))
File "D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py", line 157, in start_processes
while not context.join():
File "D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\multiprocessing\spawn.py", line 19, in _wrap
fn(i, *args)
File "E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py", line 260, in category_vocabulary_dist
model = self.set_up_dist(rank)
File "E:\PycharmProjects\CCF\reference\LOTClass-master\src\trainer.py", line 67, in set_up_dist
rank=rank
File "D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\distributed_c10d.py", line 421, in init_process_group
init_method, rank, world_size, timeout=timeout
File "D:\Anaconda3\envs\BertWWMExt\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for tcp://
"""
Looking forward to your reply, thankyou.

AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce `--match_threshold` (not recommend)

Hi,
I'm traning my model under your framework. I got this error information:

Number of documents with category indicative terms found for each category is: {0: 9014, 1: 0, 2: 0, 3: 551, 4: 1478, 5: 20642, 6: 0, 7: 7429, 8: 8676, 9: 4814, 10: 1368, 11: 23, 12: 418}
Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 57, in main
trainer.mcp(top_pred_num=args.top_pred_num, match_threshold=args.match_threshold, epochs=args.mcp_epochs)
File "/home/xuanw/HL/LOTClass-master/src/trainer.py", line 451, in mcp
self.prepare_mcp(top_pred_num, match_threshold)
File "/home/xuanw/HL/LOTClass-master/src/trainer.py", line 392, in prepare_mcp
assert category_doc_num[i] > 10, f"Too few ({category_doc_num[i]}) documents with category indicative terms found for category {i}; "
AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce --match_threshold (not recommend)

But when I directly run the sh file again(the dataset dir in sh file is replaced with mine), it runs successfully without any error. Will the result I get be correct? Does the previous error message "affect" this result to make it wrong?

Can I train on cpu and control layer freezing ?

Can this method be used in the scenario of multi-label text classification?

Hello, can this method be extended to multi-label text classification scenarios?

RuntimeError: The task could not be sent to the workers as it is too large for `send_bytes`.

Environment info

ubuntu 18.04
One Tesla V100-16G
Python3.6.2
torch 1.5.1
transformers 3.3.1

Problem
When I run ''sh agnews.sh" or 'sh imdb.sh", it works. But when I run "sh amazon.sh" or "sh dbpedia.sh"get the following errors:

Namespace(accum_steps=4, category_vocab_size=100, dataset_dir='datasets/amazon/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=1, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=0.1, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['bad'], 1: ['good']}
Some weights of the model checkpoint at bert-base-uncased were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Reading texts from datasets/amazon/train.txt
Converting texts into tensors.
Saving encoded texts into datasets/amazon/train.pt
Reading texts from datasets/amazon/train.txt
Locating label names in the corpus.
Saving texts with label names into datasets/amazon/label_name_data.pt
Reading texts from datasets/amazon/test.txt
Converting texts into tensors.
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 159, in feed
send_bytes(obj)
File "/root/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/root/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 53, in main
trainer = LOTClassTrainer(args)
File "/root/app/LOTClass/src/trainer.py", line 52, in init
self.read_data(args.dataset_dir, args.train_file, args.test_file, args.test_label_file)
File "/root/app/LOTClass/src/trainer.py", line 201, in read_data
self.test_data = self.create_dataset(dataset_dir, test_file, test_label_file, "test.pt")
File "/root/app/LOTClass/src/trainer.py", line 109, in create_dataset
results = Parallel(n_jobs=self.num_cpus)(delayed(self.encode)(docs=chunk) for chunk in chunks)
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/joblib/parallel.py", line 1054, in call
self.retrieve()
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/root/anaconda3/envs/py36/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/root/anaconda3/envs/py36/lib/python3.6/concurrent/futures/_base.py", line 405, in result
return self.__get_result()
File "/root/anaconda3/envs/py36/lib/python3.6/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
RuntimeError: The task could not be sent to the workers as it is too large for send_bytes.

Could you help check it? Thanks a lot!

is possible to pass muti name for one category

It seems that the model will first find the names in label_names.txt to mask the original text.If the category name has never or rarely been in the train txt,small train materials will be used in the first step,so does it possible to add more words that are smililar to the category name to the label_names.txt.for example instead of post {0:"sports"},just post {0:"competion","players"},asuming that sports has never occured in the train materials

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/agnews/label_names.txt'

Category Vocabulary size in paper is 50, the category vocabulary size created by the algorithm is 70

I was running the code on the AGNews dataset and I noticed the vocabularies size for the 4 classes are:
politics: 70
sports: 69
business: 61
technology: 70
(at least the one printed in the console).
I understand that they are not the same as stopwords and shared words are removed but why is the number higher than 50?

AssertionError: "时" used as the label name by multiple classes!

Hi @yumeng5

I am trying to use your model on Chinese unsupervised text classification corpus. Due to the particularity of Chinese, I follow your method in the README and separate the label names with spaces, but 5: ['时','尚'], 6: ['时','政'] For these two types of repeated words, the following error was reported? How can I adjust to the Chinese data?（我正在中文无监督文本分类语料上尝试使用您的模型，由于中文的特殊性，我按照您在README中的方法，将标签名用空格分开，但是5: ['时', '尚'], 6: ['时', '政']对于这两个类中由于有重重复的字，报了如下的错？请问该怎么调整以适应中文数据呢？）

我的label_name如下：
"""
财经
房产
家居
教育
科技
时尚
时政
游戏
娱乐
体育
"""

"""
2020-11-20 07:15:48.082470: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Namespace(accum_steps=4, category_vocab_size=100, dataset_dir='datasets/lotclass/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=1, label_names_file='label_names.txt', match_threshold=20, max_len=512, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file=None, top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Downloading: 100% 110k/110k [00:00<00:00, 588kB/s]
Downloading: 100% 2.00/2.00 [00:00<00:00, 1.69kB/s]
Downloading: 100% 112/112 [00:00<00:00, 100kB/s]
Downloading: 100% 19.0/19.0 [00:00<00:00, 17.4kB/s]
Label names used for each class are: {0: ['财', '经'], 1: ['房', '产'], 2: ['家', '居'], 3: ['教', '育'], 4: ['科', '技'], 5: ['时', '尚'], 6: ['时', '政'], 7: ['游', '戏'], 8: ['娱', '乐'], 9: ['体', '育']}
Traceback (most recent call last):
File "src/train.py", line 67, in
main()
File "src/train.py", line 54, in main
trainer = LOTClassTrainer(args)
File "/content/drive/My Drive/CCF/reference/LOTClass-master/src/trainer.py", line 50, in init
self.read_label_names(args.dataset_dir, args.label_names_file)
File "/content/drive/My Drive/CCF/reference/LOTClass-master/src/trainer.py", line 218, in read_label_names
assert word not in self.label2class, f""{word}" used as the label name by multiple classes!"
AssertionError: "时" used as the label name by multiple classes!
"""

Thank you.

pretrained_lm for non english data

should I change pretrained_lm for non english data?

Torch not compiled with CUDA enabled

Hi, I am running my code on mac which does not have CUDA and I run into some issues because of that. Any idea how I can resolve it?

Contructing category vocabulary.
Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 55, in main
trainer.category_vocabulary(top_pred_num=args.top_pred_num, category_vocab_size=args.category_vocab_size)
File "/Users/user/Downloads/LOTClass-master/src/trainer.py", line 295, in category_vocabulary
mp.spawn(self.category_vocabulary_dist, nprocs=self.world_size, args=(top_pred_num, loader_name))
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/Users/user/Downloads/LOTClass-master/src/trainer.py", line 259, in category_vocabulary_dist
model = self.set_up_dist(rank)
File "/Users/user/Downloads/LOTClass-master/src/trainer.py", line 69, in set_up_dist
model = self.model.to(rank)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 443, in to
return self._apply(convert)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 203, in _apply
module._apply(fn)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 203, in _apply
module._apply(fn)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 203, in _apply
module._apply(fn)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 225, in _apply
param_applied = fn(param)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 441, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py", line 149, in _lazy_init
_check_driver()
File "/Users/user/opt/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py", line 47, in _check_driver
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

yumeng5 / lotclass Goto Github PK

lotclass's Introduction

LOTClass

Requirements

Reproducing the Results

Command Line Arguments

Running on New Datasets

Citations

lotclass's People

Contributors

Stargazers

Watchers

Forkers

lotclass's Issues

Environment Info

agnews.sh

scirpts

OUTPUT

ERRORS

amazon.sh

scripts

OUTPUT

ERRORS

imdb.sh

scripts

OUTPUT

ERROR

dbpedia.sh

scripts

OUTPUT

ERROR

Environment info

agnews.sh

Problem

Recommend Projects

Recommend Topics

Recommend Org