Coder Social home page Coder Social logo

ust's Introduction

Uncertainty-aware Self-training

UST or Uncertainty-aware Self-Training is a method of task-specific training of pre-trainined language models (e.g., BERT, Electra, GPT) with only a few-labeled examples for the target classification task and large amounts of unlabeled data.

Our academic paper published as a spotlight presentation at NeurIPS 2020 describes the framework in details here: Uncertainy-aware Self-training for Few-shot Text Classification

Key Result

With only 20-30 labeled examples for each class for each task and large amounts of task-specific unlabeled data, UST performs within 3% accuracy of fully supervised pre-trained language models fine-tuned on thousands of labeled examples with an aggregate accuracy of 91% and improvement of upto 12% over baselines (e.g., BERT) for text classification on benchmark datasets. It does not use any auxiliary resources like paraphrases or backtranslations.

The following table reports text classification results over 5 benchmark datasets averaged over over multiple runs.

BERT (30 labels) UDA SSL (30 labels) Classic ST (30 labels) UST (30 labels) BERT (Supervised ~150K labels)
SST 69.79 83.58 84.81 87.69 92.12
IMDB 73.03 89.3 78.97 89.21 91.7
Elec 82.92 89.64 89.92 91.27 93.46
AG News 80.74 85.92 84.62 88.19 92.12
Macro Average 80.85 89.06 87.34 91.00 93.73

How it works

UST is a semi-supervised learning method that leverages pre-trained language models with stochastic regularization techniques and iterative self-training with student-teacher models. Specifically, it extends traditional self-training with three core components, namely:

(i) Masked model dropout for uncertainty estimation. We adopt MC dropout (Gal and Ghahramani, 2016) as a technique to obtain uncertainty estimates from the pre-trained language model. In this, we apply stochastic dropouts after different hidden layers in the neural network model and approximate the model output as a random sample from the posterior distribution. This allows us to compute the model uncertainty in terms of the stochastic mean and variance of the samples with a few stochastic forward passes through the network.

(ii) Sample selection. Given the above uncertainty estimates for a sample, we employ entropy-based measures to select samples that the teacher is most or least confused about to infuse for self-training corresponding to easy- and hard-entropy-aware example mining.

(iii) Confident learning. In this, we train the student model to explicitly account for the teacher confidence by emphasizing on the low variance examples. All of the above components are jointly used for end-to-end learning.

How to use the code

Continued Pre-training on Task-specific Unlabeled Data

For the few-shot learning setting with limited training labels, continued pre-training on task-specific unlabeled data starting from available pre-trained checkpoints is an effective mechanism to obtain a good base encoder to initialize the teacher model for UST. Code for continued pre-trainining with masked language modeling objective can be found in the original BERT repo here: https://github.com/google-research/bert. This requires invoking the create_pretraining_data.py and the run_pretraining.py scripts from the repo with additional instructions therein. This produces a new tensorflow checkpoint that can be used as the pre-trained checkpoint for UST.

You can use transformers-cli from https://huggingface.co/transformers/converting_tensorflow_models.html to convert tensorflow checkpoints (ckpt) to compatible checkpoints (bin) for HuggingFace transformers.

Note that this continued pre-training step in optional for UST, but required to reproduce the results in the paper. In absence of this step, UST uses the default pre-trained checkpoints for any pre-trained langauge model which also works well in practise. The continued pre-trained checkpoints for the tasks in the paper are available here that can be used to initialize UST for the respective tasks.

HuggingFace Transformers as Base Encoders

UST is integrated with HuggingFace Transformers which makes it possible to use any supported pre-trained language model as a base encoder.

Training UST

UST requires 3 input files train.tsv and test.tsv with tab-separated (i) instances (e.g., SST and IMDB) or pairs of instances (e.g., MRPC and MNLI) and (ii) labels; and transfer.txt for the unlabeled instances of the corresponding task (all line-separated) in the data directory.

The code has been tested with Tensorflow 2.3.1, Transformers 3.3.1 and Python 3.6.9. Install all the required dependencies with pip install -r requirements.txt.

These are some standard set of arugments to run UST for the few-shot setting. Refer to run_ust.py for all the optional arugments and descriptions.

PYTHONHASHSEED=42 python run_ust.py 
--task $DATA_DIR 
--model_dir $OUTPUT_DIR 
--seq_len 128 
--sample_scheme easy_bald_class_conf 
--sup_labels 60 
--valid_split 0.5
--pt_teacher TFBertModel
--pt_teacher_checkpoint bert-base-uncased
--N_base 5
--sup_batch_size 4

Classification tasks: Set --do_pairwise for pairwise classification tasks like MRPC and MNLI.

Sampling schemes: Supported sample scheme: uniform, easy_bald_class_conf (sampling easy examples with uncertainty given by Eqn. 7 in paper), dif_bald_class_conf (sampling difficult examples given by Eqn. 8). conf enables confident learning, whereas class enables class dependent exploration. Additionally, you can append soft to the above sampling scheme (e.g., easy_bald_class_conf_soft) for leveraging majority predictions from T stochastic forward passes that works well for settings involving many classes / labels.

HuggingFace Transformers: To use different pre-trained language models from HuggingFace, set pt_teacher and pt_teacher_checkpoint to corresponding model versions available from https://huggingface.co/transformers/pretrained_models.html. A default set of pre-trained language models is available at ``huggingface_utils.py`.

Training and validation: sup_labels denote the total number of available labeled examples for each class for each task, where valid_split uses a fraction of those labels as validation set for early stopping. Set sup_labels to -1 to use all training labels. Set valid_split to -1 to use the available test data as the validation set.

Initializing the teacher model: To start with a good base encoder for the few-shot setting with very few labeled examples, UST uses different random seeds to initialize and fine-tune the teacher model N_base times and selects the best one to start the self-training process. This is not required when large number of labeled examples are available (correspondingly set N_base to 1).

Fine-tuning batch size: Set sup_batch_size to a small number for few-shot fine-tuning (e.g., 4) of the teacher model. In case of many training labels, set sup_batch_size to a higher value for faster training (e.g., 32).

Self-training works for both low-data and high-data regime. For example, UST obtains 0.5 accuracy improvement for MNLI (mismatched) using all the available labeled examples (393K) to use for both training as well as the transfer set without using any additional unlabeled data.

Standard set of arugments to run UST with all labeled examples (e.g., MNLI).

PYTHONHASHSEED=42 python run_ust.py 
--task $DATA_DIR/MNLI 
--model_dir $OUTPUT_DIR 
--seq_len 128 
--sample_scheme easy_bald_class_conf_soft 
--sup_labels -1 
--valid_split -1 
--sup_batch_size 32 
--do_pairwise 
--N_base 1

Dropouts are the key for stochastic regularization and obtaining uncertainty estimates. However, too small values lead to less perturbation; whereas, too large values distort the pre-trained model attention mechanism. Good values of dropouts: BERT --hidden_dropout_prob 0.3 --attention_probs_dropout_prob 0.3, Electra/Roberta --hidden_dropout_prob 0.2 --attention_probs_dropout_prob 0.2

Examples of using other pre-trained language models (defined in huggingface_utils.py): Electra --pt_teacher TFElectraModel --pt_teacher_checkpoint google/electra-base-discriminator and Roberta --pt_teacher TFRobertaModel --pt_teacher_checkpoint roberta-base

Datasets used in our paper:

If you use this code, please cite:

@inproceedings{mukherjee-awadallah-2020-ust,
    title = "Uncertainty-aware Self-training for Few-shot Text Classification",
    author = "Mukherjee, Subhabrata  and
      Hassan Awadallah, Ahmed",
    booktitle = "Advances in Neural Information Processing Systems (NeurIPS 2020)",
    year = "2020",
    address = "Online",
    url = "https://papers.nips.cc/paper/2020/file/f23d125da1e29e34c552f448610ff25f-Paper.pdf",
}

Code is released under MIT license.

ust's People

Contributors

microsoft-github-operations[bot] avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar subhomj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ust's Issues

AttributeError: 'MirroredStrategy' object has no attribute 'experimental_run_v2'

running the code as is in colab, I get this error:

Traceback (most recent call last):
  File "run_ust.py", line 147, in <module>
    train_model(max_seq_length, X_train, y_train, X_test, y_test, X_unlabeled, model_dir, tokenizer, sup_batch_size=sup_batch_size, unsup_batch_size=unsup_batch_size, unsup_size=unsup_size, sample_size=sample_size, TFModel=TFModel, Config=Config, pt_teacher_checkpoint=pt_teacher_checkpoint, sample_scheme=sample_scheme, T=T, alpha=alpha, valid_split=valid_split, sup_epochs=sup_epochs, unsup_epochs=unsup_epochs, N_base=N_base, dense_dropout=dense_dropout, attention_probs_dropout_prob=attention_probs_dropout_prob, hidden_dropout_prob=hidden_dropout_prob)
  File "/content/UST/ust.py", line 180, in train_model
    y_mean, y_var, y_pred, y_T = mc_dropout_evaluate(model, gpus, len(labels), X_unlabeled_sample, T=T)
  File "/content/UST/ust.py", line 59, in mc_dropout_evaluate
    pred = distributed_eval_step(batch)
  File "/content/UST/ust.py", line 56, in distributed_eval_step
    return strategy.experimental_run_v2(eval_step, args=(dataset_inputs,))
AttributeError: 'MirroredStrategy' object has no attribute 'experimental_run_v2'

tf version is 2.5.0 (provided by colab)

I tried changing the tf version to 1.x but in that case, I get

INFO:filelock:Lock 140358422112272 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Traceback (most recent call last):
  File "run_ust.py", line 147, in <module>
    train_model(max_seq_length, X_train, y_train, X_test, y_test, X_unlabeled, model_dir, tokenizer, sup_batch_size=sup_batch_size, unsup_batch_size=unsup_batch_size, unsup_size=unsup_size, sample_size=sample_size, TFModel=TFModel, Config=Config, pt_teacher_checkpoint=pt_teacher_checkpoint, sample_scheme=sample_scheme, T=T, alpha=alpha, valid_split=valid_split, sup_epochs=sup_epochs, unsup_epochs=unsup_epochs, N_base=N_base, dense_dropout=dense_dropout, attention_probs_dropout_prob=attention_probs_dropout_prob, hidden_dropout_prob=hidden_dropout_prob)
  File "/content/UST/ust.py", line 113, in train_model
    model = models.construct_teacher(TFModel, Config, pt_teacher_checkpoint, max_seq_length, len(labels), dense_dropout=dense_dropout, attention_probs_dropout_prob=attention_probs_dropout_prob, hidden_dropout_prob=hidden_dropout_prob)
  File "/content/UST/models.py", line 20, in construct_teacher
    encoder = TFModel.from_pretrained(pt_teacher_checkpoint, config=config, from_pt=True, name="teacher")
  File "/usr/local/lib/python3.7/dist-packages/transformers/utils/dummy_tf_objects.py", line 404, in from_pretrained
    requires_backends(cls, ["tf"])
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 612, in requires_backends
    raise ImportError("".join([BACKENDS_MAPPING[backend][1].format(name) for backend in backends]))
ImportError: 
TFBertModel requires the TensorFlow library but it was not found in your environment. Checkout the instructions on the
installation page: https://www.tensorflow.org/install and follow the ones that match your environment.

Cannnot reproduce the results

Hi, thanks a lot for providing the code for the nice paper.
I tried this code for the SST-2 and the AG-News dataset, I followed all parameter settings listed in the paper's appendix. However, I am not able to reproduce the result.
More precise, I observed the following two issues:

  1. the BERT base model performs actually very well given 30 labels. I always achieve an accuracy of 80%+ for SST-2 and 85%+ for AG-News. Yet the paper reports 69.79% and 80.74% for them. I already tried different seeds but observe the same results.
  2. I haven't observed much improvement after using UST on unlabeled data, all within 1%. Hence, I can not reproduce the results given in the paper.

As an example: I run this command and got the my observed SST-2 results mentioned above:

CUDA_VISIBLE_DEVICES=$cuda_id PYTHONHASHSEED=42 python ../run_ust.py
--task /local/data
--model_dir /local/sst-2
--seq_len 32
--sup_batch_size 4
--unsup_batch_size 32
--sample_size 16384
--unsup_size 4096
--sample_scheme easy_bald_class_conf
--sup_labels 60
--T 30
--alpha 0.1
--valid_split 0.5
--sup_epochs 50
--unsup_epochs 25
--N_base 5
--pt_teacher TFBertModel
--pt_teacher_checkpoint bert-base-uncased
--hidden_dropout_prob 0.3
--attention_probs_dropout_prob 0.3
--dense_dropout 0.5

My questions:

  1. Is there something we need to take care of when creating the datasets (preprocessing)?
  2. there is a hyperparameter, called alpha, in the code. I think this hyperparameter is not mentioned in the paper. Shall I use the default value (i.e. 0.1) when running the code?

AttributeError: 'numpy.ndarray' object has no attribute 'values'

training with 1 GPU, tensorflow_gpu-2.3.1, transformers-3.3.1 , an error occurred: y_pred.extend(pred.values[gpu]) in ust.py

WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `experimental_run_v2` inside a tf.function to get the best performance.
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `experimental_run_v2` inside a tf.function to get the best performance.
Traceback (most recent call last):
  File "run_ust.py", line 147, in <module>
    train_model(max_seq_length, X_train, y_train, X_test, y_test, X_unlabeled, model_dir, tokenizer, sup_batch_size=sup_batch_size, unsup_batch_size=unsup_batch_size, unsup_size=unsup_size, sample_size=sample_size, TFModel=TFModel, Config=Config, pt_teacher_checkpoint=pt_teacher_checkpoint, sample_scheme=sample_scheme, T=T, alpha=alpha, valid_split=valid_split, sup_epochs=sup_epochs, unsup_epochs=unsup_epochs, N_base=N_base, dense_dropout=dense_dropout, attention_probs_dropout_prob=attention_probs_dropout_prob, hidden_dropout_prob=hidden_dropout_prob)
  File "UST/ust.py", line 183, in train_model
    y_mean, y_var, y_pred, y_T = mc_dropout_evaluate(model, gpus, len(labels), X_unlabeled_sample, T=T)
  File "UST/ust.py", line 62, in mc_dropout_evaluate
    y_pred.extend(pred.values[gpu])
AttributeError: 'numpy.ndarray' object has no attribute 'values'

actual: the value of pred is: (by print(pred))

[[-0.59470797 -0.7103901  -0.4074441   2.2519145  -1.8890193   1.024729  ]
 [ 0.79591113  0.87572926 -1.0720805  -0.9207117  -0.32005262  0.6711779 ]
 [-0.3771792  -0.71912414 -1.1747787   2.134624   -1.006975    0.3743801 ]
 ...
 [-0.5568391  -0.1446489  -0.8823348   1.9092964  -1.0569383   0.17100161]
 [ 0.9426691  -0.87104434  0.3349641   0.87110806 -1.3404613   0.5784961 ]
 [ 0.7570472  -1.1358421   0.9814421  -0.84206074  0.06219336 -0.15020499]]

Question about Confident Learning

According to your paper, sample weight should be Var(y)

In sampleer.py line 76:

for label in range(num_classes):
      # .....
       w_s.extend(y_var_[indices][:,0])

Why there always take class0's confidence?

Does this line should be:

w_s.extend(y_var_[indices][:, label]

Question about mc_dropout_evaluate

In this function, you set the number of masked models for uncertainty estimation as 30.

However, I cannot find how to change the weight in the loop:

for i in range(T):

    y_pred = []
    with strategy.scope():
        def eval_step(inputs):
            return model(inputs, training=training).numpy()#[:,0]

        def distributed_eval_step(dataset_inputs):
            return strategy.experimental_run_v2(eval_step, args=(dataset_inputs,))

        for batch in dist_data:
            pred = distributed_eval_step(batch)
            for gpu in range(gpus):
                y_pred.extend(pred.values[gpu])

    #converting logits to probabilities
    y_T[i] = tf.nn.softmax(np.array(y_pred))

Did I misunderstand the T stochastic forward passes?

unable to find transfers.txt

I downloaded all datasets from the posted links, but neither of them contains transfers.txt. Does this mean we should write our own data_preprocess.py to first generate valid datasets?

IMDB datset

I am trying to replicate the model's performance on IMDB dataset but I am not able to do so?
Does this method essentially work on only binary classification? I have tried to make it work with multiple classes but the model doesn't run as I get gradient doesn't exist error?
Can the details regarding the parameters used and the way dataset was created be shared so that we can replicate and test the model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.