The pesimcse's intro from yyinhu

@article{wu2021esimcse, title={Esimcse: Enhanced sample building method for contrastive learning of unsupervised sentence embedding}, author={Wu, Xing and Gao, Chaochen and Zang, Liangjun and Han, Jizhong and Wang, Zhongyuan and Hu, Songlin}, journal={arXiv preprint arXiv:2109.04380}, year={2021} }

PESimCSE: The Research of Semantic Matching Based on Phrase Enhancement Samples

Thanks to ESimCSE! Our work mainly based on ESimCSE repo.

Train PESimCSE

In the following section, we describe how to train a PESimCSE model by using our code.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1 should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.8.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Then come back to the root directory, you can evaluate any transformers-based pre-trained models using our evaluation code. For example,

python evaluation.py \
    --model_name_or_path ffgcc/esimcse-bert-base-uncased \
    --pooler cls_before_pooler \
    --task_set sts \
    --mode test

Arguments for the evaluation script are as follows,

--model_name_or_path: The name or path of a transformers-based pre-trained checkpoint. You can directly use the models in the above table, e.g., ffgcc/esimcse-bert-base-uncased.
--pooler: Pooling method. Now we support
- cls (default): Use the representation of [CLS] token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation).
- cls_before_pooler: Use the representation of [CLS] token without the extra linear+activation.
- avg: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa (paper), you should use this option.
- avg_top2: Average embeddings of the last two layers.
- avg_first_last: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best.
--mode: Evaluation mode
- test (default): The default test mode. To faithfully reproduce our results, you should use this option.
- dev: Report the development set results. Note that in STS tasks, only STS-B and SICK-R have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the test mode (though numbers are slightly lower).
- fasttest: It is the same as test, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).
--task_set: What set of tasks to evaluate on (if set, it will override --tasks)
- sts (default): Evaluate on STS tasks, including STS 12~16, STS-B and SICK-R. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.
- transfer: Evaluate on transfer tasks.
- full: Evaluate on both STS and transfer tasks.
- na: Manually set tasks by --tasks.
--tasks: Specify which dataset(s) to evaluate on. Will be overridden if --task_set is not na. See the code for a full list of tasks.

Training

Data

Following ESimCSE, we sample 1 million sentences from English Wikipedia; You can run data/download_wiki.sh to download the two datasets.We placed two datasets, one is the unprocessed original training set and the other is the processed dataset.

Training scripts

We provide example training scripts for ESimCSE. We have placed two training data sets, the original data set and the processed data set We explain the arguments in following:

--train_file: Training file path. We support "txt" files (one line for one sentence) . You can use Wikipedia or you can use your own data with the same format.
--model_name_or_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.) and RoBERTa-based models (RoBERTa-base, RoBERTa-large, etc.).
--temp: Temperature for the contrastive loss.
--pooler_type: Pooling method. It's the same as the --pooler_type in the evaluation part.
--hard_negative_weight: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).
--do_mlm: Whether to use the MLM auxiliary objective. If True:
- --mlm_weight: Weight for the MLM objective.
- --mlm_probability: Masking rate for the MLM objective.
--neg_size: The size of negative sentence.
--dup_type: the type of repetition
--dup_rate: the rate of repetition
--momentum: the rate of momentum

For results in the paper, we use Nvidia 3060 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

yyinhu / pesimcse Goto Github PK

pesimcse's Introduction

PESimCSE: The Research of Semantic Matching Based on Phrase Enhancement Samples

Train PESimCSE

Requirements

Evaluation

Training

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent