Coder Social home page Coder Social logo

ease's Introduction

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

Hugging Face Transformers Hugging Face Models Arxiv

EASE is a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities proposed in our paper EASE: Entity-Aware Contrastive Learning of Sentence Embedding. This repository contains the source code to train the model and evaluate it with downstream tasks. Our code is mainly based on that of SimCSE.

Released Models

Hugging Face Models

Our published models are listed as follows. You can use these models by using HuggingFace's Transformers.

Monolingual Models Avg. STS Avg. STC
sosuke/ease-bert-base-uncased 77.0 63.1
sosuke/ease-roberta-base 76.8 58.6
Multilingual Models Avg. mSTS Avg. mSTC
sosuke/ease-bert-base-multilingual-cased 57.2 36.1
sosuke/ease-xlm-roberta-base 57.1 36.3

Use EASE with Huggingface

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our pretrained model. 
tokenizer = AutoTokenizer.from_pretrained("sosuke/ease-bert-base-multilingual-cased")
model = AutoModel.from_pretrained("sosuke/ease-bert-base-multilingual-cased")

# Set pooler.
pooler = lambda last_hidden, att_mask: (last_hidden * att_mask.unsqueeze(-1)).sum(1) / att_mask.sum(-1).unsqueeze(-1)

# Tokenize input texts.
texts = [
    "Ils se préparent pour un spectacle à l'école.",
    "They are preparing for a show at school.",
    "Two medical professionals in green look on at something."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    last_hidden = model(**inputs, output_hidden_states=True, return_dict=True).last_hidden_state
embeddings = pooler(last_hidden, inputs["attention_mask"])

# Calculate cosine similarities
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print(f"Cosine similarity between {texts[0]} and {texts[1]} is {cosine_sim_0_1}")
print(f"Cosine similarity between {texts[0]} and {texts[2]} is {cosine_sim_0_2}")

Please see here for other pooling methods.

Setups

Python

Run the following script to install the dependent libraries.

pip install -r requirements.txt

Before training, please download the datasets for training and evaluation.

bash download_all.sh

Evaluation

We provide evaluation code for sentence embeddings including Semantic Textual Similarity (STS 2012-2016, STS Benchmark, SICK-elatedness, and the extended version of STS 2017 dataset), Short Text Clustering (Eight STC benchmarks and MewsC-16), Cross-lingual Parallel Matching (Tatoeba) and Cross-lingual Text Classification (MLDoc).

Set your model or path of tranformers-based checkpoint (--model_name_or_path), pooling method type (--pooler), and what set of tasks (--task_set). See the example code below.

Semantic Textual Similarity
python evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg \ 
    --task_set cl-sts 
Short Text Clustering
python downstreams/text-clustering/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \
    --pooler avg \ 
    --task_set cl
Cross-lingual Parallel Matching
python downstreams/parallel-matching/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg 
Cross-lingual Text Classification
python downstreams/cross-lingual-transfer/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg

Please refer to each evaluation code for detailed descriptions of arguments.

Training

You can train an EASE model in a monolingual setting using English Wikipedia sentences or in a multilingual setting using Wikipedia sentences in 18 languages.

We provide example trainig scripts for both monolingual (train_monolingual_ease.sh) and multilingual (train_multilingual_ease.sh) settings.

MewsC-16

We construct MewsC-16 (Multilingual Short Text Clustering Dataset for News in 16 languages) from Wikinews. This dataset contains topic sentences from Wikinews articles in 13 categories and 16 languages. More detailed information is available in our paper, Appendix E.

Statistics and Scores
Language Sentences Label types XLM-Rbase EASE-XLM-Rbase
ar 2,224 11 27.9 27.4
ca 3,310 11 27.1 27.9
cs 1,534 9 25.2 41.2
de 6,398 8 30.5 39.5
en 12,892 13 25.8 39.6
eo 227 8 24.7 37.0
es 6,415 11 20.8 38.2
fa 773 9 37.2 41.5
fr 10,697 13 25.3 33.3
ja 1,984 12 44.0 47.6
ko 344 10 24.1 33.7
pl 7,247 11 28.8 39.9
pt 8,921 11 27.4 32.9
ru 1,406 12 20.1 27.2
sv 584 7 30.1 29.8
tr 459 7 30.7 44.9
Avg. 28.1 36.3

Note that the results are slightly different from those reported in the original paper since we further cleaned the data after the publication.

Citation

Arxiv

@inproceedings{nishikawa-etal-2022-ease,
    title = "{EASE}: Entity-Aware Contrastive Learning of Sentence Embedding",
    author = "Nishikawa, Sosuke  and
      Ri, Ryokan  and
      Yamada, Ikuya  and
      Tsuruoka, Yoshimasa  and
      Echizen, Isao",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.284",
    pages = "3870--3885",
    abstract = "We present EASE, a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities.The advantage of using entity supervision is twofold: (1) entities have been shown to be a strong indicator of text semantics and thus should provide rich training signals for sentence embeddings; (2) entities are defined independently of languages and thus offer useful cross-lingual alignment supervision.We evaluate EASE against other unsupervised models both in monolingual and multilingual settings.We show that EASE exhibits competitive or better performance in English semantic textual similarity (STS) and short text clustering (STC) tasks and it significantly outperforms baseline methods in multilingual settings on a variety of tasks.Our source code, pre-trained models, and newly constructed multi-lingual STC dataset are available at https://github.com/studio-ousia/ease.",
}

ease's People

Contributors

sosuke115 avatar ryokan0123 avatar

Stargazers

YuSawan avatar Nikolaus Schlemm avatar Maiia Bocharova avatar  avatar  avatar  avatar Chris Goddard avatar  avatar  avatar Mohammad Vahidi avatar Sebastian Majstorovic avatar 김원규 avatar hyrum avatar Roger GOU avatar Kedas avatar Ashutosh Saboo avatar Chang Won Kim avatar Weerayut Buapet avatar Nayeon Han avatar Laxman Singh Tomar avatar Pratik Bhavsar avatar  avatar  avatar Jeff Carpenter avatar  avatar Hongshuo Wang avatar Jinxin Zhuo avatar Haoyun Xia avatar  avatar YrYang avatar Fumika Isono avatar Qin Lin avatar Alexandre Salle avatar Kosuke Yamada avatar spencerbraun avatar anyai avatar 爱可可-爱生活 avatar Xinhao Li avatar  avatar Lei Zhao avatar Friso avatar Minsik avatar Hayato Tsukagoshi avatar  avatar Kaito Sugimoto avatar  avatar  avatar  avatar  avatar LingZhi WiseMed 瓴智医学人工智能 avatar logCong avatar XingWu_UCAS avatar Jiakui Wang avatar  avatar Shumpei Miyawaki avatar  avatar

Watchers

Tomotaka Ito avatar Ikuya Yamada avatar  avatar James Cloos avatar  avatar Masatoshi Suzuki avatar  avatar

ease's Issues

evaluation results difference

Thank you for your amazing projects and codes.

When I implemented your codes by using the command "python evaluation.py --model_name_or_path sosuke/ease-bert-base-uncased --pooler avg --task_set sts",
I found there is difference between results of paper and implementation like below.

image
image

Could you tell me what i might have missed, which affects that difference?

Thank you!

Training on Custom Dataset??

Greetings! I am writing to express my keen interest in your work and to request advice on training with a custom dataset. Particularly, I don't know how to build the entity files for my dataset to run the training. Could you please tell me what I should change to achieve this? I would greatly appreciate your guidance. Thank you for your immense contributions to the community.

About results of training and evaluation

Hi, first of all, thank you for your nice work.

I just trained monolingual ease model by running "train_monolingual_ease.sh" (I didn't modify anything) and evaluated trained model through below command
"python evaluation.py --model_name_or_path result/my-ease-bert-base-uncased --pooler cls_before_pooler --task_set sts".

However, the results has gap compared to those in paper.
image
image

Could you tell me what i might have missed, which affects that gap?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.