kr-sbert's Introduction

KR-SBERT

A pretrained Korean-specific Sentence-BERT model (Reimers and Gurevych 2019) developed by Computational Linguistics Lab at Seoul National University.

How to use the KR-SBERT model in Python

Usage

We recommend Python 3.6 or higher and sentence-transformers v2.2.0 or higher.

>>> from sentence_transformers import SentenceTransformer, util
>>> model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS') 
>>> sentences = ['잠이 옵니다', '졸음이 옵니다', '기차가 옵니다']
>>> vectors = model.encode(sentences) # encode sentences into vectors
>>> similarities = util.cos_sim(vectors, vectors) # compute similarity between sentence vectors
>>> print(similarities)
tensor([[1.0000, 0.6577, 0.2732],
        [0.6577, 1.0000, 0.2730],
        [0.2732, 0.2730, 1.0000]])

You can see the sentence '잠이 옵니다' is more similar to '졸음이 옵니다' (cosine similarity 0.65774536) than '기차가 옵니다' (cosine similarity 0.27321893).

Model description

Pre-trained BERT model

Using a pre-trained BERT model, a sentence is segmented into WordPiece tokens, of which contextualized output vectors are mean-pooled into a single sentence vector. We use KR-BERT-V40K, a variant of KR-BERT.

Fine-tuning SBERT model

Two sentence vectors are fed into a classifier for fine-tuning. Then, a siamese network trains the BERT weight parameters to reflect the relation between the two sentences. Finally, we fine-tune the SBERT model on the KLUE-NLI dataset and the augmented KorSTS dataset.

Augmenting KorSTS dataset

We augment the KorSTS dataset using the in-domain straregy proposed by Thakur et al. (2021).

Test datasets

Paraphrase pair detection (download)
- source: National Institute of Korean Language, 모두의 말뭉치.

from sentence_transformers import SentenceTransformer, util
import pandas as pd

model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
data = pd.read_csv('path/to/the/dataset')

vec1 = model.encode(data['sent1'], show_progress_bar=True, batch_size=32)
vec2 = model.encode(data['sent2'], show_progress_bar=True, batch_size=32)

data['similarity'] = [
    util.cos_sim(sent1, sent2).squeeze()
    for sent1, sent2 in tqdm(zip(vec1, vec2), total=len(data))
]

data[['paraphrase', 'similarity']].to_csv('paraphrase-results.csv')

Application for document classification

Tutorial in Google Colab: https://colab.research.google.com/drive/1S6WSjOx9h6Wh_rX1Z2UXwx9i_uHLlOiM

Model	Accuracy
KR-SBERT-Medium-NLI-STS	0.8400
KR-SBERT-V40K-NLI-STS	0.8400
KR-SBERT-V40K-NLI-augSTS	0.8511
KR-SBERT-V40K-klueNLI-augSTS	0.8628

Dependencies

sentence_transformers (https://github.com/UKPLab/sentence-transformers)

Citation

@misc{kr-sbert,
  author = {Park, Suzi and Hyopil Shin},
  title = {KR-SBERT: A Pre-trained Korean-specific Sentence-BERT model},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snunlp/KR-SBERT}}
}

kr-sbert's People

Contributors

Stargazers

Watchers

kr-sbert's Issues

pytorch_model.bin 파일 삭제 문의

안녕하세요, 최근 SBERT를 git clone 했을 때 pytorch_model.bin파일이 없어 오류(관련 링크)가 나서 문의 드립니다.

22년 2월 경 pytorch_model.bin파일이 삭제된 기록을 보았는데

이유가 무엇인지 궁금합니다.

tokenizer 관련

해당 모델을 위해 학습된 tokenizer는 따로 불러올 수 없을까요?
padding 처리를 위해 tokenizer를 사용해야 할 것 같은데, pre-trained tokenizer에 관한 내용은 나와있지 않아 문의 드립니다.

tokenizer=BertTokenizer("./KR-SBERT/KR-SBERT-V40K-klueNLI-augSTS/0_Transformer/vocab.txt")

혹시나 해서 이런식으로 불러와봤지만, 제대로 토크나이징을 하지 못하고 있네요

sentence-transformer 버전충돌

현재 requirements 에는

We recommend Python 3.6 or higher, scikit-learn v0.23.2 or higher and sentence-transformers v0.4.1 or higher.

으로 설명되어있으나

실제 구동시에는 sentence-transformer v1.1.0 이 아닌 경우 구동이 불가능합니다.
(버전이 낮은경우(0.4.1), 최신버전인 경우에도 불가능)

감사합니다.

시연 환경
python 3.7, torch 1.10, cuda 11.4
OS: ubuntu20.04
GPU: RTX3090

Recommend Projects

snunlp / kr-sbert Goto Github PK