Coder Social home page Coder Social logo

kr-sbert's Introduction

KR-SBERT

A pretrained Korean-specific Sentence-BERT model (Reimers and Gurevych 2019) developed by Computational Linguistics Lab at Seoul National University.

How to use the KR-SBERT model in Python

Usage

We recommend Python 3.6 or higher and sentence-transformers v2.2.0 or higher.

>>> from sentence_transformers import SentenceTransformer, util
>>> model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS') 
>>> sentences = ['잠이 옵니다', '졸음이 옵니다', '기차가 옵니다']
>>> vectors = model.encode(sentences) # encode sentences into vectors
>>> similarities = util.cos_sim(vectors, vectors) # compute similarity between sentence vectors
>>> print(similarities)
tensor([[1.0000, 0.6577, 0.2732],
        [0.6577, 1.0000, 0.2730],
        [0.2732, 0.2730, 1.0000]])

You can see the sentence '잠이 옵니다' is more similar to '졸음이 옵니다' (cosine similarity 0.65774536) than '기차가 옵니다' (cosine similarity 0.27321893).

Model description

Pre-trained BERT model

Using a pre-trained BERT model, a sentence is segmented into WordPiece tokens, of which contextualized output vectors are mean-pooled into a single sentence vector. We use KR-BERT-V40K, a variant of KR-BERT.

Fine-tuning SBERT model

Two sentence vectors are fed into a classifier for fine-tuning. Then, a siamese network trains the BERT weight parameters to reflect the relation between the two sentences. Finally, we fine-tune the SBERT model on the KLUE-NLI dataset and the augmented KorSTS dataset.

Augmenting KorSTS dataset

We augment the KorSTS dataset using the in-domain straregy proposed by Thakur et al. (2021).

Test datasets

from sentence_transformers import SentenceTransformer, util
import pandas as pd

model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')
data = pd.read_csv('path/to/the/dataset')

vec1 = model.encode(data['sent1'], show_progress_bar=True, batch_size=32)
vec2 = model.encode(data['sent2'], show_progress_bar=True, batch_size=32)

data['similarity'] = [
    util.cos_sim(sent1, sent2).squeeze()
    for sent1, sent2 in tqdm(zip(vec1, vec2), total=len(data))
]

data[['paraphrase', 'similarity']].to_csv('paraphrase-results.csv')

Application for document classification

Tutorial in Google Colab: https://colab.research.google.com/drive/1S6WSjOx9h6Wh_rX1Z2UXwx9i_uHLlOiM

Model Accuracy
KR-SBERT-Medium-NLI-STS 0.8400
KR-SBERT-V40K-NLI-STS 0.8400
KR-SBERT-V40K-NLI-augSTS 0.8511
KR-SBERT-V40K-klueNLI-augSTS 0.8628

Dependencies

Citation

@misc{kr-sbert,
  author = {Park, Suzi and Hyopil Shin},
  title = {KR-SBERT: A Pre-trained Korean-specific Sentence-BERT model},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snunlp/KR-SBERT}}
}

kr-sbert's People

Contributors

snunlp avatar suparklingmin avatar

Stargazers

Jeffrey avatar jin avatar Jieun Kim avatar Gilyoung Cheong avatar Yonghyun Yoon avatar  avatar Sun Kim avatar  avatar 안진수 avatar Dabi avatar Leo Moon avatar nozzy avatar John JongYoon Kim avatar 이채은 avatar Gyeongmo Min avatar iamiks avatar Gyeongmin avatar HyunjunJeon avatar SeungWon Lee avatar 고아침 avatar  avatar 김다엘 avatar  avatar Jaehee Lee avatar Unchun Yang avatar Jun Park avatar DongGeon Lee avatar Wonhyeong Seo avatar Ceyda Cinarel (재이다) avatar  avatar Bonghwan KIM avatar SangAh Lee (Ivory) avatar luke.4.18 avatar Dongwoo Lee avatar Seunguk Yu avatar Jeonghyeon Park avatar Chanyou Hwang avatar Eu-Bin KIM avatar Taeho Kim avatar PARK EUNSANG avatar Cockroach54 avatar  avatar  avatar Seungsoo Lee avatar Eunhwan Park avatar Yuheun Kim avatar  avatar j5ng avatar KY J. avatar Changjin Han avatar  avatar Larry Jung avatar Dohyeon Park avatar Dodo avatar puy avatar Seiwon Park avatar 김병준 avatar Jeon Duyoung avatar  avatar Aravis_S avatar hunbL avatar Park Seongchan avatar Changjo Hwang avatar 김원규 avatar JoEun Yang avatar SeungJun avatar Minseok Song avatar  avatar Sunggoo Kwon avatar Kiho Han avatar Kwang-Ho Kim avatar Minyoung Lee (carl.brain) avatar sohn avatar Sunchan Park avatar  avatar qazcde avatar 정수현 avatar lim jung gun avatar rsk25 avatar Byeongjoo Kim avatar Mattias Lee avatar Kyuhong Byun (변규홍 / combacsa) avatar Jaesung Ryu avatar Seongmin Park avatar Jungseob Lee avatar Yu-seok Jeong avatar Kevin Ko avatar snoop2head avatar Junbum Lee avatar 박진영 avatar

Watchers

 avatar

kr-sbert's Issues

pytorch_model.bin 파일 삭제 문의

안녕하세요, 최근 SBERT를 git clone 했을 때 pytorch_model.bin파일이 없어 오류(관련 링크)가 나서 문의 드립니다.

22년 2월 경 pytorch_model.bin파일이 삭제된 기록을 보았는데

이유가 무엇인지 궁금합니다.

tokenizer 관련

해당 모델을 위해 학습된 tokenizer는 따로 불러올 수 없을까요?
padding 처리를 위해 tokenizer를 사용해야 할 것 같은데, pre-trained tokenizer에 관한 내용은 나와있지 않아 문의 드립니다.

tokenizer=BertTokenizer("./KR-SBERT/KR-SBERT-V40K-klueNLI-augSTS/0_Transformer/vocab.txt")

혹시나 해서 이런식으로 불러와봤지만, 제대로 토크나이징을 하지 못하고 있네요

sentence-transformer 버전충돌

현재 requirements 에는

We recommend Python 3.6 or higher, scikit-learn v0.23.2 or higher and sentence-transformers v0.4.1 or higher.

으로 설명되어있으나

실제 구동시에는 sentence-transformer v1.1.0 이 아닌 경우 구동이 불가능합니다.
(버전이 낮은경우(0.4.1), 최신버전인 경우에도 불가능)

감사합니다.

시연 환경
python 3.7, torch 1.10, cuda 11.4
OS: ubuntu20.04
GPU: RTX3090

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.