Coder Social home page Coder Social logo

kobart's Introduction

🀣 KoBART

BART(Bidirectional and Auto-Regressive Transformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” autoencoder의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ BART(μ΄ν•˜ KoBART) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ Text Infilling λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 40GB μ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ encoder-decoder μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ KoBART-baseλ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.

How to install

git clone https://github.com/SKT-AI/KoBART.git
cd KoBART
pip install -r requirements.txt
pip install .

Data

Data # of Sentences
Korean Wiki 5M
Other corpus 0.27B

ν•œκ΅­μ–΄ μœ„ν‚€ λ°±κ³Ό 이외, λ‰΄μŠ€, μ±…, λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ (λŒ€ν™”, λ‰΄μŠ€, ...), μ²­μ™€λŒ€ ꡭ민청원 λ“±μ˜ λ‹€μ–‘ν•œ 데이터가 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Tokenizer

tokenizers νŒ¨ν‚€μ§€μ˜ Character BPE tokenizer둜 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

vocab μ‚¬μ΄μ¦ˆλŠ” 30,000 이며 λŒ€ν™”μ— 자주 μ“°μ΄λŠ” μ•„λž˜μ™€ 같은 이λͺ¨ν‹°μ½˜, 이λͺ¨μ§€ 등을 μΆ”κ°€ν•˜μ—¬ ν•΄λ‹Ή ν† ν°μ˜ 인식 λŠ₯λ ₯을 μ˜¬λ ΈμŠ΅λ‹ˆλ‹€.

πŸ˜€, 😁, πŸ˜†, πŸ˜…, 🀣, .. , :-), :), -), (-:...

λ˜ν•œ <unused0> ~ <unused99>λ“±μ˜ λ―Έμ‚¬μš© 토큰을 μ •μ˜ν•΄ ν•„μš”ν•œ subtasks에 따라 자유둭게 μ •μ˜ν•΄ μ‚¬μš©ν•  수 있게 ν–ˆμŠ΅λ‹ˆλ‹€.

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("μ•ˆλ…•ν•˜μ„Έμš”. ν•œκ΅­μ–΄ BART μž…λ‹ˆλ‹€.🀣:)l^o")
['β–μ•ˆλ…•ν•˜', 'μ„Έμš”.', 'β–ν•œκ΅­μ–΄', '▁B', 'A', 'R', 'T', 'β–μž…', 'λ‹ˆλ‹€.', '🀣', ':)', 'l^o']

Model

Model # of params Type # of layers # of heads ffn_dim hidden_dims
KoBART-base 124M Encoder 6 16 3072 768
Decoder 6 16 3072 768
>>> from transformers import BartModel
>>> from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = BartModel.from_pretrained(get_pytorch_kobart_model())
>>> inputs = kobart_tokenizer(['μ•ˆλ…•ν•˜μ„Έμš”.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651,  3.2349,  ...,  5.8916,  4.0497,  3.5468],
         [-0.4096, -4.6106,  2.7189,  ...,  6.1745,  2.9832,  3.0930]]],
       grad_fn=<TransposeBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475,  0.0902,  ...,  0.1127,  0.6529,  0.2203],
         [ 0.4538, -0.2948,  0.2556,  ..., -0.0442,  0.6858,  0.4372]]],
       grad_fn=<TransposeBackward0>), encoder_hidden_states=None, encoder_attentions=None)

Performances

Classification or Regression

NSMC(acc) KorSTS(spearman) Question Pair(acc)
KoBART-base 90.07 81.31 93.80

Summarization

μ—…λ°μ΄νŠΈ μ˜ˆμ •

Demos

μœ„ μ˜ˆμ‹œλŠ” ZDNET 기사λ₯Ό μš”μ•½ν•œ κ²°κ³Όμž„

Examples

KoBARTλ₯Ό μ‚¬μš©ν•œ ν₯미둜운 μ˜ˆμ œκ°€ μžˆλ‹€λ©΄ PRμ£Όμ„Έμš”!

Contacts

KoBART κ΄€λ ¨ μ΄μŠˆλŠ” 이곳에 μ˜¬λ €μ£Όμ„Έμš”.

License

KoBARTλŠ” modified MIT λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ 및 μ½”λ“œλ₯Ό μ‚¬μš©ν•  경우 λΌμ΄μ„ μŠ€ λ‚΄μš©μ„ μ€€μˆ˜ν•΄μ£Όμ„Έμš”. λΌμ΄μ„ μŠ€ 전문은 LICENSE νŒŒμΌμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.

kobart's People

Contributors

haven-jeon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.