sktbrain / kobert Goto Github PK

View Code? Open in Web Editor NEW

1.3K 36.0 365.0 206 KB

Korean BERT pre-trained cased (KoBERT)

License: Apache License 2.0

Python 46.67% Jupyter Notebook 53.33%

korean-nlp language-model bert nlp pytorch transformers

kobert's Introduction

KoBERT

KoBERT

Korean BERT pre-trained cased (KoBERT)

Why'?'

구글 BERT base multilingual cased의 한국어 성능 한계

Training Environment

Architecture

predefined_args = {
        'attention_cell': 'multi_head',
        'num_layers': 12,
        'units': 768,
        'hidden_size': 3072,
        'max_length': 512,
        'num_heads': 12,
        'scaled': True,
        'dropout': 0.1,
        'use_residual': True,
        'embed_size': 768,
        'embed_dropout': 0.1,
        'token_type_vocab_size': 2,
        'word_embed': None,
    }

학습셋

데이터	문장	단어
한국어 위키	5M	54M

학습 환경
- V100 GPU x 32, Horovod(with InfiniBand)

사전(Vocabulary)
- 크기 : 8,002
- 한글 위키 기반으로 학습한 토크나이저(SentencePiece)
- Less number of parameters(92M < 110M )

Requirements

see requirements.txt

How to install

Install KoBERT as a python package

pip install git+https://[email protected]/SKTBrain/KoBERT.git@master

If you want to modify source codes, please clone this repository

git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt

How to use

Using with PyTorch

Huggingface transformers API가 편하신 분은 여기를 참고하세요.

>>> import torch
>>> from kobert import get_pytorch_kobert_model
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> model, vocab  = get_pytorch_kobert_model()
>>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
>>> pooled_output.shape
torch.Size([2, 768])
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> sequence_output[0]
tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],
        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],
        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],
       grad_fn=<SelectBackward>)

model은 디폴트로 eval()모드로 리턴됨, 따라서 학습 용도로 사용시 model.train()명령을 통해 학습 모드로 변경할 필요가 있다.

Naver Sentiment Analysis Fine-Tuning with pytorch
- Colab에서 [런타임] - [런타임 유형 변경] - 하드웨어 가속기(GPU) 사용을 권장합니다.

Using with ONNX

>>> import onnxruntime
>>> import numpy as np
>>> from kobert import get_onnx_kobert_model
>>> onnx_path = get_onnx_kobert_model()
>>> sess = onnxruntime.InferenceSession(onnx_path)
>>> input_ids = [[31, 51, 99], [15, 5, 0]]
>>> input_mask = [[1, 1, 1], [1, 1, 0]]
>>> token_type_ids = [[0, 0, 1], [0, 1, 0]]
>>> len_seq = len(input_ids[0])
>>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids),
>>>                             'token_type_ids':np.array(token_type_ids),
>>>                             'input_mask':np.array(input_mask),
>>>                             'position_ids':np.array(range(len_seq))})
>>> # Last Encoding Layer
>>> pred_onnx[-2][0]
array([[-0.24610452,  0.24282141,  0.25895312, ..., -0.48613444,
        -0.07305173,  0.07560554],
       [-0.24783179,  0.24200465,  0.25520486, ..., -0.4877185 ,
        -0.0727044 ,  0.07536091],
       [-0.24721591,  0.24196623,  0.2560626 , ..., -0.48743123,
        -0.07326943,  0.07650235]], dtype=float32)

ONNX 컨버팅은 soeque1께서 도움을 주셨습니다.

Using with MXNet-Gluon

>>> import mxnet as mx
>>> from kobert import get_mxnet_kobert_model
>>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]])
>>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]])
>>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False)
>>> encoder_layer, pooled_output = model(input_id, token_type_ids)
>>> pooled_output.shape
(2, 768)
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> encoder_layer[0]
[[-0.24610372  0.24282135  0.2589539  ... -0.48613444 -0.07305248
   0.07560539]
 [-0.24783105  0.242005    0.25520545 ... -0.48771808 -0.07270523
   0.07536077]
 [-0.24721491  0.241966    0.25606337 ... -0.48743105 -0.07327032
   0.07650219]]
<NDArray 3x768 @cpu(0)>

Naver Sentiment Analysis Fine-Tuning with MXNet

Tokenizer

Pretrained Sentencepiece tokenizer

>>> from gluonnlp.data import SentencepieceTokenizer
>>> from kobert import get_tokenizer
>>> tok_path = get_tokenizer()
>>> sp  = SentencepieceTokenizer(tok_path)
>>> sp('한국어 모델을 공유합니다.')
['▁한국', '어', '▁모델', '을', '▁공유', '합니다', '.']

Subtasks

Naver Sentiment Analysis

Dataset : https://github.com/e9t/nsmc

Model	Accuracy
BERT base multilingual cased	0.875
KoBERT	0.901
KoGPT2	0.899

KoBERT와 CRF로 만든 한국어 객체명인식기

https://github.com/eagle705/pytorch-bert-crf-ner

문장을 입력하세요:  SKTBrain에서 KoBERT 모델을 공개해준 덕분에 BERT-CRF 기반 객체명인식기를 쉽게 개발할 수 있었다.
len: 40, input_token:['[CLS]', '▁SK', 'T', 'B', 'ra', 'in', '에서', '▁K', 'o', 'B', 'ER', 'T', '▁모델', '을', '▁공개', '해', '준', '▁덕분에', '▁B', 'ER', 'T', '-', 'C', 'R', 'F', '▁기반', '▁', '객', '체', '명', '인', '식', '기를', '▁쉽게', '▁개발', '할', '▁수', '▁있었다', '.', '[SEP]']
len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]']
decoding_ner_sentence: [CLS] <SKTBrain:ORG>에서 <KoBERT:POH> 모델을 공개해준 덕분에 <BERT-CRF:POH> 기반 객체명인식기를 쉽게 개발할 수 있었다.[SEP]

Korean Sentence BERT

https://github.com/BM-K/KoSentenceBERT-SKT

Model	Cosine Pearson	Cosine Spearman	Euclidean Pearson	Euclidean Spearman	Manhattan Pearson	Manhattan Spearman	Dot Pearson	Dot Spearman
NLl	65.05	68.48	68.81	68.18	68.90	68.20	65.22	66.81
STS	80.42	79.64	77.93	77.43	77.92	77.44	76.56	75.83
STS + NLI	78.81	78.47	77.68	77.78	77.71	77.83	75.75	75.22

Release

v0.2.3
- support onnx 1.8.0
v0.2.2
- fix No module named 'kobert.utils'
v0.2.1
- guide default 'import statements'
v0.2
- download large files from aws s3
- rename functions
v0.1.2
- Guaranteed compatibility with higher versions of transformers
- fix pad token index id
v0.1.1
- 사전(vocabulary)과 토크나이저 통합
v0.1
- 초기 모델 릴리즈

Contacts

KoBERT 관련 이슈는 이곳에 등록해 주시기 바랍니다.

License

KoBERT는 Apache-2.0 라이선스 하에 공개되어 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 LICENSE 파일에서 확인하실 수 있습니다.

kobert's People

Contributors

Stargazers

Watchers

Forkers

oceanos74 jinsik-lee replayh jnhwkim ilyeong-ai evanimiya oppa3109 hosikchoi ikaros0909 hiyoung-asr enowy fiesta0211 bage79 jkisung goungoun 0r0i myeonghahwang nitz0211 seongl mkim0710 ares2013 marspak foremostdw khong1105 haven-jeon datalama trendingtechnology jihan-jung mercileesb sangkwun undarmaa ai-natural-language-processing-lab cgh0430haha 210010 bearrundr yeohoonyun lswook555 seanhtchoi seonghongkim jiuney theoseo dict neosapience yhs968 willthd shaunlim0105 dingbro sejin-p doheejin aiwizard namgonkim kyongpiltae jaeyun95 docu9 tree-park taehoonkoo khu-znusion mybirth0407 hhaahaha chanhee-kang hyowong hash2430 forus-ai wjkim1103 itchanghi alexseong opensource-sk crystal-k7 nare-ua gunwoo1217 dbwodlf3 hee0721 dolcelatte heethbloom bbiyongel codream00 ksyu0508 jongwon-jay-lee fngo-bigfinance lesanf ilya-palachev teosoft7 kimbumso yunkio hyeonjeongbyeon zzingae bytecell leeyeonsu haconedu lkhcnn ejihoon6065 yookyungkho hy-simon-bae parvez2017 cheon7886 bcloved shanessong seoyeonbee horanghi jireh-father

kobert's Issues

Transformers==3.2.0 에서의 기학습된 모델 로딩이 실패합니다.

안녕하세요. transformers 버전 변경에 따라 모델 로딩이 실패한 경우가 있어서 이슈 남깁니다.

(transformers==3.0.2)

정상 작동됨을 확인하였습니다.

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model

input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
model, vocab  = get_pytorch_kobert_model()

sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
pooled_output.shape  # torch.Size([2, 768])

(transformers==3.2.0)

동일 코드로 로딩이 실패합니다.

~/pyenv/versions/3.6.9/envs/envs/lib/python3.6/site-packages/kobert/pytorch_kobert.py in get_kobert_model(model_file, vocab_file, ctx)
     67 def get_kobert_model(model_file, vocab_file, ctx="cpu"):
     68     bertmodel = BertModel(config=BertConfig.from_dict(bert_config))
---> 69     bertmodel.load_state_dict(torch.load(model_file))
     70     device = torch.device(ctx)
     71     bertmodel.to(device)

~/pyenv/versions/3.6.9/envs/envs/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1043         if len(error_msgs) > 0:
   1044             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1045                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1046         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1047 

RuntimeError: Error(s) in loading state_dict for BertModel:
        Missing key(s) in state_dict: "embeddings.position_ids".

pytorch_pretrained_bert 라이브러리 사용 관련해서 문의드립니다.

koBERT를 사용해보고 싶은 일이 있어서 fork하고 작업하다가 궁금한 점이 생겨 문의드립니다.

pytorch로 kobert 모델을 불러오는 pytorch_kobert.py에

from pytorch_pretrained_bert import BertModel, BertConfig

부분이 있는데, pytorch_pretrained_bert 라이브러리는 2019년 4월 이후로 업데이트를 멈췄고
transformers라는 라이브러리에 통합된 것으로 보입니다.

https://github.com/huggingface/transformers

개인적으로 fork한 뒤 필요한 기능을 넣을 때는
pytorch_pretrained_bert보다 transformers가 필요한 기능을 찾아보기 쉬워서 transformers를 사용했는데,
혹시 pytorch_pretrained_bert 라이브러리를 계속 사용하는 이유가 있을까요?

만약 그렇지 않다면, pytorch_pretrained_bert 라이브러리를 사용하는 것보다는
통합된 버전인 transformers 패키지를 사용하는 것도 좋지 않을까 싶어 제안도 드리고 싶습니다.

from transformers import BertModel, BertConfig

감사합니다.

sentence pair classification을 하고 싶은데 진행 불가합니다 ㅠㅠ

안녕하세요. 하다가 진행이 불가능 하여 여쭈어 봅니다.. ㅠㅠ

kobert 로 비슷한 문장에 대하여 학습을 더 시키는 작업을 현재 naver영화 리뷰의 코드에서 조금씩 고치며 진행중입니다.

비슷한 문장을 학습시켜주기 위한 데이터 형식은 [[sent1, sent2 , label], [...], ....] 이러한 형식으로 존재합니다.

예시 )
['글쎄, 나는 그것에 관해 생각조차 하지 않았지만, 나는 너무 좌절했고, 결국 그에게 다시 이야기하게 되었다.', '나는 그와 다시 이야기하지 않았다.', '0']

데이터 학습 시키기 위해 아래와 같이 진행하게 된다면 마지막 pair에서 False가 아닌 True를 줘야지 비슷한 문장에 대하여 학습이 되는 것으로 알아 그렇게 진행을 하게 될 경우 python assertion error 가 나게 됩니다.

[data_train = BERTDataset(data_list, 0, 1,2, tok, max_len, True, False)]

혹시 비슷한 문장에 대하여 학습이 불가능 한가요?

한글 버트 데이터는 공유여부가 궁금합니다.

안녕하세요?
한글 버트를 제공해 주셔서 감사합니다.
한글 버트에 대해서 더 공부하고 싶은데 혹시 버트 트레이닝 했던 데이터를 공유가 가능하신지 여쭤보고 싶어서 연락드렸습니다.

감사합니다.

koBERT import Error

torch-1.8.1+cu101 이상 버전에서는
from kobert.pytorch_kobert import get_pytorch_kobert_model 에서

/usr/local/lib/python3.7/dist-packages/transformers/trainer_pt_utils.py in <module>()
     38     SAVE_STATE_WARNING = ""
     39 else:
---> 40     from torch.optim.lr_scheduler import SAVE_STATE_WARNING
     41 
     42 logger = logging.get_logger(__name__)

ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py)

에러가 납니다.
requirements.txt 에서의 torch 버전을 torch==1.7로 고정하면 좋을것같습니다!

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈

> tokenizer('[CLS] 감사합니다. [SEP]')
['▁[', 'C', 'LS', ']', '▁감사', '합니다', '.', '▁[', 'S', 'E', 'P', ']']

현재로서는 아래와 같은 방식으로 우회해야 됨

> ['[CLS]', ] + tokenizer('감사합니다. ') + ['[SEP]', ]

구글 protobuf를 수정하는 방식으로 기존 tokenizer 모델을 아래와 같이 수정하여 재 등록 해야 됨

google/sentencepiece#426
google/sentencepiece#306

fine-tuning details?

우선, 좋은 모델 공개해주셔서 감사합니다.

KoBERT모델을 KorQuAD 1.0 에 대해서 fine-tuning 하고 성능을 측정하였는데,

제 예상보다 성능이 안나오는 것 같아 저자분께 두가지 질문을 드리고자 합니다. ( F1 이 70정도 나옵니다)

혹시 KorQuAD 1.0 에 대해 성능 측정하신 결과가 있으신지 궁금합니다.
저는 Hugging face 에서 공개한 pytorch_pretrained_bert repo에 있던 run_squad 코드에서 모델과 tokenizer를 KoBERT에서 제공하는 pretrained model과 tokenizer로 바꾸어서 사용하고 있고 아래와 같이 수정하였습니다.

bert, vocab = get_pytorch_kobert_model()
config = bert.config
model = BertForQuestionAnswering(config)
model.bert = bert
tok = get_tokenizer()
tokenizer = nlp.data.BERTSPTokenizer(tok, vocab)

혹시 이렇게 implement 하는 것 외에 다른 detail이 있을지와 성능 관련한 의견을 듣고 싶습니다.

Config of KoBERT

Hi,

Thanks for releasing the model. I want to ask how I can get the config file of KoBert. For example, the config of BERT is like:

BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.5.1",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}

Error when getting model because of Transformers version

현상
- 아래와 같이 README.md 의 install 방법에 따라 설치한 후, 아래 코드를 수행할 경우 에러 발생

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model
model, vocab  = get_pytorch_kobert_model()

Traceback (most recent call last):
  File "pp.py", line 3, in <module>
    model, vocab  = get_pytorch_kobert_model()
  File "/home/jjlee/KoBERT/kobert/pytorch_kobert.py", line 64, in get_pytorch_kobert_model
    return get_kobert_model(model_path, vocab_path, ctx)
  File "/home/jjlee/KoBERT/kobert/pytorch_kobert.py", line 69, in get_kobert_model
    bertmodel.load_state_dict(torch.load(model_file))
  File "/home/jjlee/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BertModel:
        Missing key(s) in state_dict: "embeddings.position_ids".

원인
- requirements.txt 에 따라 설치하면 각 패키지의 최신 버전이 설치되는데, transformers 패키지의 버전 차이로 인한 문제로 보여짐

Training dataset size

안녕하세요. 공개해주신 모델 정말 잘 사용하고 있습니다.

다른 언어 모델과의 비교를 위해 데이터셋에 대한 정보가 필요해서 질문을 올립니다.

훈련 데이터 셋의 크기를 알 수 있을까요? (약 몇 G 인지)

감사합니다.

KoBERT를 실제 적용 하려면 어떻게 해야 하나요?

소스를 다운받은 상태인데요
그리고 콜랩에서 한번 돌려보고 있습니다.

다운받은 소스를 어떻게 사용하는지 몰라서요
학습을 어떻게 시키고

제가 인풋 데이터를 어떻게 넣고 결과를 어떻게 받는지 알 수 있을까요?

그리고
이게 그럼 어떤 결과를 가져다 주는 모델인가요?
영어 -> 한글 번역?
용도가 뭔가요?

블로그 보면 혹시 이건가요?
2) Next Sentence Prediction(NSP)
두 문장을 주고 두 번째 문장이 글에서 첫 번째 문장의 바로 다음에 오는지 예측하는 방법이다.

별도 학습된 토크나이저 사용

KoBERT 입력으로 넣어주는 문장의 토크나이저를 KoBERT에서 제공한 토크나이저가 아니라 별도로 학습된 sentencepiece 토크나이저 사용을 할 수 있나요 ?

위와 같이 시도해보니 중간에 에러가 발생합니다

안녕하세요. 잘보았습니다.

안녕하세요. 훌륭한 학습 모델을 개발해주셔서 너무나 감사합니다

다름이 아니오라 엑셀파일에 있는 1개의 열에 존재하는 모든 문장을

긍정/부정 분류를 해서 그 결과값을 엑셀 파일로 얻고 싶은데 이럴때는 어떤 함수를 입력해야하나요?

multi label text classification

안녕하세요

좋은 자료 공유해주셔서 정말 감사합니다

궁금한게 있는데 감성분석 클래스 레이블이 여러개인 데이터를 넣었을때 학습이 잘 안되는 경향을 보이는데

코드에서 레이블 갯수 외에 또 바꿔야하는 부분이 있는지 여쭤볼 수 있을까요?

정말 감사합니다!

Sentencepiece 한글 초성/중성 에러

안녕하세요

tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') 처럼 토크나이저를 가져와서

"한국어로는 안돼??ㅋ' 의 문장을 tokenizer.tokenize( ) 하게 되면 다음과 같은 아웃풋이 나옵니다

하지만 마지막 값은 한글 초성 'ㅋ'과 다른 값입니다. (작은 'ㅋ')

실제로 tokenizer.convert_tokens_to_ids('ㅋ') 하게 되면 unknown 값인 0이 리턴되고, 작은 'ㅋ' 값을 넣으면 정상적으로 나옵니다.

이 때문에 다음과 같은 상황에서 에러가 발생합니다

이는 초성뿐만아니라 중성에서도 발생합니다

Wrong token_type_id (on Huggingface porting)

from kobert_tokenizer import KoBERTTokenizer
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
tokenizer([['나 보기가 역겨워', '김소월']])
{'input_ids': [[2, 1370, 2362, 5330, 3322, 5411, 7018, 3, 1316, 6607, 7028, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

token_type_id는 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]] 가 되어야 됨

Vocabulary와 Sentencepiece 토크나이저 통합

현재 Sentencepiece 토크나이저의 토큰과 KoBERT 학습을 수행한 Vocabulary가 달라 별도의 Vocabulary를 제공하고 있음.

sentencepiece 토크나이저의 토큰과 KoBERT학습 토큰의 일관성을 맞추어 불필요한 파일 제공을 피할 필요가 있음.

Issue about padding token.

안녕하세요. 먼저 좋은 소스 제공해주셔서 감사드립니다!

사용 중에 궁금한 점이 생겨 여쭤보려 글 남깁니다.

padding token의 설정에 관련한 질문인데요.

모델 서머리를 보면 word embedding에서 padding_idx를 0으로 설정하는데 따로 제공해주신 vocabulary에서는 [PAD]이 1로 설정되어있습니다.

model summary


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(8002, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
...

vocab

{'[UNK]': 0,
 '[CLS]': 2,
 '[SEP]': 3,
 '[MASK]': 4,
 '[PAD]': 1,
 '!': 5,
 "!'": 6,
 '!”': 7,
...

bert model에 input tokens과 함께 padding이 아닌 위치에 attention을 주는 attention_masks가 있어 큰 상관없는 줄 알았는데, 출력값이 아래와 같이 달라져 의문이 생겨 질문드립니다.

input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 1]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 0], [0, 0, 0]])
sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
sequence_output

tensor([[[-0.2365, 0.2418, 0.2133, ..., -0.5110, -0.0360, 0.0815],
[-0.2378, 0.2405, 0.2103, ..., -0.5120, -0.0353, 0.0812],
[-0.2377, 0.2407, 0.2104, ..., -0.5113, -0.0356, 0.0817]],
...[[-0.0930, -0.4689, -0.0698, ..., -0.1506, -0.2966, -0.1554],
[-0.1028, -0.4727, -0.0740, ..., -0.1520, -0.2981, -0.1605],
[-0.0929, -0.4689, -0.0699, ..., -0.1507, -0.2966, -0.1554]]],
grad_fn=)

input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 0], [0, 0, 0]])
sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
sequence_output

tensor([[[-0.2365, 0.2418, 0.2133, ..., -0.5110, -0.0360, 0.0815],
[-0.2378, 0.2405, 0.2103, ..., -0.5120, -0.0353, 0.0812],
[-0.2377, 0.2407, 0.2104, ..., -0.5113, -0.0356, 0.0817]],
... [[-0.0930, -0.4689, -0.0698, ..., -0.1506, -0.2966, -0.1554],
[-0.1028, -0.4727, -0.0740, ..., -0.1520, -0.2981, -0.1605],
[ 0.1585, -0.1319, 0.2887, ..., 0.2637, -0.2277, 0.2331]]],
grad_fn=)

please modify requirement.txt

README.md의 How to use Pytorch 코드 중

sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)

sequence_output, pooled_output이 torch.size가 아닌 str 타입으로 나오며 다음으로 진행이 되지 않습니다.

원인은 transformers 4.2로 인해 나타난 에러이며, requirement.txt에서

transformers>=3.5

를

transformers<4

로 바꿔주시길 바랍니다.

띄어쓰기 복원 관련(언더바)

안녕하세요, kobert를 이용 중인 학생입니다.
코버트를 이용하는 중에 원문 복원에 관해 문제가 생겨서 이슈 남깁니다.

언더바가 매 어절 처음에 들어가는 것까지는 이해가 되었는데, 문장 안에 언더바가 이미 포함되어 있을 때는 어떻게 처리해야 할 지 잘 모르겠습니다. 예를 들어,
"조선일보_홍길동 기자"
라는 입력을 토크나이저에 입력하면
"_조선" "일보" "_" "홍" "길" "동" "_" "기자"
와 같 토큰화가 됩니다. 여기서 언더바만 있는 토큰 두개 중 어느 것이 언더바이고 어느 것이 어절의 처음을 나타내는 심볼인지 확인할 방법이 없습니다. 원문 복원을 위해서 어떤 방법이 있을지 궁금합니다.

Token의 attention을 알 수 있을까요??

아니면 따로 레이러를 추가해야 될까요??

KoBERT를 이용한 유사도 분석

안녕하세요.

KoBERT에 이미 구축된 임베딩을 이용해 문장 간 유사도 분석을 하고 싶습니다.
인터넷을 사용할 수 없는 상황이라 다음과 같이 model, vocab을 로드해서 사용하고 있습니다.

model, vocab = get_kobert_model(u'...\KoBERT\pytorch_kobert_2439f391a6.params',
                               u'...\KoBERT\kobert_news_wiki_ko_cased-1087f8699e.spiece',
                               'cpu')

분류 예제를 보아도 특정 문장을 벡터화 할 수 있는 부분을 찾기가 어려워서 글을 남깁니다.
문장을 벡터화하고 유사도를 비교할 방법이 있을까요?

새로운 문장 추론을 하는 것에 관해 문의드립니다.

안녕하세요 KoBERT 구현해주신 덕분에 이것저것 해보고 있는 중입니다.
NSMC 데이터로 높은 성능을 달성해서 이를 새로운 문장에 테스트를 해보고 싶습니다.
BERTClassifier 클래스에 forward함수의 인자를 맞춰서 집어넣어주면 된다고 생각하는데 이렇게 하는 것이 맞나요?
특히 valid_length를 만들기가 어려운데, 이건 데이터로더에서만 만들 수 있는건지 궁금합니다.

감사합니다.

transformers 라이브러리에 맞춘 예제 코드 변경

안녕하세요:)

kobert 라이브러리에서 얼마 전에 pytorch_pretrained_bert에서 transformers로 바꾸면서 README의 예제대로 하면 결과가 동일하게 나오지 않아 확인해보았습니다.

forward 인자 순서 변경

pytorch_pretrained_bert: input_ids, token_type_ids, input_mask 순
transformers: input_ids, input_mask, token_type_ids 순

https://github.com/huggingface/transformers/blob/73028c5df0c28ca179fbe565482a9c2143787f61/src/transformers/modeling_bert.py#L636-L646

return 값 변경
기존 pytorch_pretrained_bert에서는 12 layer의 값을 모두 리턴했지만, transformers에서는 마지막 layer의 값만 리턴합니다.

https://github.com/huggingface/transformers/blob/73028c5df0c28ca179fbe565482a9c2143787f61/src/transformers/modeling_bert.py#L648-L671

두 가지 부분을 수정하여 확인하였고, 기존 README.md의 예제와 동일한 결과가 나오는 것을 확인했습니다.

해당 부분 관련하여 PR 올렸습니다:)

감사합니다!

Tokenizing result doesn't look good.

Here is my codes.

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model
import numpy as np
import pandas as pd

from gluonnlp.data import SentencepieceTokenizer
from kobert.utils import get_tokenizer


tok_path = get_tokenizer()
sp  = SentencepieceTokenizer(tok_path)

print(train['중식메뉴_processed'][0])
--output: 쌀밥/잡곡밥 오징어찌개 쇠불고기 계란찜 청포묵무침

print(sp(train['중식메뉴_processed'][0]))
--output: ['▁', '쌀', '밥', '/', '잡', '곡', '밥', '▁오', '징', '어', '찌', '개', '▁', 
'쇠', '불', '고', '기', '▁계', '란', '찜', '▁청', '포', '묵', '무', '침']

It doesn't look great.
Is there any idea for improve the result?

Thanks.

안녕하세요 잘보았습니다.

안녕하세요. 잘보았습니다.

혹시 파이썬을 다루는데 있어 아직 초보라 그런데 나와있는 colab을 하나하나 모두 실행시켜야 하나요?

그리고 colab은 작성하고 난후 시간이 지나면 입력된 변수 값들이 모두 제거되지 않나요?

그리고 혹시 jupyter notebook 등으로 작업하면 어떤 순서대로 작업해야 하는지 예제를 알 수 있을까요??

KoBERT 사용에 대한 질문

안녕하세요.
kobert를 사용하면 sentence에서 각 token들을 contextual embedding 한 것을 확인할 수 있었습니다.
(기본 모델의 경우 각 token 당 768차원)
구글BertForMaskedLM의 경우와 같이 masked sentence를 입력하고 masked 된 단어를 전체 vocab에서 예측하는 LM로도 사용할 수 있는지요?
KoBERT에서는 768차원 hidden state만 출력 되는 것 같아 LM 기능도 제공되는지 여쭈어봅니다.

huggingface 모델 사용 불가

안녕하세요. huggingface api를 통해 kobert 모델을 사용하려고 하는데 아래와 같은 에러가 발생합니다.

from transformers import BertModel
model = BertModel.from_pretrained('skt/kobert-base-v1')
OSError: Can't load config for 'skt/kobert-base-v1'. Make sure that:

'skt/kobert-base-v1' is a correct model identifier listed on 'https://huggingface.co/models'
or 'skt/kobert-base-v1' is the correct path to a directory containing a config.json file

from kobert_tokenizer import KoBERTTokenizer
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
OSError: Model name 'skt/kobert-base-v1' was not found in tokenizers model name list (xlnet-base-cased, xlnet-large-cased). We assumed 'skt/kobert-base-v1' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

혹시 저와 동일한 에러를 겪어보신분은 팁 부탁드립니다ㅠㅠ

fine tuning 시에 사용할 vocab을 따로 만들순 없나요?

안녕하세요! kobert 를 이용해 grammar error correction task 를 진행하고 있습니다.
제가 사용하는 데이터의 단어 중에 공개해 주신 모델의 vocab에 없는 단어가 많아 재현율이 현저히 떨어져서 제 데이터로 vocab을 만들고자 합니다.
제공된 토크나이저로 형태소 단위로 자른 후,
glounnlp의 counter 클래스에 input해 counter 를 형성해 보니
다시 음절 단위로 자르더라구요.( 형태소 string을 음절단위로 자릅니다 )
형태소 string을 그대로 이용해 counter를 형성한 뒤 vocab을 형성하고 싶은데, 혹시 sk brain 에선 vocab 형성을 어떻게 했는지 알려주실 수 있을까요?

긴 글 질문 읽어주셔 감사드립니다.

KoBERT Pre-Training Procedure

You've already provided some information on how you pre-trained KoBERT in the README.md, which is great. Would you mind also sharing how many training steps you trained for and what batch sizes and sequence lengths you used?

Thanks!

Issue on train naver_review_classifications_pytorch_kobert.ipynb

When running naver_review_classifications_pytorch_kobert.ipynb example, I got an error below.
In last section, out = model(token_ids, valid_length, segment_ids) code makes the error.
I have doubt where the expired api is used.
May I know get any information to solve the error from anyone?

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """
0%
0/2344 [00:00<?, ?it/s]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-e6a38b13095b> in <module>()
      9         valid_length= valid_length
     10         label = label.long().to(device)
---> 11         out = model(token_ids, valid_length, segment_ids)
     12         loss = loss_fn(out, label)
     13         loss.backward()

4 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in dropout(input, p, training, inplace)
    981     return (_VF.dropout_(input, p, training)
    982             if inplace
--> 983             else _VF.dropout(input, p, training))
    984 
    985 

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Here is the link for reproducing error.

pre-training 작업을 어떻게 진행하셨는 지 알 수 있을까요?

pre-training 작업을 어떻게 진행하셨는 지 알 수 있을까요?
가능하다면 소스레벨로 받을 수 있을까요??

pretraining time 질문드립니다!

먼저 너무 좋은 자료 공유해주셔서 감사합니다.
pretraining 학습시 V100 GPU 32장을 사용하신 것으로 알고 있습니다.
500만 개의 문장을 해당 환경에서 학습하는데 어느 정도의 시간이 소요되는지 궁금합니다.
또한 epoch는 몇으로 설정했는지 궁금합니다.

감사합니다.

Is there any published paper describing your work?

Hello,

First of all, thank you for publishing this repository. Is there any published paper describing your work? I mean a paper in some journal or conference proceedings. This information would help to understand your work a lot.

Thanks in advance!

컨버전 관련

구글 mutilingual BERT 로 만들어진 프로그램을
koBert 로 전환 가능할까요?

[Colab환경]KoBERT pytorch sentiment analysis 예제 오류

안녕하세요, NLP 공부중인 학생입니다.

이미 아시는지 모르겠지만, ko-BERT 파이토치 예제 실행시 마지막 셀에서 문제가 발생합니다.

라이브러리 버젼상 충돌인지 모르겠지만 어떤 부분에선지 에러가 납니다. 에러 화면은 아래와 같습니다.

감사합니다.

naver_review_classification_gluon_kobert.ipynb 실행 시 KeyError: <class 'numpy.str_'> Issue


for epoch_id in range(num_epochs):
    metric.reset()
    step_loss = 0
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(train_dataloader):
        if step_num < num_warmup_steps:
            new_lr = lr * step_num / num_warmup_steps
        else:
            non_warmup_steps = step_num - num_warmup_steps
            offset = non_warmup_steps / (num_train_steps - num_warmup_steps)
            new_lr = lr - offset * lr
        trainer.set_learning_rate(new_lr)
        with mx.autograd.record():
            # load data to GPU
            token_ids = token_ids.as_in_context(ctx)
            valid_length = valid_length.as_in_context(ctx)
            segment_ids = segment_ids.as_in_context(ctx)
            label = label.as_in_context(ctx)

            # forward computation
            out = model(token_ids, segment_ids, valid_length.astype('float32'))
            ls = loss_function(out, label).mean()

        # backward computation
        ls.backward()
        if not accumulate or (batch_id + 1) % accumulate == 0:
          trainer.allreduce_grads()
          nlp.utils.clip_grad_global_norm(params, 1)
          trainer.update(accumulate if accumulate else 1)
          step_num += 1
          if accumulate and accumulate > 1:
              # set grad to zero for gradient accumulation
              all_model_params.zero_grad()

        step_loss += ls.asscalar()
        metric.update([label], [out])
        if (batch_id + 1) % (50) == 0:
            print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.10f}, acc={:.3f}'
                         .format(epoch_id + 1, batch_id + 1, len(train_dataloader),
                                 step_loss / log_interval,
                                 trainer.learning_rate, metric.get()[1]))
            step_loss = 0
    test_acc = evaluate_accuracy(model, test_dataloader, ctx)
    print('Test Acc : {}'.format(test_acc))``

가장 마지막 라인 돌리면, 아래와 같은 에러가 나옵니다 ㅜㅜ
numpy 등 넘어갈때 데이터 형식의 문제 인거 같은데, 어떻게 고치면 될까요??

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-27-d901d94ff4fc> in <module>()
      2     metric.reset()
      3     step_loss = 0
----> 4     for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(train_dataloader):
      5         if step_num < num_warmup_steps:
      6             new_lr = lr * step_num / num_warmup_steps

1 frames
/usr/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

KeyError: <class 'numpy.str_'>

마지막 학습단계에서...

질문1)
마지막 에폭 돌리는 코드에서
out = model(token_ids, valid_length, segment_ids)
이부분에 에러가 발생합니다.

에러: dropout(): argument 'input' (position 1) must be Tensor, not str

질문2)
파이토치 설정 부분에서

import torch
from kobert.pytorch_kobert import get_pytorch_kobert_model
input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
model, vocab = get_pytorch_kobert_model()
sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
pooled_output.shape
torch.Size([2, 768])
vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")

Last Encoding Layer

sequence_output[0]
tensor([[-0.2461, 0.2428, 0.2590, ..., -0.4861, -0.0731, 0.0756],
[-0.2478, 0.2420, 0.2552, ..., -0.4877, -0.0727, 0.0754],
[-0.2472, 0.2420, 0.2561, ..., -0.4874, -0.0733, 0.0765]],
grad_fn=)

이값은 코랩코드에는 나타나있지않은데,
코랩으로 돌릴때 설정해주어야 하는건가요?

kobert로 squad 문제 해결을 위한 사용 문의

안녕하세요
알려주신 사이트를 참고해봤는데, 질문이 있어서 올립니다.
hugging face에 run_squad를 했을 때와 똑같은 에러가 나오는데

위 사이트에서는 모델에 인풋값을
model(input_ids, segment_ids, input_mask, start_positions, end_positions)
로 주는데

Kobert 모델에 인풋값은
model(input_ids, token_type_ids, input_mask)
이런 형식으로 주더라고요

이렇게 인풋에 차이가 나는데 어떻게 맞춰줄 수 있는지 조언을 좀 구할 수 있을까요?

Using with PyTorch 마지막 부분 error

Using with PyTorch
위 코드로 콜랩에서 실행하면
마지막 코드에서

    out = model(token_ids, valid_length, segment_ids) # 이부분에서 오류 납니다. 값이 텐서여야 한다고..

코드
for e in range(num_epochs):
train_acc = 0.0
test_acc = 0.0
model.train()
for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
optimizer.zero_grad()
token_ids = token_ids.long().to(device)
segment_ids = segment_ids.long().to(device)
valid_length= valid_length
label = label.long().to(device)
out = model(token_ids, valid_length, segment_ids)
loss = loss_fn(out, label)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
scheduler.step() # Update learning rate schedule
train_acc += calc_accuracy(out, label)
if batch_id % log_interval == 0:
print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
model.eval()
for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
token_ids = token_ids.long().to(device)
segment_ids = segment_ids.long().to(device)
valid_length= valid_length
label = label.long().to(device)
out = model(token_ids, valid_length, segment_ids)
test_acc += calc_accuracy(out, label)
print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

==================로그 log
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
"""

TypeError Traceback (most recent call last)
in ()
9 valid_length= valid_length
10 label = label.long().to(device)
---> 11 out = model(token_ids, valid_length, segment_ids)
12 loss = loss_fn(out, label)
13 loss.backward()

4 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in dropout(input, p, training, inplace)
981 return (VF.dropout(input, p, training)
982 if inplace
--> 983 else _VF.dropout(input, p, training))
984
985

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str
0%
0/2344 [00:00<?, ?it/s]

네이버 영화 리뷰 분류 콜랩 코드에서 에러가 발생합니다.

안녕하세요, 예제 코드를 돌리는 중에 에러가 발생하여 올립니다.
naver_review_classifications_gluon_bert.ipynb을 돌리는 중에

bert_base, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False, ctx=ctx)에서
다음과 같은 에러가 발생합니다.

TypeError Traceback (most recent call last)
in ()
----> 1 bert_base, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False, ctx=ctx)

1 frames
/usr/local/lib/python3.6/dist-packages/kobert/mxnet_kobert.py in get_kobert_model(model_file, vocab_file, use_pooler, use_decoder, use_classifier, ctx)
90 output_attention=False,
91 output_all_encodings=False,
---> 92 use_residual=predefined_args['use_residual'])
93
94 # BERT

TypeError: init() got an unexpected keyword argument 'attention_cell'

어떻게 고쳐야 하는지 알려주실 수 있나요?

KoBERTSUM 생성 계획 문의

KoBERT를 잘 사용하고 있습니다. 감사합니다.
Document Summarization 분야에서 BertSumExt, BertSumExtAbs 가 상위 1-2위인데, 이 모델을 사용하려면 BERT가 BERTSUM 형태로 빌드되어 있어야 해서 한글 버전 BERTSUM이 배포되기를 조심스럽게 희망합니다.
KoBERTSUM이 생성 배포될 수 있을지 문의드립니다.
(참고) https://arxiv.org/pdf/1908.08345.pdf (BERTSUM extends BERT by inserting multiple [CLS] symbols to learn sentence representations and using interval segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences.))

기본 모델 레이어 변경

안녕하세요

좋은 모델 배포해주셔서 감사의 말씀 먼저 전합니다

KoBert 구조는 [Attention-Feed_Forward] x 12 의 모델 구조로 이루어져 있습니다 (add-norm 생략)
모델 구조를 각 레이어 마지막에 (예를들어) RNN을 추가하여 ( [Attention-Feed_Forward-RNN] x 12 구조로 변경가능한지, 가능하다면 어떻게 할지 궁금하여 문의드립니다

감사합니다

Kobert로 squad 문제를 풀려고합니다.

Kobert로 squad dataset 형태의 문제를 풀려고하는데 학습을 하려하는데 어떻게 학습 시켜아 할 지 모르겠는데 도움을 좀 주실 수 있을까요??

hugging face의 run_squad와 결합하여 학습을 시켜보려고 노력했지만 실패했습니다..

일반 스트링에 대한 transform 적용 시 결과값 차이

안녕하세요, 먼저 모델 공유해주셔서 정말 감사합니다.

Pytorch 예시로 네이버 영화 리뷰 데이터셋에 대한 코드를 보면,

데이터를 txt 에서 gluonnlp.data.TSVDataset 로 로드하고 gluonnlp.data.BERTSentenceTransform 으로 transform 하게 되면

정상적으로 토크나이징이 진행되는데요.

gluonnlp.data.TSVDataset을 사용하지 않고, 일반적인 스트링에 gluonnlp.data.BERTSentenceTransform 을 적용하면 반환되는 값에 차이가 있습니다.

(일반적으로 input_id 의 유효한 토큰 및 valid_lenth의 개수가 낮아집니다.)

예를 들어, 학습 데이터에서 첫번째 인스턴스인 "아 더빙.. 진짜 짜증나네요 목소리" 에 대해서,

기존 예시 대로 적용하면 valid_lenth가 15이지만

transform('아 더빙.. 진짜 짜증나네요 목소리')로 바로 적용할 경우, valid_lenth 가 3이 나오면서 유효한 토큰 개수도 3개로 작게 나옵니다.

혹시 관련해서 해결책이 있을까요? 아니면 파일 로드 시에 반드시 gluonnlp 를 사용해야 하나요?

혹시 MLM 모델은 제공이 안될까 해서 문의드립니다

안녕하세요

깃에 글남겨보는거 처음이네요...

혹시 학습하실때 사용한 MLM 모델의 pretrained weights 는

받을 수 있는 방법이 없을까요??

MLM 모델부분을 만들어서 fine-tuning 해보긴 했는데

생각보다 데이터가 많이 필요한것도있고

학습환경도 그리 넉넉치않아서

크게 차이가 나지않는다면 general 하게 학습된 가중치가 더 일반화에 좋지 않을까해서요

감사합니다!

코드 실행시 커널이 죽습니다.

from kobert.pytorch_kobert import get_pytorch_kobert_model
또는
bertmodel, vocab = get_pytorch_kobert_model()

코드 실행 시 실행 되다가 끝까지 못가고 커널이 죽어버려서 다음과 같은 메세지를 띄웁니다.

The kernel appears to have died. It will restart automatically.

두 대의 서버에서 해당 코드를 실행해봤는데 한 대는 정상 작동하지만 다른 한 대는 위와 같은 메세지와 함께 커널이 죽어버리네요.

pip install -r requirements.txt 명령을 실행해서 버전도 모두 맞췄습니다.

이유가 뭘까요?

Kobert 모델 구조 변경 질문

먼저 Kobert를 배포해주신것에 감사합니다.
현재 배포에 주신 모델은 Sentence 모델에 더 적합다고 생각합니다.
word 단위로 하기위하여 model에서 나오는 output dimension을 줄일수있는지 궁금합니다

저한테만 뜨는 에러인지

model, vocab = get_pytorch_kobert_model() 엔터하면

OSError: Not found: "C:\Users\etfffff/kobert/kobert_news_wiki_ko_cased-1087f8699e.spiece": Illegal byte sequence Error #42 이라는 에러코드가 뜨는데요

sktbrain / kobert Goto Github PK

kobert's Introduction

KoBERT

Korean BERT pre-trained cased (KoBERT)

Why'?'

Training Environment

Requirements

How to install

How to use

Using with PyTorch

Using with ONNX

Using with MXNet-Gluon

Tokenizer

Subtasks

Naver Sentiment Analysis

KoBERT와 CRF로 만든 한국어 객체명인식기

Korean Sentence BERT

Release

Contacts

License

kobert's People

Contributors

Stargazers

Watchers

Forkers

kobert's Issues

Last Encoding Layer

==================로그 log /usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0 Please use tqdm.notebook.tqdm instead of tqdm.tqdm_notebook """

Recommend Projects

Recommend Topics

Recommend Org

==================로그 log
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
"""