himkt / awesome-bert-japanese Goto Github PK

View Code? Open in Web Editor NEW

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information

awesome-bert-japanese's Introduction

awesome-bert-japanese

日本語の学習済み BERT は文から単語への分かち書き，単語からサブワードへの分割の処理にいくつかの選択肢が存在します．また，単語をサブワードに分割する際に利用する語彙についても構築方法に数種類のバリエーションがあります．

本リポジトリでは，公開されている学習済み BERT モデルについて，分かち書き・サブワード分割・語彙構築アルゴリズムそれぞれどのアルゴリズムが採用されているかを表にまとめています．

A list of pre-trained BERT models for Japanese. Japanese is a complicated language; which doesn't have any word boundaries and has many kind of characters. Therefore, it requires word segmentation before tokenizing word into subwords. I summarize pretrained BERT models for Japanese by word segmentation algorithm, subword tokenization algorithm, and algorithm for constructing vocabulary used in subword tokenization.

Model

Model	Sentence -> Words	Word -> Subword	Algorithm for constructing vocabulary used in subword tokenization
Google (Multilingual BERT)	Whitespace	WordPiece	BPE?
Kikuta	--	Sentencepiece (without word segmentation)	Sentencepiece (model_type=unigram)
Hotto Link Inc.	--	Sentencepiece (without word segmentation)	Sentencepiece (model_type=unigram)
Kyoto University	Juman++ (JUMANDIC?)	WordPiece	subword-nmt (BPE)
Stockmark Inc. (a)	MeCab (mecab-ipadic-neologd)	--	--
Tohoku University (a)	MeCab (mecab-ipadic)	WordPiece	Sentencepiece (model_type=bpe)
Tohoku University (b)	MeCab (mecab-ipadic)	Character	Sentencepiece (model_type=character)
NICT (a)	MeCab (mecab-jumandic)	WordPiece	subword-nmt (BPE)
NICT (b)	MeCab (mecab-jumandic)	---	---
akirakubo (a)	MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in `新仮名` + MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`	WordPiece	subword-nmt (BPE)
akirakubo (b)	SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in `新仮名` + MeCab (unidic_qkana) for Aozora bunko written in `旧仮名`	WordPiece	subword-nmt (BPE)
The University of Tokyo	MeCab (mecab-ipadic-neologd + user dic (J-MeDic)	WordPiece	? (BPE)
Laboro.AI Inc.	--	Sentencepiece (without word segmentation)	Sentencepiece (model_type=unigram)
Bandai Namco Research Inc.	MeCab (mecab-ipadic)	WordPiece	Sentencepiece (model_type=bpe)
Retrieva, Inc.	MeCab (mecab-ipadic)	WordPiece	Sentencepiece (model_type=bpe)
Waseda University	Juman++ (JUMANDIC)	WordPiece	Sentencepiece (model_type=unigram)
LINE Corp.	MeCab (mecab-unidic)	WordPiece	Sentencepiece (model_type=bpe)
Stockmark Inc. (b)	MeCab (mecab-ipadic-neologd)	WordPiece	Sentencepiece (model_type=?)

NICT: National Institute of Information and Communications Technology
without word segmentation: 文を単語に分割せず直接サブワードへ分割する
For models by Tohoku University, MeCab+mecab-ipadic-neologd is used for sentence segmentation (thanks @ikuyamada san!)
For models by akirakubo, documents in Aozora bunko are classified into two categories. It is based on types of kana spelling. (thanks @kkadowa san and @akirakubo san!
- See also: akirakubo/bert-japanese-aozora#1 (comment)
For DistilBERT (by Bandai Namco Resean Inc.), the same word segmentation and algorithm for constructing vocabulary are used both for teacher/studen models.

Reference

Google (Multilingual BERT) (2018/11): https://github.com/google-research/bert/blob/master/multilingual.md
Kikuta (2019/01): https://yoheikikuta.github.io/bert-japanese/
Hotto Link Inc. (2019/03): https://www.hottolink.co.jp/blog/20190311_101674/
Kyoto University (2019/03): http://nlp.ist.i.kyoto-u.ac.jp/bert
Stockmark Inc. (a) (2019/04): https://qiita.com/mkt3/items/3c1278339ff1bcc0187f
Tohoku University (a,b) (2019/12): https://github.com/cl-tohoku/bert-japanese
NICT (2020/03): https://alaginrc.nict.go.jp/nict-bert/index.html
akirakubo (2020/03): https://github.com/akirakubo/bert-japanese-aozora
The University of Tokyo (2020/03): https://ai-health.m.u-tokyo.ac.jp/uth-ber
Laboro.AI Inc. (2020/04): https://laboro.ai/column/laboro-bert/
Bandai Namco Research Inc. (2020/04): https://github.com/BandaiNamcoResearchInc/DistilBERT-base-jp
Retrieva, Inc. (2021/04): https://tech.retrieva.jp/entry/2021/04/01/114943
Waseda University (2021/12): https://huggingface.co/nlp-waseda/roberta-base-japanese
LINE Corp. (2023/03): https://engineering.linecorp.com/ja/blog/line-distilbert-high-performance-fast-lightweight-japanese-language-model
Stockmark Inc. (b) (2020/02): https://qiita.com/mkt3/items/b41dcf0185e5873f5f75

awesome-bert-japanese's People

Contributors

Stargazers

Watchers

awesome-bert-japanese's Issues

(Thank you @icoxfog417 for pointing out!)

Japanese BART

Although not sure if we would include BART.
https://tech.stockmark.co.jp/blog/bart-japanese-base-news/

Raw text segmentation or puntuation

Hello,

Thank you for collecting links to the bert based models for Japanese

Just wanted to ask if you know any models or investigations regarding raw text (after automatic speech recognition the text is not splitted at all, just characters one by one) segmentation? Something simple like splitting text on sentences or more complicated like adding punctuation to the text. For example, nvidia provides models for punctuation based on bert and distilbert: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html

That would be great if there is something for raw text split for Japanese language

Japanese T5

Although not decided whether to include T5.
https://qiita.com/sonoisa/items/a9af64ff641f0bbfed44

東北大学とNICT

東北大学 (a)の「サブワード分割のための語彙構築アルゴリズム」はSentencepieceだと思います。
以下のscriptで、Sentencepieceでまずvocabを学習してから、BERTのvocab.txtのフォーマットになるように変換しています。

https://github.com/cl-tohoku/bert-japanese/blob/master/build_vocab.py

東北大学 (b)の「単語 -> サブワード」は文字単位なので Character とかの方がいいのではないでしょうか。
(「サブワード分割のための語彙構築アルゴリズム」のところ、正確にはSentencepieceの --model_type=char オプションで学習していますが、実質文字単位なので Character でいいと思います。)

NICT (a)の「単語 -> サブワード」はWordPieceであっていると思います。
NICT (b)が「BPEなし」モデルだと思いますが、「BPE」が人によって何を指しているのかがまちまちというのもあるのですが、ここでは「BPEなし」は「サブワードに分割せずに形態素単位」という意味なので、「単語 -> サブワード」「サブワード分割のための語彙構築アルゴリズム」ともに「--」が正しいと思います。