Coder Social home page Coder Social logo

himkt / awesome-bert-japanese Goto Github PK

View Code? Open in Web Editor NEW
129.0 9.0 7.0 39 KB

📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information

nlp natural-language-processing japanese bert bert-models

awesome-bert-japanese's Introduction

awesome-bert-japanese

日本語の学習済み BERT は文から単語への分かち書き,単語からサブワードへの分割の処理にいくつかの選択肢が存在します. また,単語をサブワードに分割する際に利用する語彙についても構築方法に数種類のバリエーションがあります.

本リポジトリでは,公開されている学習済み BERT モデルについて, 分かち書き・サブワード分割・語彙構築アルゴリズムそれぞれどのアルゴリズムが採用されているかを表にまとめています.

A list of pre-trained BERT models for Japanese. Japanese is a complicated language; which doesn't have any word boundaries and has many kind of characters. Therefore, it requires word segmentation before tokenizing word into subwords. I summarize pretrained BERT models for Japanese by word segmentation algorithm, subword tokenization algorithm, and algorithm for constructing vocabulary used in subword tokenization.

Model

Model Sentence -> Words Word -> Subword Algorithm for constructing vocabulary used in subword tokenization
Google (Multilingual BERT) Whitespace WordPiece BPE?
Kikuta -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Hotto Link Inc. -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Kyoto University Juman++ (JUMANDIC?) WordPiece subword-nmt (BPE)
Stockmark Inc. (a) MeCab (mecab-ipadic-neologd) -- --
Tohoku University (a) MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Tohoku University (b) MeCab (mecab-ipadic) Character Sentencepiece (model_type=character)
NICT (a) MeCab (mecab-jumandic) WordPiece subword-nmt (BPE)
NICT (b) MeCab (mecab-jumandic) --- ---
akirakubo (a) MeCab (unidic-cwj) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 WordPiece subword-nmt (BPE)
akirakubo (b) SudachiPy (SudachiDict_core + A mode) for Wikipedia and Aozora bunko written in 新仮名 + MeCab (unidic_qkana) for Aozora bunko written in 旧仮名 WordPiece subword-nmt (BPE)
The University of Tokyo MeCab (mecab-ipadic-neologd + user dic (J-MeDic) WordPiece ? (BPE)
Laboro.AI Inc. -- Sentencepiece (without word segmentation) Sentencepiece (model_type=unigram)
Bandai Namco Research Inc. MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Retrieva, Inc. MeCab (mecab-ipadic) WordPiece Sentencepiece (model_type=bpe)
Waseda University Juman++ (JUMANDIC) WordPiece Sentencepiece (model_type=unigram)
LINE Corp. MeCab (mecab-unidic) WordPiece Sentencepiece (model_type=bpe)
Stockmark Inc. (b) MeCab (mecab-ipadic-neologd) WordPiece Sentencepiece (model_type=?)
  • NICT: National Institute of Information and Communications Technology
  • without word segmentation: 文を単語に分割せず直接サブワードへ分割する
  • For models by Tohoku University, MeCab+mecab-ipadic-neologd is used for sentence segmentation (thanks @ikuyamada san!)
  • For models by akirakubo, documents in Aozora bunko are classified into two categories. It is based on types of kana spelling. (thanks @kkadowa san and @akirakubo san!
  • For DistilBERT (by Bandai Namco Resean Inc.), the same word segmentation and algorithm for constructing vocabulary are used both for teacher/studen models.

Reference

awesome-bert-japanese's People

Contributors

himkt avatar kkadowa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-bert-japanese's Issues

Raw text segmentation or puntuation

Hello,

Thank you for collecting links to the bert based models for Japanese

Just wanted to ask if you know any models or investigations regarding raw text (after automatic speech recognition the text is not splitted at all, just characters one by one) segmentation? Something simple like splitting text on sentences or more complicated like adding punctuation to the text. For example, nvidia provides models for punctuation based on bert and distilbert: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html

That would be great if there is something for raw text split for Japanese language

東北大学とNICT

東北大学 (a)の「サブワード分割のための語彙構築アルゴリズム」はSentencepieceだと思います。
以下のscriptで、Sentencepieceでまずvocabを学習してから、BERTのvocab.txtのフォーマットになるように変換しています。

https://github.com/cl-tohoku/bert-japanese/blob/master/build_vocab.py

東北大学 (b)の「単語 -> サブワード」は文字単位なので Character とかの方がいいのではないでしょうか。
(「サブワード分割のための語彙構築アルゴリズム」のところ、正確にはSentencepieceの --model_type=char オプションで学習していますが、実質文字単位なので Character でいいと思います。)

NICT (a)の「単語 -> サブワード」はWordPieceであっていると思います。
NICT (b)が「BPEなし」モデルだと思いますが、「BPE」が人によって何を指しているのかがまちまちというのもあるのですが、ここでは「BPEなし」は「サブワードに分割せずに形態素単位」という意味なので、「単語 -> サブワード」「サブワード分割のための語彙構築アルゴリズム」ともに「--」が正しいと思います。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.