Coder Social home page Coder Social logo

michihosokawa / albert-japanese Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alinear-corp/albert-japanese

0.0 0.0 0.0 390 KB

BERT with SentencePiece for Japanese text.

License: Apache License 2.0

Python 4.54% Shell 0.37% Jupyter Notebook 95.08%

albert-japanese's Introduction

ALBERT with SentencePiece for Japanese text.

This is a repository of Japanese ALBERT model with SentencePiece tokenizer.
(Note: this project is a fork of bert-japanese)

To clone this repository together with the required ALBERT (my fork of original ALBERT) and WikiExtractor:

git clone --recurse-submodules https://github.com/jnory/albert-japanese

Pretrained models

We provide pretrained BERT model and trained SentencePiece model for Japanese text. Training data is the Japanese wikipedia corpus from Wikimedia Downloads.
Please download all objects in the following google drive to model/ directory.

Loss function during training is as below (after 1M steps the loss function massively changes because max_seq_length is changed from 128 to 512.): pretraining-loss

Finetuning with BERT Japanese

We also provide a simple Japanese text classification problem with livedoor ニュースコーパス.
Try the following notebook to check the usability of finetuning.
You can run the notebook on CPU (too slow) or GPU/TPU environments.

The results are the following:

  • ALBERT with SentencePiece
                          precision    recall  f1-score   support
    
    dokujo-tsushin       0.99      0.91      0.95       178
      it-life-hack       0.93      0.95      0.94       172
     kaden-channel       0.96      0.98      0.97       176
    livedoor-homme       0.83      0.86      0.85        95
       movie-enter       0.98      0.99      0.99       158
            peachy       0.91      0.93      0.92       174
              smax       0.98      0.96      0.97       167
      sports-watch       0.99      0.96      0.98       190
        topic-news       0.94      0.97      0.95       163
    
          accuracy                           0.95      1473
         macro avg       0.94      0.95      0.94      1473
      weighted avg       0.95      0.95      0.95      1473
    
  • BERT with SentencePiece (from original bert-japanese repository)
                    precision    recall  f1-score   support
    
    dokujo-tsushin       0.98      0.94      0.96       178
      it-life-hack       0.96      0.97      0.96       172
     kaden-channel       0.99      0.98      0.99       176
    livedoor-homme       0.98      0.88      0.93        95
       movie-enter       0.96      0.99      0.98       158
            peachy       0.94      0.98      0.96       174
              smax       0.98      0.99      0.99       167
      sports-watch       0.98      1.00      0.99       190
        topic-news       0.99      0.98      0.98       163
    
         micro avg       0.97      0.97      0.97      1473
         macro avg       0.97      0.97      0.97      1473
      weighted avg       0.97      0.97      0.97      1473
    

Pretraining from scratch

All scripts for pretraining from scratch are provided. Follow the instructions below.

Data preparation

Data downloading and preprocessing.

python3 src/data-download-and-extract.py
bash src/file-preprocessing.sh

The above scripts use the latest jawiki data and wikiextractor module, which are different from those used for the pretrained model. If you wanna prepare the same situation, use the following information:

  • albert-japanese: commit de548d2cfdf1ca90a7872248e4b2adf98527da3e
  • dataset: jawiki-20191201-pages-articles-multistream.xml.bz2 in the Google Drive
  • wikiextractor: commit 3162bb6c3c9ebd2d15be507aa11d6fa818a454ac

Training SentencePiece model

Train a SentencePiece model using the preprocessed data.

python3 src/train-sentencepiece.py

Creating data for pretraining

Create .tfrecord files for pretraining.

bash src/run_create_pretraining_data.sh [extract_dir] [max_seq_length]

extract_dir is a base directory to which the wikipedia texts are extracted, and,
max_seq_length need to be 128 and 512

Pretraining

You need GPU/TPU environment to pretrain a BERT model.
The following notebook provides the link to Colab notebook where you can run the scripts with TPUs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.