Coder Social home page Coder Social logo

cantoformer's Introduction



Cantoformer
廣東話嘅語言 AI

Recent advances in AI enable smarter applications based on texts. It's good but they are mostly in English due to its abundance of texts available from the Internet.

This repository explores LM in Cantonese (Yue Chinese, 廣東話), a langauge predominantly spoken in Guangzhou, Hong Kong and Macau, and containing very challenging lingual properties for AI to learn.

AI 喺呢幾年發展得好快,好多嘢都話用 AI 處理會醒好多,但其實喺「語言處理」嘅領域入面,好多嘅資源都只係得英文,所以要落手做廣東話嘅 NLP,其實唔容易。

所以諗住喺呢度開個 Repo ,鼓勵更多人開發廣東話 AI。

Challenges

  • Mixed Languages (English, Chinese, Yue)
    夾雜多種語言
  • Complex Syntax
    語法複雜
  • Scarce Resource
    資源稀少
  • Many Homonyms & Homophones in online texts
    網上嘅字通常有好多一語多義/同音異字

Remediation

We adopt the following preprocessing to the model:
用呢個 model 前我哋會對文字做一啲嘅處理:

  • WordPiece Tokenizer from forked 🤗Tokenizers which,

    • strips accents like the original BERT
      除去組合附加符號 (e.g. àa)

    • uses lower casing
      使用細階英文

    • treats symbols/numers as a separate token
      符號/數字全部當係一個 token

    • Simplified Chinese → Traditional Chinese (Since most of our corpus are in Trad. Chinese)
      簡轉繁(因為文本大部分都係繁體字)

      Using OpenCC v1.1.1 from here

    • normalizes Unicode Characters (Some are hand-crafted) by
      統一中文字符(其中一啲係人手分類)

      • Symbols of the same functionality 相同功能嘅符號 (e.g. [ )
      • Variant Chinese characters 異體字 (e.g. )
      • Deomposing rare characters 將罕見字拆開 (e.g. 亻春 )

      (Mapping here)

  • Newlines are regarded as a token, i.e. <nl>

Framework to be used

  • Tensorflow
  • Pytorch

Libraries to be used

  • OpenCC (Simpl-to-Trad, 簡轉繁) @ v1.1.1
  • 🤗Tokenizers (forked version is used for normalization)
# Installing OpenCC v1.1.1 by
sudo bash ./install_opencc.sh

# Installing by forked 🤗 Tokenizers by 
pip3 install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
# This takes some time!

# This is forked from [email protected]
# with python package renamed to tokenizers_zh

Corpus

zh en
~ 80 GB
(incl. ~ 20 GB Cantonese)
~ 100 GB

Evaluation

Since we have NO datasets in Cantonese, we evaluate the models on both English and Chinese datasets:

  • MNLI (Entailment Prediction)
  • DRCD (Reading Comprehension)
  • SQuAD-v2 (Reading Comprehension)
  • CMRC2018 (Reading Comprehension)

Something to explore

  1. Sentence Order Prediction (SOP)

    SOP is a pretraining objective that is used in Albert. StructBERT also introduces Sentence Structural Objective, but since the code for electra reads the data sequentially, this repo explores SOP first

  2. Cluster Objective

    DocProduct is a cool project training a BERT model to cluster similar Q&A -- if a text A answers the question Q, then Q and A will be close in vector representation.

    This means the model must predict the possible contexts (before and after) in order to embed a vector that can minimize the cost function

    Details refer to the DocProduct repo.

To Do List

  • Normalize Chinese characters
  • ELECTRA-small
  • ELECTRA-base
  • ELECTRA-base-sop
  • ELECTRA-albert-base
  • ELECTRA-albert-xlarge
  • ELECTRA-base-cluster
  • ELECTRA-large
  • Evaluation in Cantonese dataset
  • Upload to 🤗Huggingface

Model Comparisons

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
CMRC2018-dev
(EM/F1)
🐤 BERT (s) 12M 12/256 77.6 60.5/64.2🤗
🐦 BERT (b) 110M 12/768 84.3 85.0/91.2 72.4/75.8🤗
🦅 BERT (l) 334M 12/1024 87.1 92.8/86.7
🐦 roBERTa (b) 110M 12/768 87.6 86.6/92.5 78.5/81.7🤗
🦅 roBERTa (l) 335M 24/1024 90.2 88.9/94.6
🐤 alBERT (b) 12M 12/768 84.6 79.3/82.1
🐤 alBERT (l) 18M 24/1024 86.5 81.8/84.9
🐦 alBERT (xl) 60M 24/2048 87.9 84.1/87.9
🦅 alBERT (xxl) 235M 12/4096 90.6 86.9/89.8
🐤 ELECTRA (s) 14M 12/256 81.6 83.5/89.2 69.7/73.4🤗
🐦 ELECTRA (b) 110M 12/768 88.5 89.6/94.2 80.5/83.3 69.3/87.0
🦅 ELECTRA (l) 335M 24/1024 90.7 88.8/93.3 88.0/90.6
🐦 XLM-R (b) 270M 12/768
🦅 XLM-R (l) 550M 24/1024 89.0
Ours (1.2M)
🐤 ELECTRA (s) 14M 12/256 80.7 82.1/88.0 69.4/72.1
🐦 ELECTRA (b) 110M 12/768 86.3 88.2/92.5 80.4/83.3
🐦 albert (xl) 60M 12/2048 87.7 89.9/94.7 82.9/85.9
Ours (1.5M)
🐦 ELECTRA (b) 110M 12/768 86.8 88.5/93.3 80.8/83.7 67.4/86.7
+ finetuned after SQuAD 89.5/94.1 70.2/88.5

Individual Comparisions

Small Models 🐤

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
🐤 BERT (s) 12M 12/256 77.6 60.5/64.2🤗
🐤 alBERT (b) 12M 12/768 84.6 79.3/82.1
🐤 alBERT (l) 18M 24/1024 86.5 81.8/84.9
🐤 ELECTRA (s) 14M 12/256 81.6 83.5/89.2 69.7/73.4🤗
Ours
🐤 ELECTRA (s) 14M 12/256 80.7 82.1/88.0 69.4/72.1

Base Models 🐦

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
CMRC2018-dev
(EM/F1)
🐦 BERT (b) 110M 12/768 84.3 85.0/91.2 72.4/75.8🤗
🐦 roBERTa (b) 110M 12/768 87.6 86.6/92.5 78.5/81.7🤗 67.4/87.2
🐦 ELECTRA (b) 110M 12/768 88.5 89.6/94.2 80.5/83.3 69.3/87.0
Ours
🐦 ELECTRA (b) 110M 12/768 86.3 88.2/92.5 80.4/83.3
Ours (1.5M)
🐦 ELECTRA (b) 110M 12/768 86.8 88.5/93.3 80.8/83.7 67.4/86.7
+ finetuned after SQuAD 89.5/94.1 70.2/88.5

Downloads 🐤🐦

Electra checkpoints are put here in Google Drive.

Electra-albert checkpoints are here in Google Drive


Explorations

Model params # L/H MNLI-en DRCD-dev
(EM/F1)
SQuADv2-dev
(EM/F1)
Ours (1.5M)
🐦 ELECTRA (b) 110M 12/768 86.8 88.5/93.3 80.8/83.7
+ finetuned after SQuAD 89.5/94.1
Ours (1.5M) + SOP
🐦 ELECTRA (b) 110M 12/768 87.1 88.6/93.6 80.4/83.2
+ finetuned after SQuAD 89.7/94.1

References

Expected Losses / Training Curves during Pre-Training.

google-research/electra#3


Credits

Special thanks to Google's TensorFlow Research Cloud (TFRC) for providing TPU-v3 for all the training in this repo!

cantoformer's People

Contributors

ecchochan avatar paramihk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cantoformer's Issues

Questions about corpus

Hi, thank you for your sharing!

Could you please kindly provide the corpus you use in this project, especially 20GB Cantonese data?

Many thanks.

Experienced error when cloning the Tokenizer

Experienced error when cloning the Tokenizer:

% pip install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
Collecting version_subpkg
  Cloning https://github.com/ecchochan/tokenizers.git (to revision zh-norm-4) to /private/var/folders/jq/lml3bg2146b_9np0qbyt0s080000gq/T/pip-install-9l41zopk/version-subpkg_750a2b16341a4860a10ce61738df1c63
  Running command git clone --filter=blob:none --quiet https://github.com/ecchochan/tokenizers.git /private/var/folders/jq/lml3bg2146b_9np0qbyt0s080000gq/T/pip-install-9l41zopk/version-subpkg_750a2b16341a4860a10ce61738df1c63
  Running command git checkout -b zh-norm-4 --track origin/zh-norm-4
  Switched to a new branch 'zh-norm-4'
  branch 'zh-norm-4' set up to track 'origin/zh-norm-4'.
  Resolved https://github.com/ecchochan/tokenizers.git to commit 7a8d502337a70cc995a2e8c9793a4b85e57eff45
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  WARNING: Generating metadata for package version_subpkg produced metadata for project name tokenizers-zh. Fix your #egg=version_subpkg fragments.
Discarding git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python: Requested tokenizers-zh from git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python has inconsistent name: expected 'version-subpkg', but metadata has 'tokenizers-zh'
ERROR: Could not find a version that satisfies the requirement version-subpkg (unavailable) (from versions: none)
ERROR: No matching distribution found for version-subpkg (unavailable)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.