paramiai / cantoformer Goto Github PK

Transformers for Cantonese

License: MIT License

Python 99.84% Shell 0.16%

nlp bert cantonese hongkong chinese

cantoformer's Introduction

Cantoformer
廣東話嘅語言 AI

Recent advances in AI enable smarter applications based on texts. It's good but they are mostly in English due to its abundance of texts available from the Internet.

This repository explores LM in Cantonese (Yue Chinese, 廣東話), a langauge predominantly spoken in Guangzhou, Hong Kong and Macau, and containing very challenging lingual properties for AI to learn.

AI 喺呢幾年發展得好快，好多嘢都話用 AI 處理會醒好多，但其實喺「語言處理」嘅領域入面，好多嘅資源都只係得英文，所以要落手做廣東話嘅 NLP，其實唔容易。

所以諗住喺呢度開個 Repo ，鼓勵更多人開發廣東話 AI。

Challenges

Mixed Languages (English, Chinese, Yue)
夾雜多種語言
Complex Syntax
語法複雜
Scarce Resource
資源稀少
Many Homonyms & Homophones in online texts
網上嘅字通常有好多一語多義／同音異字

Remediation

We adopt the following preprocessing to the model:
用呢個 model 前我哋會對文字做一啲嘅處理：

WordPiece Tokenizer from forked 🤗Tokenizers which,
- strips accents like the original BERT
  除去組合附加符號 (e.g. à → a)
- uses lower casing
  使用細階英文
- treats symbols/numers as a separate token
  符號／數字全部當係一個 token
- Simplified Chinese → Traditional Chinese (Since most of our corpus are in Trad. Chinese)
  簡轉繁（因為文本大部分都係繁體字）
  
  Using OpenCC v1.1.1 from here
- normalizes Unicode Characters (Some are hand-crafted) by
  統一中文字符（其中一啲係人手分類）
  - Symbols of the same functionality 相同功能嘅符號 (e.g. 【 → [ )
  - Variant Chinese characters 異體字 (e.g. 俢 → 修 )
  - Deomposing rare characters 將罕見字拆開 (e.g. 偆 → 亻春 )
  (Mapping here)
Newlines are regarded as a token, i.e. <nl>

Framework to be used

Tensorflow
Pytorch

Libraries to be used

OpenCC (Simpl-to-Trad, 簡轉繁) @ v1.1.1
🤗Tokenizers (forked version is used for normalization)

# Installing OpenCC v1.1.1 by
sudo bash ./install_opencc.sh

# Installing by forked 🤗 Tokenizers by 
pip3 install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
# This takes some time!

# This is forked from [email protected]
# with python package renamed to tokenizers_zh

Corpus

zh	en
~ 80 GB (incl. ~ 20 GB Cantonese)	~ 100 GB

Evaluation

Since we have NO datasets in Cantonese, we evaluate the models on both English and Chinese datasets:

MNLI (Entailment Prediction)
DRCD (Reading Comprehension)
SQuAD-v2 (Reading Comprehension)
CMRC2018 (Reading Comprehension)

Something to explore

Sentence Order Prediction (SOP)

SOP is a pretraining objective that is used in Albert. StructBERT also introduces Sentence Structural Objective, but since the code for electra reads the data sequentially, this repo explores SOP first
Cluster Objective

DocProduct is a cool project training a BERT model to cluster similar Q&A -- if a text A answers the question Q, then Q and A will be close in vector representation.

This means the model must predict the possible contexts (before and after) in order to embed a vector that can minimize the cost function

Details refer to the DocProduct repo.

To Do List

Model Comparisons

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)	CMRC2018-dev (EM/F1)
🐤	BERT (s)	12M	12/256	77.6		60.5/64.2🤗
🐦	BERT (b)	110M	12/768	84.3	85.0/91.2	72.4/75.8🤗
🦅	BERT (l)	334M	12/1024	87.1		92.8/86.7

🐦	roBERTa (b)	110M	12/768	87.6	86.6/92.5	78.5/81.7🤗
🦅	roBERTa (l)	335M	24/1024	90.2		88.9/94.6

🐤	alBERT (b)	12M	12/768	84.6		79.3/82.1
🐤	alBERT (l)	18M	24/1024	86.5		81.8/84.9
🐦	alBERT (xl)	60M	24/2048	87.9		84.1/87.9
🦅	alBERT (xxl)	235M	12/4096	90.6		86.9/89.8

🐤	ELECTRA (s)	14M	12/256	81.6	83.5/89.2	69.7/73.4🤗
🐦	ELECTRA (b)	110M	12/768	88.5	89.6/94.2	80.5/83.3	69.3/87.0
🦅	ELECTRA (l)	335M	24/1024	90.7	88.8/93.3	88.0/90.6

🐦	XLM-R (b)	270M	12/768
🦅	XLM-R (l)	550M	24/1024	89.0

	Ours (1.2M)
🐤	ELECTRA (s)	14M	12/256	80.7	82.1/88.0	69.4/72.1
🐦	ELECTRA (b)	110M	12/768	86.3	88.2/92.5	80.4/83.3
🐦	albert (xl)	60M	12/2048	87.7	89.9/94.7	82.9/85.9

	Ours (1.5M)
🐦	ELECTRA (b)	110M	12/768	86.8	88.5/93.3	80.8/83.7	67.4/86.7
	+ finetuned after SQuAD				89.5/94.1		70.2/88.5

Individual Comparisions

Small Models 🐤

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)
🐤	BERT (s)	12M	12/256	77.6		60.5/64.2🤗

🐤	alBERT (b)	12M	12/768	84.6		79.3/82.1
🐤	alBERT (l)	18M	24/1024	86.5		81.8/84.9

🐤	ELECTRA (s)	14M	12/256	81.6	83.5/89.2	69.7/73.4🤗

	Ours
🐤	ELECTRA (s)	14M	12/256	80.7	82.1/88.0	69.4/72.1

Base Models 🐦

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)	CMRC2018-dev (EM/F1)
🐦	BERT (b)	110M	12/768	84.3	85.0/91.2	72.4/75.8🤗

🐦	roBERTa (b)	110M	12/768	87.6	86.6/92.5	78.5/81.7🤗	67.4/87.2

🐦	ELECTRA (b)	110M	12/768	88.5	89.6/94.2	80.5/83.3	69.3/87.0

	Ours
🐦	ELECTRA (b)	110M	12/768	86.3	88.2/92.5	80.4/83.3
	Ours (1.5M)
🐦	ELECTRA (b)	110M	12/768	86.8	88.5/93.3	80.8/83.7	67.4/86.7
	+ finetuned after SQuAD				89.5/94.1		70.2/88.5

Downloads 🐤🐦

Electra checkpoints are put here in Google Drive.

Electra-albert checkpoints are here in Google Drive

Explorations

	Model	params #	L/H	MNLI-en	DRCD-dev (EM/F1)	SQuADv2-dev (EM/F1)
	Ours (1.5M)
🐦	ELECTRA (b)	110M	12/768	86.8	88.5/93.3	80.8/83.7
	+ finetuned after SQuAD				89.5/94.1

	Ours (1.5M) + SOP
🐦	ELECTRA (b)	110M	12/768	87.1	88.6/93.6	80.4/83.2
	+ finetuned after SQuAD				89.7/94.1

References

Expected Losses / Training Curves during Pre-Training.

google-research/electra#3

Credits

Special thanks to Google's TensorFlow Research Cloud (TFRC) for providing TPU-v3 for all the training in this repo!

cantoformer's People

Contributors

Stargazers

Watchers

Forkers

cybo1112 jimmycxxq smartthomas freestanding-binary

cantoformer's Issues

how to use this tokenizer

Questions about corpus

Hi, thank you for your sharing!

Could you please kindly provide the corpus you use in this project, especially 20GB Cantonese data?

Many thanks.

% pip install 'git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python'
Collecting version_subpkg
  Cloning https://github.com/ecchochan/tokenizers.git (to revision zh-norm-4) to /private/var/folders/jq/lml3bg2146b_9np0qbyt0s080000gq/T/pip-install-9l41zopk/version-subpkg_750a2b16341a4860a10ce61738df1c63
  Running command git clone --filter=blob:none --quiet https://github.com/ecchochan/tokenizers.git /private/var/folders/jq/lml3bg2146b_9np0qbyt0s080000gq/T/pip-install-9l41zopk/version-subpkg_750a2b16341a4860a10ce61738df1c63
  Running command git checkout -b zh-norm-4 --track origin/zh-norm-4
  Switched to a new branch 'zh-norm-4'
  branch 'zh-norm-4' set up to track 'origin/zh-norm-4'.
  Resolved https://github.com/ecchochan/tokenizers.git to commit 7a8d502337a70cc995a2e8c9793a4b85e57eff45
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  WARNING: Generating metadata for package version_subpkg produced metadata for project name tokenizers-zh. Fix your #egg=version_subpkg fragments.
Discarding git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python: Requested tokenizers-zh from git+https://github.com/ecchochan/tokenizers.git@zh-norm-4#egg=version_subpkg&subdirectory=bindings/python has inconsistent name: expected 'version-subpkg', but metadata has 'tokenizers-zh'
ERROR: Could not find a version that satisfies the requirement version-subpkg (unavailable) (from versions: none)
ERROR: No matching distribution found for version-subpkg (unavailable)