A text mining model that uses N-gram
models (in this instance 3) to detects if someone is from Taiwan or China. Originally called 共匪測試機
, and changed as the name was not being very friendly to our overseas neighbours and potential overlords.
There are two overall goals for this project:
- Calculate the probability of whether a sequence of strings is more likely to be from Taiwan or China
- Predict the next character / string from a given string
Install python 3
, pip
and (optionally) venv
on your computer and the required packages from the requirements.txt
file. As of writing this, there is no need to install anything other than python 3
to have it functional. However if you wish to use the features of
- translating 簡體華文 to 繁體華文
- checking the F1 score for the predictions
install the required python packages
pip install -r requirements.txt
This is tested on python 3.8.6
on Ubuntu 20.10
and Windows
but most likely there would not be any problems if you're using python 3.x
or running Mac OS
.
As of now, we do not use the nltk
package for our purposes, rather we wrote out own implementation of
- tokenization (removing all non-繁體華文 unicode characters)
- building the n-gram
- smoothing technique (Lidstone's Law)
- some kind of classifier
We wish to pivot towards using more standard packages (such as nltk
) in the future.
To use this, prepare a bunch of documents (.txt
files) that are from China and Taiwan and seperate them in two folders (The default preset is ChinaDataset/
and TaiwanDataset/
). You could change the folder directories in ngram.py
.
# change the directories if you wish
china_dataset = files_to_list('./ChinaDataset/')
taiwan_dataset = files_to_list('./TaiwanDataset/')
Then just use your terminal/command line and type
python3 ngram.py
let it train and type in the sentence you wish to check
Sentence: 我是從火星來的
Parts of my code comes from the articles I have read online and I may miss out on some credits. So, if you see your code used and not credited here, please do tell.
- A Comprehensive Guide to Build your own Language Model in Python! - Mohd Sanad Zaki Rizvi
- Building Language Models for Text with Named Entities - Md Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang
- Building language models - bogdani
- 自然語言處理 — 使用 N-gram 實現輸入文字預測 - Airwaves
- 结巴:中文分词组件
Also, my greatest thanks to my teammates for helping me. Even though there are no commit messages written by them, most of commit #34dd2d83
and all the web scrapping for the databases is not my work.