Chinese Detector

A text mining model that uses N-gram models (in this instance 3) to detects if someone is from Taiwan or China. Originally called 共匪測試機, and changed as the name was not being very friendly to our overseas neighbours and potential overlords.

Goals

There are two overall goals for this project:

Calculate the probability of whether a sequence of strings is more likely to be from Taiwan or China
Predict the next character / string from a given string

Set up

Install python 3, pip and (optionally) venv on your computer and the required packages from the requirements.txt file. As of writing this, there is no need to install anything other than python 3 to have it functional. However if you wish to use the features of

translating 簡體華文 to 繁體華文
checking the F1 score for the predictions

install the required python packages

pip install -r requirements.txt

Test environment

This is tested on python 3.8.6 on Ubuntu 20.10 and Windows but most likely there would not be any problems if you're using python 3.x or running Mac OS.

Implementation

As of now, we do not use the nltk package for our purposes, rather we wrote out own implementation of

tokenization (removing all non-繁體華文 unicode characters)
building the n-gram
smoothing technique (Lidstone's Law)
some kind of classifier

We wish to pivot towards using more standard packages (such as nltk) in the future.

Usage

To use this, prepare a bunch of documents (.txt files) that are from China and Taiwan and seperate them in two folders (The default preset is ChinaDataset/ and TaiwanDataset/). You could change the folder directories in ngram.py.

# change the directories if you wish
china_dataset = files_to_list('./ChinaDataset/')
taiwan_dataset = files_to_list('./TaiwanDataset/')

Then just use your terminal/command line and type

python3 ngram.py

let it train and type in the sentence you wish to check

Sentence: 我是從火星來的

Credits

Parts of my code comes from the articles I have read online and I may miss out on some credits. So, if you see your code used and not credited here, please do tell.

Also, my greatest thanks to my teammates for helping me. Even though there are no commit messages written by them, most of commit #34dd2d83 and all the web scrapping for the databases is not my work.

imfulee / chinesedetector Goto Github PK