Coder Social home page Coder Social logo

chinesedetector's Introduction

Chinese Detector

A text mining model that uses N-gram models (in this instance 3) to detects if someone is from Taiwan or China. Originally called 共匪測試機, and changed as the name was not being very friendly to our overseas neighbours and potential overlords.

Goals

There are two overall goals for this project:

  1. Calculate the probability of whether a sequence of strings is more likely to be from Taiwan or China
  2. Predict the next character / string from a given string

Set up

Install python 3, pip and (optionally) venv on your computer and the required packages from the requirements.txt file. As of writing this, there is no need to install anything other than python 3 to have it functional. However if you wish to use the features of

  • translating 簡體華文 to 繁體華文
  • checking the F1 score for the predictions

install the required python packages

pip install -r requirements.txt

Test environment

This is tested on python 3.8.6 on Ubuntu 20.10 and Windows but most likely there would not be any problems if you're using python 3.x or running Mac OS.

Implementation

As of now, we do not use the nltk package for our purposes, rather we wrote out own implementation of

  • tokenization (removing all non-繁體華文 unicode characters)
  • building the n-gram
  • smoothing technique (Lidstone's Law)
  • some kind of classifier

We wish to pivot towards using more standard packages (such as nltk) in the future.

Usage

To use this, prepare a bunch of documents (.txt files) that are from China and Taiwan and seperate them in two folders (The default preset is ChinaDataset/ and TaiwanDataset/). You could change the folder directories in ngram.py.

# change the directories if you wish
china_dataset = files_to_list('./ChinaDataset/')
taiwan_dataset = files_to_list('./TaiwanDataset/')

Then just use your terminal/command line and type

python3 ngram.py

let it train and type in the sentence you wish to check

Sentence: 我是從火星來的

Credits

Parts of my code comes from the articles I have read online and I may miss out on some credits. So, if you see your code used and not credited here, please do tell.

Also, my greatest thanks to my teammates for helping me. Even though there are no commit messages written by them, most of commit #34dd2d83 and all the web scrapping for the databases is not my work.

chinesedetector's People

Contributors

imfulee avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.