langdect's Introduction

langdect

A language detector

Brief Explanation

This code uses an adaptation of the vector space model to ngrams in text documents in order to detect languages.

The code is still experimental and just a proof of concept.

Requirements

Python 3.5.2 (older versions haven't been tested but it is likely that it will work) numpy scipy

Usage

Just run the code inside the langdect folder with the max number of ngrams desired as a parameter. Depending on this paramenter it might take a few seconds to load

python3 ngram.py 3

Then introduce the text for which you want to detect the language. Usually it gives good results for a paragraph of text

Testing (Idea)

An alternatively and seemingly quicker way to index the dictionaries and reduce dimensionalities is provided by

python3 nongram.py 3

Here, the characters inside the ngrams are ignored, I called nongrams. For example the text "hello" would have the following nongrams of size 3: h_l, e_l, l_o.

Implementation details where improvements are possible/required

there is no control of stop words
the only characters controlled are '\n' and ' '
the indexes are generated each time the code is run (so it can be slow to start if you put many characters); the indexes should be save and loaded from files
there is no way to control the minimum amount of ngrams
you can include more languages by copying and pasting more translations of the Human Rights Declaration

Data set of documents

The documents are the translation of the Human Rights Declaration taken from here: http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx

Recommend Projects

robertour / langdect Goto Github PK

langdect's Introduction

langdect

Brief Explanation

Requirements

Usage

Testing (Idea)

Implementation details where improvements are possible/required

Data set of documents

langdect's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent