Coder Social home page Coder Social logo

ngrambased-textcategorizer's Introduction

ngrambased-textcategorizer

Python implementation (PoC) of "N-Gram-Based Text Categorization (1994)" paper by William B. Cavnar and John M. Trenkle just for fun, learn and researching purposes.

How it works

To perform N-Gram-Based Text Categorization we need to compute N-grams (with N=1 to 5) for each word - and apostrophes - found in the text, doing something like (being the word "TEXT"):

  • bi-grams: _T, TE, EX, XT, T_
  • tri-grams: _TE, TEX, EXT, XT_, T_ _
  • quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _

With every N-Gram computed we just keep top 300 with most occurrences and save them as a "text category profile":

$ tail -n 10 langdata/spanish.dat 
  se   6826
nc  6788
su	6770
mi	6665
 con	6590
  con	6590
er 	6459
er   	6459
er  	6459
z	6379

Now, to check/guess text category we only need to generate N-grams in previous way and match against pre-computed profiles to calculate distance for every N-gram, choosing the nearest one (the profile with smallest total "distance")

Language detection example

I have included three profiles - spanish, english and french - for a quick demo:

$ python example.py -i example_data/El\ Hobbit\ -\ Una\ Tertulia\ Inesperada.txt 
[*] example_data/El Hobbit - Una Tertulia Inesperada.txt seems to be written in spanish

Creating category profile

>>> import ngramfreq
>>> text_categorizer = ngramfreq.NGramBasedTextCategorizer()
>>> lesmiserables_words = open('./example_data/lesmiserables.txt', mode='r').read()
>>> text_categorizer.generate_ngram_frequency_profile_from_raw_text(lesmiserables_words, 'french.dat')
>>> # or we could just do
>>> text_categorizer.generate_ngram_frequency_profile_from_file('./example_data/uuee_const.txt', 'english.dat')

Categorizing text

>>> import ngramfreq
>>> # create just one object for multiple checks
>>> text_categorizer = ngramfreq.NGramBasedTextCategorizer()
>>> text_categorizer.guess_language("Le temps est un grand maître, dit-on, le malheur est qu'il tue ses élèves.")
'french'
>>> # or for only one quick check
>>> ngramfreq.guess_language("The only thing necessary for the triumph of evil is that good men do nothing.")
'english'

ngrambased-textcategorizer's People

Contributors

z0mbiehunt3r avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.