Coder Social home page Coder Social logo

truecaser's Introduction

Language Independet Truecaser for Python

This is an implementation of a trainable Truecaser for Python.

modified for python3, and to take in a file in traintruecaser

A truecaser converts a sentence where the casing was lost to the most probable casing. Use cases are sentences that are in all-upper case, in all-lower case or in title case.

A model for English is provided, achieving an accuracy of 98.39% on a small test set of random sentences from Wikipedia.

Model

The model was inspired by the paper of Lucian Vlad Lita et al., tRuEcasIng but with some simplifications.

The model applies a greed strategy. For each token, from left to right, it computes for all possible casing:

score(w_0) = P(w_0) * P(w_0 | w_{-1}) * P(w_0 | w_1) * P(w_0 | w_{-1}, w_1)

with w_0 the word at the current position, w_{-1} the previous word, w_1 the next word in the sentence.

All observed casings for w_0 are tested and the casing with the highest score is selected.

The probabilities P(...) are computed based on a large training corpus.

Requirements

The Code was written for Python 2.7 and requires NLTK 3.0.

From NLTK, it uses the functions to spilt sentences into tokens and the FreqDist(). These parts of the code can easily be replaced, so that the code can be used without NLTK install.

Run the Code

You need a distributions.obj that contains information on the frequencies of unigrams, bigrams, and trigrams.

A pre-trained distributions.obj for English is provided in the release section (name: english_distribitions.obj.zip. Unzip it before you can use it).

One large distributions.obj for English is provided in the download section of github.

You can train your own distributions.obj using the TrainTruecaser.py script.

To run the model on one (or multiple) text files, provide distributions.obj to the PredictTruecase.py script. If no text file(s) are provided as arguments, input is read from STDIN.

To evaluate a model, have a look at EvaluateTruecaser.py.

Train your own Truecaser

You can retrain the Truecaser easily. Simply change the train.txt file with a large sample of sentences, change the TrainTruecaser.py such that is uses the train.txt and run the script. You can also use it for other languages than English like German, Spanish, or French.

Disclaimer

Sorry that this is kind of shitty code without documentation. I was looking for my research for a truecaser, but I couldn't find any working implementation. I implemented this script in a hacky manner and it works quite well (at least for me).

I think the code is so simple that everyone can use and adapt it and maybe it is handy for someone. The principle behind the code is really simple, but as mentioned above, it achieves good results.

Hint: The casing of company and product names is the hardest. Train the system on a large and recent dataset to achieve the best results (e.g. on a recent dump of Wikipedia).

truecaser's People

Contributors

nreimers avatar khayrallah avatar hoonkai avatar

Stargazers

Jonathan Moreno-Medina avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.