Coder Social home page Coder Social logo

langident's Introduction

Langident

Docker Automated build

This is a language guesser for historical text. It distinguishes between German, English, French, Latin and Dutch (that is, the seventeenth-century variants of all those).

There are three models: Cavnar-Trenkle, cumulative frequency and Naive Bayes. Accuracy scores on a held-out set of "hard" (very short) examples are:

  • cavnartrenkle: 224/2969 errors = 92.46% accuracy
  • cumfreq: 103/2969 errors = 96.53% accuracy
  • naivebayes: 142/2969 errors = 95.22% accuracy

Use as a library

To use as a library from Maven, put the following in your pom.xml:

<repository>
  <id>huygens</id>
  <url>http://maven.huygens.knaw.nl/repository/</url>
  <releases>
    <enabled>true</enabled>
    <updatePolicy>always</updatePolicy>
    <checksumPolicy>warn</checksumPolicy>
  </releases>
  <snapshots>
    <enabled>true</enabled>
    <updatePolicy>always</updatePolicy>
    <checksumPolicy>fail</checksumPolicy>
  </snapshots>
</repository>

<dependency>
  <groupId>nl.knaw.huygens.pergamon</groupId>
  <artifactId>langident</artifactId>
  <version>${langident.version}</version>
</dependency>

Then make a LanguageGuesser object, train it on the built-in training set, and use it:

TrainingSet builtin = TrainingSet.getBuiltin();
Model guesser = new NaiveBayes().train(builtin);
System.out.println(guesser.predictBest("Welke taal is dit?"));

This should produce "nl" on the standard output stream. More detailed information is available from the predictScores method.

Web service

To use langident as a web service, build it:

mvn package

Start the server:

target/appassembler/bin/langident server

Then try it out, in another terminal window:

curl http://localhost:8080/ident -d text="C'est quel langue?" | jq .

The jq command does pretty-printing of JSON. It can also give you the highest-scoring language, which is always the first one listed:

curl http://localhost:8080/ident -d text="C'est quel langue?" |
    jq .prediction[0].label

Langident has several built-in models, which are available from

curl http://localhost:8080/ident/models

To use a different model, pass it as a GET parameter:

curl http://localhost:8080/ident?model=naivebayes -d text="che lingua è?"

The list of known languages is available from the /ident/languages endpoint:

curl http://localhost:8080/ident/languages?model=cavnartrenkle

langident's People

Contributors

lhuygi avatar wravenek avatar hayco avatar

Watchers

 avatar James Cloos avatar Arno Bosse avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.