Coder Social home page Coder Social logo

lingus's Introduction

Lingus

Lingus is a toolset for testing out AI-enabled NLP concepts. It currently supports word embeddings using the word2vec model implementation from DeepLearning4J.

This is first and foremost my personal playground, as I'm currently exploring the possibilities for AI-enabled NLP wrt. their application in enterprise search. I published this repository in case other people search for an easy entrance into this field and find the tools provided useful for their own experiments. But please note: While Lingus can surely be used for your personal experiments, the integration of Lingus into professional and / or commercial software is prohibited by its license (cf. the license information at the end of this document).

Getting started

Lingus is a Quarkus application that expects a data directory with word2vec models in it. Using Maven, the solution can easily be compiled from its Maven module lingus-webapp and started locally:

$> mvn compile quarkus:dev -Dlingus.data.dir=/var/data/lingus

Word embeddings

The word embedding API has both query and administration endpoints. After successfully starting up Lingus, go to its Swagger UI to see the different endpoints and interact with them. The following examples uses curl to interact with Lingus from the CLI.

Upon startup, Lingus treats all files at lingus.data.dir as binary word2vec models and lists them along with their active-inactive status when querying the /wordembeddings/admin/models endpoint. A model has to explicitly be activated before using it.

Suppose the query

$> curl -X GET http://localhost:8080/wordembedding/admin/models

yields the following result:

{
  "availableModels": [
    {
      "active": false,
      "filename": "wikipedia-small.bin"
    },
    {
      "active": false,
      "filename": "wikipedia-large.bin"
    }
  ]
}

This shows that Lingus detected to files at lingus.data.dir. We're going to load a word2vec model from wikipedia-small.bin by issuing

$> curl -X PUT http://localhost:8080/wordembedding/admin/models/wikipedia-small.bin

Loading a model naturally depends heavily on its filesize. This process can easily take up to several minutes. The endpoint responds synchronously (as this suffices for my use cases), so don't be surprised about long response times. The response shows something like this:

{
  "activated": "wikipedia-small.bin",
  "deactivated": "<<undefined>>"
}

The response tells us which model it loaded and which model - if any - got deactivated.

Now we are ready to query the model. For this example, we are interested for terms that the model deems similar for the input term "school".

$> curl http://localhost:8080/wordembeddings/query?term=school&maxSuggestions=5
{
  "terms": [
    {
      "similar": [
        {
          "similarity": 0.9001061320304871,
          "word": "university"
        },
        {
          "similarity": 0.6805482506752014,
          "word": "academy"
        },
        {
          "similarity": 0.5688245296478271,
          "word": "sciences"
        },
        {
          "similarity": 0.6678423881530762,
          "word": "arts"
        }
      ],
      "term": "school"
    }
  ]
}

The response yields four similar terms for "school" (just four because the example is based on the German Wikipedia, so the model primarily learns contextual information for German words). Don't let the similarity metric fool you: The list is ordered by the closeness wrt. the word vector model of the underlying word2vec implementation. The similarity measure is another metric that the word2vec model is able to infer for pairs of words (e.g. the pair "school" and "university").

Training a word2vec model

The Lingus toolset also comes with CLI applications to build models. Currently, there is a single CLI application in Maven module lingus-word2vec-cli which enables you to build your one models based on a given corpus. The corpus needs to be a plaintext file where each sentence is separated by a newline.

To give you a glimpse on the possible parameters: If you start the CLI tool TrainWord2VecModel without the required parameters, it prints the following help text.

 _      _____ _   _  _____ _    _  _____
| |    |_   _| \ | |/ ____| |  | |/ ____|
| |      | | |  \| | |  __| |  | | (___
| |      | | | . ` | | |_ | |  | |\___ \
| |____ _| |_| |\  | |__| | |__| |____) |
|______|_____|_| \_|\_____|\____/|_____/
Usage: train-word2vec-model [-fv] [--iterations=<iterations>]
                            [--layerSize=<layerSize>]
                            [--minWordFrequency=<minWordFrequency>]
                            [-o=<outputFilename>] [--seed=<seed>]
                            [--windowSize=<windowSize>] <inputFilename>
Trains a word2vec model using Deeplearning4J based on a sentence-by-line
textfile.
      <inputFilename>   The training corpus for the word2vec model
  -f, --force           Override an already existing output file
      --iterations=<iterations>
                        Sets how many iterations should be done over batched
                          sequences.
      --layerSize=<layerSize>
                        Sets the number of dimensions for outcome vectors
      --minWordFrequency=<minWordFrequency>
                        Sets the minimal element frequency for elements found
                          in the training corpus. All elements below this
                          threshold will be removed before training.
  -o, --output=<outputFilename>
                        Sets the output filename
      --seed=<seed>     Sets the seed value for the internal random number
                          generator
  -v, --verbose         Increases the amont of log output
      --windowSize=<windowSize>
                        Sets the window size for Skip-Gram training

Native builds

Both the Lingus web application and the Lingus CLI tools can be compiled via GraalVM to native binaries.

There are a dozen tutorials on the web on how to do this, so won't go into the details here. If you want to create an executable for the Lingus web application, you'll have to check the Quarkus documentation on how to do that. If you want to create an executable for the CLI tools, you'll have to check the Picocli documentation on how to do that.

License

This work is released under the terms of the LGPL v3.

lingus's People

Contributors

mguenther avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

andy-wagner

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.