Coder Social home page Coder Social logo

embeddings's Introduction

Embeddings

This package is designed to provide easy-to-use python class and cli interfaces to:

  • clean corpuses in an efficient way in terms of computation time

  • generate word2vec embeddings (based on gensim) and directly write them to a format that is compatible with Tensorflow Projector

Thus, with two classes, or two commands, anyone should be able clean a corpus and generate embeddings that can be uploaded and visualized with Tensorflow Projector.

Getting started

Requirements

This packages requires gensim, nltk, and docopt to run. If pip doesn't install this dependencies automatically, you can install it by running :

pip install nltk docopt gensim

Installation

To install this package, simply run :

pip install embeddingsprep

Further versions might include conda builds, but it's currently not the case.

Main features

Preprocessing

For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible. A detailed version of what is done during the preprocessing is available here

Usage example :

Creating and saving a loadable configuration:

from embeddingsprep.preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('/tmp/logdir')
config.set_config(writing_dir='/tmp/outputs')
config.save_config()
prep = Preprocessor('/tmp/logdir')  # Loads the config object in /tmp/logdir if it exists
prep.fit('~/mydata/')  # Fits the unigram & bigrams occurences
prep.filter()  # Filters with all the config parameters
prep.transform('~/mydata')  # Transforms the texts with the filtered vocab. 

Word2Vec

For the Word2Vec, we just wrote a simple wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)

Usage example:

from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')

Contributing

Any github issue, contribution or suggestion is welcomed! You can open issues on the github repository.

embeddings's People

Contributors

sally14 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.