Coder Social home page Coder Social logo

jimhester / text2vec Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dselivanov/text2vec

0.0 2.0 0.0 39.01 MB

Fast text mining framework for R. Text vectorization and state-of-the-art GloVe word embeddings.

Home Page: http://dsnotes.com/

License: Other

R 61.15% C++ 38.85%

text2vec's Introduction

CRAN_Status_Badge Travis-CI Build Status License Downloads Follow

Tutorials

  1. Vectorization.
  2. GloVe on english wikipedia.

Features

text2vec is a package for which the main goal is to provide an efficient framework with concise API for text analysis and natural language processing (NLP) in R. It is inspired by gensim - an excellent python library for NLP.

Core functionality

At the moment we cover two following topics:

  1. Fast text vectorization on arbitrary n-grams.
    • using vocabulary
    • using feature hashing
  2. State-of-the-art GloVe word embeddings.

Efficiency

  • The core of the functionality is carefully written in C++. Also this means text2vec is memory friendly.
  • Some parts (GloVe training) are fully parallelized using an excellent RcppParallel package. This means, parallel features work on OS X, Linux, Windows and Solaris(x86) without any additinal tuning/hacking/tricks.
  • Streaming API, this means users don't have to load all the data into RAM. text2vec allows processing streams of chunks.

API

  • Built around iterator abstraction.
  • Concise, provides only a few functions, which do their job well.
  • Don't (and probably will not in future) provide trivial very high-level functions.

Terminology and what is under the hood

As stated before, text2vec is built around streaming API and iterators, which allows the constructin of the corpus from iterable objects. Here we touched 2 main concepts:

  1. Corpus. In text2vec it is an object, which contains tokens and other information / metainformation which is used for text vectorization and other processing. We can be efficiently insert documents into corpus, because, technically, Corpus is an C++ class, wrapped with Rcpp-modules as reference class (which has reference semantics!). Usually user should not care about this, but should keep in mind nature of such objects. Particularly important, that user have to remember, that he can't save/serialize such objects using R's save*() methods. But good news is that he can easily and efficiently extract corresponding R objects from corpus and work with them in a usual way.
  2. Iterators. If you are not familliar with them in R's context, I highly recommend to review vignettes of iterators package. A big advantage of this abstraction is that it allows us to be agnostic of type of input - we can transparently change it by just providing correct iterator.

Contributors are very welcome

Project has issue tracker on github where I'm filing feature requests and notes for future work. Any ideas are very appreciated.

If you like it, you can help:

  • Test and leave feedback on github issuer tracker (preferably) or directly by email.
    • package is tested on linux and OS X platforms, so Windows users are especially welcome.
  • Fork and start contributing. Vignettes, docs, tests, use cases are very welcome.
  • Or just give me a star on project page :-)

text2vec's People

Contributors

dselivanov avatar lmullen avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.