Coder Social home page Coder Social logo

word2vec.jl's Introduction

Word2Vec

License CI version pkgeval deps

Julia interface to word2vec

Word2Vec takes a text corpus as input and produces the word vectors as output. Training is done using the original C code, other functionalities are pure Julia. See demo for more details.

Installation

Pkg.add("Word2Vec")

Note: Only linux and OS X are supported.

Functions

All exported functions are documented, i.e., we can type ? functionname to get help. For a list of functions, see here.

Examples

We first download some text corpus, for example http://mattmahoney.net/dc/text8.zip.

Suppose the file text8 is stored in the current working directory. We can train the model with the function word2vec.

julia> word2vec("text8", "text8-vec.txt", verbose = true)
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 350.44k  

Now we can import the word vectors text8-vec.txt to Julia.

julia> model = wordvectors("./text8-vec")
WordVectors 71291 words, 100-element Float64 vectors

The vector representation of a word can be obtained using get_vector.

julia> get_vector(model, "book")'
100-element Array{Float64,1}:
 -0.05446138539336186
  0.001090934639284009
  0.06498087707990222
  
 -0.0024113040415322516
  0.04755140828570571
  0.039764719065723826

The cosine similarity of book, for example, can be computed using cosine_similar_words.

julia> cosine_similar_words(model, "book")
10-element Array{String,1}:
 "book"
 "books"
 "diary"
 "story"
 "chapter"
 "novel"
 "preface"
 "poem"
 "tale"
 "bible"

Word vectors have many interesting properties. For example, vector("king") - vector("man") + vector("woman") is close to vector("queen").

5-element Array{String,1}:
 "queen"
 "empress"
 "prince"
 "princess"
 "throne"

References

  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", In Proceedings of Workshop at ICLR, 2013. [pdf]

  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality", In Proceedings of NIPS, 2013. [pdf]

  • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, "Linguistic Regularities in Continuous Space Word Representations", In Proceedings of NAACL HLT, 2013. [pdf]

Acknowledgements

The design of the package is inspired by Daniel Rodriguez (@danielfrg)'s Python word2vec interface.

Reporting Bugs

Please file an issue to report a bug or request a feature.

word2vec.jl's People

Contributors

aviks avatar juliatagbot avatar sambitdash avatar tkelman avatar zgornel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

word2vec.jl's Issues

Does word2vec.jl use the skip-gram with negative sampling (SGNS) method?

Maybe I am missing something, but I figure that word2vec.jl has not implemented the skip-gram with negative sampling (SGNS) variant of word2vec? Most people might have moved on from word2vec, but it is vastly more data efficient than the transformer framework. I am making custom embeddings, and want to port my code to Julia. But no SGNS is a showstopper. Any chance of an implementation?

The AdaGram.jl package, which does, is no longer actively supported.
https://github.com/sbos/AdaGram.jl

Documentation - Corrections

Unless I am mistaken, in word2phrase

"""
threshold <AbstractFloat>
      	      The <AbstractFloat> value represents threshold for 
              forming the phrases (higher means less phrases); default 100
"""

It's not a float but an int.

In word2cluster(train, output, classes; ...) It's actually plural word2clusters.

Awesome package, thanks!

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

c file did not build on installation

New to Julia as well as this package, so it's possible there only needs to be a readme update.

After installing and including the package in my repl I was unable to run the word2vec command succesfully. The error was ERROR: IOError: could not spawn...no such file or directory (ENOENT). It was unable to find the file deps/src/word2vec-c/word2vec.

After manually cd'ing into that directory and running make I am able to run the command successfully. Did I miss an installation step that would have built the c executables? I see you have a file deps/build.jl that is supposed to build that directory, so it's possible I missed a step that would have run this.

Thanks!

text8.zip file for primary example is no longer there

It appears that Matt Mahoney no longer is hosting this file (text8.zip), used in the pirmary example here (and which seems to be widely used elsewhere). Does anyone know where else to get it?

Thanks for any help.

word2vec is not defined (Julia)

I'm getting an error that prevents me from using the word2vec function in Julia on a corpus.

Install.pkg("Word2Vec")

Code:

using Word2Vec
word2vec("text8","vec.txt",verbose=true)

Error Message:

ERROR: UndefVarError: word2vec not defined
Stacktrace:
 [1] word2vec(::String, ::String; size::Int64, window::Int64, sample::Float64, hs::Int64, negative::Int64, threads::Int64, iter::Int64, min_count::Int64, alpha::Float64, debug::Int644, binary::Int64, cbow::Int64, save_vocab::Nothing, read_vocab::Nothing, verbose::Bool) at C:\Users\15714\.julia\packages\Word2Vec\knfyL\src\interface.jl:73
 [2] top-level scope at none:1

Is anyone else having this problem?

Adding word2vec on MacOS fails

I have the following with Julia 1.1 on MacOS:

julia> Pkg.add("Word2Vec")
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package ASTInterpreter2 [e6d88f4b]:
 ASTInterpreter2 [e6d88f4b] log:
 ├─possible versions are: 0.1.0-0.1.1 or uninstalled
 ├─restricted by julia compatibility requirements to versions: uninstalled
 └─restricted by compatibility requirements with Atom [c52e3926] to versions: 0.1.0-0.1.1 — no versions left
   └─Atom [c52e3926] log:
     ├─possible versions are: [0.1.0-0.1.1, 0.2.0-0.2.1, 0.3.0, 0.4.0-0.4.6, 0.5.0-0.5.10, 0.6.0-0.6.17, 0.7.0-0.7.15, 0.8.0-0.8.5] or uninstalled
     └─restricted to versions 0.7.14 by an explicit requirement, leaving only versions 0.7.14
Stacktrace:
 [1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:1005
 [2] propagate_constraints! at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:946 [inlined]
 [3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:1460
 [4] simplify_graph! at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:1460 [inlined] (repeats 2 times)
 [5] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Nothing) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/Operations.jl:371
 [6] resolve_versions! at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/Operations.jl:315 [inlined]
 [7] #add_or_develop#63(::Array{Base.UUID,1}, ::Symbol, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/Operations.jl:1171
 [8] #add_or_develop at ./none:0 [inlined]
 [9] #add_or_develop#15(::Symbol, ::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:54
 [10] #add_or_develop at ./none:0 [inlined]
 [11] #add_or_develop#14 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:31 [inlined]
 [12] #add_or_develop at ./none:0 [inlined]
 [13] #add_or_develop#13 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:29 [inlined]
 [14] #add_or_develop at ./none:0 [inlined]
 [15] #add_or_develop#12(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:mode,),Tuple{Symbol}}}, ::Function, ::String) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:28
 [16] #add_or_develop at ./none:0 [inlined]
 [17] #add#20 at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:59 [inlined]
 [18] add(::String) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:59
 [19] top-level scope at none:0

Translate the C code to Julia

Do you plan to translate the C code of word2vec into Julia?
Or, may I merge my partial translation work into this repository?

Explicit rows values separator as kwarg

Word2vec can be used to read other embeddings files such as KB embeddings if the row value separator (i.e. what separates embedding vector values) is changed to other chars such as ,.

Making the separator an explicit option for text embeddings would solve this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.