ericxsun / word2vec Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from https://code.google.com/p/word2vec
License: Apache License 2.0
Automatically exported from https://code.google.com/p/word2vec
License: Apache License 2.0
Tools for computing distributed representtion of words ------------------------------------------------------ We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts. Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following: - desired vector dimensionality - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model - training algorithm: hierarchical softmax and / or negative sampling - threshold for downsampling the frequent words - number of threads to use - the format of the output word vector file (text or binary) Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words. More information about the scripts is provided at https://code.google.com/p/word2vec/
The enclosed patch makes word2vec build on FreeBSD.
Original issue reported on code.google.com by [email protected]
on 1 Oct 2013 at 9:19
Attachments:
Just released a Ruby module that builds an index of a binary word2vec vector
file, so your code can seek directly to the right position in the file for a
given word or term. For example, the word "/en/italy" in the English
"freebase-vectors-skipgram1000-en.bin" file is at byte position 116414 position.
The module also computes a locally-sensitive hash for each vector in a binary
word2vec file, so you can do a nearest neighbor search (i.e. cosine distance)
much faster. I get a couple orders of magnitude better performance on my
machine, with a 10 bit random projection LSH.
https://github.com/someben/treebank/blob/master/src/build_word2vec_index.rb
Thanks for the project, Tomas.
Best,
Ben
Original issue reported on code.google.com by [email protected]
on 23 Sep 2013 at 3:48
When I run demo-phrases.sh on Linux I get the following error message:
./demo-phrases.sh: line 6: 5492 Segmentation fault ./word2phrase -train
text8 -output text8-phrase -threshold 500 -debug 2
Original issue reported on code.google.com by [email protected]
on 25 Aug 2013 at 11:12
If, after making needed corrections, this could be added to the source code, I
think future users would appreciate this. Thanks. --Gregg Williams
--- begin distance.c ---
// Copyright 2013 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
const long long max_size = 2000; // max length of strings
const long long N = 40; // number of closest words that will
be shown
const long long max_w = 50; // max length of vocabulary entries
int main(int argc, char **argv) {
FILE *f;
char st1[max_size];
char bestw[N][max_size];
char file_name[max_size], st[100][max_size];
float dist, len, bestd[N], vec[max_size];
long long words, size, a, b, c, d, cn, bi[100];
char ch;
float *M;
char *vocab;
if (argc < 2) {
printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
return 0;
}
strcpy(file_name, argv[1]);
f = fopen(file_name, "rb");
if (f == NULL) {
printf("Input file not found\n");
return -1;
}
// words = number of words in file
fscanf(f, "%lld", &words);
// size = number of floating-point values associated with each word in the "dictionary"
fscanf(f, "%lld", &size);
// vocab points to a list of all the words in the "dictionary". Words are stored in fixed-width substrings;
// each word is allotted max_w bytes.
vocab = (char *)malloc((long long)words * max_w * sizeof(char));
// SUMMARY: M contains 'size' (an integer) floats for each word in the "dictionary".
// M points to a vector of (words * size) floats, stored linearly. Floats 0 through (size - 1) correspond
// to word 0, floats size through (2 * size - 1) correspond to word 1, etc.
M = (float *)malloc((long long)words * (long long)size * sizeof(float));
if (M == NULL) {
printf("Cannot allocate memory: %lld MB %lld %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
return -1;
}
for (b = 0; b < words; b++) {
// Reads one entry from input file f, which corresponds to one word of the "dictionary" of words contained in f;
// this word is stored in the specific substring of vocab reserved for it.
// UNSURE WHAT PURPOSE SERVED by the char pointed to by &ch--maybe a NUL char denoting end-of-string?
fscanf(f, "%s%c", &vocab[b * max_w], &ch);
// Reads 'size' floats, corresponding to the word with index b, into array M.
for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
// len = sqrt (sum of [each entry in M] ** 2 ) -- a normalizing factor
len = 0;
for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
len = sqrt(len);
// Each entry in M is normalized by a factor of 'len'.
for (a = 0; a < size; a++) M[a + b * size] /= len;
}
fclose(f);
// **********************************************
// **** beginning of user-interaction loop ****
// **********************************************
while (1) {
for (a = 0; a < N; a++) bestd[a] = 0;
for (a = 0; a < N; a++) bestw[a][0] = 0;
printf("Enter word or sentence (EXIT to break): ");
// st1 contains a input text from stdin (usually the console)
a = 0;
while (1) {
st1[a] = fgetc(stdin);
if ((st1[a] == '\n') || (a >= max_size - 1)) {
st1[a] = 0;
break;
}
a++;
}
// End program loop if input text = "EXIT".
if (!strcmp(st1, "EXIT")) break;
// st[0] is a zero-terminated character array representing the input text.
cn = 0;
b = 0;
c = 0;
while (1) {
st[cn][b] = st1[c];
b++;
c++;
st[cn][b] = 0;
if (st1[c] == 0) break;
if (st1[c] == ' ') {
cn++;
b = 0;
c++;
}
}
cn++;
// cn = number of words (separated by a space) in the input text
// st = an array of strings: st[0][] is the first word of the input text; st[1][] is the second word, etc.
// This loop either finds each word within the input text in the 'vocab' string, or it signals
// b = -1 if at least one word is not found. If a word is found, b is the index to it in 'vocab'.
// bi[0] = the index of the first word in the input text; bi[1] = the index of the second word, etc.;
// bi[k] = -1 signals no more words--i.e., there are k words in the input text.
// For each word in the input text, the word and its position in the "dictionary" is printed.
for (a = 0; a < cn; a++) {
//
for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st[a])) break;
if (b == words) b = -1;
bi[a] = b;
printf("\nWord: %s Position in vocabulary: %lld\n", st[a], bi[a]);
if (b == -1) {
printf("Out of dictionary word!\n");
break;
}
}
// Ff input text is not found, restart user-interaction loop.
if (b == -1) continue;
// Reminder:
// st points to the words in the input text (there are cn of them)
// bi[k] is the index of word st[k] within the 'vocab' string
// M gives the precalculated "cosine distance" value for the word at st[k][].
// The code below finds and prints the N "closest" words to the input and their
// "similarity" values (which are always < 1.0)--larger values are "closer".
printf("\n Word Cosine distance\n------------------------------------------------------------------------\n");
// vec contains the 'size' floating-point values associated with the input text.
// NOTE: if the input text contains multiple words, the value of each element in vec is
// the SUM of the corresponding float values for each of the words in the input text.
for (a = 0; a < size; a++) vec[a] = 0;
for (b = 0; b < cn; b++) {
if (bi[b] == -1) continue;
// vec contains the 'size' vectors associated with the bi[b]-th word in the "dictionary".
for (a = 0; a < size; a++) vec[a] += M[a + bi[b] * size];
}
// len = sqrt (sum of the squares of each vector element within vec)
// Each element in vec is normalized by dividing it by 'len'.
len = 0;
for (a = 0; a < size; a++) len += vec[a] * vec[a];
len = sqrt(len);
for (a = 0; a < size; a++) vec[a] /= len;
// Arrays bestd and bestw are associated with the list of the N words that are "closest"
// to the word(s) in the input text
// For an index i, bestw[i][] points to the word in that slot,
// bestd[i] points to the word's "distance" value.
for (a = 0; a < N; a++) bestd[a] = 0;
for (a = 0; a < N; a++) bestw[a][0] = 0;
// For each word in "dictionary".... (in loop, c is the index of the word being tested)
for (c = 0; c < words; c++) {
// a is set to 1 if any of the words in the input text is the word being tested.
a = 0;
for (b = 0; b < cn; b++) if (bi[b] == c) a = 1;
if (a == 1) continue;
// The following executes only if the word being tested is NOT in the input text.
dist = 0;
// dist = sum of the 'size' the float values associated with the word being tested
for (a = 0; a < size; a++) dist += vec[a] * M[a + c * size];
// for each of the N slots that will eventually hold the N "closest" words...
for (a = 0; a < N; a++) {
// if the "distance" of word c is greater than the "distance" of the current slot (slot a),
// move all the bestd and bestw entries one entry closer to the end of the list (losing
// the "worst" entry) and insert the bestd and bestw entries for the current word (c)
// into the current slot (a).
if (dist > bestd[a]) {
for (d = N - 1; d > a; d--) {
bestd[d] = bestd[d - 1];
strcpy(bestw[d], bestw[d - 1]);
}
bestd[a] = dist;
strcpy(bestw[a], &vocab[c * max_w]);
break;
}
}
}
// From "best" to "worst", print each word and its "distance" value.
for (a = 0; a < N; a++) printf("%50s\t\t%f\n", bestw[a], bestd[a]);
}
return 0;
}
--- end distance.c ---
Original issue reported on code.google.com by [email protected]
on 22 Aug 2013 at 6:40
What steps will reproduce the problem?
1. Load freebase.bin files into a word2vec model on freebase
2. attempy .most_similar function
3. error returned
What is the expected output? What do you see instead?
see beloe
What version of the product are you using? On what operating system?
mac osx anaconda python
Please provide any additional information below.
I’m trying to get started by loading the pretrained .bin files from the
google word2vec site ( freebase-vectors-skipgram1000.bin.gz) into the gensim
implementation of word2vec. The model loads fine,
using ..
model = word2vec.Word2Vec.load_word2vec_format('...../free....-en.bin', binary=
True)
and creates a
>>> print model
<gensim.models.word2vec.Word2Vec object at 0x105d87f50>
but when I run the most similar function. It cant find the words in the
vocabulary. My error code is below.
Any ideas where I’m going wrong?
>>> model.most_similar(['girl', 'father'], ['boy'], topn=3)
2013-10-11 10:22:00,562 : WARNING : word ‘girl’ not in vocabulary; ignoring
it
2013-10-11 10:22:00,562 : WARNING : word ‘father’ not in vocabulary;
ignoring it
2013-10-11 10:22:00,563 : WARNING : word ‘boy’ not in vocabulary; ignoring
it
Traceback (most recent call last):
File “”, line 1, in
File
“/....../anaconda/python.app/Contents/lib/python2.7/site-packages/gensim-0.8.7
/py2.7.egg/gensim/models/word2vec.py”, line 312, in most_similar
raise ValueError(“cannot compute similarity with no input”)
ValueError: cannot compute similarity with no input
any ideas welcome?
Original issue reported on code.google.com by [email protected]
on 11 Oct 2013 at 3:41
What steps will reproduce the problem?
1. Download attached text_simple train file
2. Compile word2vec.c as: gcc word2vec.c -o word2vec -lm -pthread
3. Run: ./word2vec -train text_simple -save-vocab vocab.txt
What is the expected output? What do you see instead?
Expect in saved vocab.txt file:
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
nine 9
===============
What is really seen in the file
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
===============
The last element "nine 5" wass missing.
What version of the product are you using? On what operating system?
MacOS, gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build
2336.11.00)
Please provide any additional information below.
This is NOT really a bug report because I am confused to understand the format
of train_file and how the vocab is constructed from it.
Based on the source code of word2vec.c, when reading from train_file, it will
1. insert </s> as the first element in vocab
2. scan each word (or </s> for newline) in train_file, add it to vocab, and
hash it in vocab_hash
So far the vocab_size = the number of words in vocab, INCLUDING </s> at the head
3. sort the words in vocab based on their counts, but keep </s> as the first of
vocab
Now the vocab_size because the number of words in vocab, EXCLUDING the leading
</s>. And if there is no newline character in train_file, </s> won't even be
hashed in vocab_hash
So there is a inconsistency here between vocab_size and the actual size of
vocab (including </s>). It could be a bug because later when the vocab is being
iterated, it is always done by iterating the elements from 0 to vocab_size-1,
like in SaveVocab(). This results in that the leading </s> will be saved, but
the last element in vocab will be ignored. At least that's what it looks with a
simple train file "text_simple" as attached here.
Original issue reported on code.google.com by [email protected]
on 25 Aug 2013 at 2:38
Attachments:
Give me directions to setup Word2Vec in windows platform .
Original issue reported on code.google.com by [email protected]
on 20 Sep 2013 at 5:42
What steps will reproduce the problem?
1. make on Mac
2.
3.
What is the expected output? What do you see instead?
distance.c:18:10: fatal error: 'malloc.h' file not found
#include <malloc.h>
^
1 error generated.
make: *** [distance] Error 1
What version of the product are you using? On what operating system?
OSX 10.9.4
Please provide any additional information below.
I fixed it by replacing malloc.h with stdlib.h
Original issue reported on code.google.com by [email protected]
on 17 Jul 2014 at 5:44
word2vec does not free allocated objects correctly. It also reads freed objects.
Attached patch fixes this issue.
Original issue reported on code.google.com by [email protected]
on 17 Aug 2013 at 6:23
Attachments:
Where it says "First billion characters from wikipedia (use the pre-processing
perl script from the bottom of Matt Mahoney's page)" on the homepage:
The link should point to: http://cs.fit.edu/~mmahoney/compression/textdata.html
which is the site which contains the script in question.
Original issue reported on code.google.com by [email protected]
on 4 Oct 2013 at 3:33
What steps will reproduce the problem?
On a Mac:
1. svn checkout http://word2vec.googlecode.com/svn/trunk/
2. make
What is the expected output?
Binary is emitted.
What do you see instead?
pindari:word2vec pmonks$ make
gcc word2vec.c -o word2vec -lm -pthread -Ofast -march=native -Wall
-funroll-loops -Wno-unused-result
cc1: error: invalid option argument ‘-Ofast’
cc1: error: unrecognized command line option "-Wno-unused-result"
word2vec.c:1: error: bad value (native) for -march= switch
word2vec.c:1: error: bad value (native) for -mtune= switch
make: *** [word2vec] Error 1
pindari:word2vec pmonks$
What version of the product are you using?
SVN r32
On what operating system?
Mac OSX 10.8.4
Original issue reported on code.google.com by [email protected]
on 15 Aug 2013 at 5:45
Run ./word2vec without arguments
You can see the help in the last line
Use the continuous back of words model; default is 0 (skip-gram model)
Original issue reported on code.google.com by [email protected]
on 24 Sep 2013 at 11:30
$ ./demo-phrase-accuracy.sh
make: Nothing to be done for `all'.
Starting training using file text8
Words processed: 17000K Vocab size: 4399K
Vocab size (unigrams + bigrams): 2586139
Words in train file: 17005206
Words written: 17000K
real 0m21.130s
user 0m20.062s
sys 0m1.054s
Starting training using file text8-phrase
Vocab size: 123636
Words in train file: 16337523
Alpha: 0.000119 Progress: 99.59% Words/thread/sec: 22.70k
real 1m38.617s
user 12m0.795s
sys 0m1.501s
newspapers:
./demo-phrase-accuracy.sh: line 12: 36538 Segmentation fault: 11
./compute-accuracy vectors-phrase.bin < questions-phrases.txt
I'm on OSX (latest non-beta), and had to switch out "#include <stdlib.h>" to
get it to compile, but no other changes.
Original issue reported on code.google.com by [email protected]
on 19 Aug 2013 at 7:41
What steps will reproduce the problem?
1. svn checkout http://word2vec.googlecode.com/svn/trunk/
2. cd trunk
3. make
What is the expected output? What do you see instead?
cc1: error: invalid option argument '-Ofast'
make: *** [word2vec] Erreur 1
What version of the product are you using? On what operating system?
revision 34. Debian Squeeze.
Original issue reported on code.google.com by [email protected]
on 4 Oct 2013 at 11:49
Sorry to put a plain old question here, but... is there a correct place to put
a plain old question?
In other words, is there a way to engaging in dialog about this project? I have
several points I'd like to discuss.
Original issue reported on code.google.com by [email protected]
on 19 Aug 2013 at 2:16
Hi,
I was able to run the demo-words.sh (using the text8) and obtained the output
binary file - vectors.bin.
Can someone help me on how to convert it into a readable text/ascii file ?
since I don't know the format of the binary file.
tx
Original issue reported on code.google.com by [email protected]
on 26 Aug 2013 at 3:51
Patch for bug, which caused discarding the last word of vocab after sorting if
there was no newline character in the input file.
If there is no newline in the input file, vocab[0].cn==0, which is ignored in
sorting, but is not in the for loop, where it decrements the vocab_size and
frees the memory of the last word. However, it still computes the hash for the
last word if its count is greater than min_count. Also the realloc needs to
allocate only vocab_size * sizeof(struct vocab_word).
Original issue reported on code.google.com by FerroMrkva
on 5 Feb 2014 at 11:24
Attachments:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.