deborausujono / word2vecpy Goto Github PK
View Code? Open in Web Editor NEWPython implementation of CBOW and skip-gram word vector models, and hierarchical softmax and negative sampling learning algorithms
Python implementation of CBOW and skip-gram word vector models, and hierarchical softmax and negative sampling learning algorithms
Hi, I met some difficulties about multiprocessing. When the num_processes is changed to more than 1, the assignment of tasks among the workers has something wrong. Do you have any idea about this issue?
Compare to the original C code released by Google, MAX_SEN_LEN and EPOCH is missing, which caused these two problem.
[1] In the sub training process, each process read lines from file start and end. Once the input file contains only the one line (for example text8 corpus), following code snippets would caused bug.
while fi.tell() < end:
line = fi.readline().strip()
# Skip blank lines
if not line:
continue
line = fi.readline().strip()
would load the whole tokens from start.
[2] EPOCH would create embedding with more training samples.
Hi, I tried running your code, and got this error message:
Reading word 11690000
Unknown vocab size: 68558
Total words in training file: 11690125
Total bytes in training file: 85775698
Vocab size: 45151
Initializing unigram table
Traceback (most recent call last):
File "word2vec.py", line 388, in
args.min_count, args.num_processes, bool(args.binary))
File "word2vec.py", line 354, in train
table = UnigramTable(vocab)
File "word2vec.py", line 175, in __init__
table = np.zeros(table_size, dtype=np.uint32)
TypeError: 'float' object cannot be interpreted as an index`
I can't figure out how to fix it. Could you look into it? Thanks.
I found that the negative samples selected in the negative sampling method here may be the same as the positive example.
//indices = np.random.randint(low=0, high=len(self.table), size=count)
The index value generated by this line of code may be the same as the token
I found that using python to implement word2vec is slow, even using multiprocessing. Do you have any ideas to fix this problem?
Hi,I download the code and run it .However It gets the follow problem :
pickle.PicklingError: Can't pickle <class 'numpy.ctypeslib.c_double_Array_100'>: it's not found as numpy.ctypeslib.c_double_Array_100 when it run into the line "pool = Pool(processes=num_processes, initializer=__init_process,nitargs=(vocab, syn0, syn1, table, cbow, neg,dim, alpha, win, num_processes,global_word_count, fi))".Can you get me some information?Thank you
Can you provide training and testing data file?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.