deborausujono / word2vecpy Goto Github PK

Python implementation of CBOW and skip-gram word vector models, and hierarchical softmax and negative sampling learning algorithms

Python 100.00%

word2vecpy's People

Contributors

Stargazers

Watchers

Forkers

andersonhaynes henryslzhao eriche2016 nkchensir ryancotterell hatshex royshan rohanag leonardoamorim kaeflint gelli qibinchuan zbxzc35 rabintang cfarbs ddxv bariscimen aaiitggrp sunilitggu hejunqing xc918 coneain zxsted azpoliak vyraun qijiezhao nguyenkh makai281 shangy markstoehr zlzr200599 asiagood akshayjh teezeit zbkruturaj jiangnanhugo murthymalladi zwytop theo-m wdqatualr ruimao1988 xiaran gunnii rifkiaputri ozdemir08 leoluyi jacky168 danveno aenal-abie hanyangliu shenyong123 jonesky zhezhaoruc fengcanliu shuntucodes yrevar cmusjtuliuyuan wang1104014663 sainttde linron84 meethariprasad siddbanpsu thanhdtran shivam05011996 krish567 coolestxuxu zhangyijia1979 jeshuren zzchan chenshaw1995 navalchand crh5914 leliuchn eminemrain maxhhhhh crystalajj xjd-tju albertbj afcarl donghaozhang95 jiaxin96 benyuereal ningshiqi kaharjan adrrahman justingoes yangmiemie wangxiaobaobao lzjtt2017 changolin britesun datianshi21 shuxiaobo wunaidev rgarciarui qdj0511 wmsout sunlinghao foye501 shaynesc

word2vecpy's Issues

About Multiprocessing

Hi, I met some difficulties about multiprocessing. When the num_processes is changed to more than 1, the assignment of tasks among the workers has something wrong. Do you have any idea about this issue?

missing MAX_SEN_LEN and EPOCH

Compare to the original C code released by Google, MAX_SEN_LEN and EPOCH is missing, which caused these two problem.

[1] In the sub training process, each process read lines from file start and end. Once the input file contains only the one line (for example text8 corpus), following code snippets would caused bug.

 while fi.tell() < end:
        line = fi.readline().strip()
        # Skip blank lines
        if not line:
            continue

line = fi.readline().strip() would load the whole tokens from start.

[2] EPOCH would create embedding with more training samples.

About the accuracy evaluation

TypeError: 'float' object cannot be interpreted as an index

Hi, I tried running your code, and got this error message:

Reading word 11690000
Unknown vocab size: 68558
Total words in training file: 11690125
Total bytes in training file: 85775698
Vocab size: 45151
Initializing unigram table
Traceback (most recent call last):
File "word2vec.py", line 388, in
args.min_count, args.num_processes, bool(args.binary))
File "word2vec.py", line 354, in train
table = UnigramTable(vocab)
File "word2vec.py", line 175, in __init__
table = np.zeros(table_size, dtype=np.uint32)
TypeError: 'float' object cannot be interpreted as an index`

I can't figure out how to fix it. Could you look into it? Thanks.

About negative sampling method

I found that the negative samples selected in the negative sampling method here may be the same as the positive example.
//indices = np.random.randint(low=0, high=len(self.table), size=count)
The index value generated by this line of code may be the same as the token

About the speed of python implementation

I found that using python to implement word2vec is slow, even using multiprocessing. Do you have any ideas to fix this problem?

numpy.ctypeslib.c_double_Array_100

Hi,I download the code and run it .However It gets the follow problem :
pickle.PicklingError: Can't pickle <class 'numpy.ctypeslib.c_double_Array_100'>: it's not found as numpy.ctypeslib.c_double_Array_100 when it run into the line "pool = Pool(processes=num_processes, initializer=__init_process,nitargs=(vocab, syn0, syn1, table, cbow, neg,dim, alpha, win, num_processes,global_word_count, fi))".Can you get me some information?Thank you

Training Data

Can you provide training and testing data file?

deborausujono / word2vecpy Goto Github PK

word2vecpy's People

Contributors

Stargazers

Watchers

Forkers

word2vecpy's Issues

About Multiprocessing

missing MAX_SEN_LEN and EPOCH

About the accuracy evaluation

TypeError: 'float' object cannot be interpreted as an index

About negative sampling method

About the speed of python implementation

numpy.ctypeslib.c_double_Array_100

Training Data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent