klb3713 / sentence2vec Goto Github PK

Tools for mapping a sentence with arbitrary length to vector space

Python 99.78% C 0.22%

sentence2vec's Introduction

sentence2vec

Tools for mapping a sentence with arbitrary length to vector space

We provide an implementation of the Paragraph Vector in Quoc Le and Tomas Mikolov's paper: Distributed representations of Sentences and Documents.

This project is based on gensim.

install requires:

'scipy >= 0.7.0'
'six >= 1.2.0'

2014-9-23 update: add test files for demo.

sentence2vec's People

Stargazers

Watchers

Forkers

lakezhang vingorilla hirofumiyashima tagami-keisuke akirashibata fanfannothing yiiwood yuwentao giserh liaobs haonest ccxuy njuhugn ty01csbaidu chagge yao8839836 zuiwufenghua jakezhaojb lvqian provemyself gokunwu sywu duweifu vseledkin zhoujialinmumu wavelets ankon jellchou extelligence yipingnus yliuhb ww880412 riskyhe309 kalyanp huaxuan v-ramachandran chenglongchen jiangdapeng darcy0511 wangdongfrank yinweichong seebeyond faneshion stray-leone vangogh0318 yanyushu briangu sunilitggu errord hadoopit vanessad bilash hellcoderz amirpouya weizliu no1ysc yilab zhmz90 stockedge ucgggg aaron-xichen coder3344 breezelee wangyibing shdut ddofer tigerone1 vrod koorukuroo jadesoul ylongqi likaiguo chenmengdx lyzhanying jiangdong123 andersonhaynes windweller jyt109 sericwong liyi193328 jjjkaixin wheatwaves kdjyss atcbosselut pengjiemory sunmeng007 elitonperin tpnguyen dlindbe stevenlol lhmei wayland-chen jack-and-rozz xiaoge56 pathriclee to-shimo ahmanz junwei-pan beyondacm adrianhust

sentence2vec's Issues

[Bug] AttributeError: 'Word2Vec' object has no attribute 'syn1'

File "/home/zhanghao10/.jumbo/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/zhanghao10/sentence2vec-master/word2vec.py", line 855, in worker_train
for sent_no, sentence in job)
File "/home/zhanghao10/sentence2vec-master/word2vec.py", line 855, in
for sent_no, sentence in job)
File "/home/zhanghao10/sentence2vec-master/word2vec.py", line 985, in train_sent_vec_sg
l2a = deepcopy(model.syn1[word2.point]) # 2d matrix, codelen x layer1_size
AttributeError: 'Word2Vec' object has no attribute 'syn1'

the Word2Vec model has attribute 'syn1', please check!

vector value is not same to duplicated sentence.

I copied a sentence in sent.txt. so a sentence is duplicated.
But, after executing demo.py vector value is not same to two same sentence.

my sent.txt file is below

Harbin Institute of Technology (HIT) was founded in 1920.                                                                                                                                      
Harbin Institute of Technology (HIT) was founded in 1920.
After nearly 100 years, HIT has developed into a large nationally renowned multi-disciplinary university with science, engineering and research as its core.
HIT is consistently on the forefront in making innovations in research. For years, HIT has continued to undertake large-scale and highly sophisticated national projects.
HIT students study humanities and social sciences along with basic engineering and science courses for a strong comprehensive base. 
HIT is famous for its original style of schooling: 'Being strict in qualifications for graduates; making every endeavor in educating students.'  
HIT has remained an international university since its foundation. Courses at HIT used to be conducted exclusively in Russian and Japanese.         
Today, all the faculty, students and staff of HIT, are dedicating, with full confidence

the first vector value of 'Harbin Institute of Technology (HIT) was founded in 1920.' and the second vector value of 'Harbin Institute of Technology (HIT) was founded in 1920.' is different.

How to check similarity of sentences?

model used in sentence2vec

Hi, Which is the type of model returned in case of word2vec transformation? For example, tfidf = models.TfidfModel(corpus). Is it possible to change this to semantic models? If so, please suggest how. Thank you.

sentence2vec请教

你好：
有没有java版本的sentence2vec，python版本的我看不懂。有python版本的中文解释也可以，麻烦发我一份，邮箱[email protected] 谢谢大神

How to fix the following issue , I have numpy installed already

/pyrex/word2vec_inner.c:435:10: fatal error:
'numpy/arrayobject.h' file not found
#include "numpy/arrayobject.h"
^

How to use?

UnpicklingError

I saved the word2vec model for a large dataset. But while testing the Sent2vec function in demo.py file it gives me the below error.
UnpicklingError: unpickling stack underflow.
Please suggest.

How to resolve "UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte"

I have some large text files which have such characters and i wish to ignore such characters and proceede with the sentToVec conversion .. I see the below error , please help me fix this .
File "kfold1.py", line 34, in
model = Sent2Vec(LineSentence(sent_file), model_file=input_file + '.model')
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 800, in init
self.reset_sent_vec(sentences)
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 809, in reset_sent_vec
for sent in sentences:
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 1113, in iter
yield utils.to_unicode(line).split()
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte

I need a license file

How to load the sentence vectors?

There doesn't seem to be a method to load the sentence vectors similar to Word2Vec.load_word2vec_format to load the word2vec model. So if I use this same method to load the sent2vec model model = Word2Vec.load_word2vec_format('test.txt.model')

I get an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Training to get paragraph vectors D

Hi there,

I've read the paper by Quoc Le and Tomas Mikolov.
It seems that, at the first stage of the algorithm, what it does is training to get word vectors W, softmax weights U, b and paragraph vectors D on already seen paragraph.

But in your demo.py script, at stage one, it trains to get word vectors W, softmax weights U and b, using Word2Vec. It doesn't get the paragraph vectors D.
At the second stage of your demo.py script, it uses the word vectors obtained in stage one to get paragraph vectors D for new paragraphs.

It seems what your scripts do is not what the paper says.

Correct me if I am wrong.
Thanks.

He Chen

How to test?

Hi, I can't find the testing part.. So, How to ran the testing?

Concatenate vs Average/Sum

From what I understand, sentence2vec generates vectors with a fixed dimension=size based on average or sum. Have you considered concatenation as per the original Quoc & Mikolov 2014 paper?

Benchmarking

I tried to bench mark this tool on the Mass Dataset in the same setting as mentioned in the paper (Distributed Representations of Sentences and Documents). Instead of testing it directly, I had created sentence representation of entire mass dataset (train+test) and did a cross validation. I am not getting more than 51%. Has anybody tested this implementation and bench marked?