Coder Social home page Coder Social logo

sentence2vec's Introduction

sentence2vec

Tools for mapping a sentence with arbitrary length to vector space

We provide an implementation of the Paragraph Vector in Quoc Le and Tomas Mikolov's paper: Distributed representations of Sentences and Documents.

This project is based on gensim.

install requires:

  • 'scipy >= 0.7.0'
  • 'six >= 1.2.0'

2014-9-23 update: add test files for demo.

sentence2vec's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sentence2vec's Issues

[Bug] AttributeError: 'Word2Vec' object has no attribute 'syn1'

File "/home/zhanghao10/.jumbo/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/zhanghao10/sentence2vec-master/word2vec.py", line 855, in worker_train
for sent_no, sentence in job)
File "/home/zhanghao10/sentence2vec-master/word2vec.py", line 855, in
for sent_no, sentence in job)
File "/home/zhanghao10/sentence2vec-master/word2vec.py", line 985, in train_sent_vec_sg
l2a = deepcopy(model.syn1[word2.point]) # 2d matrix, codelen x layer1_size
AttributeError: 'Word2Vec' object has no attribute 'syn1'

the Word2Vec model has attribute 'syn1', please check!

vector value is not same to duplicated sentence.

I copied a sentence in sent.txt. so a sentence is duplicated.
But, after executing demo.py vector value is not same to two same sentence.

my sent.txt file is below

Harbin Institute of Technology (HIT) was founded in 1920.                                                                                                                                      
Harbin Institute of Technology (HIT) was founded in 1920.
After nearly 100 years, HIT has developed into a large nationally renowned multi-disciplinary university with science, engineering and research as its core.
HIT is consistently on the forefront in making innovations in research. For years, HIT has continued to undertake large-scale and highly sophisticated national projects.
HIT students study humanities and social sciences along with basic engineering and science courses for a strong comprehensive base. 
HIT is famous for its original style of schooling: 'Being strict in qualifications for graduates; making every endeavor in educating students.'  
HIT has remained an international university since its foundation. Courses at HIT used to be conducted exclusively in Russian and Japanese.         
Today, all the faculty, students and staff of HIT, are dedicating, with full confidence   

the first vector value of 'Harbin Institute of Technology (HIT) was founded in 1920.' and the second vector value of 'Harbin Institute of Technology (HIT) was founded in 1920.' is different.

model used in sentence2vec

Hi, Which is the type of model returned in case of word2vec transformation? For example, tfidf = models.TfidfModel(corpus). Is it possible to change this to semantic models? If so, please suggest how. Thank you.

sentence2vec请教

你好:
有没有java版本的sentence2vec,python版本的我看不懂。有python版本的中文解释也可以,麻烦发我一份,邮箱[email protected] 谢谢大神

UnpicklingError

I saved the word2vec model for a large dataset. But while testing the Sent2vec function in demo.py file it gives me the below error.
UnpicklingError: unpickling stack underflow.
Please suggest.

How to resolve "UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte"

I have some large text files which have such characters and i wish to ignore such characters and proceede with the sentToVec conversion .. I see the below error , please help me fix this .
File "kfold1.py", line 34, in
model = Sent2Vec(LineSentence(sent_file), model_file=input_file + '.model')
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 800, in init
self.reset_sent_vec(sentences)
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 809, in reset_sent_vec
for sent in sentences:
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 1113, in iter
yield utils.to_unicode(line).split()
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte

How to load the sentence vectors?

There doesn't seem to be a method to load the sentence vectors similar to Word2Vec.load_word2vec_format to load the word2vec model. So if I use this same method to load the sent2vec model model = Word2Vec.load_word2vec_format('test.txt.model')

I get an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Training to get paragraph vectors D

Hi there,

I've read the paper by Quoc Le and Tomas Mikolov.
It seems that, at the first stage of the algorithm, what it does is training to get word vectors W, softmax weights U, b and paragraph vectors D on already seen paragraph.

But in your demo.py script, at stage one, it trains to get word vectors W, softmax weights U and b, using Word2Vec. It doesn't get the paragraph vectors D.
At the second stage of your demo.py script, it uses the word vectors obtained in stage one to get paragraph vectors D for new paragraphs.

It seems what your scripts do is not what the paper says.

Correct me if I am wrong.
Thanks.

He Chen

How to test?

Hi, I can't find the testing part.. So, How to ran the testing?

Concatenate vs Average/Sum

From what I understand, sentence2vec generates vectors with a fixed dimension=size based on average or sum. Have you considered concatenation as per the original Quoc & Mikolov 2014 paper?

Benchmarking

I tried to bench mark this tool on the Mass Dataset in the same setting as mentioned in the paper (Distributed Representations of Sentences and Documents). Instead of testing it directly, I had created sentence representation of entire mass dataset (train+test) and did a cross validation. I am not getting more than 51%. Has anybody tested this implementation and bench marked?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.