askerlee / topicvec Goto Github PK

View Code? Open in Web Editor NEW

196.0 196.0 70.0 668.89 MB

Python 75.63% Batchfile 0.70% Perl 7.98% C 1.47% Shell 1.99% Makefile 0.06% C++ 12.17%

topicvec's People

Contributors

Stargazers

Watchers

Forkers

52nlp zhoujialinmumu xsongx njuhugn little1tow qss2012 techstone binbinbian ericxsun sudanenator vikingmew riskyhe309 boluoyu fangzheng354 hit-computer karlyang2013 scofield0li xchuwenbo chagge yutarochan abnering jortizcs qiuyuew lightsilver saikswaroop neurocomputing stevenlol coderhaohao ml-ai-nlp-ir njuxiaoqian madscience101 iamsile wtgme vyraun robertadams yflyzhang benjamesbabala asnjudy quanfang zdstandup agile-innovations gabrer puchka wangtingc tumeteor soeffing alis93 hxin18 youngleec bdubbya lukebelieves colinsongf lingyungh generalzh sunguoming xuezhizeng smartree afcarl smutahoang fatmas1982 dwd888999 embarassed sainiudit astrodrew suiyun0234 prashanth1608 fkchan

topicvec's Issues

Question in gramcount.pl

Hi Li, I'm trying to run PSDVec using Japanese Wikipedia text.

How did you deal with the stopwords in top1gram-wiki.txt? I ask you this because top2gram-wiki.txt is not uploaded on vectors&corpus Dropbox. And why there are no stopwords in top1gram-reuters.txt? I'd like to know when to delete the stopwords?

Sorry if it sound a stupid question.
Thank you!

Use another dataset

Hi,

I've seen that reuters and rcv1 seems hardcoded into the code.

In order to use a corpus without any label (1 txt file per document), what is the simplest way to achieve it ?

Thank you for your help !

Training on dataset without categories

Hi!
I have a corpus of documents without categories, and I would generate topics without adding this kind of information. However, it seems that this code is only oriented to corpus with categories.
Is there any straight way to do this?

*.bat files need to be updated

Hello!

I've just tried your last code, however, the *.bat scripts should be updated with the new flag adopted from 0.75 version forward within the file "topicExp.py".

For example, "-p" become "-i".

Bye

Short text

Hi askerlee, thanks for your great work!

Would this work on short text like tweets? If so, what parameters should I change?

Thanks.

Topic Vector for large text files

Hi,
I have a large text file of about 2GB from which I want to form topic vector and visualize them through topic clouds. I wanted to ask which file should I use to generate topic vectors.
Another Question I want to ask is can I Generate topic vector using a CSV file having words and their counts across all documents?

Thanks

misunderstand

Sentences within a dataset

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..

hi all,

if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt?
for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model?
Thanks!

LogLikelihood

Hi askerlee!

I would ask you if the logLikelihood computed by "calcLoglikelihood" function is normalised by the number of words in the corpus?
If not, it could be easily done?

Thank you!

Problem about input embeddings generated by other algo.

Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.

I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.

I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.

In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.

askerlee / topicvec Goto Github PK

topicvec's People

Contributors

Stargazers

Watchers

Forkers

topicvec's Issues

Question in gramcount.pl

Use another dataset

Training on dataset without categories

*.bat files need to be updated

Short text

Topic Vector for large text files

misunderstand

Sentences within a dataset

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..

LogLikelihood

Problem about input embeddings generated by other algo.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent