Coder Social home page Coder Social logo

topicvec's People

Contributors

askerlee avatar fabfhmm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

topicvec's Issues

Question in gramcount.pl

Hi Li, I'm trying to run PSDVec using Japanese Wikipedia text.

How did you deal with the stopwords in top1gram-wiki.txt? I ask you this because top2gram-wiki.txt is not uploaded on vectors&corpus Dropbox. And why there are no stopwords in top1gram-reuters.txt? I'd like to know when to delete the stopwords?

Sorry if it sound a stupid question.
Thank you!

Use another dataset

Hi,

I've seen that reuters and rcv1 seems hardcoded into the code.

In order to use a corpus without any label (1 txt file per document), what is the simplest way to achieve it ?

Thank you for your help !

Training on dataset without categories

Hi!
I have a corpus of documents without categories, and I would generate topics without adding this kind of information. However, it seems that this code is only oriented to corpus with categories.
Is there any straight way to do this?

*.bat files need to be updated

Hello!

I've just tried your last code, however, the *.bat scripts should be updated with the new flag adopted from 0.75 version forward within the file "topicExp.py".

For example, "-p" become "-i".

Bye

Short text

Hi askerlee, thanks for your great work!

Would this work on short text like tweets? If so, what parameters should I change?

Thanks.

Topic Vector for large text files

Hi,
I have a large text file of about 2GB from which I want to form topic vector and visualize them through topic clouds. I wanted to ask which file should I use to generate topic vectors.
Another Question I want to ask is can I Generate topic vector using a CSV file having words and their counts across all documents?

Thanks

Sentences within a dataset

I am working on a dataset quite "noisy", so it's very difficult to exactly detect a sentence (for example, I have a lot of abbreviation with points, so these points are detected as the end of phrases).

Do you think that having many short sentences (often with just 3 words) could compromise the algorithm performances? Is it important to preserve the information about words belonging to a sentence?

PS: Furthermore, if the punctuation is filtered, the information about a "phrase" is completely lost, as documents became a bag of words, could it work also in this case?

LogLikelihood

Hi askerlee!

I would ask you if the logLikelihood computed by "calcLoglikelihood" function is normalised by the number of words in the corpus?
If not, it could be easily done?

Thank you!

Problem about input embeddings generated by other algo.

Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.

I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.

I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.

In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.