piskvorky / gensim Goto Github PK

Topic Modelling for Humans

Home Page: https://radimrehurek.com/gensim

License: GNU Lesser General Public License v2.1

Python 91.98% Shell 0.26% C 0.03% C++ 0.05% Jupyter Notebook 0.03% GDB 0.01% Cython 7.64%

gensim topic-modeling information-retrieval machine-learning natural-language-processing nlp data-science python data-mining word2vec

gensim's People

Contributors

Stargazers

Watchers

Forkers

merrellb dieterbe emsrc andremi dedan davidnemeskey mikedewar ambimorph josh-levy-dm shreyaskarnik quesada strongh lucadealfaro sandinmyjoints felixmiddendorf paulrudin jesterhazy mofei hjanime jtbates aweinstein andreypopp pombredanne larsmans corradomonti luckysort buma auduno jeschkies billsmith2 sharecqy rch salmoni nak123 lakshmi-kannan luispedro wmelton clintpgeorge irwenqiang andrewjoc360 apseyed xiaoheigu haoshuji chyikwei panyang aguicode litaoshao arcodergh ummae ultimate010 tongming anikacyp kulv2012 xjzhou miha-stopar liu-fang linkerlin splade ababino wuyeguo samantp devinshields jorispelemans guyrt robwahl ghamrouni tmarthal invinciblejha redsuncmx tuzzeg zclfly langmore maxhodak johnhess spmohanty wfwei funes karinabunyik nederhrj imclab vviro radixseven dingchuangnwpu zehsilva hihihippp agibsonccc ioriiod0 akkineniramesh nipengadmaster azizur77 sumitborar lumost holopoj iiapache alienfeel carbonz0 shelocks maydaygmail fyuval rand

gensim's Issues

ValueError: dictionary update sequence element #0 has length 1; 2 is required

I'm trying to try out the new chunking feature, but I get this:

Traceback (most recent call last):
  File "./build-models.py", line 344, in <module>
    s = t.timeit(number=1)
  File "/usr/lib/python2.7/timeit.py", line 194, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "./build-models.py", line 211, in rebuild_topsimilarities
    str = format_topsimilarities(target_chunk, corpus, mofi2blei, blei2mofi, sim, tfidf, sls, settings)
  File "./build-models.py", line 187, in format_topsimilarities
    sims = sim[targets_docs]
  File "/usr/lib/python2.7/site-packages/gensim/interfaces.py", line 194, in __getitem__
    result = self.getSimilarities(query)
  File "/usr/lib/python2.7/site-packages/gensim/similarities/docsim.py", line 77, in getSimilarities
    return [matutils.cossim(doc, other) for other in self.corpus]
  File "/usr/lib/python2.7/site-packages/gensim/matutils.py", line 263, in cossim
    vec1, vec2 = dict(vec1), dict(vec2)
ValueError: dictionary update sequence element #0 has length 1; 2 is required

I don't find the error message entirely clear, but a print of both vec1 and vec2 reveals this:


vec1:  []
vec2:  25617638b6404eacbc405817a4ea0fc30f672bda384f4cbbb23e288fa4e8c394xbns8vxegfshxejbelznxbram88qu0ch

both vec1 and vec2 should be sparse vectors, right?

gensim 0.8.1: sqlitedict as dependency

sqlitedict seems to be no optional dependency::

>>> from gensim.matutils import MmWriter

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    from gensim.matutils import MmWriter
[...]
   from sqlitedict import SqliteDict # needs sqlitedict: run "sudo easy_install sqlitedict"
ImportError: No module named sqlitedict

Possible solutions:

Move sqlitedict from "extra dependencies" to the core dependencies
Change gensim.__init__ and avoid the import of similarities (but this wouldn't help since gensim.similarities imports the SimServer/SessionServer)
Avoid the default import of SimServer/SessionServer in gensim.similarities

distributed LDA: 'NoneType' object is not subscriptable

Distributed LDA sometimes throws

2011-11-23 21:19:58,881 : INFO : resetting worker #11
Exception in thread Thread-19
Traceback (most recent call last):
File "/bb/blaw/PYTHON/lib/python2.7/threading.py", line 530, in
__bootstrap_inner
self.run()
File "/bb/blaw/PYTHON/lib/python2.7/threading.py", line 483, in run
self.__target(_self.__args, *_self.__kwargs)
File
"/bb/blaw/tools/Python-2.7.1/lib/python2.7/site-packages/gensim-0.8.2-py2.7 .egg/gensim/models/lda_worker.py",
line 58, in requestjob
self.processjob(job)
File
"/bb/blaw/PYTHON/lib/python2.7/site-packages/gensim-0.8.2-py2.7.egg/gensim/ utils.py",
line 49, in _synchronizer
result = func(self, _args, *_kwargs)
File
"/bb/blaw/tools/Python-2.7.1/lib/python2.7/site-packages/gensim-0.8.2-py2.7 .egg/gensim/models/lda_worker.py",
line 65, in processjob
self.model.do_estep(job)
File
"/bb/blaw/PYTHON/lib/python2.7/site-packages/gensim-0.8.2-py2.7.egg/gensim/ models/ldamodel.py",
line 371, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File
"/bb/blaw/PYTHON/lib/python2.7/site-packages/gensim-0.8.2-py2.7.egg/gensim/ models/ldamodel.py",
line 324, in inference
expElogbetad = self.expElogbeta[:, ids]
TypeError: 'NoneType' object is not subscriptable

Reported by several people, e.g. by Dale here: http://groups.google.com/group/gensim/browse_thread/thread/a57c68342e58c7b2

Check the flow around the clear() method in distributed LDA, which is likely culprit.

merge larkc changes

Merge changes by sotte/dedan/quesada into gensim.

python2 setup.py test : AttributeError: 'module' object has no attribute 'test_models'

When I run python2 setup.py test on the latest master git checkout (but I've actually been having this problem for months) I get this error.
AttributeError: 'module' object has no attribute 'test_models'

Below is the complete output.

running test
running egg_info
creating src/gensim.egg-info
writing requirements to src/gensim.egg-info/requires.txt
writing src/gensim.egg-info/PKG-INFO
writing top-level names to src/gensim.egg-info/top_level.txt
writing dependency_links to src/gensim.egg-info/dependency_links.txt
writing manifest file 'src/gensim.egg-info/SOURCES.txt'
reading manifest file 'src/gensim.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'src/gensim.egg-info/SOURCES.txt'
running build_ext
Traceback (most recent call last):
  File "setup.py", line 76, in <module>
    include_package_data = True,
  File "/usr/lib/python2.7/distutils/core.py", line 152, in setup
    dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 137, in run
    self.with_project_on_sys_path(self.run_tests)
  File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 117, in with_project_on_sys_path
    func()
  File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 146, in run_tests
    testLoader = loader_class()
  File "/usr/lib/python2.7/unittest/main.py", line 94, in __init__
    self.parseArgs(argv)
  File "/usr/lib/python2.7/unittest/main.py", line 149, in parseArgs
    self.createTests()
  File "/usr/lib/python2.7/unittest/main.py", line 158, in createTests
    self.module)
  File "/usr/lib/python2.7/unittest/loader.py", line 128, in loadTestsFromNames
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/usr/lib/python2.7/unittest/loader.py", line 103, in loadTestsFromName
    return self.loadTestsFromModule(obj)
  File "/usr/lib/python2.7/site-packages/setuptools/command/test.py", line 34, in loadTestsFromModule
    tests.append(self.loadTestsFromName(submodule))
  File "/usr/lib/python2.7/unittest/loader.py", line 100, in loadTestsFromName
    parent, obj = obj, getattr(obj, part)
AttributeError: 'module' object has no attribute 'test_models'

For more info, see also http://groups.google.com/group/gensim/browse_thread/thread/80866ad542a56c38

interpretation of LDA parameters

There was an interesting post by Matt Hoffman in the topic-models mailing list yesterday:
https://lists.cs.princeton.edu/pipermail/topic-models/2011-October/001600.html

Apparently there is an easier way to estimate beta and theta: simply normalize rows in lambda and gamma, respectively.

Currently, gensim does the more roundabout way of exp^{E[log lambda]} (resp. gamma), but according to Matt, the normalization is "more common".

Switching to the simple normalization is trivial code-wise (and will result in faster code), but inform users of the change first -- it gives different results!

Documentation is out of date

Just another issue, for the fun of it. :)

As I have been skimming though the documentation I have found a few places where it is outdated, or lacking in certain respects. Here's a few off the top of my head (actually, it's not just the top, it's all I have found for now):

There is no serialize method in the corpus classes, only saveCorpus (both in API and tutorials)
The line lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, numTopics=2) does not work (http://nlp.fi.muni.cz/projekty/gensim/tut2.html). It should be id2word=dictionary.id2word.
In http://nlp.fi.muni.cz/projekty/gensim/tut1.html, it is not at all clear what happens after you call dictionary.filterTokens, let alone compactify. I reckon the corpus would obviously "not work" after these commands, and has to be reparsed, but I am not sure. A few sentences on this would be welcome, along with a code example of how to reparse the corpus with such a dictionary (doc2bow with allowUpdate=False?)

Bug in the inference module of gensim.models.LdaModel

The topics are learned correctly by the model, but inference for each of the documents has a bug in it. I tried a simple 30 document corpus with very specific topics (no stop words, no overlap of words between topics). The topics are inferred perfectly for the entire corpus, but the inference for each document gives the wrong topic distribution.

$lda = gensim.models.ldamodel.LdaModel(corpus=myC, id2word=myCorpus.dictionary, num_topics=K,passes=1000,distributed=True)

$lda[myC[0]]

yeilds a topic distribution that is wrong.

Shivani

Improve doc

The example with Deerwester corpus confuses people, they copy&paste and store corpora as Python lists.

Make the corpus in documentation a simple generic iterable object, not list, to showcase how to save memory.

Corpus compression

Hey,

I'm just wondering if it would be possible to save corpora to the disk compressed. According to my experience, gzipping a corpus saved in the Matrix Market format decreases the file size to about 34% of the uncompressed size. I am not sure about the other formats, but at least for text-based ones, the percentages should be similar.

So the question is basically: is there anything that would prevent such a feature to be implemented (fseek comes into mind)? If the corpus files are always handled as streams, I think it should be OK to add this option.

Option to pass Dictionary.docFreq to TfidfModel.init()

TfidfModel.initialize() calculates document frequencies for tokens. However, these are also calculated when creating a Dictionary object. TfidfModel.init can therefore take a keyword arg that allows providing Dictionary.docFreq and prevents recalculating document frequencies.

Principal Component Analysis Based on Nonparametric Maximum Entropy

Check out the PCA based on preserving entropy (as opposed to reconstruction L2 norm): http://mloss.org/software/view/361/

If it works and is useful, add to gensim.

TypeError: coercing to Unicode: need string or buffer, bool found

Hi,
when I try to save a SparseMatrixSimilarity object using the latest develop code in git, I get a TypeError.
Below is the complete output.
Note also the line: INFO:gensim.similarity.docsim:storing SimmatrixSparseMatrixSimilarity object to False and False.npy doesn't look very correct. (my SimmatrixSparseMatrixSimilarity class just extends SparseMatrixSimilarity and adds custom load/save.. HOLD ON, let me figure this out myself first. probably my own mistake :p )


INFO:gensim.similarity.docsim:creating sparse index
INFO:gensim.matutils:creating sparse matrix from corpus
INFO:gensim.matutils:PROGRESS: at document #0
INFO:gensim.matutils:PROGRESS: at document #10000
INFO:gensim.matutils:PROGRESS: at document #20000
INFO:gensim.matutils:PROGRESS: at document #30000
INFO:gensim.matutils:PROGRESS: at document #40000
INFO:gensim.matutils:PROGRESS: at document #50000
INFO:gensim.matutils:PROGRESS: at document #60000
INFO:gensim.matutils:PROGRESS: at document #70000
INFO:gensim.matutils:PROGRESS: at document #80000
INFO:gensim.matutils:PROGRESS: at document #90000
INFO:gensim.matutils:PROGRESS: at document #100000
INFO:gensim.matutils:PROGRESS: at document #110000
INFO:gensim.matutils:PROGRESS: at document #120000
INFO:gensim.matutils:PROGRESS: at document #130000
INFO:gensim.matutils:PROGRESS: at document #140000
INFO:gensim.matutils:PROGRESS: at document #150000
INFO:gensim.matutils:PROGRESS: at document #160000
INFO:gensim.matutils:PROGRESS: at document #170000
INFO:gensim.similarity.docsim:created <178643x3248178 sparse matrix of type '<type 'numpy.float32'>'
    with 14085600 stored elements in Compressed Sparse Row format>
DEBUG:root:build_matrix doc/min: 59928.944920
INFO:gensim.similarity.docsim:storing SimmatrixSparseMatrixSimilarity object to False and False.npy
Traceback (most recent call last):
  File "./build-models.py", line 321, in <module>
    rebuild_data_files(r, args.tag, args.force) #takes 10minutes or so (most part dictionary)
  File "./build-models.py", line 155, in rebuild_data_files
    sim.save(force)
  File "/usr/local/lib/python2.6/dist-packages/gensim-0.8.0rc1-py2.6.egg/gensim/similarities/docsim.py", line 513, in save
    utils.pickle(self, fname) # store array-less object
  File "/usr/local/lib/python2.6/dist-packages/gensim-0.8.0rc1-py2.6.egg/gensim/utils.py", line 427, in pickle
    with open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
TypeError: coercing to Unicode: need string or buffer, bool found

inconsistent return values of Similarity[]

Hi,
a list of similarityscores (say, with num_best = 2) looks something like: [(15, 0.845), (2, 0.452)]. I call such a list a 'topsim'.

Now, what I notice, if you query Similarity[input], there are 3 cases:

input is a single query. -> result is a topsim
input is a list of queries (more then 1 query) -> result is a list of topsims
input is a list of queries, but has only 1 element -> result is a topsim

The 3rd case is inconsistent IMHO, and I would expect to get a list back with only 1 element.

tests fail

python setup.py test

gives me the following errors in the github head.

ERROR: testPersistence (gensim.test.test_models.TestLdaModel)

Traceback (most recent call last):
File "/Users/joelr/Work/python/gensim/src/gensim/test/test_models.py", line 165, in testPersistence
self.assertTrue(numpy.allclose(model.logProbW, model2.logProbW))
AttributeError: 'LdaModel' object has no attribute 'logProbW'

FAIL: testTransform (gensim.test.test_models.TestLdaModel)

Traceback (most recent call last):
File "/Users/joelr/Work/python/gensim/src/gensim/test/test_models.py", line 157, in testTransform
self.assertTrue(passed, "Error in randomized LDA test")
AssertionError: Error in randomized LDA test

Ran 17 tests in 0.362s

FAILED (failures=1, errors=1)

respect stochastic SVD params in LsiModel

Currently, the extra_samples and power_iters parameters to LsiModel are only used in the multi-pass variant (=when onepass is False).

Pass them also to the one-pass variant, for consistency and greater flexibility.

Alternatively, mention explicitly in the docs that they are not used, to avoid confusion.

This issue was brought up by Brian Murphy at http://groups.google.com/group/gensim/browse_thread/thread/6de956d4300ab7da

typo in changelog

Hi, I found a typo in the changelog. my name is Plaetinck, not Plaetnick

thanks,
Dieter

Sharding the similarity index

Scale up similarities memory-wise, so that users can create an index over corpora of arbitrary size.

Currently there's the Similarity class that can do this, but that's very slow. Reimplement Similarity so that it splits the index into smaller pieces (shards) that can be processed efficiently in RAM.

The splitting should be completely transparent to the user, i.e. index[vec] and index[chunk] work as expected, as does for sims in index: ....

Also, write shards so that in the future, processing each shard can be done separately (in parallel), either with multicore or on a distributed cluster.

AttributeError: dtype not found

I'm using the SparseMatrixSimilarity class.
in its getSimilarities() function it first checks whether utils.isCorpus(query), which returns true, because i specify multiple queries (i.e. I want to use the chunking feature). i.e. query is an iterable, here's 3 of its elements:

[(81, 0.48507125007266594), (162, 0.48507125007266594), (6, 0.24253562503633297), (96, 0.24253562503633297), (166, 0.24253562503633297), (207, 0.24253562503633297), (235, 0.24253562503633297), (492, 0.24253562503633297), (503, 0.24253562503633297), (515, 0.24253562503633297), (547, 0.24253562503633297)]
[(279, 0.4), (375, 0.4), (408, 0.4), (447, 0.4), (544, 0.4), (373, 0.2), (404, 0.2), (446, 0.2), (549, 0.2), (566, 0.2)]
[(88, 1), (181, 1), (325, 1), (400, 1), (520, 1)]

however, the code then tries to fetch a value from self.corpus (self.corpus.dtype) which cannot be resolved, which is weird because self.corpus is a sparse matrix created in the constructor.
So this sparse matrix does not have a dtype attribute?
Here is the complete error trace:

Traceback (most recent call last):
  File "./build-models.py", line 352, in <module>
    s = t.timeit(number=1)
  File "/usr/lib/python2.7/timeit.py", line 194, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "./build-models.py", line 225, in rebuild_topsimilarities
    str = format_topsimilarities(target_chunk, corpus, mofi2blei, blei2mofi, sim, tfidf, sls, settings)
  File "./build-models.py", line 202, in format_topsimilarities
    sims = sim[targets_docs]
  File "/home/dieter/code/python/datafiles.py", line 112, in __getitem__
    return self.data[key]
  File "/usr/lib/python2.7/site-packages/gensim/interfaces.py", line 194, in __getitem__
    result = self.getSimilarities(query)
  File "/usr/lib/python2.7/site-packages/gensim/similarities/docsim.py", line 222, in getSimilarities
    query = matutils.corpus2csc(query, self.corpus.shape[1], dtype=self.corpus.dtype)
  File "/usr/lib/python2.7/site-packages/scipy/sparse/base.py", line 384, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: dtype not found

tell() in writeCorpus() takes a lot of time

Hi,

I have cProfile'd the code (it ran quite slow), and I have discovered that fout.tell() in writeCorpus() takes a LOT of time (21 seconds for a 23M MmCorpus file). I was wondering if it is strictly necessary to have it in the code?

Now my understanding is that you need it for reverse indexing. However, wouldn't it be feasible to just count the number of bytes written? In my case, it would make the whole code run 30% faster.

print_topics() should work with logging disabled

When I ran my first LDA model, I was sad when I tried lda.print_topics() and nothing was printed to the screen. It wasn't until I looked in the source code that I saw that logging must be turned on in order to see the topics.

This probably isn't the best behavior for this function as a user will probably want to see the topics with or without logging. Consider replacing logger.info with print in gensim/models/ldamodel.py line 558.

LSA dimensionality

Try the automated dimensionality setting for Latent Semantic Analysis, via MDL:

http://www.springerlink.com/content/500651582r310t05/

This means: reproduce Fig.1 from that article. See what it does on the Lee corpus. Does the curve make sense? Is MDL robust enough, across several corpora?

models/tfidfmodel.py variablename idfs should be dfs

src/gensim/models/tfidfmodel.py

in the initialize method:
idfs = {}
(...)
for termId, termCount in bow:
idfs[termId] = idfs.get(termId, 0) + 1

shouldn't this variable called "dfs" or something? because they are just document frequencies which are not yet inversed.

speeding up similarity queries

Currently if one is only interested in the top-n most similar documents, the MatrixSimilarity and SparseMatrixSimilarity classes compute all similarities, then sort them, then clip to top-n.

Actually the sorting is slower than the matrix multiplication:
http://groups.google.com/group/gensim/browse_thread/thread/f6b839ceaa16c834
so for a start, speed up the post-processing (sorting) part.

improve LsiModel (SVD) documentation

Extend the theoretical part of LsiModel documentation:

how do multipass/singlepass SVD relate and how can users turn them on/off
describe speed/accuracy trade-offs of multipass vs. singlepass
where do users find topic "details" (U, S matrices for LSI, lambda and gamma for LDA...)
but don't confuse newbie users! find a proper place for this "advanced" info

http://groups.google.com/group/gensim/browse_thread/thread/4b605b72f8062770#

Logo & Website Layout

Improve the landing page and create a logo for gensim.

This is a chance for a skilled web designer to step up and contribute, as I have neither the skill nor the time to do it :-)

dynamic Dictionary

When adding new document via Dictionary.addDocuments(), the id2token cache is not updated. token2id may get new ids and new words, but id2token stays the same.

Update id2token to be in sync with token2id whenever it's requested.

Use `tox` for testing

I've heard good things about automated testing tool tox: http://codespeak.net/~hpk/tox/

It runs unit tests automatically, supports integration testing etc. Try it out and include it in gensim.

The goal is to automatically check that dependencies (old version of numpy & scipy, old python 2.5) still work.

scipy 0.6.0 not supported

Gensim doesn't work with scipy 0.6.0, but the install page claims it does.

Fix:

by removing scipy 0.6.0 from the list of supported versions;
or see if there is a workaround that could make things work even for 0.6.0 (probably not, there were big changes in scipy 0.7)

Reported in:
http://groups.google.com/group/gensim/browse_thread/thread/c74b07b57dc6256c

Add other distance measures to Similarity

Currently, Similarity works purely over cosine similarity (~the angle between query and indexed document).

Make this more general, using e.g. Hellinger distance for models that represent the documents as probability distributions.

At the same time, try to still keep things computationally efficient (using BLAS & mmap etc.).

test_similarities fails

after a fresh install some of the tests fail on my machine with the following output:

(gen)dedan@client194-95:~/.virtualenvs/gen/lib/python2.7/site-packages/gensim: nosetests
.......................................................EEEEE.................................................................EEEEE..........
======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testChunking
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 90, in testChunking
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testFull
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 55, in testFull
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testIter
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 114, in testIter
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testNumBest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 79, in testNumBest
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testPersistency
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 135, in testPersistency
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testChunking
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 90, in testChunking
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testFull
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 55, in testFull
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testIter
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 114, in testIter
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testNumBest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 79, in testNumBest
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

======================================================================
ERROR: gensim.gensim.test.test_similarities.TestSimilarityABC.testPersistency
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dedan/.virtualenvs/gen/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/dedan/projects/mpi/gensim/gensim/test/test_similarities.py", line 135, in testPersistency
    if self.cls == similarities.Similarity:
AttributeError: 'TestSimilarityABC' object has no attribute 'cls'

----------------------------------------------------------------------
Ran 140 tests in 27.542s

FAILED (errors=10)

If I find some time later I will look into it but I cannot promise when I can do it..

I use python 2.7 on osx 10.7

cPickle error when saving similarity matrix

Note: applies only to similarities.Similarity and not similarities.MatrixSimilarity similarities.SparseMatrixSimilarity.

After being created, attempting to save similarities.Similarity matrix results in error when using cPickle.

Traceback (most recent call last):
File "/Users/alan/Projects/LSIA/docs/code/Similarities01.py", line 45, in
Q = querier(corpus)
File "/Users/alan/Projects/LSIA/docs/code/Similarities01.py", line 15, in init
self.index.save(self.workdir+'/ops/sims.index')
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/gensim-0.7.7-py2.6.egg/gensim/utils.py", line 120, in save
cPickle.dump(self, f, protocol=-1) # -1 to use the highest available protocol, for efficiency
PicklingError: Can't pickle <class 'gensim.interfaces.TransformedCorpus'>: attribute lookup gensim.interfaces.TransformedCorpus failed

Gensim 0.7.7 on OSX.

cPickle.PicklingError: Can't pickle <type 'ellipsis'>: attribute lookup builtin.ellipsis failed

When I have an instance of SparseMatrixSimilarity, and I try to save() it, I get this:

INFO:gensim.utils:saving Similarity object to 846c57f-dirty--CHNK100-EB0-FW1-FW_NA0.5-FW_NB5-M0-NFdata__bom-nerfile-withmediaobjectfragmentids-NUMBEST10-PATRN_INCL-S_L0-S_P0-SQ_K5-SQ_R1-TFIDF0_sim_dense_disk
Traceback (most recent call last):
  File "./build-models.py", line 250, in <module>
    rebuild_data_files(r, args.tag)
  File "./build-models.py", line 118, in rebuild_data_files
    sim.save(sim_filename(tag))
  File "/usr/lib/python2.7/site-packages/gensim/utils.py", line 118, in save
    pickle(self, fname)
  File "/usr/lib/python2.7/site-packages/gensim/utils.py", line 414, in pickle
    cPickle.dump(obj, fout, protocol=protocol)
cPickle.PicklingError: Can't pickle <type 'ellipsis'>: attribute lookup __builtin__.ellipsis failed

Interestingly, I have done this hundreds of times before without issues.
I wonder if it has anything to do with an update to python or numpy, but I don't think so (I did upgrade scipy and python2 a week ago, but reverting didn't fix it)
tested with:

python-scipy 0.8.0-4
python-scipy 0.9.0-1
python2-numpy 1.5.1-2
python2 2.7.1-7
python2 2.7.1-9

cPickle Problem

As described in this thread on the mailinglist I have a problem when pickling models that contain very large matrices. As a intermediate solution I would like to contribute something which stores the matrices using the numpy methods.

@piskvorky: We wanted to discuss this when you have been in Berlin but then forgot about it. Do you have some ideas of how to implement this nicely? I would like to hear your suggestions so that the solution fits your ideas of gensim. If you have no time or no ideas, I will just implement something and send you a pull request

comply to PEP8

Change camelCase vars and function names and parameters to python_style. This will break backward compatibility, so it best be done all at once (rather than incrementally).

http://groups.google.com/group/gensim/browse_thread/thread/73f4301723cea765

pre-commit hook for git

Find out how commit hooks work on git; write one. Enforce it on GitHub if possible, or at least distribute with the repo (and instruct contributers how to set it up and use it locally).

Should contain:

check that non-empty lines in *.py files don't have trailing whitespace/strip such trailing space. Keep logical indentation on empty lines (in-between code blocks, between methods etc), if possible, though this is no big deal.
?

SVD accuracy suffers with many power iterations steps?

According to the experiments http://groups.google.com/group/gensim/browse_thread/thread/4b605b72f8062770 , accuracy of the randomized SVD algo actually decreases with more power iterations steps (with steps above a certain small threshold, such as 6 for that data).

That is not good, because in theory the algorithm can arrive at arbitrary precision, exactly by increasing the number of power iteration steps (=trading off run-time for greater accuracy).

The culprit seems to be matrix multiplications during the power iteration steps, which eventually cause numeric overflows for larger amount of steps.

Find a fix so that SVD can be run to arbitrary precision.

make normalization a transformation

Right now the similarity computations assume cosine similarity and transform all vectors to unit length implicitly, inside the *Similarity classes.

Instead, make normalization an explicit transformation and leave it up to the user to choose which (or whether) normalization to use. Example vector = norm_l2[lsi_model[norm_l2[tfidf_model[bow_vector]]]].

The most common cases (L2 norm, L1 norm, identity) should be pre-defined, probably in gensim.matutils.

Also connected is issue #64 (allow custom similarity metrics).

LDA convergence parameters

There are variational parameters VAR_MAXITER and VAR_THRESH that guide convergence of LDA inference (both during training and document transformations).

Currently they are set to a magic value which works well for online training over large corpora, but perhaps not so well for batch training over different corpora: http://groups.google.com/group/gensim/browse_thread/thread/d394a1fd8ee86450#

Add an option (on by default?) that sets these parameters automatically and transparently, based on the training dataset.

Dictionary.save_as_text/load_from_text is dangerous

I thought that Dictionary.save_as_text and load_from_text is equivalent to Dictionary.save/load, but it isn't. The text format does not keep the "num_docs" and after loading a Dict from a text, several methods do not work anymore::

>>> dct = Dictionary()
>>> _ = dct.doc2bow(['Hi', 'there'], allow_update=True)
>>> _ = dct.doc2bow(['Hi', 'all'], allow_update=True)
>>> _ = dct.doc2bow(['Hi', 'there', 'world'], allow_update=True)
>>> print dct
Dictionary(4 unique tokens)
>>> print dct.num_docs
3
>>> dct.save('./test.dct')
>>> d2 = Dictionary.load('./test.dct')
>>> print d2.num_docs
3
>>> # Filter extremes
>>> d2.filter_extremes(no_below=1)
>>> print d2
Dictionary(2 unique tokens)
>>> print d2.token2id
{'world': 0, 'all': 1}

Everything works as expected. Now the text version (assuming the dct from above)::

>>> dct.save_as_text('./test.txt')
>>> d2 = Dictionary.load_from_text('./test.txt')
>>> print d2.num_docs
0
>>> print d2 # Everything is fine here, expecting 4 tokens
Dictionary(4 unique tokens)
>>> # Filter extremes
>>> d2.filter_extremes(no_below=1)
>>> print d2 # Whoops, all tokens are gone, expected 2, see above
Dictionary(0 unique tokens)
>>> print d2.token2id
{}

This behaviour is not documented anywhere and I'd think that save_as_text and load_from_text should lead to a fully functional Dictionary object

"chunks" parameter is confusing

Similarity.chunks means (if i understand correctly) how many vectors should go into 1 chunk. The parametername chunks suggests that it's about "how many chunks do you want to use"
I suggest to rename this parameter to chunksize

corpus: why not update self.length after iterating all

Hi,
why not do in every corpus, something like:

def __iter__(self):
     (...)
     length = 0   
     for lineNo, line in enumerate(...):
          (....)
          length += 1
          yield doc
     self.length = length

this reduces the chance of needing to run the highly expensive iteration for the sole sake of returning the length, in the len function.

rename README.txt

... to README.rst, so that github renders the welcome page properly.

Approximate similarity search

Google published a whitepaper http://www.google.com/trends/correlate on approximate kNN queries. See how it could apply to gensim & semantic similarity searches.

Background:

currently, gensim does a linear scan (compare query against every indexed vector, by means of a matrix multiplication)
I tried fancier indexing techniques, but they degenerate into linearly checking each datum anyway, for the high-dimensional vectors
even worse, they access objects out-of-order (thrashing caches & HW buffers), so much much slower than plain linear scan in reality (matrix multiplication is linear in index size, but the constant factors are super low)
Google claims this new technique works well for high-dim data as well => maybe finally something faster than a linear scan?

Hierarchical logging

Make all loggers in gensim have a common root, other than logging.root (as it is now).

That way, handlers can be added/removed from this logger without affecting the global logging.root, which will be less confusing and cleaner.

See http://groups.google.com/group/gensim/browse_thread/thread/ff363fb5f07b6d01# .

Similarity sharding broken

Paul Rudin found a bug in Similarity sharding: http://groups.google.com/group/gensim/browse_thread/thread/dc3fc4c0cd9e8814

Fix the bug, extend the unit tests.

(and also print a warning when the requested num_best is greater than shard_size -- very rare, but potentially confusing combination when it does come up)

Create utility functions that perform basic sanity checks on user's input data:

check that the all feature ids in a corpus are compatible with the user-provided dictionary (should avoid issues like http://projects.scipy.org/scipy/ticket/1582 )
check that the data range is valid -- look for NaNs, Infs, explicit zeros => these are all illegal in gensim input.
check that the data is not degenerate => all vectors identical/empty/?/model looks weird
~~check corpus type and warn the user if it's plain list (promote the memory-friendly generator interface, shown in tutorials)~~ NOT NEEDED