tropcomplique / lda2vec-pytorch Goto Github PK

View Code? Open in Web Editor NEW

116.0 6.0 35.0 1.78 MB

Topic modeling with word vectors

License: MIT License

Jupyter Notebook 99.07% Python 0.93%

pytorch topic-modeling word-vectors

lda2vec-pytorch's People

Contributors

Stargazers

Watchers

lda2vec-pytorch's Issues

torch. load model_state

I have this error No such file 'model_state.python' , I don't find this file. who can help me plz

Run failure

MY COMPUTER: NVIDIA CUDA 9.0.176 driver Torch-0.3.0
When I run "train.py" ， it always stops at the following interface.

Please help me , thank you very much.

spacy nlp() - unknown arguments

Running on GoogleColab with python 3 + GPU:
Issue in preprocess.py on line #26 for method nlp():

nlp = spacy.load('en')
text = nlp(text, tag=True, parse=False, entity=False)

nlp() unknown arguments (e.g. - tag, etc.)
changed to this:
text = nlp(text)

How can I evaluate lda2vec model??

Can you explain how to evaluate lda2vec model?

where are all these .npy files called inside train.py?

train.py tried to load a bunch of .npy files like word_vector.npy. Where and how are they suppose to be generated?

how to train this model with my own data

i wanted to train this code with my own data stored in mysql database, how do it ?? which module.py will modify ?? help please

ValueError: Object arrays cannot be loaded when allow_pickle=False

I am running explore_trained_model.ipynb on GoogleCoLab with python 3.6 and numpy 1.16.4.

ValueError                                Traceback (most recent call last)
/content/lda2vec-pytorch/20newsgroups/explore_trained_model.ipynb in <module>()
      6 
      7 # "integer -> word" decoder
----> 8 decoder = np.load('decoder.npy')[()]
      9 
     10 # for restoring document ids, "id used while training -> initial id"

1 frames
/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
    694         # The array contained Python objects. We need to unpickle the data.
    695         if not allow_pickle:
--> 696             raise ValueError("Object arrays cannot be loaded when "
    697                              "allow_pickle=False")
    698         if pickle_kwargs is None:

ValueError: Object arrays cannot be loaded when allow_pickle=False

This seems relevant:
https://stackoverflow.com/questions/55824625/how-to-fix-object-arrays-cannot-be-loaded-when-allow-pickle-false-in-the-sketc

Different vocabulary size with decoder

Please give an example to explain how to run this code.

How to infer topics distribution for new documents

I ran all your code successfully. In explore_trained_model.ipynb, I see that you get prediction results for trained documents. However, I want to infer topics distribution for new documents. Please tell me how I can do it?
Thank you very much

RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /pytorch/aten/src/TH/THRandom.cpp:320

This exception is happening training 20newgroups with the embed_dimension 300 in GoogleColab with pytorch 1.1.0 with cuda 10.
Curiously, it does not happen training on my Mac with only the CPU.

From alias_multinomial.py:

    def draw(self, N):
        """Draw N samples from the distribution."""

        K = self.J.size(0)
        r = torch.LongTensor(np.random.randint(0, K, size=N))
        q = self.q.index_select(0, r)
        j = self.J.index_select(0, r)
        b = torch.bernoulli(q)
        #print("K r q j b r.shape, q.shape j.shape b.shape, j.shape", K, r, q, j, b, r.shape, q.shape, j.shape, b.shape, j.shape)
        oq = r.mul(b.long())
        oj = j.mul((1 - b).long())
        return oq + oj

Traceback (most recent call last):
  File "train.py", line 36, in <module>
    main()
  File "train.py", line 32, in main
    save_every=20, grad_clip=5.0
  File "../utils/training.py", line 127, in train
    neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 82, in forward
    neg_loss = self.neg(pivot_words, target_words, doc_vectors, w)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 152, in forward
    noise = self.multinomial.draw(batch_size*window_size*self.num_sampled)
  File "../utils/alias_multinomial.py", line 60, in draw
    b = torch.bernoulli(q)
RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /pytorch/aten/src/TH/THRandom.cpp:320

q.min() and q.max()
0.21869047
1.0066459

This seems relevant:

pytorch/pytorch#9917

Encounter the problem "IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number".

Running on GoogleColab.
'Encounter the problem "IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number".'
This solution worked for me:

pytorch/pytorch#15585

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

Try moving noise() to the GPU in utils/lda2vec_loss.py in the forward method
of class negative_sampling_loss(nn.Module):
E.g. -

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
noise = noise.to(device)

RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1

I am getting this error running lda2vec-pytorch on Google's CoLab platorm on a text file with 5,171 news articles from global websites.
E.g.
RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1

from lda2vec_loss.py:
print(doc_weights.shape, doc_probs.shape, unsqueezed_doc_probs.shape, unsqueezed_topic_vectors.shape)

torch.Size([7168, 20]) torch.Size([7168, 20]) torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])

%run get_stories_windows.ipynb
100%|██████████| 5171/5171 [22:09<00:00,  1.36it/s]
number of removed short documents: 5
total number of tokens: 5841072
number of tokens to be removed: 2660533
number of additionally removed short documents: 3
total number of tokens: 3180503

minimum word count number: 18
this number can be less than MIN_COUNTS because of document removal
5163it [00:08, 616.10it/s]
CPU times: user 1h 23min 15s, sys: 2.38 s, total: 1h 23min 18s
Wall time: 42min 10s
  2%|▏         | 108/5163 [00:00<00:04, 1074.25it/s]CPU times: user 12.1 s, sys: 7.26 s, total: 19.3 s
Wall time: 9.84 s
topic 0 : Point Vladimir Union Moscow progress migration air Mosul Insight safe
topic 1 : Hong Morocco Kong les sur Jamaica Carnoustie Escocia Park Turkish
topic 2 : Got Talent Baron Sacha Club UAE AMERICA SHOWTIME Trapeze talent
topic 3 : Kentucky Lil Peoria Song Minnesota CBS hai Cyclone boy KZN
topic 4 : Game NBA Oman heat Tokyo Utah Bangla Soccer Basketball Mail
topic 5 : Ganga Chennai Bengaluru Hyderabad ordeal Theatre thREAD CLOSE Sabha Mumbai
topic 6 : Army syrian AGT Philippines Syrian Baltimore February October army January
topic 7 : ADVERTISEMENT Deutsch Programs migrant teach Podcasts turkish Puigdemont XXL Reuters
topic 8 : Air Force NYC Iraq Turkmenistan Kazakhstan Nordic Energy CBS Affairs
topic 9 : Philadelphia Mix Pennsylvania Route Baltimore Episode Summer Deep Massachusetts Park
topic 10 : Herald ZEALAND NEW IOL NZME Property crash Northern Pakuranga serial
topic 11 : RFI GMT Paris attachment Mon gmt Fri flash bulletin analysis
topic 12 : Nairobi Counties Rift switch Ethiopia NTV hoursÂ Ruaraka Eritrea Gold
topic 13 : Texas Russian Oman Newscasts Image Star Fry summer funny Vermont
topic 14 : Premium Content Zuckerberg Conference arab Gaza jewish JPost Careers Diaspora
topic 15 : Messenger WhatsApp external LinkedIn Queen window Prince Kilmeade School Elizabeth
topic 16 : Director Documentary London Chicago Game Tower Urdu Airlines english Egypt
topic 17 : Peninsula EBITDA Khmer Hungary SPAIN amazing thai Partners khmer lottery
topic 18 : Amsterdam Mark Fox Amazon Swansea Song Qatar TIME Indonesia durationStr
topic 19 : Pak Headlines pak modi india England pakistani Reaction indian Pakistani
100%|██████████| 5163/5163 [00:05<00:00, 977.59it/s]

!python train.py

number of documents: 5163
number of windows: 3180503
number of topics: 25
vocabulary size: 19224
word embedding dim: 50
../utils/lda2vec_loss.py:47: UserWarning: nn.init.normal is now deprecated in favor of nn.init.normal_.
  init.normal(self.doc_weights.weight, std=DOC_WEIGHTS_INIT)
number of batches: 444 

epoch 1
  0% 0/444 [00:00<?, ?it/s]../utils/lda2vec_loss.py:196: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  doc_probs = F.softmax(doc_weights)
torch.Size([7168, 20]) torch.Size([7168, 20]) torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])

Traceback (most recent call last):
  File "train.py", line 36, in <module>
    main()
  File "train.py", line 32, in main
    save_every=20, grad_clip=5.0
  File "../utils/training.py", line 127, in train
    neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 70, in forward
    doc_vectors = self.topics(doc_weights)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 208, in forward
    doc_vectors = (unsqueezed_doc_probs*unsqueezed_topic_vectors).sum(1)
RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1

Pytorch version issue? Perhaps:

https://github.com/marvis/pytorch-yolo2/issues/106

but in any case, the shapes don't match up:
e.g.
torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])

Run lda2vec on mycustom dataset

run failed in get_windows.ipynb

call_() got an unexpected keyword argument 'tag'

i have the problem in
encoded_docs, decoder, word_counts = preprocess(
docs, nlp, MIN_LENGTH, MIN_COUNTS, MAX_COUNTS
)
i don't know why i have this problem

I run python train.py and get error with File "../utils/alias_multinomial.py", line 57

Traceback (most recent call last):
File "train.py", line 36, in
main()
File "train.py", line 32, in main
save_every=20, grad_clip=5.0
File "../utils/training.py", line 127, in train
neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
File "/Users/macbook/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(input, **kwargs)
File "../utils/lda2vec_loss.py", line 72, in forward
neg_loss = self.neg(pivot_words, target_words, doc_vectors, w)
File "/Users/macbook/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(input, **kwargs)
File "../utils/lda2vec_loss.py", line 142, in forward
noise = self.multinomial.draw(batch_sizewindow_sizeself.num_sampled)
File "../utils/alias_multinomial.py", line 57, in draw
b = torch.bernoulli(q)
RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /Volumes/OSX/Downloads/pytorch/aten/src/TH/THRandom.c:300

tropcomplique / lda2vec-pytorch Goto Github PK

lda2vec-pytorch's People

Contributors

Stargazers

Watchers

Forkers

lda2vec-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org