Coder Social home page Coder Social logo

lda2vec-pytorch's People

Contributors

tropcomplique avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lda2vec-pytorch's Issues

torch. load model_state

I have this error No such file 'model_state.python' , I don't find this file. who can help me plz

Run failure

MY COMPUTER: NVIDIA CUDA 9.0.176 driver Torch-0.3.0
When I run "train.py" , it always stops at the following interface.

Please help me , thank you very much.

spacy nlp() - unknown arguments

Running on GoogleColab with python 3 + GPU:
Issue in preprocess.py on line #26 for method nlp():

nlp = spacy.load('en')
text = nlp(text, tag=True, parse=False, entity=False)

nlp() unknown arguments (e.g. - tag, etc.)
changed to this:
text = nlp(text)

ValueError: Object arrays cannot be loaded when allow_pickle=False

I am running explore_trained_model.ipynb on GoogleCoLab with python 3.6 and numpy 1.16.4.

ValueError                                Traceback (most recent call last)
/content/lda2vec-pytorch/20newsgroups/explore_trained_model.ipynb in <module>()
      6 
      7 # "integer -> word" decoder
----> 8 decoder = np.load('decoder.npy')[()]
      9 
     10 # for restoring document ids, "id used while training -> initial id"

1 frames
/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
    694         # The array contained Python objects. We need to unpickle the data.
    695         if not allow_pickle:
--> 696             raise ValueError("Object arrays cannot be loaded when "
    697                              "allow_pickle=False")
    698         if pickle_kwargs is None:

ValueError: Object arrays cannot be loaded when allow_pickle=False

This seems relevant:
https://stackoverflow.com/questions/55824625/how-to-fix-object-arrays-cannot-be-loaded-when-allow-pickle-false-in-the-sketc

How to infer topics distribution for new documents

I ran all your code successfully. In explore_trained_model.ipynb, I see that you get prediction results for trained documents. However, I want to infer topics distribution for new documents. Please tell me how I can do it?
Thank you very much

RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /pytorch/aten/src/TH/THRandom.cpp:320

This exception is happening training 20newgroups with the embed_dimension 300 in GoogleColab with pytorch 1.1.0 with cuda 10.
Curiously, it does not happen training on my Mac with only the CPU.

From alias_multinomial.py:

    def draw(self, N):
        """Draw N samples from the distribution."""

        K = self.J.size(0)
        r = torch.LongTensor(np.random.randint(0, K, size=N))
        q = self.q.index_select(0, r)
        j = self.J.index_select(0, r)
        b = torch.bernoulli(q)
        #print("K r q j b r.shape, q.shape j.shape b.shape, j.shape", K, r, q, j, b, r.shape, q.shape, j.shape, b.shape, j.shape)
        oq = r.mul(b.long())
        oj = j.mul((1 - b).long())
        return oq + oj

Traceback (most recent call last):
  File "train.py", line 36, in <module>
    main()
  File "train.py", line 32, in main
    save_every=20, grad_clip=5.0
  File "../utils/training.py", line 127, in train
    neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 82, in forward
    neg_loss = self.neg(pivot_words, target_words, doc_vectors, w)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 152, in forward
    noise = self.multinomial.draw(batch_size*window_size*self.num_sampled)
  File "../utils/alias_multinomial.py", line 60, in draw
    b = torch.bernoulli(q)
RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /pytorch/aten/src/TH/THRandom.cpp:320

q.min() and q.max()
0.21869047
1.0066459

This seems relevant:

pytorch/pytorch#9917

RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1

I am getting this error running lda2vec-pytorch on Google's CoLab platorm on a text file with 5,171 news articles from global websites.
E.g.
RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1

from lda2vec_loss.py:
print(doc_weights.shape, doc_probs.shape, unsqueezed_doc_probs.shape, unsqueezed_topic_vectors.shape)

torch.Size([7168, 20]) torch.Size([7168, 20]) torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])

%run get_stories_windows.ipynb
100%|██████████| 5171/5171 [22:09<00:00,  1.36it/s]
number of removed short documents: 5
total number of tokens: 5841072
number of tokens to be removed: 2660533
number of additionally removed short documents: 3
total number of tokens: 3180503

minimum word count number: 18
this number can be less than MIN_COUNTS because of document removal
5163it [00:08, 616.10it/s]
CPU times: user 1h 23min 15s, sys: 2.38 s, total: 1h 23min 18s
Wall time: 42min 10s
  2%|▏         | 108/5163 [00:00<00:04, 1074.25it/s]CPU times: user 12.1 s, sys: 7.26 s, total: 19.3 s
Wall time: 9.84 s
topic 0 : Point Vladimir Union Moscow progress migration air Mosul Insight safe
topic 1 : Hong Morocco Kong les sur Jamaica Carnoustie Escocia Park Turkish
topic 2 : Got Talent Baron Sacha Club UAE AMERICA SHOWTIME Trapeze talent
topic 3 : Kentucky Lil Peoria Song Minnesota CBS hai Cyclone boy KZN
topic 4 : Game NBA Oman heat Tokyo Utah Bangla Soccer Basketball Mail
topic 5 : Ganga Chennai Bengaluru Hyderabad ordeal Theatre thREAD CLOSE Sabha Mumbai
topic 6 : Army syrian AGT Philippines Syrian Baltimore February October army January
topic 7 : ADVERTISEMENT Deutsch Programs migrant teach Podcasts turkish Puigdemont XXL Reuters
topic 8 : Air Force NYC Iraq Turkmenistan Kazakhstan Nordic Energy CBS Affairs
topic 9 : Philadelphia Mix Pennsylvania Route Baltimore Episode Summer Deep Massachusetts Park
topic 10 : Herald ZEALAND NEW IOL NZME Property crash Northern Pakuranga serial
topic 11 : RFI GMT Paris attachment Mon gmt Fri flash bulletin analysis
topic 12 : Nairobi Counties Rift switch Ethiopia NTV hours Ruaraka Eritrea Gold
topic 13 : Texas Russian Oman Newscasts Image Star Fry summer funny Vermont
topic 14 : Premium Content Zuckerberg Conference arab Gaza jewish JPost Careers Diaspora
topic 15 : Messenger WhatsApp external LinkedIn Queen window Prince Kilmeade School Elizabeth
topic 16 : Director Documentary London Chicago Game Tower Urdu Airlines english Egypt
topic 17 : Peninsula EBITDA Khmer Hungary SPAIN amazing thai Partners khmer lottery
topic 18 : Amsterdam Mark Fox Amazon Swansea Song Qatar TIME Indonesia durationStr
topic 19 : Pak Headlines pak modi india England pakistani Reaction indian Pakistani
100%|██████████| 5163/5163 [00:05<00:00, 977.59it/s]

!python train.py

number of documents: 5163
number of windows: 3180503
number of topics: 25
vocabulary size: 19224
word embedding dim: 50
../utils/lda2vec_loss.py:47: UserWarning: nn.init.normal is now deprecated in favor of nn.init.normal_.
  init.normal(self.doc_weights.weight, std=DOC_WEIGHTS_INIT)
number of batches: 444 

epoch 1
  0% 0/444 [00:00<?, ?it/s]../utils/lda2vec_loss.py:196: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  doc_probs = F.softmax(doc_weights)
torch.Size([7168, 20]) torch.Size([7168, 20]) torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])

Traceback (most recent call last):
  File "train.py", line 36, in <module>
    main()
  File "train.py", line 32, in main
    save_every=20, grad_clip=5.0
  File "../utils/training.py", line 127, in train
    neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 70, in forward
    doc_vectors = self.topics(doc_weights)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../utils/lda2vec_loss.py", line 208, in forward
    doc_vectors = (unsqueezed_doc_probs*unsqueezed_topic_vectors).sum(1)
RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1

Pytorch version issue? Perhaps:

https://github.com/marvis/pytorch-yolo2/issues/106

but in any case, the shapes don't match up:
e.g.
torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])

run failed in get_windows.ipynb

call_() got an unexpected keyword argument 'tag'

i have the problem in
encoded_docs, decoder, word_counts = preprocess(
docs, nlp, MIN_LENGTH, MIN_COUNTS, MAX_COUNTS
)
i don't know why i have this problem

I run python train.py and get error with File "../utils/alias_multinomial.py", line 57

Traceback (most recent call last):
File "train.py", line 36, in
main()
File "train.py", line 32, in main
save_every=20, grad_clip=5.0
File "../utils/training.py", line 127, in train
neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
File "/Users/macbook/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(input, **kwargs)
File "../utils/lda2vec_loss.py", line 72, in forward
neg_loss = self.neg(pivot_words, target_words, doc_vectors, w)
File "/Users/macbook/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(input, **kwargs)
File "../utils/lda2vec_loss.py", line 142, in forward
noise = self.multinomial.draw(batch_size
window_size
self.num_sampled)
File "../utils/alias_multinomial.py", line 57, in draw
b = torch.bernoulli(q)
RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /Volumes/OSX/Downloads/pytorch/aten/src/TH/THRandom.c:300

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.