tropcomplique / lda2vec-pytorch Goto Github PK
View Code? Open in Web Editor NEWTopic modeling with word vectors
License: MIT License
Topic modeling with word vectors
License: MIT License
I have this error No such file 'model_state.python' , I don't find this file. who can help me plz
MY COMPUTER: NVIDIA CUDA 9.0.176 driver Torch-0.3.0
When I run "train.py" , it always stops at the following interface.
Please help me , thank you very much.
Running on GoogleColab with python 3 + GPU:
Issue in preprocess.py on line #26 for method nlp():
nlp = spacy.load('en')
text = nlp(text, tag=True, parse=False, entity=False)
nlp() unknown arguments (e.g. - tag, etc.)
changed to this:
text = nlp(text)
Can you explain how to evaluate lda2vec model?
train.py tried to load a bunch of .npy files like word_vector.npy. Where and how are they suppose to be generated?
i wanted to train this code with my own data stored in mysql database, how do it ?? which module.py will modify ?? help please
I am running explore_trained_model.ipynb on GoogleCoLab with python 3.6 and numpy 1.16.4.
ValueError Traceback (most recent call last)
/content/lda2vec-pytorch/20newsgroups/explore_trained_model.ipynb in <module>()
6
7 # "integer -> word" decoder
----> 8 decoder = np.load('decoder.npy')[()]
9
10 # for restoring document ids, "id used while training -> initial id"
1 frames
/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
694 # The array contained Python objects. We need to unpickle the data.
695 if not allow_pickle:
--> 696 raise ValueError("Object arrays cannot be loaded when "
697 "allow_pickle=False")
698 if pickle_kwargs is None:
ValueError: Object arrays cannot be loaded when allow_pickle=False
This seems relevant:
https://stackoverflow.com/questions/55824625/how-to-fix-object-arrays-cannot-be-loaded-when-allow-pickle-false-in-the-sketc
I ran all your code successfully. In explore_trained_model.ipynb, I see that you get prediction results for trained documents. However, I want to infer topics distribution for new documents. Please tell me how I can do it?
Thank you very much
This exception is happening training 20newgroups with the embed_dimension 300 in GoogleColab with pytorch 1.1.0 with cuda 10.
Curiously, it does not happen training on my Mac with only the CPU.
From alias_multinomial.py:
def draw(self, N):
"""Draw N samples from the distribution."""
K = self.J.size(0)
r = torch.LongTensor(np.random.randint(0, K, size=N))
q = self.q.index_select(0, r)
j = self.J.index_select(0, r)
b = torch.bernoulli(q)
#print("K r q j b r.shape, q.shape j.shape b.shape, j.shape", K, r, q, j, b, r.shape, q.shape, j.shape, b.shape, j.shape)
oq = r.mul(b.long())
oj = j.mul((1 - b).long())
return oq + oj
Traceback (most recent call last):
File "train.py", line 36, in <module>
main()
File "train.py", line 32, in main
save_every=20, grad_clip=5.0
File "../utils/training.py", line 127, in train
neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "../utils/lda2vec_loss.py", line 82, in forward
neg_loss = self.neg(pivot_words, target_words, doc_vectors, w)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "../utils/lda2vec_loss.py", line 152, in forward
noise = self.multinomial.draw(batch_size*window_size*self.num_sampled)
File "../utils/alias_multinomial.py", line 60, in draw
b = torch.bernoulli(q)
RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /pytorch/aten/src/TH/THRandom.cpp:320
q.min() and q.max()
0.21869047
1.0066459
This seems relevant:
Running on GoogleColab.
'Encounter the problem "IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number".'
This solution worked for me:
Try moving noise() to the GPU in utils/lda2vec_loss.py in the forward method
of class negative_sampling_loss(nn.Module):
E.g. -
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
noise = noise.to(device)
I am getting this error running lda2vec-pytorch on Google's CoLab platorm on a text file with 5,171 news articles from global websites.
E.g.
RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1
from lda2vec_loss.py:
print(doc_weights.shape, doc_probs.shape, unsqueezed_doc_probs.shape, unsqueezed_topic_vectors.shape)
torch.Size([7168, 20]) torch.Size([7168, 20]) torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])
%run get_stories_windows.ipynb
100%|██████████| 5171/5171 [22:09<00:00, 1.36it/s]
number of removed short documents: 5
total number of tokens: 5841072
number of tokens to be removed: 2660533
number of additionally removed short documents: 3
total number of tokens: 3180503
minimum word count number: 18
this number can be less than MIN_COUNTS because of document removal
5163it [00:08, 616.10it/s]
CPU times: user 1h 23min 15s, sys: 2.38 s, total: 1h 23min 18s
Wall time: 42min 10s
2%|▏ | 108/5163 [00:00<00:04, 1074.25it/s]CPU times: user 12.1 s, sys: 7.26 s, total: 19.3 s
Wall time: 9.84 s
topic 0 : Point Vladimir Union Moscow progress migration air Mosul Insight safe
topic 1 : Hong Morocco Kong les sur Jamaica Carnoustie Escocia Park Turkish
topic 2 : Got Talent Baron Sacha Club UAE AMERICA SHOWTIME Trapeze talent
topic 3 : Kentucky Lil Peoria Song Minnesota CBS hai Cyclone boy KZN
topic 4 : Game NBA Oman heat Tokyo Utah Bangla Soccer Basketball Mail
topic 5 : Ganga Chennai Bengaluru Hyderabad ordeal Theatre thREAD CLOSE Sabha Mumbai
topic 6 : Army syrian AGT Philippines Syrian Baltimore February October army January
topic 7 : ADVERTISEMENT Deutsch Programs migrant teach Podcasts turkish Puigdemont XXL Reuters
topic 8 : Air Force NYC Iraq Turkmenistan Kazakhstan Nordic Energy CBS Affairs
topic 9 : Philadelphia Mix Pennsylvania Route Baltimore Episode Summer Deep Massachusetts Park
topic 10 : Herald ZEALAND NEW IOL NZME Property crash Northern Pakuranga serial
topic 11 : RFI GMT Paris attachment Mon gmt Fri flash bulletin analysis
topic 12 : Nairobi Counties Rift switch Ethiopia NTV hours Ruaraka Eritrea Gold
topic 13 : Texas Russian Oman Newscasts Image Star Fry summer funny Vermont
topic 14 : Premium Content Zuckerberg Conference arab Gaza jewish JPost Careers Diaspora
topic 15 : Messenger WhatsApp external LinkedIn Queen window Prince Kilmeade School Elizabeth
topic 16 : Director Documentary London Chicago Game Tower Urdu Airlines english Egypt
topic 17 : Peninsula EBITDA Khmer Hungary SPAIN amazing thai Partners khmer lottery
topic 18 : Amsterdam Mark Fox Amazon Swansea Song Qatar TIME Indonesia durationStr
topic 19 : Pak Headlines pak modi india England pakistani Reaction indian Pakistani
100%|██████████| 5163/5163 [00:05<00:00, 977.59it/s]
!python train.py
number of documents: 5163
number of windows: 3180503
number of topics: 25
vocabulary size: 19224
word embedding dim: 50
../utils/lda2vec_loss.py:47: UserWarning: nn.init.normal is now deprecated in favor of nn.init.normal_.
init.normal(self.doc_weights.weight, std=DOC_WEIGHTS_INIT)
number of batches: 444
epoch 1
0% 0/444 [00:00<?, ?it/s]../utils/lda2vec_loss.py:196: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
doc_probs = F.softmax(doc_weights)
torch.Size([7168, 20]) torch.Size([7168, 20]) torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])
Traceback (most recent call last):
File "train.py", line 36, in <module>
main()
File "train.py", line 32, in main
save_every=20, grad_clip=5.0
File "../utils/training.py", line 127, in train
neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "../utils/lda2vec_loss.py", line 70, in forward
doc_vectors = self.topics(doc_weights)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "../utils/lda2vec_loss.py", line 208, in forward
doc_vectors = (unsqueezed_doc_probs*unsqueezed_topic_vectors).sum(1)
RuntimeError: The size of tensor a (20) must match the size of tensor b (25) at non-singleton dimension 1
Pytorch version issue? Perhaps:
https://github.com/marvis/pytorch-yolo2/issues/106
but in any case, the shapes don't match up:
e.g.
torch.Size([7168, 20, 1]) torch.Size([1, 25, 50])
call_() got an unexpected keyword argument 'tag'
i have the problem in
encoded_docs, decoder, word_counts = preprocess(
docs, nlp, MIN_LENGTH, MIN_COUNTS, MAX_COUNTS
)
i don't know why i have this problem
Traceback (most recent call last):
File "train.py", line 36, in
main()
File "train.py", line 32, in main
save_every=20, grad_clip=5.0
File "../utils/training.py", line 127, in train
neg_loss, dirichlet_loss = model(doc_indices, pivot_words, target_words)
File "/Users/macbook/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(input, **kwargs)
File "../utils/lda2vec_loss.py", line 72, in forward
neg_loss = self.neg(pivot_words, target_words, doc_vectors, w)
File "/Users/macbook/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(input, **kwargs)
File "../utils/lda2vec_loss.py", line 142, in forward
noise = self.multinomial.draw(batch_sizewindow_sizeself.num_sampled)
File "../utils/alias_multinomial.py", line 57, in draw
b = torch.bernoulli(q)
RuntimeError: invalid argument 1: must be >= 0 and <= 1 at /Volumes/OSX/Downloads/pytorch/aten/src/TH/THRandom.c:300
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.