tkm's People
tkm's Issues
Error when running setup file
Hi,
I tried to run ur code, it raises an error which I could not resolve it
there is little work around that in the net
/home/saria/anaconda3/envs/condaenv27/bin/python2.7 /home/saria/Downloads/tkm-master/setup.py
Warning: Extension name '_topicAssign' does not match fully qualified name 'tkm-master._topicAssign' of '_topicAssign.pyx'
Compiling _topicAssign.pyx because it changed.
[1/1] Cythonizing _topicAssign.pyx
Error compiling Cython file:
...
#cython: language_level=3
^
_topicAssign.pyx:1:0: 'tkm-master._topicAssign' is not a valid module name
Traceback (most recent call last):
File "/home/saria/Downloads/tkm-master/setup.py", line 22, in
ext_modules = cythonize(extensions),
File "/home/saria/anaconda3/envs/condaenv27/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 1027, in cythonize
cythonize_one(*args)
File "/home/saria/anaconda3/envs/condaenv27/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 1149, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: _topicAssign.pyx
Process finished with exit code 1
would you please have a look on that
output does not make sense
hello John
I have run your code on my dataset, its a large dataset about 500k txt documents
and the result does not make sense as most words got 0 percent and only 4 topic has been output.
would you please justify this?
thanks
0: cvasoci 0.00000, mcireview 0.00000, panhypopituitarismreview 0.00000, calltrig 0.00000, orderslength 0.00000, showmiss 0.00000, gaitpt 0.00000, notepsychotherapi 0.00000, ucohl 0.00000, pulmonarydismiss 0.00000, wellnl 0.00000, utiresult 0.00000, welldc 0.00000, mammogramneg 0.00000, boquet 0.00000,
1: cvasoci 0.00000, mcireview 0.00000, panhypopituitarismreview 0.00000, calltrig 0.00000, orderslength 0.00000, showmiss 0.00000, gaitpt 0.00000, notepsychotherapi 0.00000, ucohl 0.00000, pulmonarydismiss 0.00000, wellnl 0.00000, utiresult 0.00000, welldc 0.00000, mammogramneg 0.00000, boquet 0.00000,
2: prescript 0.100, mouth 0.098, tablet 0.095, skin 0.088, clear 0.053, site 0.051, joint 0.050, lesion 0.046, neg 0.044, mild 0.043, deni 0.042, sleep 0.041, procedur 0.040, bowel 0.040, think 0.039,
3: baalmann 0.896, damian 0.669, cuffautomat 0.648, rachelen 0.442, balkcom 0.431, schwarz 0.368, klusmann 0.296, deceas 0.253, calv 0.250, rightarm 0.235, scholz 0.229, sztajnkrycer 0.211, sadosti 0.201, bernic 0.188, negativegu 0.162,
4: patient 0.207, time 0.155, mg 0.151, tablet 0.143, daili 0.092, mouth 0.091, pain 0.085, medic 0.068, dai 0.067, instruct 0.064, left 0.060, right 0.058, indic 0.058, dr 0.056, need 0.055,
I have not changed your code except the way read the dataset:
import re
documents = glob.glob("/infodev1/phi-data/sohn/biobank/saria/biobank_65up_CI_dx_CN_cp/*.txt")
docs = []
for fi in documents:
with open(fi, 'r') as myfile:
d = myfile.read().replace('\n', '')
d = re.sub(r"/[A-Za-z0-9_-]+ ", " ",
d) # The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn") #.replace("/at","").replace("/nn-tl","").replace("/nn-hp","").replace("/np-hl","").replace("/nn","").replace("/vbd","").replace("/in","").replace("/jj","").replace("/hvz","").replace("/cs","").replace("/nps","").replace("/nr","").replace("/np-tl","").replace("/md","").replace("/np","").replace("/cd-hl","").replace("/vbn","").replace("/np-tl","").replace("/dti","").replace("--/--","")
docs.append(d)
return docs
function for calculating the score
Again thanks for all your answer,
I want to apply and use some of your idea in my research,
Actually, I don't want the topic assignment part, rather the way you calculate the score for each word,
May I ask you to help me with your code, Do I need the computeKeywordScore? Or there are other dependencies I should consider?
Thank you so much for your time :)
evaluation part
hi john,
would you please let me know the code for your evaluation part?
have you tried to visualize your data?
may I ask you to let me know with which implementation of LDA did you compare your result? if you have the link I appreciate it share it with me.
Thanks for your time :)
visualization
Hi,
Is there any way to save the model to visualize the result something like the PyLDA is doing?
Thanks.
word distributions with 0 weight
Hi again,
I have applied this model on some data being scraped from the web news. it printed out about 18 to 20 topics for each subject. the thing is that for some subjects like "drugs" it printed out 18 topics and one of the topic clusters have 0 weight for each word distribution.
Do you have any idea why this is happening?
Like this topic cluster which is related to "Obesity" subject.
1: faerch 0.00000, perrier 0.00000, mottola 0.00000, augustin 0.00000, kadono 0.00000, hamstr 0.00000, flexor 0.00000, paperboard 0.00000, tanz 0.00000, lenihan 0.00000, aa 0.00000, dysphoria 0.00000, christel 0.00000, yukiko 0.00000, sveikata 0.00000,
Thanks!
2.1 Modeling keywords
I have a question not related to code but the paper,
I hope I get the answer of that,
In part 2.1 of the paper, column2, formula(7)
the bold difference between the two formula is(ignoring B) the first formula has Log and the second one which you reasoned will give broader topics does not have Log,
May I ask you explain how you come up with this?
I mean how you can conclude that the first formula gives more specific and the second broader while the only difference is Log?
Thank you so much, sorry if my question is not a coding issue and maybe kind of naive,
Python 2.x or 3.x?
I have inherited some code that uses TKM and I'm in the process of running it. It compiled ok, but it seems to be failing now because it's designed for 2.7 and I'm running 3.5
Should I roll back and try to run it using python 2.7? Or does the new TKM code here work in python 3.5? I can't see anywhere in the documentation for directions about the proper python to use
Thanks
Reverse of entropy
Hello,
Actually, I found your project very interesting. I have a confusion about the reverse of entropy idea.
My question is more about the implementation part:
suppose we have a matrix (10*2) stands for 10 words and 2 topics.
From your code, I can understand that you have applied the entropy on each row, so in this case, we will end up with 10 entropy. then how you can decide about the distinctiveness of the clusters of the topics?
To put it another way, how can you derive a conclusion which word is distinctive to the two available topics?
Thank you
How did you track the context
Hi :),
I have a question regarding the way you create the corpus, as stated in the paper this algorithm considers the context while assigning topic to words. I mean looking at a window before and after targeted word to assign the topic to words.
I think I may misunderstood but how did you track the context of the words?
I looked at the methods for creating corpus in AlgTools, it seems the corpus has been created regardless of their position in the documents.
So, how you were able to see which words are around the targeted word?
Sorry for many question I need to know your approach for applying this part.
many Thanks :)
PMI evaluation
I need to evaluate your approach on my data set
Do you mean for finding PMI you reported in the paper I need to calculate the median of the whole numbers in each topic and then the mean of PMI of each calculated topics?
Would you please let me know is it the same thing you did in the paper?
Actually I got good result and I only need to evaluate the model so I appreciate it if you let me know about this.
Thanks:)
assign
:)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.