johntailor / tkm Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 5.0 15 KB

Python 100.00%

tkm's People

Contributors

Stargazers

Watchers

Forkers

solms llcchen vendeloeranu littlestar-angel

tkm's Issues

Error when running setup file

Hi,
I tried to run ur code, it raises an error which I could not resolve it
there is little work around that in the net

/home/saria/anaconda3/envs/condaenv27/bin/python2.7 /home/saria/Downloads/tkm-master/setup.py
Warning: Extension name '_topicAssign' does not match fully qualified name 'tkm-master._topicAssign' of '_topicAssign.pyx'
Compiling _topicAssign.pyx because it changed.
[1/1] Cythonizing _topicAssign.pyx

Error compiling Cython file:

...
#cython: language_level=3
^

_topicAssign.pyx:1:0: 'tkm-master._topicAssign' is not a valid module name
Traceback (most recent call last):
File "/home/saria/Downloads/tkm-master/setup.py", line 22, in
ext_modules = cythonize(extensions),
File "/home/saria/anaconda3/envs/condaenv27/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 1027, in cythonize
cythonize_one(*args)
File "/home/saria/anaconda3/envs/condaenv27/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 1149, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: _topicAssign.pyx

Process finished with exit code 1

would you please have a look on that

output does not make sense

hello John

I have run your code on my dataset, its a large dataset about 500k txt documents
and the result does not make sense as most words got 0 percent and only 4 topic has been output.
would you please justify this?

thanks


0:    cvasoci 0.00000, mcireview 0.00000, panhypopituitarismreview 0.00000, calltrig 0.00000, orderslength 0.00000, showmiss 0.00000, gaitpt 0.00000, notepsychotherapi 0.00000, ucohl 0.00000, pulmonarydismiss 0.00000, wellnl 0.00000, utiresult 0.00000, welldc 0.00000, mammogramneg 0.00000, boquet 0.00000, 
1:    cvasoci 0.00000, mcireview 0.00000, panhypopituitarismreview 0.00000, calltrig 0.00000, orderslength 0.00000, showmiss 0.00000, gaitpt 0.00000, notepsychotherapi 0.00000, ucohl 0.00000, pulmonarydismiss 0.00000, wellnl 0.00000, utiresult 0.00000, welldc 0.00000, mammogramneg 0.00000, boquet 0.00000, 
2:    prescript 0.100, mouth 0.098, tablet 0.095, skin 0.088, clear 0.053, site 0.051, joint 0.050, lesion 0.046, neg 0.044, mild 0.043, deni 0.042, sleep 0.041, procedur 0.040, bowel 0.040, think 0.039, 
3:    baalmann 0.896, damian 0.669, cuffautomat 0.648, rachelen 0.442, balkcom 0.431, schwarz 0.368, klusmann 0.296, deceas 0.253, calv 0.250, rightarm 0.235, scholz 0.229, sztajnkrycer 0.211, sadosti 0.201, bernic 0.188, negativegu 0.162, 
4:    patient 0.207, time 0.155, mg 0.151, tablet 0.143, daili 0.092, mouth 0.091, pain 0.085, medic 0.068, dai 0.067, instruct 0.064, left 0.060, right 0.058, indic 0.058, dr 0.056, need 0.055,

I have not changed your code except the way read the dataset:

import re
documents = glob.glob("/infodev1/phi-data/sohn/biobank/saria/biobank_65up_CI_dx_CN_cp/*.txt")
docs = []
for fi in documents:
    with open(fi, 'r') as myfile:
        d = myfile.read().replace('\n', '')
        d = re.sub(r"/[A-Za-z0-9_-]+ ", " ",
                   d)  # The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn") #.replace("/at","").replace("/nn-tl","").replace("/nn-hp","").replace("/np-hl","").replace("/nn","").replace("/vbd","").replace("/in","").replace("/jj","").replace("/hvz","").replace("/cs","").replace("/nps","").replace("/nr","").replace("/np-tl","").replace("/md","").replace("/np","").replace("/cd-hl","").replace("/vbn","").replace("/np-tl","").replace("/dti","").replace("--/--","")
        docs.append(d)
return docs

function for calculating the score

Again thanks for all your answer,

I want to apply and use some of your idea in my research,
Actually, I don't want the topic assignment part, rather the way you calculate the score for each word,
May I ask you to help me with your code, Do I need the computeKeywordScore? Or there are other dependencies I should consider?

Thank you so much for your time :)

evaluation part

hi john,

would you please let me know the code for your evaluation part?
have you tried to visualize your data?
may I ask you to let me know with which implementation of LDA did you compare your result? if you have the link I appreciate it share it with me.

Thanks for your time :)

visualization

Hi,

Is there any way to save the model to visualize the result something like the PyLDA is doing?

Thanks.

word distributions with 0 weight

Hi again,

I have applied this model on some data being scraped from the web news. it printed out about 18 to 20 topics for each subject. the thing is that for some subjects like "drugs" it printed out 18 topics and one of the topic clusters have 0 weight for each word distribution.
Do you have any idea why this is happening?
Like this topic cluster which is related to "Obesity" subject.
1: faerch 0.00000, perrier 0.00000, mottola 0.00000, augustin 0.00000, kadono 0.00000, hamstr 0.00000, flexor 0.00000, paperboard 0.00000, tanz 0.00000, lenihan 0.00000, aa 0.00000, dysphoria 0.00000, christel 0.00000, yukiko 0.00000, sveikata 0.00000,

Thanks!

2.1 Modeling keywords

I have a question not related to code but the paper,
I hope I get the answer of that,
In part 2.1 of the paper, column2, formula(7)

the bold difference between the two formula is(ignoring B) the first formula has Log and the second one which you reasoned will give broader topics does not have Log,

May I ask you explain how you come up with this?
I mean how you can conclude that the first formula gives more specific and the second broader while the only difference is Log?

Thank you so much, sorry if my question is not a coding issue and maybe kind of naive,

Python 2.x or 3.x?

I have inherited some code that uses TKM and I'm in the process of running it. It compiled ok, but it seems to be failing now because it's designed for 2.7 and I'm running 3.5

Should I roll back and try to run it using python 2.7? Or does the new TKM code here work in python 3.5? I can't see anywhere in the documentation for directions about the proper python to use

Thanks

Reverse of entropy

Hello,

Actually, I found your project very interesting. I have a confusion about the reverse of entropy idea.
My question is more about the implementation part:
suppose we have a matrix (10*2) stands for 10 words and 2 topics.
From your code, I can understand that you have applied the entropy on each row, so in this case, we will end up with 10 entropy. then how you can decide about the distinctiveness of the clusters of the topics?
To put it another way, how can you derive a conclusion which word is distinctive to the two available topics?

Thank you

How did you track the context

Hi :),

I have a question regarding the way you create the corpus, as stated in the paper this algorithm considers the context while assigning topic to words. I mean looking at a window before and after targeted word to assign the topic to words.

I think I may misunderstood but how did you track the context of the words?
I looked at the methods for creating corpus in AlgTools, it seems the corpus has been created regardless of their position in the documents.
So, how you were able to see which words are around the targeted word?
Sorry for many question I need to know your approach for applying this part.

many Thanks :)

PMI evaluation

I need to evaluate your approach on my data set
Do you mean for finding PMI you reported in the paper I need to calculate the median of the whole numbers in each topic and then the mean of PMI of each calculated topics?
Would you please let me know is it the same thing you did in the paper?
Actually I got good result and I only need to evaluate the model so I appreciate it if you let me know about this.

Thanks:)

johntailor / tkm Goto Github PK

tkm's People

Contributors

Stargazers

Watchers

Forkers

tkm's Issues

Error compiling Cython file:

... #cython: language_level=3 ^

Recommend Projects

Recommend Topics

Recommend Org

...
#cython: language_level=3
^