cpsievert / ldavis Goto Github PK

View Code? Open in Web Editor NEW

555.0 32.0 134.0 24.6 MB

R package for web-based interactive topic model visualization.

License: Other

R 29.03% HTML 0.48% CSS 0.35% JavaScript 70.00% Makefile 0.14%

topic-modeling r javascript visualization text-mining

ldavis's Introduction

LDAvis

R package for interactive topic model visualization.

LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Installing the package

Stable version on CRAN:

install.packages("LDAvis")

Development version on GitHub (with devtools):

devtools::install_github("cpsievert/LDAvis")

Getting started

Once installed, we recommend a visit to the main help page:

library(LDAvis)
help(createJSON, package = "LDAvis")

The documentation and example on the bottom of that page should provide a quick sense of how to create (and share) your own visualizations. If you want more details about the technical specifications of the visualization, see the vignette:

vignette("details", package = "LDAvis")

Note that LDAvis itself does not provide facilities for fitting the model (only visualizing a fitted model). If you want to perform LDA in R, there are several packages, including mallet, lda, and topicmodels.

If you want to perform LDA with the R package lda and visualize the result with LDAvis, our example of a 20-topic model fit to 2,000 movie reviews may be helpful.

LDAvis does not limit you to topic modeling facilities in R. If you use other tools (MALLET and gensim are popular), we recommend that you visit our Twenty Newsgroups example to help quickly understand what components LDAvis will need.

Sharing a Visualization

To share a visualization that you created using LDAvis, you can encode the state of the visualization into the URL by appending a string of the form:

"#topic=k&lambda=l&term=s"

to the end of the URL, where "k", "l", and "s" are strings indicating the desired values of the selected topic, the value of lambda, and the selected term, respectively. For more details, see the last section of our Movie Reviews example, or for a quick example, see the link here:

https://ldavis.cpsievert.me/reviews/vis/#topic=3&lambda=0.6&term=cop

Video demos

Additional data

We included one data set in LDAvis, 'TwentyNewsgroups', which consists of a list with 5 elements:

phi, a matrix with the topic-term distributions
theta, a matrix with the document-topic distributions
doc.length, a numeric vector with token counts for each document
vocab, a character vector containing the terms
term.frequency, a numeric vector of observed term frequencies

We also created a second data-only package called LDAvisData to hold additional example data sets. Currently there are three more examples available there:

Movie Reviews (a 20-topic model fit to 2,000 movie reviews)
AP (a 40-topic model fit to approximately 2,246 news articles)
Jeopardy (a 100-topic model fit to approximately 20,000 Jeopardy questions)

ldavis's People

Contributors

Stargazers

Watchers

Forkers

benmarwick mengyingli gregdl iamkbpark snowdj chagge seebeyond praveenkottayi kshirley tcarnus snazz2001 fxcebx 466152112 matthieubizien shawngraham arturochian sahooamarjeet 52nlp bmabey nassimhaddad nkhuyu sripadapavan taalbrecht rhmiller47 kcompher yilab gturri datasciemon khinezarthwe kg226 vamshibandari gharp dselivanov guanlongtianzi rygbee graybosch qingniufly rljones gokul180288 siddhant-08 maizifang guoruijiao korterling nathania jayanthkmr percyzhou ww880412 karawoo latuji imclab start4worker digideskio agile-innovations crew102 graysquirrel jongokko ianclarkdatasci drstatsvenu wjcdenis saknevatia anhjc kastureys spnichol fatdopa anhnguyendepocen datascisp scottddexter yzharold yanyushu yunho0130 ahupersonal omarun adamwongch lucylu-pfizer summersuny melodymz jluo41 hlshao segranp fotisz kashenfelter dhicks sailuh ofraklein kiinapoika dpmccabe rydt eitansht manuelbickel sivay gregwalla juanisernghosn gilcreativity renespijker zhanglipku bbnsumanth afcarl socialai harelhan eddieyue

ldavis's Issues

DataTable tied to LDAvis

Hi there,

I am fairly new to WebDev side of things. I am using Shiny and LDAvis but would like to also add a basic DataTable. Such as this one: http://shiny.rstudio.com/gallery/basic-datatable.html

In addition, I would like to have the DataTable bound to LDAvis, such that if I select a topic, then the table will filter to the respective documents.

Is this possible with the LDAvis as it stands at the moment? Could someone please point my in the direction or provide resources as to what I may need to consider in doing so?

Many thanks

rccp crash?

Hi,
On several PC's I've had the following problem trying to run the demo.

Sys.info()
sysname release version nodename machine
"Windows" "7 x64" "build 7601, Service Pack 1"

Error in withCallingHandlers(tryCatch(evalg((function (hash=TRUE, parent = parent.frame(), :
object '.rcpp_warning_recorder' not found

R then aborts.

I'm using R 3.1.2

Any suggestions?

color parameter in serVis for more customized visualizations

By now red and blue colors are set to be non-changebale in /inst/htmljs/ldavis.js https://github.com/cpsievert/LDAvis/blob/master/inst/htmljs/ldavis.js#L32

What would you say about letting user to provide his own colors by additional parameter in serVis?

Using LDAvis is still a bummer

A little while I added an issue about runVis not working and code is now provided that shows how to use it. However, there are still two big difficulties with using the package:

runVis relies on global variables. Is there a good reason for this? It is generally considered bad form to assume that variable are available that aren't explicitly passed as a parameter in a function.
It's still not clear to me how to use the package, other than to show the AP example. Can you provide an example that starts with a corpus, creates an lda, and visualize the topics with runVis?

The visualization itself looks awesome and I'd like to incorporate it into a project I'm working on now with PubMed. However, the packages usability is still a big issue for me.

how to display chinese with LDAvis

hello，
i povide an example data set，

1、科学学作为一门新兴学科,如果没有科学而严谨的逻辑体系,是会妨碍它的发展的。建立科学学的逻辑体系的前提是明确科学学的研究对象、范围、方面和内容以及内容之间的逻辑联系。在建立科学学体系时,不能硬套某种僵化的框架,而要围绕科学学研究的对象、本质、关系和规律,进行创造性探讨。
2、本文阐明了生产劳动的内涵及在各种社会历史形态下的变化,科学劳动在特定历史条件下成为生产劳动的根据和途径。科学劳动队伍形成的特点以及社会主义社会中科学劳动的性质和地位等问题。
3、科学发现的过程,类似于采掘过程。它总是沿着不同物质层次(或能量级别)不断推进的,不同时代总有不同的科学成为“当采学科”。当采学科转移的历史条件,乃是当采层次所包含的基本换能效应的发现。在这一条件尚未满足以前,科学史上往往出现大规模的“回采现象”。“当采”与“回采”都是影响科学发展的重要因素,也是预测未来科学发展战略趋势的有效手段。
4、科学研究就是探索未知的新事物或新规律,并从中导致科学发现和技术发明。科学研究是一种艰苦的创造性劳动,对推进社会历史发展起着重要的作用。取得科研成果的劳动之所以非常艰巨,主要原因之一在于实验条件有时是极其恶劣的,它必然危及科学家的身体健康甚至生命。以遭受过剂量辐射损害的诺贝尔物理奖获得者来说,贝克勒尔寿命是56岁,居里夫人是67岁,费米是53岁,劳伦斯是57岁。他们
5、本文探讨了科技发展战略的内容、特点、研究方法以及制定科技发展战略需要建立的组织系统、信息系统、预测系统、指标系统、模型系统等基础性工作。
6、本文以机械部门为例,探讨了有重点地发展新技术、加快传统工业的技术改造、坚持以应用研究为主、鼓励军用向民用转移、奖励成果推广应用、加强科技情报交流、发挥学术组织作用、引进国外先进技术、加强智力开发以及改革科技管理体制等政策问题。

i use topicmodels to deal with this data set
i hope you can help me to display it with LDAvis.thank you!!

Add some vignettes

Just a note to myself to create some Rmd vignettes that demonstrate use cases. Primarily, a MALLET example using either knitr's engine="bash" option (see here) or system commands -- then turn the output into an interactive viz -- all within the same document.

[question]Error in createJSON: Rows of theta don't all sum to 1 -How to Ignore this check

Dears,
I get below error in createJSON.
Is there a way to ignore theta rows sum to 1?
some of my theta rows sum to 0.99999999999 and is preventing me from proceeding visualise my LDA

Error in createJSON(phi = phi, theta = theta, vocab = vocab, doc.length = doc.length, :
Rows of theta don't all sum to 1

preprocess.R error

Missing , s in preprocess.R, had to add trailing , in order to resolve the issue after forking repository. Install is currently failing otherwise.

cat(paste0("\n", sum(category == -1), " additional documents removed ",
           "because they consisted entirely of punctuation or rare ",
           "terms that are not in the vocabulary."))

Text in final visualization in different language than english

I could provide polish version of LDAvis. We could add additional parameter: output.language to createJSON. If this is OK I can prepare pull request, but I need to know whether you are reading comments on this repo and whether such proposition is OK for you and in the future will be submitted to CRAN ?

Just a question.

Nice package, liked the option to change between topical distances.
Is there adding mallet support in the pipeline? That would make it easy to play with large corpus.

rethink runVis()

Apparently it's now possible to pass arguments to a "shiny app". This would be a better approach than the current approach of assuming certain objects exist in the global workspace. See here for an example.

Clarification of license

The LICENSE indicates the copyright holder but not the type of license. Could you please specify what it is and choose one if needed? (MIT, Apache, etc..)

I would like port the R portion of LDAvis to python and ideally reuse the JS code so I'm wondering if that is possible under the project's license.

Swapping phi, token.frequency, vocab, and topic.proportion rda files removes some visualization features

Hello Carson,

I'd appreciate any thoughts on what might be causing an issue with your otherwise great visualization package.

One of your tutorials generates a beautiful Shiny application. I replaced your RDA files with my own - you had RDA files for phi, topic.proportion, token.frequency, and vocab - and got a picture of the topic regions but do not get lists of relevant terms for topics I click on. I also do not get barcharts of the breakdown of tokens for each topic, only a list of the overall most salient terms for the corpus.

I, initially, got a NaN error when the Shiny application tried to build. I built my model using super-fast Vowpal Wabbit for LDA. VW requires a vocabulary size to be a power of 2, plus 1, and so if your |Vocabulary| <> 2^N + 1 then you will have some rows of zeroes in phi. My guess is those zeros made the Kullback-Leibler divergence blow up. When I forced the zero entries in phi to equal 10^-6 the app ran and gave me a beautiful picture of the overlapping topics. However, when I selected a region, I no longer automatically got barcharts of relevant terms for that cluster. Said feature worked beautifully prior to my replacing your RDA files. The app does still tell me how much of the corpus comes from each topic and still does list the overall most salient terms.

I've checked my phi, topic.proportion, token.frequency, and vocab, I'd appreciate any thoughts on what might be causing the issue, thanks again for the great visualization package,

Anthony

When term is saved in url, wrong term is bold upon topic hover

Take this for example:

http://cpsievert.github.io/LDAvis/reviews/vis/#topic=14&lambda=0.5&term=action

If you hover over circle fourteen (without selecting anything else), the bold term changes from "action" to "chan". Moreover, hovering over the other topics results in the top term being bold.

Incorrect word frequencies in topics

I have run my topic model using the stm package, collected all the input required for LDAvis, and successfully created a topic model visual.

But: the bar chart in my example indicates that for some topics, the number of occurrences of a term (red bar) is higher than the overall term occurences across all documents (grey). I have checked all input data for the visualization to see if the error is in the input:

my topic.proportions sum up to 1 across all topics
the columns of my phi matrix (the topics) sum up to 1 across all terms
the order of terms is consistent in phi, vocab and term.frequency

In the json file I spotted in "tinfo" that for many terms "Freq" is greater than "Total". So it must have something to do with the normalization?

Missing "check.inputs" function

Hi Carson,
I recently updated R and reinstalled the latest packages, but also discovered that function "check.inputs" is no longer there. I didn't find any reference about this change, is it a design change, or perhaps an omission? Thanks.
Alex

Only one visualization can be displayed on a single page

When more than one visualizations are displayed on a single page most of the controls for the bottom visualization effect the top/first vis instead of their own elements.

Click here for an example.

I or @log0ymxm will submit a PR for this sometime next week unless you beat us to it. :)

Topic numbers from lda is not the same those from LDAvis

Hi,

I'm using 'lda' and 'LDAvis' packages for topic modeling
I got topic numbers from 'lda' packages using 'topics()' function.
but it's not correspond to the topic numbers from 'LDAvis' packages

I guess there are some calculation about marginal distribution but i don't know what it is

can i calculate it? or how can i konw the relationship between two topic numbers?

LDAvis display chinese problem

i am chinese user of LDAvis ,when i display chinese language ,i find its messy code,
lda.vis <- topicmodels_json_ldavis(title.abstract.model, myCorp, tfidf_dtm)

visualize lda model

serVis(lda.vis)

how to do it with UTF-8 or gbk?
please help me!!thank you

Recipe for embedding in static page with knitr?

It looks like it should be possible to embed an lda visualization in a self-contained HTML5, by using knitr and an iframe of some sort. Do you have a recipe for doing so?

Thanks much!

createJSON causes segfault

I'm using mallet and LDAvis to analyze a corpus of approximately 1 million documents, following this example pretty closely: http://cpsievert.github.io/LDAvis/reviews/reviews.html

After the model is fit, I run checkInputs and it tells me everything looks good. However, once I call createJSON, R crashes with the following information:

Problem signature:
Problem Event Name: APPCRASH
Application Name: rsession.exe
Application Version: 0.98.1028.0
Application Timestamp: 53ed1f34
Fault Module Name: lapack.dll
Fault Module Version: 3.2.63987.0
Fault Module Timestamp: 5243035d
Exception Code: c0000005
Exception Offset: 000000000000464c
OS Version: 6.1.7601.2.1.0.18.10
Locale ID: 1033
Additional Information 1: 24cf
Additional Information 2: 24cfe87c8f9b583e72c7671cd3b0a66d
Additional Information 3: 557d
Additional Information 4: 557d717948b0b58e03da185573deb738

For reference, I'm using an HPC virtual desktop running Windows 2008 R2 enterprise with 256GB of RAM. I was able to save and transfer the object resulting from checkInputs (the "out" object from the example) to another machine (Mac/OSX) and createJSON was successful.

How to combine LDAvis output into a shiny app

Dear Carson,

I recently discovered the LDAvis package and I am highly interested in the outputs you have been able to generate in a browser via the serVis function (I am no js knowledge so it's great to be able to leverage such tools in my analysis).
Working part of a data scientist team, I would like to be able to integrate the serVis output into some of our own shiny apps, in order to enrich our text mining capabilities.
Is that feasible? If yes, which render* and which *Output should be used?
Congratulations again for this great package!

Best regards,

Thomas

Can LDAvis support the analysis of texts in Chinese

Can LDAvis support the analysis of texts in Chinese?

computing doc.length with mallet

Hi,
I used to have a previous version of LDAvis (2014) installed with devtools.
In the version I had of LDAvis I would call createJSON as:
json <- createJSON(K, phi, term.frequency, vocab, topic.proportions)

Today I updated my R packages and have a newer vesion of LDAvis (from CRAN) which uses createJSON as:
json <- createJSON(phi, theta, doc.length, vocab, term.frequency)

I'm using MALLET for the LDA. I can easily access to the phi and theta matrices as well as the vocab and term.frequency but not so much to doc.length.
According to the doc of LDAvis it's a vector containing the number of tokens in each document of the corpus.

Question: how can I construct such vector from a MALLET instance (mallet.import)?

Thanks!
G.

LDAvis in Jupyter Notebook

Hi,

I'm currently working with the R-kernel for Jupyter Notebooks and try to include LDAvis outputs in such notebooks. Unfortunately this does not work in comparison to usual plots which work flawlessly. It only works when the visualisation is opened in a new browser windows.

The python adapation pyLDAvis offers a little helper function with which visualisations can directly be included. Is this also possible for the R version?

Cheers

Percentage of explained variance in axes of final plot

At first it looks like the PCA is performed for multi-dimensional-scaling to represent topics in 2 dimensions - this is the conclusion that comes up from the axis labels.

But when one looks closer to the default jsPCA function, it looks like the reduction is made on dissimilarity matrix (not on the regular dataset) and what is more, the cmdscale function (used in jsPCA) is used to perform dimension reduction and not the prcomp .

How this is relevant to PCA on axis labels?
Shouldn't you write MDS1 and MDS2 instead of PCA1 and PCA2?
If somehow cmdscale perform semi-similiar operations to PCA computations, is it possible to calculate eigen value for every component and to present the percentage of explained variance on axises next to PCAx text?

Error with 2 topics

I was using this package's createJSON() function while my topicmodel was for 2 topics and received this error
Error in stats::cmdscale(dist.mat, k = 2) : 'k' must be in {1, 2, .. n - 1}
Then I tested with reproducible example given here
http://cpsievert.github.io/LDAvis/reviews/reviews.html
by putting K=2 and keeping everything same and bumped into this error again in createJSON().

Upon seeing the source code of createJSON(), the issue is in function jsPCA().
In jsPCA(), while K=2, the dist.mat comes out to be a single value which throws an error in line
pca.fit <- stats::cmdscale(dist.mat, k = 2)

Any advice how to get past this error?

Object contstancy when entering/exiting topics

The main reason for creating a pure-JS implementation is to preserve object constancy when a topic is selected and the value of lambda is altered. It might be a good idea to do the same when entering/exiting a new topic when one is already selected. This would provide another way to verify topic similarity.

runVis() returns an error.

> library(LDAvis)
Loading required package: shiny
Loading required package: MASS
Loading required package: proxy

Attaching package: ‘proxy’

The following objects are masked from ‘package:stats’:

    as.dist, dist

Loading required package: plyr
Loading required package: reshape2
> runVis()
Error in eval(expr, envir, enclos) : object 'topic.proportion' not found

Plot dimensions of LDAvis for shiny/shinydashboard

Hi,
Is there an effective way of adjusting plot dimensions (e.g. width, height) of LDAvis?
I am using it in shinydashboard (renderVis and visOutput functions), but plots exceed dashboard body dimensions. Using box doesn't make difference...
Thank you.
Marcelo Pita

[LDAvis] in IE + responsive designed

Dear Carson,

Thanks to your answer to issue #27, I am now able to integrate LDAvis output in my shiny app. I have 2 observations:

The output displays in Google Chrome and Safari, but not in Internet Explorer: is there a way to fix this issue?
Is there a way to make the width of the output 'responsive designed' (i.e. so that it resizes accordingly to browser's width when user resizes)? The current width seems to be fixed.

Many thanks in advance for your views on this, and congratulations again for this great tool!

Best regards,

Thomas

Error: could not find function "check.inputs"

I am really interested in LDAvis and cool visualizations. I have the basic code running like the " Twenty Newsgroups" data. However, the visualization generated does not have the "cluster" button and "topic distance calculation" button etc like "https://gallery.shinyapps.io/LDAelife/". I want to have all the buttons on the top.

I am encountering a problem when I tried different code sources: "Error: could not find function "check.inputs"" this codes is from "https://github.com/cpsievert/cpsievert.github.com/blob/master/elife/elife.R".

The problem I have is from this line: "z <- check.inputs(K=max(topic.id), W=max(token.id), phi, token.frequency, vocab, topic.proportion)" and it said "Error: could not find function "check.inputs" ".

I am new to R and appreciate your feedback on this.

Thank you.

question: how are the topic numbers put into the circles?

Hello! I see that createJSON orders the topics in order of decreasing frequency to put the numbers in the circles. But does that correspond to the ordering of the columns in theta and/or the rows in phi? I ask because I am "zooming in" on topics recursively and running lda again to show subtopics (after pigeonhole-ing each document into only one topic). Right now, my code assumes the numbering of the circles corresponds to the ordering of the columns in theta, but I believe that's not true. Thanks! :)

topic term frequency miscalculation

As reported and discussed in #32 some model fits produce erroneous visualizations. Specifically, red bars are not monotonically decreasing in size down the vertical axis when lambda = 1). See the second viz in this notebook: http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=

As @kshirley pointed out near the end of #32 this is occurring due to how the topic.term.frequency is being calculated/estimated.

LDAvis showing tokens not present in the data in the visulaiztion.

Hi all, I cleaned a csv file I had using {tm} package and fitted an lda collapsed gibbs sampler model to it. After that I created a visualization for the model using {LDAvis} but the visualization shows strange tokens not in the data like, "=", numbers in quotes("18" etc.

URGENT help required.

dist() with jensenShannon returns Nan

I can't quit figure out why (as the jensen Shannon distance function looks okay) but

`jensenShannon <- function(x, y) {
m <- 0.5_(x + y)
0.5_sum(x_log(x/m)) + 0.5_sum(y*log(y/m))
}

dist.mat <- proxy::dist(x = phi, method = jensenShannon)`

returns Nan using phi.

dist() with jensenShannon returns Nan

I really enjoy this package and appreciate your work on it. I've previously used it successfully, but updated the package recently and now get an error that I previously had not encountered.

I can't quit figure out why (as the jensen Shannon distance function looks okay) but

`jensenShannon <- function(x, y) {
m <- 0.5_(x + y)
0.5_sum(x_log(x/m)) + 0.5_sum(y*log(y/m))
}

dist.mat <- proxy::dist(x = parems$phi, method = jensenShannon)`

returns Nan using phi.

Clustering

More of a question than an issue, but I cannot seem to figure out how to use the clustering options as shown in the paper (http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf Fig 4). Any hints?

createJSON creates Infinite values for loglift

This problem might be due to my lack of understanding of preprocessing in mallet, but it might also be helpful/relevant to others so I'm posting anyway.

In a nutshell, the problem is that LDAvis::createJSON must have values greater than zero in the term.frequency parameter and it is fairly easy to accidentally include zeros when using mallet for preprocessing. Consider this example --

library(moviereviews)
data(reviews, package = "moviereviews")
reviews <- sapply(reviews, function(x) paste(x, collapse = ""))
library(tm) # just for the stopwords()
library(mallet) # for the model fitting
writeLines(stopwords(), "stopwords.txt")
doc.ids <- as.character(seq_along(reviews))
mallet.instances <- mallet.import(doc.ids, reviews, "stopwords.txt")
topic.model <- MalletLDA(num.topics = 20)
topic.model$loadDocuments(mallet.instances)
word.freqs <- mallet.word.freqs(topic.model)
# Eliminate infrequent words
stopwordz <- as.character(subset(word.freqs, term.freq <= 5)$words)
subset(word.freqs, term.freq == 0)

          words term.freq doc.freq
12641 parillaud         6        1

writeLines(c(stopwords(), stopwordz, "s", "t"), "stopwords.txt")
# Re-'initiate' topic model without the infrequent words
mallet.instances <- mallet.import(doc.ids, reviews, "stopwords.txt")
topic.model <- MalletLDA(num.topics = 20)
topic.model$loadDocuments(mallet.instances)
word.freqs <- mallet.word.freqs(topic.model)
subset(word.freqs, term.freq == 0)

          words term.freq doc.freq
12641 parillaud         0        0

What seems to be happening is that Mallet automatically throws away "small" documents. This can cause "newly infrequent terms" (in this case 'parillaud' doesn't occur at all) even though we've removed infrequent terms once already. I'm not sure what the best approach is to avoid this situation, but at the very least createJSON should throw a warning if any values in term.frequency are zero.

possible bug in createJSON function when calculating top relevant terms

In making a Python port (almost done!) I think I may have found a bug in the R code. I'm probably wrong since I am not proficient in R but the problem I see is that the most relevant terms for each topic are being attributed to other topics. I've pulled out the relevant parts of code from the createJSON function and added comments about where I think the bug is:

createJSON <- function(phi = matrix(), theta = matrix(), doc.length = integer(), 
                       vocab = character(), term.frequency = integer(), R = 30, 
                       lambda.step = 0.01, mds.method = jsPCA, cluster, 
                       plot.opts = list(xlab = "PC1", ylab = "PC2"), 
                       ...) {
  # ...
  dt <- dim(theta)
  K <- dt[2]

  # asserts...

  topic.frequency <- colSums(theta * doc.length)
  topic.proportion <- topic.frequency/sum(topic.frequency)

  # re-order the K topics in order of decreasing proportion:
  o <- order(topic.proportion, decreasing = TRUE)
  ## N.B: phi is being reordered
  phi <- phi[o, ]
  # ... other reorderings

  #...

  phi <- t(phi)

  # ...
  ## N.B. topic_seq is being created with a consecutive range from 1->K
  topic_seq <- rep(seq_len(K), each = R)

  # ...
  find_relevance <- function(i) {
    relevance <- i*log(phi) + (1 - i)*log(lift)
    idx <- apply(relevance, 2, 
                 function(x) order(x, decreasing = TRUE)[seq_len(R)])
    # for matrices, we pick out elements by their row/column index
    indices <- cbind(c(idx), topic_seq)
    data.frame(Term = vocab[idx], Category = category,
               ### ** BUG?? here we index using topic_seq, but phi has been reordered above.
               ###   This seems like we are attributing top terms to the wrong topic.
               ###   For example, if our phi has been reordered so that topic 5 is the top
               ###   topic then the top terms for topic 5 will not be denoted as the
               ###   top terms for topic 1!
               logprob = round(log(phi[indices]), 4),
               loglift = round(log(lift[indices]), 4),
               stringsAsFactors = FALSE)
  }

  # .. code that calls find_relevacne
  # ...
}

In summary I think this line:

topic_seq <- rep(seq_len(K), each = R)

should really be:

topic_seq <- rep(o, each = R)

Missing bar widths when estimated frequency is less than 1

I noticed this while playing around with my xkcd post. When I choose a very small value of lambda, the term rankings look odd:

I also see these error messages:

Consider the term "bruce". Here is the data that createJSON() sends to the browser:

     Term logprob loglift      Freq Total Category
631 bruce -6.5192   3.338 0.8203917     2  Topic22

Note the estimated term frequency (within topic 22) is less than one. However, in ldavis.js, we set the lower bound of the domain to be 1. It should really be 0.

LDAvis URL - DESCRIPTION and website

Why won't you create index file for this address cpsievert.github.io/LDAvis/ ?
Which would include the table of contents like

Reviews -> cpsievert.github.io/LDAvis/reviews/reviews.html
the same for rest folders

The this URL could be added to DESCRIPTION file under URL field https://github.com/cpsievert/LDAvis/blob/master/DESCRIPTION#L34

I have no idea how I found cpsievert.github.io/LDAvis/reviews/reviews.html this example but other examples I have only found just after I visit gh-pages branch on this repository. Maybe not all users are so familiar with GitHub and sharing your website in DESCRIPTION is a good idea?

[LDAvis] in IE + responsive designed

Intertopic Distance Map Label

Is it possible to label the circles with the top term rather than a number?
Thanks

[question] How to control width of visualisation?

I was able to get LDAvis to work with rmarkdown:
See source and corresponding page.
But the page looks a little bit ugly - width of the plot is too high.

Is it possible to adjust width of the plot?

createJSON error: Error in term.topic.frequency[as.matrix(tinfo[c("Category", "Term")])] : subscript out of bounds

Hi,

First off, congratulation on the amazing work on LDAvis!

I'm trying to use this package to explore themes in a set of tweets (see dataset attached https://s3.amazonaws.com/rtmpre8k2d5awybtxxod-segue/tweets.csv) using the code below.

Yet I'm facing the following problem:
The createJSON function returns the following error: Error in term.topic.frequency[as.matrix(tinfo[c("Category", "Term")])] : subscript out of bounds

Any idea on what the issue could be?
Thanks in advance for your answer.

library(LDAvis)
library(tm)
library(lda)

tweets.raw <- read.csv("/Rwd/tweets.csv")

txt <- as.list(as.character(tweets.raw$contents))
nms <- as.list(as.character(tweets.raw$authname))
nms <- gsub("[^[:graph:]]", " ",nms)
tweets <- setNames(txt, nms)
tweets <- sapply(tweets, function(x) paste(x, collapse = " "))

stop_words <- stopwords("SMART")

# tweets prepartaion
tweets <- gsub("'", "", tweets)  
tweets <- gsub("[[:punct:]]", " ", tweets)  
tweets <- gsub("[[:cntrl:]]", " ", tweets) 
tweets <- gsub("[^[:graph:]]", " ",tweets)
tweets <- tolower(tweets) 

# tokenize
doc.list <- strsplit(tweets, "[[:space:]]+")

#table of terms
term.table <- table(unlist(doc.list))
term.table <- sort(term.table, decreasing = TRUE)

#remove terms that are stop words or occur fewer than 5 times:
del <- names(term.table) %in% stop_words | term.table < 5
term.table <- term.table[!del]
vocab <- names(term.table)

#format to lda package:
get.terms <- function(x) {
  index <- match(x, vocab)
  index <- index[!is.na(index)]
  rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
documents <- lapply(doc.list, get.terms)

#the data set stats
D <- length(documents) 
W <- length(vocab)  
doc.length <- sapply(documents, function(x) sum(x[2, ]))  
N <- sum(doc.length) 
term.frequency <- as.integer(term.table)

# MCMC and model tuning parameters:
K <- 20
#G <- 5000
G <- 100
alpha <- 0.02
eta <- 0.02

# Fit the model:
set.seed(357)
t1 <- Sys.time()
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab, 
                                   num.iterations = G, alpha = alpha, 
                                   eta = eta, initial = NULL, burnin = 0,
                                   compute.log.likelihood = TRUE)
t2 <- Sys.time()
t2 - t1  

theta <- t(apply(fit$document_sums + alpha, 2, function(x) x/sum(x)))
phi <- t(apply(t(fit$topics) + eta, 2, function(x) x/sum(x)))

tweets.list <- list(phi = phi,
                     theta = theta,
                     doc.length = doc.length,
                     vocab = vocab,
                     term.frequency = term.frequency)

#create the JSON object to feed the visualization:
json <- createJSON(phi = tweets.list$phi, 
                   theta = tweets.list$theta, 
                   doc.length = tweets.list$doc.length, 
                   vocab = tweets.list$vocab, 
                   term.frequency = tweets.list$term.frequency)

#serVis(json, out.dir = 'vis', open.browser = FALSE)

Definition of doc.length and term.frequency

Dear Carson,

Thank you for the excellent topic modelling visualization tool!

In the doc.length object is the number of tokens that appear in a document the number of unique tokens or the number of tokens? For instance if my document is: {Larry, Larry, Larry} would the corresponding factor in doc.length be 3, or 1? I am guessing "1"

In the term.frequency object is the number of times a term appears the sum over documents of the number of unique appearances of the term in each document, i.e. a sum of numbers that, for each document, are either 1 or 0, or is it simply the total number of times the term appears in any document?

Thank you again

How to create .rda data

Hi,
Not an issue, just wanted to use the LDAvis with custom dataset. Is there any guidelines or some sort of tutorial about how to create create dataset from bunch of text files in .rda format. Since LDAvis comes with APdata.rda, i'm trying to create my own .rda dataset inorder to get LDAvis running with custom dataset.

Phi test and Error message in CreateJSON

Hi, I was running into this issue when using LDAVis

> createJSON(phi=phi_n, theta = theta_n, doc.length = doc.length, vocab=vocab_vector, term.frequency=term_frequency)
Error in createJSON(phi = phi_n, theta = theta_n, doc.length = doc.length,  : 
  Columns of phi don't all sum to 1.

But the columns where in fact normalized to sum 1, by running previously:

phi_n <- sweep(phi, 2, colSums(phi), FUN="/")

So, I checked the source code:
https://github.com/cpsievert/LDAvis/blob/master/R/createJSON.R#L144-L147

And it seems that the phi_test and the output message don't match:

phi_test is testing for phi rows summing to 1, and the error message is saying that the phi columns should sum to 1.

Thanks for this library, looks great!

runShiny error

I am trying to run the examples but get this error:

Listening on http://127.0.0.1:7258
Joining by: Term
Joining by:
Error in eval(substitute(expr), envir, enclos) : object 'Freq2' not found
Error: object 'Freq2' not found

The issued commands were:
z <- with(APdata, check.inputs(K = 40, W = 10473, phi, term.frequency, vocab, topic.proportion)) with(z, runShiny(phi, term.frequency, vocab, topic.proportion))

cpsievert / ldavis Goto Github PK

ldavis's Introduction

LDAvis

Installing the package

Getting started

Sharing a Visualization

Video demos

More documentation

Additional data

ldavis's People

Contributors

Stargazers

Watchers

Forkers

ldavis's Issues

visualize lda model

Recommend Projects

Recommend Topics

Recommend Org