quanteda / tutorials.quanteda.io Goto Github PK

View Code? Open in Web Editor NEW

39.0 11.0 54.0 112.16 MB

Quanteda tutorials website

Home Page: https://tutorials.quanteda.io

License: MIT License

HTML 26.81% CSS 46.75% JavaScript 26.41% R 0.02%

quanteda hugo blogdown

tutorials.quanteda.io's Introduction

quanteda tutorials website

quanteda tutorials website is created by Hugo based on the Learn theme. Since Hugo accepts only Markdown and HTML, we use blogdown to generate those files from Rmarkdown.

How to add new pages?

You can add new pages to the content folder, but note that the file extension must be .Rmarkdown not .rmd, because blogdown converts .rmd to .html and .Rmarkdown to .markdown. After adding pages, you probably have to rebuild the website by clicking the 'Build Website' button in the build panel in R Studio.

How to edit pages?

If you have blogdown installed, you can execute blogdown::serve_site() in the console to check how your pages will look like. It will start a web server on your local machine so that you can preview the changes in your browser. HTML files are saved in the 'public' folder.

How to update files on the server?

After editing the website locally, commit all the changes and push them to the master branch. Netlify will then detect the changes and update the file on the server for you within a minute or so.

tutorials.quanteda.io's People

Contributors

Stargazers

Watchers

Forkers

nicmer cormac-work datastrategist atharkharal dani-lbnl dannydata klarahan rohanalexander amatsuo jimqian911 marianorico davi-moreira meilinshi konstantinossampanis gijsschumacher jonas-volle cdedatos iphone7725 dkaufmann8989 xiangmiao fghjorth yuanzhouir shinecho724 daiyamao lanabi broepke cristianneuhaus datapumpernickel fplatz adammo12 stellacha sarahjewett mohamednasr1 dongiljang claudegrasland helenafs hyperspin thinnley andybega galexandros ma-martel gcull enzedonline rochelleterman machinatoonist michaelniemann aseiito sarahashleyking banduoba analuiza-olive aubreympungose profrobwells macedonialapadian

tutorials.quanteda.io's Issues

Tutorial as pdf?

Apologies if this is not the right place to ask, but I was wondering whether it's a way to get a pdf version of the tutorial website. Many thanks, also for this wonderful package.

encoding is not speficied properly

I noted that ISO-8859-1(Latin 1) is set as character encoding for all the European languages:
https://tutorials.quanteda.io/import-data/multiple-files/

It is inappropriate because many of the European languages are not in Latin 1 (and text is actually corrupt in GR and LV). The most appropriate way is specifying ISO-8859-1, SO-8859-2, SO-8859-3 etc. but it requires file names to identify language...

dfm_trim comment is not correct

tutorials.quanteda.io/content/machine-learning/topicmodel.en.Rmarkdown

Lines 32 to 40 in 995c5d2

    
           Further, after removal of function words and punctuation in `dfm()`, we keep only the top 5% of the most frequent features (`min_termfreq = 0.8`) that appear in less than 10% of all documents (`max_docfreq = 0.1`) using `dfm_trim()` to focus on common but distinguishing features. 
        
           ```{r} 
        
           toks_news <- tokens(corp_news_2016, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE) 
        
           toks_news <- tokens_remove(toks_news, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst")) 
        
           dfmat_news <- dfm(toks_news) %>%  
        
                         dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile", 
        
                                  max_docfreq = 0.1, docfreq_type = "prop") 
        
           ```

this does not in fact select the top 5% of features. Rather, it selects the features that occur in the 80th percentile of frequency or higher. (So it's the top 20% of features.)

code object names inconsistent in textstat_collocations()

A user reports:

"This morning, I got stuck on this example:

https://tutorials.quanteda.io/advanced-operations/compound-mutiword-expressions/
which creates "tstat_col_cap"

I ran the earlier example:

https://tutorials.quanteda.io/statistical-analysis/collocation/

That started similarly but created "tstat_col_caps" with an "s" on the end. Rather than start over, I just pasted the change onto the bottom, which didn't work due to the "s." It's minor, but if you could make those two consistent, it would smooth the road for future readers.

Korean text from historical sources leaves brackets

once tokens() has been used, extractNouns() from KoNLP library doesn't work (and vice versa).
extractNouns() has the required tokenizer but doesn't recognize brackets, which tokens() does. This leaves words+brackets unfiltered.

Example is: "목석간담(木石肝膽)이", "일변(一邊)".

Thus,
extractNouns() --> as.tokens()
tokenizes correctly --> leaves brackets

as.tokens() --> extractNouns()
tokenizes wrongly --> fails

A corresponding regex pattern with remove_tokens() just removes the entire token "일변(一邊)".

Add more language-specific pages

We have pages on how to process texts in different languages. We want to cover all the languages but the following are the languages with priority.

textstat_dist fails with Error in validityMethod(as(object, superClass))

I'm trying to run the code on the tutorial page for DOCUMENT/FEATURE SIMILARITY

require(quanteda)
require(quanteda.textstats)
toks_inaug <- tokens(data_corpus_inaugural)
dfmat_inaug <- dfm(toks_inaug, remove = stopwords("en"))
tstat_dist <- as.dist(textstat_dist(dfmat_inaug))
clust <- hclust(tstat_dist)
plot(clust, xlab = "Distance", ylab = NULL)

The call to textstat_dist triggers the following error:

Error in validityMethod(as(object, superClass)) : 
  object 'packedMatrix_validate' not found

Solutions tried:

Clearing workspace & restarting R for clean environment.

First I've tried removing punctuation from the tokens, then fixing the dfm call to remove the stop words (as this doesn't seem to be valid):

toks_inaug <- tokens(data_corpus_inaugural, remove_punct = T)
dfmat_inaug <- dfm_remove(dfm(toks_inaug), pattern = stopwords("en"))

Same error. Also tried just calling textstat_dist(dfmat_inaug) directly (without as.dist).

I doubled checked the class of dfmat_inaug and checked it had valid data:

class(dfmat_inaug)
[1] "dfm"
attr(,"package")
[1] "quanteda"
topfeatures(dfmat_inaug, 10)
    people government         us        can       must       upon      great        may     states 
       584        564        505        487        376        371        344        343        334 
     world 
       319

I've also tried some of the other data_corpus data sets from textstats, same again.

I notice when I load the quanteda library, I get the following warning messages:

require(quanteda)
Loading required package: quanteda
Package version: 3.2.1
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 4 of 4 threads used.
See https://quanteda.io for tutorials and examples.
Warning messages:
1: In .recacheSubclasses(def@className, def, env) :
  undefined subclass "packedMatrix" of class "replValueSp"; definition not updated
2: In .recacheSubclasses(def@className, def, env) :
  undefined subclass "packedMatrix" of class "mMatrix"; definition not updated

System:

Win10 x64
R4.1.3
quanteda 3.2.1

All packages as per https://tutorials.quanteda.io/introduction/install/ installed with latest versions.

Update example object names according to the new wiki

After the extensive discussions and the updated style guide, we should apply the same rules (in particular object names and references) for the tutorials. I will adjust the code accordingly in the next couple of days.

Add acknowledgments section

In the past, we received excellent improvement suggestions from various users through Pull Requests. I suggest we add an acknowledgments section on the main page, listing the contributors' names and GitHub handles (R for Data Science is a good example of this approach). What do you think, @koheiw?

In newsmap, iso.alpha is not a function

When running tutorial codes in newsmap section,
RStudio return an error:
iso.alpha is not a function

The problem disappeared after rerun the code.

Unable to instantiate any of the param variables within textplot_wordcloud

None of the parameter variables can be instantiated as in the example below:
textplot_wordcloud(my_dfm, min_size=4, min_count =4, max_words = 10)
Error: none of these variables "is not a graphical parameter" according to the warnings. Any thoughts? thx

Add short section on how to import pre-tokenized text

Hi,

Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.

Maybe it is already in the docs, but google fails me when I try to search for that.

My solution is this one, but I am not sure whether that is the best way:

library(quanteda)
library(BTM)

# example data from the BTM package
data("brussels_reviews_anno")

# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")

# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id

# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |> 
  quanteda::dfm()

quanteda.corpora not found

Hello.
This is my first report, plus I am short on time here. So I hope I am following the guidelines.

I cannot make the examples for TOPIC MODELS.
1st error I get:
Loading required package: quanteda.corpora
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘quanteda.corpora’

I tried to look for the package in R but I could not find it.

Best regards

Extend section on topic modelling

One of the most frequently asked questions is how to use a dfm for topic models. We cover topic models in one section, but I think we should also show how to convert the dfm object to a stm topic model and how to run a basic operation.

We might also add links to tutorials on topic models, such as:

What's your view, @koheiw?

textstat_valence not working

Hi,

My code was working perfectly a week ago. Now, whenever I try to use textstat_valence I get the following error:
"Error in UseMethod("valence") :
no applicable method for 'valence' applied to an object of class "dictionary3".

I see that some other people have experience the same issue (https://stackoverflow.com/questions/71997257/r-quanteda-sentiment-error-in-usemethod-valence-no-applicable-method-for), but I didn't find any solutions.

Does anyone have any idea on how to solve the issue?

Images are not linked from HTML files

After rebuilding using new Hugo and the theme, images disappeared, for example:
https://tutorials.quanteda.io/machine-learning/wordfish/
This applies to all the pages.

Rename repo?

Should we rename this repo tutorials.quanteda.io ?

Update and extend Quanteda tutorials

Now that quanteda v2 is on CRAN, @koheiw and I should use this opportunity to revise, edit, and – if necessary – extend the Quanteda tutorials pages. In particular, we should add more explanations and comments to the chapters and check whether v2 allows implementing some of the operations more efficiently.

Find a way to generate slides from same content

Is there a way to generate slides with less text, from the same content as the tutorials that have more text and explanation?

The slides would be good for teaching, the text good for reading.

quanteda not loading in R after update to macOS Sonoma

I already reinstall the newest versions of both quanteda and R but the issue continues.

Here is what I am getting in R when trying to load it:

library(quanteda)
Package version: 3.3.1
Unicode version: 14.0
ICU version: 71.1

*** caught segfault ***
address 0x245, cause 'invalid permissions'

Traceback:
1: RcppParallel::defaultNumThreads()
2: get_threads()
3: unname(min(get_threads(), na.rm = TRUE))
4: get_options_default()
5: quanteda_initialize()
6: quanteda_options(initialize = TRUE)
7: fun(libname, pkgname)
8: doTryCatch(return(expr), name, parentenv, handler)
9: tryCatchOne(expr, names, parentenv, handlers[[1L]])
10: tryCatchList(expr, classes, parentenv, handlers)
11: tryCatch(fun(libname, pkgname), error = identity)
12: runHook(".onAttach", ns, dirname(nspath), nsname)
13: attachNamespace(ns, pos = pos, deps, exclude, include.only)
14: doTryCatch(return(expr), name, parentenv, handler)
15: tryCatchOne(expr, names, parentenv, handlers[[1L]])
16: tryCatchList(expr, classes, parentenv, handlers)
17: tryCatch({ attr(package, "LibPath") <- which.lib.loc ns <- loadNamespace(package, lib.loc) env <- attachNamespace(ns, pos = pos, deps, exclude, include.only)}, error = function(e) { P <- if (!is.null(cc <- conditionCall(e))) paste(" in", deparse(cc)[1L]) else "" msg <- gettextf("package or namespace load failed for %s%s:\n %s", sQuote(package), P, conditionMessage(e)) if (logical.return && !quietly) message(paste("Error:", msg), domain = NA) else stop(msg, call. = FALSE, domain = NA)})
18: library(quanteda)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

topic modeling example

I would like to know how to convert into a data frame
the last line in the example:

head(topics(lda), 20)

Thank you!

	Further, after removal of function words and punctuation in `dfm()`, we keep only the top 5% of the most frequent features (`min_termfreq = 0.8`) that appear in less than 10% of all documents (`max_docfreq = 0.1`) using `dfm_trim()` to focus on common but distinguishing features.

	```{r}
	toks_news <- tokens(corp_news_2016, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
	toks_news <- tokens_remove(toks_news, pattern = c(stopwords("en"), "-time", "updated-", "gmt", "bst"))
	dfmat_news <- dfm(toks_news) %>%
	dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile",
	max_docfreq = 0.1, docfreq_type = "prop")
	```