Coder Social home page Coder Social logo

tutorials.quanteda.io's Introduction

quanteda tutorials website

quanteda tutorials website is created by Hugo based on the Learn theme. Since Hugo accepts only Markdown and HTML, we use blogdown to generate those files from Rmarkdown.

How to add new pages?

You can add new pages to the content folder, but note that the file extension must be .Rmarkdown not .rmd, because blogdown converts .rmd to .html and .Rmarkdown to .markdown. After adding pages, you probably have to rebuild the website by clicking the 'Build Website' button in the build panel in R Studio.

How to edit pages?

If you have blogdown installed, you can execute blogdown::serve_site() in the console to check how your pages will look like. It will start a web server on your local machine so that you can preview the changes in your browser. HTML files are saved in the 'public' folder.

How to update files on the server?

After editing the website locally, commit all the changes and push them to the master branch. Netlify will then detect the changes and update the file on the server for you within a minute or so.

tutorials.quanteda.io's People

Contributors

alexander-poon avatar andybega avatar broepke avatar cdedatos avatar dani-lbnl avatar datapumpernickel avatar datastrategist avatar enzedonline avatar fgeeri avatar fghjorth avatar klarahan avatar koheiw avatar lanabi avatar mohamednasr1 avatar rochelleterman avatar rohanalexander avatar sarahashleyking avatar sarahjewett avatar stefan-mueller avatar yuanzhouir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tutorials.quanteda.io's Issues

Tutorial as pdf?

Apologies if this is not the right place to ask, but I was wondering whether it's a way to get a pdf version of the tutorial website. Many thanks, also for this wonderful package.

dfm_trim comment is not correct

In

Further, after removal of function words and punctuation in `dfm()`, we keep only the top 5% of the most frequent features (`min_termfreq = 0.8`) that appear in less than 10% of all documents (`max_docfreq = 0.1`) using `dfm_trim()` to focus on common but distinguishing features.
```{r}
toks_news <- tokens(corp_news_2016, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks_news <- tokens_remove(toks_news, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst"))
dfmat_news <- dfm(toks_news) %>%
dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
```

this does not in fact select the top 5% of features. Rather, it selects the features that occur in the 80th percentile of frequency or higher. (So it's the top 20% of features.)

See also quanteda/quanteda#2200.

code object names inconsistent in textstat_collocations()

A user reports:

"This morning, I got stuck on this example:

https://tutorials.quanteda.io/advanced-operations/compound-mutiword-expressions/
which creates "tstat_col_cap"

I ran the earlier example:

https://tutorials.quanteda.io/statistical-analysis/collocation/

That started similarly but created "tstat_col_caps" with an "s" on the end. Rather than start over, I just pasted the change onto the bottom, which didn't work due to the "s." It's minor, but if you could make those two consistent, it would smooth the road for future readers.

Korean text from historical sources leaves brackets

once tokens() has been used, extractNouns() from KoNLP library doesn't work (and vice versa).
extractNouns() has the required tokenizer but doesn't recognize brackets, which tokens() does. This leaves words+brackets unfiltered.

Example is: "목석간담(木石肝膽)이", "일변(一邊)".

Thus,
extractNouns() --> as.tokens()
tokenizes correctly --> leaves brackets

as.tokens() --> extractNouns()
tokenizes wrongly --> fails

A corresponding regex pattern with remove_tokens() just removes the entire token "일변(一邊)".

textstat_dist fails with Error in validityMethod(as(object, superClass))

I'm trying to run the code on the tutorial page for DOCUMENT/FEATURE SIMILARITY

require(quanteda)
require(quanteda.textstats)
toks_inaug <- tokens(data_corpus_inaugural)
dfmat_inaug <- dfm(toks_inaug, remove = stopwords("en"))
tstat_dist <- as.dist(textstat_dist(dfmat_inaug))
clust <- hclust(tstat_dist)
plot(clust, xlab = "Distance", ylab = NULL)

The call to textstat_dist triggers the following error:

Error in validityMethod(as(object, superClass)) : 
  object 'packedMatrix_validate' not found

Solutions tried:

Clearing workspace & restarting R for clean environment.

First I've tried removing punctuation from the tokens, then fixing the dfm call to remove the stop words (as this doesn't seem to be valid):

toks_inaug <- tokens(data_corpus_inaugural, remove_punct = T)
dfmat_inaug <- dfm_remove(dfm(toks_inaug), pattern = stopwords("en"))

Same error. Also tried just calling textstat_dist(dfmat_inaug) directly (without as.dist).

I doubled checked the class of dfmat_inaug and checked it had valid data:

class(dfmat_inaug)
[1] "dfm"
attr(,"package")
[1] "quanteda"
topfeatures(dfmat_inaug, 10)
    people government         us        can       must       upon      great        may     states 
       584        564        505        487        376        371        344        343        334 
     world 
       319 

I've also tried some of the other data_corpus data sets from textstats, same again.

I notice when I load the quanteda library, I get the following warning messages:

require(quanteda)
Loading required package: quanteda
Package version: 3.2.1
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 4 of 4 threads used.
See https://quanteda.io for tutorials and examples.
Warning messages:
1: In .recacheSubclasses(def@className, def, env) :
  undefined subclass "packedMatrix" of class "replValueSp"; definition not updated
2: In .recacheSubclasses(def@className, def, env) :
  undefined subclass "packedMatrix" of class "mMatrix"; definition not updated

System:

Win10 x64
R4.1.3
quanteda 3.2.1

All packages as per https://tutorials.quanteda.io/introduction/install/ installed with latest versions.

Add acknowledgments section

In the past, we received excellent improvement suggestions from various users through Pull Requests. I suggest we add an acknowledgments section on the main page, listing the contributors' names and GitHub handles (R for Data Science is a good example of this approach). What do you think, @koheiw?

Add short section on how to import pre-tokenized text

Hi,

Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.

Maybe it is already in the docs, but google fails me when I try to search for that.

My solution is this one, but I am not sure whether that is the best way:

library(quanteda)
library(BTM)

# example data from the BTM package
data("brussels_reviews_anno")

# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")

# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id

# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |> 
  quanteda::dfm()

quanteda.corpora not found

Hello.
This is my first report, plus I am short on time here. So I hope I am following the guidelines.

I cannot make the examples for TOPIC MODELS.
1st error I get:
Loading required package: quanteda.corpora
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘quanteda.corpora’

I tried to look for the package in R but I could not find it.

Best regards

textstat_valence not working

Hi,

My code was working perfectly a week ago. Now, whenever I try to use textstat_valence I get the following error:
"Error in UseMethod("valence") :
no applicable method for 'valence' applied to an object of class "dictionary3".

I see that some other people have experience the same issue (https://stackoverflow.com/questions/71997257/r-quanteda-sentiment-error-in-usemethod-valence-no-applicable-method-for), but I didn't find any solutions.

Does anyone have any idea on how to solve the issue?

Rename repo?

Should we rename this repo tutorials.quanteda.io ?

Update and extend Quanteda tutorials

Now that quanteda v2 is on CRAN, @koheiw and I should use this opportunity to revise, edit, and – if necessary – extend the Quanteda tutorials pages. In particular, we should add more explanations and comments to the chapters and check whether v2 allows implementing some of the operations more efficiently.

Find a way to generate slides from same content

Is there a way to generate slides with less text, from the same content as the tutorials that have more text and explanation?

The slides would be good for teaching, the text good for reading.

quanteda not loading in R after update to macOS Sonoma

I already reinstall the newest versions of both quanteda and R but the issue continues.

Here is what I am getting in R when trying to load it:

library(quanteda)
Package version: 3.3.1
Unicode version: 14.0
ICU version: 71.1

*** caught segfault ***
address 0x245, cause 'invalid permissions'

Traceback:
1: RcppParallel::defaultNumThreads()
2: get_threads()
3: unname(min(get_threads(), na.rm = TRUE))
4: get_options_default()
5: quanteda_initialize()
6: quanteda_options(initialize = TRUE)
7: fun(libname, pkgname)
8: doTryCatch(return(expr), name, parentenv, handler)
9: tryCatchOne(expr, names, parentenv, handlers[[1L]])
10: tryCatchList(expr, classes, parentenv, handlers)
11: tryCatch(fun(libname, pkgname), error = identity)
12: runHook(".onAttach", ns, dirname(nspath), nsname)
13: attachNamespace(ns, pos = pos, deps, exclude, include.only)
14: doTryCatch(return(expr), name, parentenv, handler)
15: tryCatchOne(expr, names, parentenv, handlers[[1L]])
16: tryCatchList(expr, classes, parentenv, handlers)
17: tryCatch({ attr(package, "LibPath") <- which.lib.loc ns <- loadNamespace(package, lib.loc) env <- attachNamespace(ns, pos = pos, deps, exclude, include.only)}, error = function(e) { P <- if (!is.null(cc <- conditionCall(e))) paste(" in", deparse(cc)[1L]) else "" msg <- gettextf("package or namespace load failed for %s%s:\n %s", sQuote(package), P, conditionMessage(e)) if (logical.return && !quietly) message(paste("Error:", msg), domain = NA) else stop(msg, call. = FALSE, domain = NA)})
18: library(quanteda)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

topic modeling example

I would like to know how to convert into a data frame
the last line in the example:

head(topics(lda), 20)

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.