Coder Social home page Coder Social logo

qtalr / book Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 299.73 MB

An Introduction to Quantitative Text Analysis for Linguistics: Reproducible Research Using R

Home Page: https://qtalr.com

Lua 0.14% CSS 0.41% TeX 99.27% JavaScript 0.02% Shell 0.12% R 0.04%
linguistics r text-analysis textbooks tidyverse

book's Introduction

An Introduction to Quantitative Text Analysis for Linguistics

Reproducible Research Using R

Book

The goal of this textbook is to provide readers with foundational knowledge and practical skills in quantitative text analysis using the R programming language. It is geared towards advanced undergraduates, graduate students, and researchers looking to expand their methodological toolbox. It assumes no prior knowledge of programming or quantitative methods and prioritizes practical application and intuitive understanding over technical details.

By the end of this textbook, readers will be able to identify, interpret and evaluate data analysis procedures and results to support research questions within language science. Additionally, readers will gain experience in designing and implementing research projects that involve processing and analyzing textual data employing modern programming strategies. This textbook aims to instill a strong sense of reproducible research practices, which are critical for promoting transparency, verification, and sharing of research findings.

Author

Dr. Jerid Francom is Associate Professor of Spanish and Linguistics at Wake Forest University. His research focuses on the use of language corpora from a variety of sources (news, social media, and other internet sources) to better understand the linguistic and cultural similarities and differences between language varieties for both scholarly and pedagogical projects. He has published on topics including the development, annotation, and evaluation of linguistic corpora and analyzed corpora through corpus, psycholinguistic, and computational methodologies. He also has experience working with and teaching statistical programming with R.

License

Creative Commons License
This work by Jerid C. Francom is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Citation

@book{Francom2025,
  title = {An {{Introduction}} to {{Quantitative Text Analysis}} for {{Linguistics}}: {{Reproducible Research Using R}}},
  shorttitle = {An {{Introduction}} to {{Quantitative Text Analysis}} for {{Linguistics}}},
  author = {Francom, Jerid},
  year = {2025},
  publisher = {Routledge},
  urldate = {2024-07-30},
  isbn = {978-1-03-249426-5},
  langid = {english},
}

book's People

Contributors

francojc avatar

Stargazers

Lisa Levinson avatar Roberto Salas avatar Shinya Uryu avatar Nikolay Karelin avatar

Watchers

 avatar

book's Issues

AA: only cover normal, skewed distros and no tests

it is sufficient to discuss normal and skewed distributions, and how log-transformation can help reduce skewing.

Not need for kurtosis or skewness scores --we will be using simulation-based inference so nuanced understanding of distributions is not necessary.

The **Normal Distribution** is a theoretical distribution where the values are symmetrically dispersed around the central tendency (mean/ median). In terms we can now understand, this means that the mean and median are the same. The Normal Distribution is important because many statistical tests assume that the data distribution is normal or near normal.

Remove unweighted log odds

The log odds calculation should just jump to the weighted version as we have already estalished that the unweighted version does not take into account sub-corpus size.

The `tidylo` package provides a convenient function `bind_log_odds()` to calculate the log odds ratio, and a weighed variant, for each type in each sub-corpus. Let's use this function to calculate the log odds ratio for each lemma in each modality, as seen in @exm-eda-masc-log-odds.

Transform datasets: remove verbose section on bigrams

The bigrams section goes into too much detail than necessary, just make reference to the fact that other unit types can be used to tokenize text. Then move on to collapsing.

As we create derived datasets to explore, let's also create bigram tokens. We can do this by changing the `token` parameter to `"ngrams"` and specifying the value for $n$ with the `n` parameter. I will assign the result to `cabnc_bigrams_tbl` as we will have two-word tokens, as seen in @exm-td-cabnc-tokenization-bigrams-tidytext.

Skip to result instead of breaking out spoken and written

book/exploration.qmd

Lines 694 to 695 in 7df19b6

We can further cull our results by filtering out lemmas that are not well-dispersed across the sub-corpora. Although it may be tempting to use the threshold we used earlier, we should consider that the size of the sub-corpora are different and the distribution of the dispersion measure may be different. With this in mind, we need to visualize the distribution of the dispersion measure for each modality, as seen in @fig-eda-masc-dispersion-threshold.

The result that is found from the elbow method is enough

Fix the cummulative distro figure

Get the distro figure to show a line, not columns. Also add an x-axis scale that is appropriate. Add annotations for the 10 and 100 most frequent lemmas.

book/exploration.qmd

Lines 371 to 372 in 7df19b6

#| label: fig-eda-masc-count-cumulative
#| fig-cap: "Cumulative frequency of lemmas in the MASC dataset"

Using `kable()`, `kableExtra()`, and markdown tables

I should take a look at my tables and ensure that I'm consistent with the way I'm using them.

~doc_id, ~type, ~line_id, ~line,

  • I need to review the use of kable() across the book, as I think I am using it in different ways
    • kable()
    • kable(booktabs = TRUE)
    • kable(booktabs = TRUE) |> kable_styling()
    • kable(booktabs = TRUE) |> kable_styling(latex_options = c("striped", "hold_position"))

Summarize DTM as lesson 'Advanced objects' will cover

The Advanced objects lesson will cover the DTM, so this description is redundant.

book/exploration.qmd

Lines 1043 to 1044 in 7df19b6

To recast a data frame into a DTM, we can use the `cast_dtm()` function from the `tidytext` package. This function takes a data frame with a document identifier, a feature identifier, and a value for each observation and casts it into a matrix. Operations such as normalization are easily and efficiently performed in R on matrices, so initially we can cast a frequency table of lemmas and part-of-speech tags into a matrix and then normalize the matrix by documents. For the $tf-idf$ measure we use the `bind_tf_idf()` function from the `tidytext` package. This function takes a DTM and calculates the $tf-idf$ measure for each feature in each document. This is a normalized measure, so we do not need to normalize the matrix by documents. Let's see how this works with the MASC dataset in @exm-eda-masc-dtms.

Preface questions

Revise Questions.

Technical exercises: make sure that these align and do not overlap with the Recipe/ Lab.

Update the MASC dataset to that used in Recipe 7

The transformed dataset from Recipe 7 is cleaner.

To remove non-words:

pos

  • CD, FW, LS, SYM

lemma

  • ^\W$

Our first pass at calculating lemma frequency in @exm-eda-masc-count should bring something else to our attention. As we can see among the most frequent lemmas are non-words such as `,`, and `.`. As you can imagine, given the conventions of written and transcriptional language, these types are very frequent. For a frequency analysis focusing on words, however, we should probably remove them. Thinking ahead, there may also be other non-words that we want to remove, such as symbols, numbers, *etc*. Let's take a look at @fig-eda-masc-pos, where I've counted the part-of-speech tags `pos` in the dataset to see what other non-words we might want to remove.

Orientation part overview

The Orientation part needs to be revised so that the overview is more representative of the new configuration.

Citations for packages not appearing

R and some R packages provide structured datasets that are available for use directly within R. For example, the `languageR` package [@R-languageR] provides the `dative` dataset, which is a dataset containing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection. The `janeaustenr` package [@R-janeaustenr] provides the `austen_books` dataset, which is a dataset of Jane Austen's novels. Package datasets are loaded into an R session using either the `data()` function, if the package is loaded, or the `::` operator, if the package is not loaded. For example, `data(dative)` or `languageR::dative`.

Add idealized dataset in the CABNC orientation section

The other cases in this chapter include a tabular idealization. This case should follow suit.

Now, let's envision a scenario in which we want to organize a dataset that can be used in a study that aims to investigate the relationship between speaker demographics and utterances. An ideal dataset would contain information about speakers and their utterances. In their original format, the CABNC datasets separate information about utterances and speakers in separate tables, `cabnc_utterances` and `cabnc_participants`, respectively. The idealized dataset, then, will combine the variables from each of these tables into a single dataset.

Copy `skim()` output and add as plain text

Let's get a high-level summary of the variables in the dataset. We can use the `skim()` function from the `skimr` package [@R-skimr] to get a summary of the variables in the dataset^[Note I've modified the output of `skim()` for display purposes.].

The skimr output does not format well in the knitted document.

Create GH publish

Create a GH workflow for publishing to the gh-pages branch using the _freeze for computations. Then I need to remove the docs folder from GH.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.