qtalr / book Goto Github PK

View Code? Open in Web Editor NEW

4.0 1.0 0.0 299.73 MB

An Introduction to Quantitative Text Analysis for Linguistics: Reproducible Research Using R

Home Page: https://qtalr.com

Lua 0.14% CSS 0.41% TeX 99.27% JavaScript 0.02% Shell 0.12% R 0.04%

linguistics r text-analysis textbooks tidyverse

book's Introduction

An Introduction to Quantitative Text Analysis for Linguistics

Reproducible Research Using R

Book

The goal of this textbook is to provide readers with foundational knowledge and practical skills in quantitative text analysis using the R programming language. It is geared towards advanced undergraduates, graduate students, and researchers looking to expand their methodological toolbox. It assumes no prior knowledge of programming or quantitative methods and prioritizes practical application and intuitive understanding over technical details.

By the end of this textbook, readers will be able to identify, interpret and evaluate data analysis procedures and results to support research questions within language science. Additionally, readers will gain experience in designing and implementing research projects that involve processing and analyzing textual data employing modern programming strategies. This textbook aims to instill a strong sense of reproducible research practices, which are critical for promoting transparency, verification, and sharing of research findings.

Author

Dr. Jerid Francom is Associate Professor of Spanish and Linguistics at Wake Forest University. His research focuses on the use of language corpora from a variety of sources (news, social media, and other internet sources) to better understand the linguistic and cultural similarities and differences between language varieties for both scholarly and pedagogical projects. He has published on topics including the development, annotation, and evaluation of linguistic corpora and analyzed corpora through corpus, psycholinguistic, and computational methodologies. He also has experience working with and teaching statistical programming with R.

License

This work by Jerid C. Francom is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Citation

@book{Francom2025,
  title = {An {{Introduction}} to {{Quantitative Text Analysis}} for {{Linguistics}}: {{Reproducible Research Using R}}},
  shorttitle = {An {{Introduction}} to {{Quantitative Text Analysis}} for {{Linguistics}}},
  author = {Francom, Jerid},
  year = {2025},
  publisher = {Routledge},
  urldate = {2024-07-30},
  isbn = {978-1-03-249426-5},
  langid = {english},
}

book's People

Contributors

Stargazers

Watchers

book's Issues

Transform datasets: Be careful not to repeat myself

The overview sections of each of the transformations should not repeat what has been said in the 'Understanding Data' chapter. Briefly frame and then make reference to this chapter/ section.

https://github.com/qtalr/book/blob/b149ec7f2179d965c93cdd858b14006b3cd020a8/transform-datasets.qmd#L920C3-L920C3

tinytable tt() centers all columns

It doesn't seem possible to get data-sensitive alignment. Even with style_tt(align = 'l'). Need to look at another solution.

AA: only cover normal, skewed distros and no tests

it is sufficient to discuss normal and skewed distributions, and how log-transformation can help reduce skewing.

Not need for kurtosis or skewness scores --we will be using simulation-based inference so nuanced understanding of distributions is not necessary.

book/approaching-analysis.qmd

Line 817 in b149ec7

    
           The **Normal Distribution** is a theoretical distribution where the values are symmetrically dispersed around the central tendency (mean/ median). In terms we can now understand, this means that the mean and median are the same. The Normal Distribution is important because many statistical tests assume that the data distribution is normal or near normal.

Remove unweighted log odds

The log odds calculation should just jump to the weighted version as we have already estalished that the unweighted version does not take into account sub-corpus size.

book/exploration.qmd

Line 738 in 7df19b6

    
           The `tidylo` package provides a convenient function `bind_log_odds()` to calculate the log odds ratio, and a weighed variant, for each type in each sub-corpus. Let's use this function to calculate the log odds ratio for each lemma in each modality, as seen in @exm-eda-masc-log-odds.

Replace `[[` list subsets with `purrr::pluck()`

To simplify extracting elements from a list we can use the purrr package and the pluck() function.

https://github.com/qtalr/book/blob/b149ec7f2179d965c93cdd858b14006b3cd020a8/curate-datasets.qmd#L1106C22-L1106C22

This likely appears in other parts of the book too.

Transform datasets: remove verbose section on bigrams

The bigrams section goes into too much detail than necessary, just make reference to the fact that other unit types can be used to tokenize text. Then move on to collapsing.

book/transform-datasets.qmd

Line 838 in b149ec7

    
           As we create derived datasets to explore, let's also create bigram tokens. We can do this by changing the `token` parameter to `"ngrams"` and specifying the value for $n$ with the `n` parameter. I will assign the result to `cabnc_bigrams_tbl` as we will have two-word tokens, as seen in @exm-td-cabnc-tokenization-bigrams-tidytext.

Skip to result instead of breaking out spoken and written

book/exploration.qmd

Lines 694 to 695 in 7df19b6

    
           We can further cull our results by filtering out lemmas that are not well-dispersed across the sub-corpora. Although it may be tempting to use the threshold we used earlier, we should consider that the size of the sub-corpora are different and the distribution of the dispersion measure may be different. With this in mind, we need to visualize the distribution of the dispersion measure for each modality, as seen in @fig-eda-masc-dispersion-threshold.

The result that is found from the elbow method is enough

Fix the cummulative distro figure

Get the distro figure to show a line, not columns. Also add an x-axis scale that is appropriate. Add annotations for the 10 and 100 most frequent lemmas.

book/exploration.qmd

Lines 371 to 372 in 7df19b6

    
           #| label: fig-eda-masc-count-cumulative 
        
           #| fig-cap: "Cumulative frequency of lemmas in the MASC dataset"

Using `kable()`, `kableExtra()`, and markdown tables

I should take a look at my tables and ensure that I'm consistent with the way I'm using them.

book/curate-datasets.qmd

Line 189 in 7df19b6

~doc_id, ~type, ~line_id, ~line,

I need to review the use of kable() across the book, as I think I am using it in different ways
- kable()
- kable(booktabs = TRUE)
- kable(booktabs = TRUE) |> kable_styling()
- kable(booktabs = TRUE) |> kable_styling(latex_options = c("striped", "hold_position"))

Collapse sections 1.1 and 1.2 into one section.

Collapse the sections on human limitations of understanding and the use of scientific methods as a way to stem these limitations.

https://github.com/qtalr/book/blob/b149ec7f2179d965c93cdd858b14006b3cd020a8/text-analysis.qmd#L53C47-L53C47

Summarize DTM as lesson 'Advanced objects' will cover

The Advanced objects lesson will cover the DTM, so this description is redundant.

book/exploration.qmd

Lines 1043 to 1044 in 7df19b6

    
           To recast a data frame into a DTM, we can use the `cast_dtm()` function from the `tidytext` package. This function takes a data frame with a document identifier, a feature identifier, and a value for each observation and casts it into a matrix. Operations such as normalization are easily and efficiently performed in R on matrices, so initially we can cast a frequency table of lemmas and part-of-speech tags into a matrix and then normalize the matrix by documents. For the $tf-idf$ measure we use the `bind_tf_idf()` function from the `tidytext` package. This function takes a DTM and calculates the $tf-idf$ measure for each feature in each document. This is a normalized measure, so we do not need to normalize the matrix by documents. Let's see how this works with the MASC dataset in @exm-eda-masc-dtms.

Preface questions

Revise Questions.

Technical exercises: make sure that these align and do not overlap with the Recipe/ Lab.

Update the MASC dataset to that used in Recipe 7

The transformed dataset from Recipe 7 is cleaner.

To remove non-words:

pos

CD, FW, LS, SYM

lemma

^\W$

book/exploration.qmd

Line 267 in b149ec7

    
           Our first pass at calculating lemma frequency in @exm-eda-masc-count should bring something else to our attention. As we can see among the most frequent lemmas are non-words such as `,`, and `.`. As you can imagine, given the conventions of written and transcriptional language, these types are very frequent. For a frequency analysis focusing on words, however, we should probably remove them. Thinking ahead, there may also be other non-words that we want to remove, such as symbols, numbers, *etc*. Let's take a look at @fig-eda-masc-pos, where I've counted the part-of-speech tags `pos` in the dataset to see what other non-words we might want to remove.

Do not show stopword filtering in detail

Summarize stopword approach, but do not include the code. This will appear in other resources.

https://github.com/qtalr/book/blob/b149ec7f2179d965c93cdd858b14006b3cd020a8/exploration.qmd#L427C20-L427C20

BTW: add a frequency table after filtering by pos content words.

Orientation part overview

The Orientation part needs to be revised so that the overview is more representative of the new configuration.

Redundant directory structure

Remove the redundant directory structure output (bash)

book/curate-datasets.qmd

Line 452 in b149ec7

    
           We will leverage the project directory structure which has distinct directories for `original/` and `derived/` data(sets), seen in @exm-cd-europarl-write-directory.

Remove the verbose description manual download procedure

File 5.1 should be removed and the prose will be enough.

book/acquire-data.qmd

Line 164 in b149ec7

    
           In the *1_acquire_data.qmd* file I've added example sections to display the data origin CSV file as a table and to document the data download procedures, as seen in @lst-ad-cedel2-acquire-data-qmd.

Citations for packages not appearing

book/curate-datasets.qmd

Line 547 in b149ec7

    
           R and some R packages provide structured datasets that are available for use directly within R. For example, the `languageR` package [@R-languageR] provides the `dative` dataset, which is a dataset containing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection. The `janeaustenr` package [@R-janeaustenr] provides the `austen_books` dataset, which is a dataset of Jane Austen's novels. Package datasets are loaded into an R session using either the `data()` function, if the package is loaded, or the `::` operator, if the package is not loaded. For example, `data(dative)` or `languageR::dative`.

Add idealized dataset in the CABNC orientation section

The other cases in this chapter include a tabular idealization. This case should follow suit.

book/curate-datasets.qmd

Line 607 in b149ec7

    
           Now, let's envision a scenario in which we want to organize a dataset that can be used in a study that aims to investigate the relationship between speaker demographics and utterances. An ideal dataset would contain information about speakers and their utterances. In their original format, the CABNC datasets separate information about utterances and speakers in separate tables, `cabnc_utterances` and `cabnc_participants`, respectively. The idealized dataset, then, will combine the variables from each of these tables into a single dataset.

Remove description of control statements (`if`)

The R lesson "Control Statements" will cover the description of if statements. Remove from the chapter text.

https://github.com/qtalr/book/blob/b149ec7f2179d965c93cdd858b14006b3cd020a8/acquire-data.qmd#L399C5-L399C5

Text Analysis in context needs to be more concise

Section 1 needs to be reduced or cut. The overall chapter length itself is not overly long (5,067) but section 1 seems a bit to off topic.

Copy `skim()` output and add as plain text

book/curate-datasets.qmd

Line 680 in b149ec7

    
           Let's get a high-level summary of the variables in the dataset. We can use the `skim()` function from the `skimr` package [@R-skimr] to get a summary of the variables in the dataset^[Note I've modified the output of `skim()` for display purposes.].

The skimr output does not format well in the knitted document.

Create GH publish

Create a GH workflow for publishing to the gh-pages branch using the _freeze for computations. Then I need to remove the docs folder from GH.

AA: Summarize correlation, association, and CIs

It is only necessary to make a connection that the visualizations correspond to some measure of association and that the association estimate can be given a measure of confidence.

The exact association measures is definitely not necessary. Confidence intervals are explained in more detailed in the inference chapter.

https://github.com/qtalr/book/blob/b149ec7f2179d965c93cdd858b14006b3cd020a8/approaching-analysis.qmd#L1159C20-L1159C20

Remove File 5.2

Again, like File 5.1, it will not be necessary to show the file verbosely.

book/acquire-data.qmd

Line 211 in b149ec7

	#\| label: fig-eda-masc-count-cumulative
	#\| fig-cap: "Cumulative frequency of lemmas in the MASC dataset"

qtalr / book Goto Github PK

book's Introduction

An Introduction to Quantitative Text Analysis for Linguistics

Book

Author

License

Citation

book's People

Contributors

Stargazers

Watchers

book's Issues

Recommend Projects

Recommend Topics

Recommend Org