thedatamine / the-examples-book Goto Github PK

View Code? Open in Web Editor NEW

20.0 4.0 35.0 1.18 GB

Supplementary material for solving projects assigned in The Data Mine.

Home Page: https://the-examples-book.com

Makefile 0.21% Shell 0.38% Python 0.22% Handlebars 11.67% Jupyter Notebook 28.12% HTML 59.40%

data-science database r python3 sql book purdue-university

the-examples-book's Introduction

Website: https://the-examples-book.com (main branch)

The Examples Book

Supplementary material for solving projects assigned in The Data Mine, Purdue University's integrative data science initiative. The "core" book can be found at https://the-examples-book.com. Complementary materials are available as appendices at the following URLs:

Containerization and Kubernetes Deployment: https://the-examples-book.com/starter-guides/data-engineering/containers/intro-to-containers
Prodigy Annotation Tool: https://the-examples-book.com/starter-guides/data-science/data-analysis/nlp/prodigy
Geospatial Analytics: https://the-examples-book.com/starter-guides/data-science/data-analysis/gis
Data Visualization: https://the-examples-book.com/starter-guides/data-visualiation/introduction-data-visualization
Natural Language Processing: https://the-examples-book.com/starter-guides/data-science/data-analysis/nlp/introduction-nlp
MATLAB: https://the-examples-book.com/starter-guides/tools-and-standards/matlab/introduction-matlab
Time Series: https://the-examples-book.com/starter-guides/data-science/data-analysis/time-series
Projects Archive: https://the-examples-book.com/projects
Corporate Partners Information: https://the-examples-book.com/crp
Anvil Resources & Guides: https://the-examples-book.com/starter-guides/anvil/
Registration: https://the-examples-book.com/registration
Think Summer: https://the-examples-book.com/think-summer

You can learn more about The Data Mine using the following links:

Contribution

Thank you for those that have already contributed. If you have an ignored issue or pull request, please know we are going to get to it and we really appreciate your patience.

Here is our guide on how to contribute. Please feel free to start a discussion or open up an issue.

Build

This book is written using AsciiDoc. AsciiDoc is an open and powerful format for writing notes, text documents, books, etc. It is easy to write technical documentation in AsciiDoc, and quickly convert the text to various mediums like websites, ebooks, pdfs, etc.

Search index

Search is handled by Meilisearch. For this repository -- the core book -- the following GitHub Action job automatically builds, deploys, and updates the search index. There is no additional work that must be done when a change is made to this repository.

on: 
  push:
    branches:
      - main

jobs:
  build:
    runs-on: ubuntu-20.04
    steps:
    - uses: actions/checkout@v2
    - name: Wait for CF Pages
      id: cf-pages
      uses: WalshyDev/cf-pages-await@v1
      with:
        accountEmail: ${{ secrets.CLOUDFLARE_ACCOUNT_EMAIL }}
        apiKey: ${{ secrets.CLOUDFLARE_GLOBAL_API_KEY  }}
        accountId: 'c07da5a4aa8d50689311ae57df77e3a6'
        project: 'the-examples-book'
        # Add this if you want GitHub Deployments (see below)
        githubToken: ${{ secrets.GITHUB_TOKEN }}

  run-scraper:
    needs: build
    runs-on: ubuntu-20.04
    steps:      
    - name: Clone TheDataMine/docs-scraper
      uses: actions/checkout@v2
      with: 
        repository: TheDataMine/docs-scraper
    - name: Install pipenv
      run: |
        python3 -m pip install --upgrade pipenv wheel
    - id: cache-pipenv
      uses: actions/cache@v1
      with:
        path: ~/.local/share/virtualenvs
        key: ${{ runner.os }}-pipenv-${{ hashFiles('**/Pipfile.lock') }}
    - name: Install dependencies
      if: steps.cache-pipenv.outputs.cache-hit != 'true'
      run: 
        pipenv install
    - name: Run docs-scraper
      env:
        MEILISEARCH_HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }}
        MEILISEARCH_API_KEY: ${{ secrets.MEILISEARCH_API_KEY }}
      run: |
        pipenv run ./docs_scraper ./the-examples-book.config.json
        
  purge-cf-cache:
    needs: build
    runs-on: ubuntu-20.04
    steps:
    - name: Purge Cloudflare cache
      uses: jakejarvis/cloudflare-purge-action@master
      env:
        CLOUDFLARE_ZONE: ${{ secrets.CLOUDFLARE_ZONE }}
        CLOUDFLARE_TOKEN: ${{ secrets.CLOUDFLARE_TOKEN }}

— # —

the-examples-book's People

Contributors

Stargazers

Watchers

the-examples-book's Issues

[290/390-p9] Additional suggestions

Dear @kevinamstutz , here are some additional thoughts on the 290 and 390 Projects 9. Thank you for the consideration! (Sorry for putting everything together in the same issue post.)

Q1

Shall we add a hint for .tables? Currently the book section on SQL does not seem to include it. Or, will there be additional instructions about using SQLite3 in the terminal (so that a book section can be linked instead of directly telling the students to use .tables)?

Q2

The first hint says

Make sure you take a look at the data dictionary for the table and column names.

Is there a separate data dictionary for the students to use? According to the solution, it seems that all the students need to do is to show the data with the headers on. In that case, can we combine the two hints, maybe as follows?

Hint: Make sure you take a look at the column names and get familiar with the data tables. To see the header row as a part of each query result, run the following:

.headers on

Styles of hints for Q1-Q3 and Q4-Q7

Starting from Q4, there are no links in "Relevant topics" and there is the note referring the SQL examples. I think it better if we make the hint styles consistent. I prefer the way adopted in Q4-Q7. My reason is although different keywords all have links embedded in Q1-Q3, the links all direct to the same book section. What do you think?

Q9 of 390

The question says

... when appearing in a World Series (WSWin) or when league champion (LgWin).

The solution checks whether WSWin or LgWin is "Y". What about "N"? I thought that "N" would also mean appearances, and thus checked whether the two variables equal to "" for this question. As I am not so familiar with baseball, could you help me clarify the meanings of the values here?

[290-p4] Additional hints

Dear @kevinamstutz , I would suggest that we provide some additional notes and hints for the students.

Escape characters. Some students may forget that they need to escape special characters. Shall we mention this FAQ item as a note for Q4? For Q6-Q8, I think a separate note on escaping characters in R is necessary. The explanation here or the last paragraph under str_extract or could be useful. (However, it is possible that some students may get stuck about the use of \\ in R since they would not read about str_extract in Q8 until the very end.)
I think it also helpful to include somewhere in the project the regex cheat sheet by RStudio. It was introduced to the students last year, and can be a supplement to the Unix grep examples currently in the Examples Book (e.g., the use of parentheses and the list of quantifiers).

[290-p7] Q4 clarification

Dear @kevinamstutz , I would suggest the following clarification for Q4. Please feel free to modify it as you see fit!

awk is a really good tool to quickly get some data and manipulate it a little bit. ~~For example, let's see the number of kilometers and miles traveled in 1990.~~ The column Distance contains the distances of the flights in miles. Use awk to calculate the total distance travelled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.
~~Example output:~~ Your output should follow the format below. When printing out the results, you can add a new line by \n.

Miles: 12345
Kilometers: 19867.35168

The results I got is different from the solution. Could you also help me check it? Thanks!

Below is the results I got by running the code from the solution:

$ awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' 1990.csv
Miles: 3274877170 
Kilometers:5.2704e+09

The solution says:

Miles: 4343599210 
Kilometers: 6.99035e+09

[290/390-p10] Require `COUNT` in 290-Q5/390-Q4?

Dear @kevinamstutz , I have a question about the use of COUNT(DISTINCT playerID). This is something required in Questions 5-7 of 290 and Questions 4-6 in 390. Those questions ask the students to get the count of unique players.

I like these questions in that they help me better understand the use of DISTINCT. However, shall we make it required to use COUNT?

I will use 290 Question 5 as an example to explain why I ask this question. We would like to count the unique players that have more than 50 home runs (HR) in a season. The solution code is:

dbGetQuery(con, "SELECT COUNT(DISTINCT playerID) FROM batting WHERE HR>50;")

However, students may also manually count the players from the output of the following code:

dbGetQuery(con, "SELECT DISTINCT playerID FROM batting WHERE HR>50;")

(R even provided row index that indirectly counts the lines.)

I am guessing if we do not require COUNT, the students can equally get the answer. Personally, I think we do not need to impose restrictions on what code the students should use. In that case, no change of the questions is necessary. We simply need to include a note in the solution/for the grader to give full credits as long as the result is correct. What do you think?

Fix "Examples" anchor links.

Currently, clicking on a table of contents "Examples" section could bring you to an unintended section.

Add warning about long outputs to project(s)

Add R gsub content

[290-p4] Q3 ideas

Hi @kevinamstutz , two ideas about Question 3.

Shall we specify where to save the file jim_and_pam.csv? Since the code would be executed within the Rmd file, the working directory would be where the Rmd file is if the students do not specify the path. (Meanwhile, I agree that it does not matter whether the students want to save the file. This is just in case some students ask about the proper path.)

The other issue is that the cut command seems not working well with commas in quotations. This makes it a bit difficult to keep only the five columns. For example,

head -n5 /class/datamine/data/movies_and_tv/the_office_dialogue.csv | cut -d, -f4,7-9,12

But the final field of data in the four lines are not always the air_date. Shall we ask the students to keep all the columns instead? (Although awk has a solution (Stack Overflow), I guess we do not want to mention awk yet in this project.)

Thanks for consideration!

[290-p8] Q4 additional hint

Dear @kevinamstutz , for Question 4 is it possible to include the number of flights in the description or as a hint? This question involves code that is a bit more complicated, and the total number of flights would allow the students to verify their answer.

Also, could you give me a hint about how to use echo (listed under "Relevant topics")? I see it used in the solution under ### 5 - but I am not sure how the solution code for Questions 5-7 correspond to the current version of Project 8.

Just for your reference, my code for Question 4 is:

cd /class/datamine/data/flights/subset

# get the airports
awk -F, '{if ($4 ~ /state|AZ|FL/) {print}}' ~/new_airports.csv > ~/az_fl_airports.txt
wc -l ~/az_fl_airports.txt # 160 = 159 + 1 (header)

# get the flights
grep -w -F -f <(cut -d, -f1 ~/az_fl_airports.txt) 2008.csv > ~/az_fl_flights_2008.csv
wc -l ~/az_fl_flights_2008.csv # 484705

I thought that we need process substitution to first obtain the IATA codes for grep, but this is not used in the solution. Could you help me clarify it? Thanks!!

Add bash script content to book

Dear @kevinamstutz , it seems that the section for writing bash scripts in the Examples Book is still under preparation. Shall we utilize perhaps materials from 290 Fall 2019?

290 Project 7 in Fall 2019: https://datamine.purdue.edu/seminars/fall2019/stat29000project7.html

Pages loading videos bounce around

I've noticed that when navigating the book the pages bounce around a lot as the videos load. I'm wondering if we should change the embedded videos to instead be bolded links to videos?

[290-p5] Q6 clarification

Hi @kevinamstutz , I realize that I may have misunderstood Q6 after checking the solution.

Here is my thought process. We are looking for the ProductId:

that have the highest HelpfulnessNumerator and Score == 5, and
that have the highest HelpfulnessNumerator and Score == 1.

So my code is

cut -d, -f2,5,7 amazon_fine_food_reviews.csv | grep ',5$' | sort -t, -k2 -n | tail -n3
# B009K2BBT8,559,5
# B001PQTYN2,808,5
# B000FI4O90,866,5

cut -d, -f2,5,7 amazon_fine_food_reviews.csv | grep ',1$' | sort -t, -k2 -n | tail -n3
# B0081XPTBS,436,1
# B0099HD3YA,446,1
# B001F10XUU,580,1

However, the two IDs are different from the solution, which is

sort -t, -k7r,7 -k5r,5 amazon_fine_food_reviews.csv | head -n2
# Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
# 207719,B00012182G,A21V2QODUYH7Q8,Ty,99,115,5,1251072000,Very Fresh!,I was sent a live rabbit a hammer and a skinning knife.  It doesn't get fresher than this folks.
# sort: write failed: standard output: Broken pipe
# sort: write error

sort -t, -k7,7 -k5r,5 amazon_fine_food_reviews.csv | head -n2
# 103960,B00065LI0A,A2WJ1V4LK1578Q,Paul E. Seidel,97,103,1,1278979200,Wouldn't Send to Anyone,"Received this for Fathers Day and found the basket contained mostly stuffing and over-sized boxes. The very decorative box of Angilina's ""Sweet Butter Cookies"" looked like it could hold 6-8 ozs.of cookies. Inside there were 2.2 ozs of tiny ordinary cookies laden with palm oil and made in Indonesia. The 2 oz.""California Pantry"" gourmet water crackers were edible with the Wisconsin cheese but had a lot of trans-fat and were made in Hong Kong. The 4 oz. ""Focaccia Crisps Tuscan Style Crackers"" were ho-hum slightly not-crisp crackers made with palm oil in Indonesia. A 2oz. slice of ""Italian Rasperry Cake"" was dry. The most weight in the basket came from the 3.75 oz. pack of cheese the 5 oz beef salami and a small jar of mustard. All 3 were ok but at what price?"
# 134829,B000SKS8JM,A31UJUB82AWL1,Dean Guattare,96,134,1,1240963200,Save your money,The seeds I got were clearly non viable and I threw all but one in the garbage. Terrible customer service as well.  I emailed 4 times to 3 different email addresses they have posted and never got a response.
# sort: write failed: standard output: Broken pipe
# sort: write error

While writing this post, I guess that the solution code should have included the option -n for sort. However, after adding -n, I still get results that are different from mine. What do you think?

With the correction, we may no longer need the following sentence:

In the case of a tie, write down all ProductId's to get full credit.

[290/390-p11] Q1: MiB vs. MB; ls vs. du

Dear @kevinamstutz , Question 1 asks about file sizes in MB. I am curious whether it is easy to add a brief note explaining MiB and MB. Another potential issue is the use of ls (listed as a relevant topic) versus du (not listed as a relevant topic). What do you think?

Or, we could skip adding notes, and be flexible in grading. Please read on for more details.

When attempting this question myself, I tried several different approaches:

$ ls -lh /class/datamine/data/lahman/lahman.db 
-rw-rw-r-- 1 kamstut tdm-admin 63M Jun 25 11:33 /class/datamine/data/lahman/lahman.db

$ ls -l --block-size=M /class/datamine/data/lahman/lahman.db 
-rw-rw-r-- 1 kamstut tdm-admin 63M Jun 25 11:33 /class/datamine/data/lahman/lahman.db

$ ls -l --block-size=MB /class/datamine/data/lahman/lahman.db 
-rw-rw-r-- 1 kamstut tdm-admin 67MB Jun 25 11:33 /class/datamine/data/lahman/lahman.db

$ du -h /class/datamine/data/lahman/lahman.db
38M	/class/datamine/data/lahman/lahman.db

$ du -h --block-size=M /class/datamine/data/lahman/lahman.db
38M	/class/datamine/data/lahman/lahman.db

$ du -m /class/datamine/data/lahman/lahman.db
38	/class/datamine/data/lahman/lahman.db

$ du -h --block-size=MB /class/datamine/data/lahman/lahman.db
40MB	/class/datamine/data/lahman/lahman.db

Some students may only go so far as ls -lh or du -h, while some may use ls -l --block-size=M instead of MB. The answers from the students may not be identical. In particular, the difference between ls and du results could be large.

Another concern of mine is the distinction between MiB and MB. Personally, I have the impression that "1 MB = 1024^2 Bytes" and have not heard about MiB until recently. However, I am not sure how prevalent this impression is.

[290-p7] Q3 example on for loop

Dear @kevinamstutz , I would suggest that we provide another example or some reference link about using script arguments for the range of a loop.

Here are my reasons.

Currently, the last example on the script section has for i in {1987..2008}. However, they would soon find out for i {$1..$2} does not work.
Some students may not have experience in other programming languages, and as a result, the approach of ((f=$1; f<=$2; f++)) may not be intuitive enough for them. (They may get stuck at some details, such as the use of two layers of parentheses.)

While I think we should not cover everything in the projects by examples, some students may not be able to identify the proper resources for help other than the Examples Book. Hence, I suggest some additional hint. What do you think?

Add variable examples to unix/bash scripts section

[290/390-p6] Question 5 note on space

Dear @kevinamstutz , the Tutorials Point page on comm gives examples that include a space between < and (. However, it would not work in the terminal. For example:

$ cd /class/datamine/data/flights
$ comm -23 < (cut -d, -f15 2008.csv | sort | uniq) < (cut -d, -f25 2008.csv | sort | uniq)
bash: syntax error near unexpected token `('
$ comm -23 <(cut -d, -f15 2008.csv | sort | uniq) < (cut -d, -f25 2008.csv | sort | uniq)
bash: syntax error near unexpected token `('

Shall we include a note saying that there should be no space between < and (?

[290-p6] On writing bash scripts

Dear @kevinamstutz , it seems that the section for writing bash scripts in the Examples Book is still under preparation. Shall we utilize perhaps materials from 290 Fall 2019?

290 Project 7 in Fall 2019: https://datamine.purdue.edu/seminars/fall2019/stat29000project7.html

Also, how about moving Q2 to right before Q7 so that the two questions about scripts would be right next to each other?

[1900-P3] Removing read metadata.csv from question1

Dear @kevinamstutz in question 1 students are asked to read two datasets however the metadata.csv data is not used later in the project. Is the line asking to read that data a typo?

Section "About" in 290 projects

Dear @kevinamstutz , I just notice an additional section under the 290 projects: https://thedatamine.github.io/the-examples-book/projects.html#about. Is it something intended to be written there? (Not sure how to identify the commit that added those texts...) Thanks!

[290-p5] Q8 quick clarification

Hi @kevinamstutz , last post about 290 Project 5 here. Question 8 says:

... Based on this comparison, decide (anecdotally) whether you think people found the review helpful because the product is overrated, underrated, or correctly reviewed by the masses.

I was trying to understand the clause "because the product is overrated, underrated, or correctly reviewed by the masses". Is it okay to say the following instead?

because the distribution of scores that the product receives.

I make this suggestion because I am not sure whether Score would imply something being, say, overrated. "Overrating" to me sounds like a product gets many high scores when it should be rated lower. However, based on only the Score column, we could not see what is the "appropriate" score for a product. Thus, I think it could be difficult to tell whether something is over- or under-rated.

But this is just my perspective. Question 8 already explicitly asks the students to compare the histograms. So feel free to ignore this post in case I am over-thinking. Thanks!

Add old questions as table examples

[290-p5] `sort | head`

Dear @kevinamstutz , in Q1, Q4, and Q6, the students may use sort and then head in their pipes. For example, we would like to see the helpful-ness in Q1:

cut -d, -f1,5 amazon_fine_food_reviews.csv | sort -t, -k2,2 -nr | head -n3

The output is

190734,866
207713,844
566780,808
sort: write failed: standard output: Broken pipe
sort: write error

According to this Stack Overflow post, this is an expected error as sort is stopped by the limited number of rows by head. However, this may make some students wonder whether their code was correct. Shall we give a note under Q1, saying that the error messages are okay?

I am thinking about a note like this:

Note: You can always use head in the pipe to print only the relevant rows in case the complete result is very long. Note that if you used sort before head in the pipe, you may see error messages as follows:

sort: write failed: standard output: Broken pipe
sort: write error

This is because head would truncate the output from sort. It is fine to have error messages usually. See this discussion for more details.

As you can see, this note would also nudge the students to use head so we do not receive too much output when grading. (Always feel free to modify or reject my suggestion as needed!)

[390-p8] Q5 hint

Dear @kevinamstutz , is it possible to include a hint for "two-dimensional" arrays in awk?

I ask this because "by month, and by year" seems to require a two-dim array. However, according to the "# 2" post on this page, awk does not support two-dim arrays, and actually pastes the two columns into one. (Then I understand your code {M[$1","$2]++} in the solution.)

Moreover, are the students required to get the data for all the years available in /class/datamine/data/flights/subset? It would be helpful if we could clarify the range of years in Question 5.

Thanks!

Folding the table of contents?

Dear editors,

I wonder whether the table of contents could be folded so that only the chapter titles are shown by default, and that the section titles will be shown if the reader click a particular chapter. I am referring to the behavior of R for Data Science.

I notice this because currently there are many FAQ items shown and readers would need to scroll down to find the projects. (I know from the office hours that some students are accessing the Examples Book using the embedded browser in Brightspace, which is smaller.)

Perhaps the folding would make the access easier? However, I am not sure whether this would require some big changes of the entire book...

Thanks!

Yumin

[290-P3] Resources on Unix

Dear @kevinamstutz , my last post for 290/390 Project 3 here.

Keyboard shortcut

Where do you think it appropriate to introduce some keyboard shortcuts when working with the Terminal? I think of this because only when using "control + C" for copying (in RStudio Server on my Windows computer) did I recall this combo is used differently in UNIX.
I checked the example for 190 Project 7 in Fall 2019 (/class/datamine/data/examples/project7examples.txt), and there seems no mentioning of the keyboard combos. Also, since this confusion may not arise for Mac users, do you think it okay that we include a brief topic of the shortcuts under the Unix section?

Resources for Unix (and other topics)

Perhaps this is already something under planning: I would suggest that we have a resource subsection under each topic. Perhaps we could start with the recommended readings from Dr. @mdw333 ?

Thank you for the consideration!!

[290-p5] Q2 clarification

Dear @kevinamstutz , I would like to suggest some clarifications for Q2. Q2 asks

What proportion of all Summarys are unique?

I first thought this question is asking the Summary value that appears only once (instead of more than once). The word "proportion" may make some students try to find how many Summary values appear only once, out of all the unique Summary values. That question could be a bit tricky.

According the solution, I think you intended to ask the ratio of #{unique values} over #{all the values}. What about the following way of asking?

Some entries under the Summary column appear more than once. What is the proportion of such duplicate entries out of all the Summary values? Use two lines of UNIX commands to find the ~~answer~~_numerator and the denominator, and you can then manually calculate the proportion._

I have to say that the question could still be confusing. For example, if a value appears $n$ times, some students may wonder whether to use $n-1$ or $n$ in the numerator. The code would probably be different, too. What do you think?

As you can see, I also suggest a modification for the second half of the question, because I think it might be less obvious which two lines to use. Some students may even guess perhaps the second line is for the calculation...

[290/390-p11] Q4 format

Dear @kevinamstutz , Question 4 asks the students to show the query result in a particular way. However, the output may be confusing.

I can imagine that both +-------------+ and | are automatic formatting when SQL queries are submitted via terminals. However, I notice some students usually write the code directly in RMarkdown code chunks. As a result, they may not see the terminal format of the query output, and thus try to include +-------------+. Or, if the students use RMariaDB::dbGetQuery, they would not see the formatting, either.

For example, upon the first look at the question, I forgot that +-------------+ was added automatically, and tried something like the following:

```{sql connection=con}
SELECT CONCAT('+-------------+\n', 
              '| Donations   |\n', 
              '+-------------+\n', 
              '| IN: ', COUNT(*), ' |\n',
              '+-------------+\n')
FROM elections WHERE state = 'IN';
```

Do you think there is a way of fixing this?

Also, do you think we should replace the value 1111778 in Question 4 by some fake number or a placeholder (otherwise, the students may directly use it for Question 3)?

[290/390-p10] Clarification for Purdue-IU comparison

Dear @kevinamstutz , pardon me for raising another question on Question 7 in 290 Project 10 and Question 6 in 390 Project 10. The question is about creating a "misleading" plot that makes Purdue look better than IU. I would like to suggest that we simplify the question by saying:

Use the information you have in the database, and the power of R to create a ~~misleading~~ graphic that ~~makes~~ compares Purdue ~~look better than~~ against IU ~~, even if just at first glance~~.

(We can even add "barplot" as a relevant topic under the question.)

I would like to explain my suggestion here. Personally, I find it a bit difficult to come up with an idea about the misleading graph. I ran the code in the solution, and got the following plot:

While the y-axis is the log-count, Purdue still has a bar shorter than than IU's. To me the plot does not seem to imply that Purdue is better than IU.

However, please let me know if there is anything that I missed. Thank you in advance for clarifying my understanding!

[290-p4] Q1 File and permission

Hi @kevinamstutz , the main issue I notice about Q1 is the permission that students have. Please see my code below and the Terminal output in the attachment.

date; grep -Ri "bears. beets. battlestar galactica" /class/datamine/data; date

Attachment: f20-290-p4-preview-q1out.txt

Meanwhile, there seem more than one the_office_dialogue.csv file under /class/datamine/data/:

/class/datamine/data/movies_and_tv/the_office_dialogue.csv
/class/datamine/data/spring2020/the_office_dialogue.csv

However, the path in the solution

/class/datamine/data/the_office/the_office_dialogue.csv

is not available. (I used the one under /class/datamine/data/movies_and_tv/ for previewing.)

Could you double check the file location?

[1900-p3] Question#5 Incomplete

Dear @kevinamstutz the last sentence in Question 5 seems to be incomplete. Only the words " How many" appear. Should it be " How many quarters do you get?" instead ?

Thank you

[290-p4] Q6 & Q7 understanding

Hi @kevinamstutz , this will be one of the last two posts about 290 Project 4. The two posts will be about my understanding of the code: The questions are well designed, and it's just about my misunderstanding of the code. Thanks in advance for helping me to learn!

Q6 is to identify dialogue directions enclosed in "[" and "]". and Q7 is about identifying multiple directions in one line. The solution says:

# Q6 (2a in solution)
dat$has_direction <- grepl("(\\[.*\\])+", dat$text_w_direction)

# Q7 (2b in solution)
length(grep("(\\[.*\\].*){3,}", dat$text_w_direction))
length(grep("(\\[.*\\].*){6,}", dat$text_w_direction))

It seems that the pattern "\\[.*\\]" will get the same results as "(\\[.*\\])+" in the Q6 solution. Is it a general recommendation/good practice to use quantifier. or is there some important difference that I may neglect?

I would also like to share three mistakes I made about Q7, which motivated me to suggest more notes or hints on, for example, the parentheses and quantifiers.

# Mistake 1: Not using parentheses
sum(grepl('\\[.*\\]{2,}', dat$text_w_direction)) # 1
# Mistake 2: Checking consecutive pairs of "[...]"
sum(grepl('(\\[.*\\]){2,}', dat$text_w_direction)) # 10
# Mistake 3: Mis-using quantifiers
sum(grepl('(\\[.*\\]+){2,}', dat$text_w_direction)) # 10

I am not sure how to best help the students to learn such details. While I can compare my code with the solution, it could be difficult to debug such mistakes on the students' side as manually checking the data may or may not reveal all the patterns. Do you think there are any online resources or examples that may be relevant to be included in the project?

Add this example to the book

https://www.reddit.com/r/Rlanguage/comments/irir2a/i_love_r_but_im_still_a_beginner_i_try_to/

[290-P3] Q2 clarification

Hi @kevinamstutz , this would be the first of a few posts that I have about 290/390 Project 3. As I am not sure about how multiple pull requests would work, I will try to illustrate my suggestions in the description below.

Question description

The suggestion was to make the "target" of this question more obvious - Meanwhile, this would inevitably make this question almost too straight-forward... I suggest the following removal also because the students still get opportunities to look at the other commands later.

There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a bash shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient!
[YZ: Empty line added.]
What is the default directory of your bash shell?
[YZ: Empty line added.]
In our list of relevant topics, we've included links to a variety of UNIX commands that may help you solve this problem. Some of the tools are super simple to use, and some are a little bit more difficult.
...
Relevant topics: man, cd, pwd ~~, ls, ~, ..,~~.

Item(s) to submit

I suggest the following editing in case some students may see the term "working directory" for the first time.

The bash code used to show your home directory or current ~~working~~ directory (also known as the working directory) when the bash shell is first launched.

[290-p7] Q2 clarification

Dear @kevinamstutz , Question 2 in Project 7 of 290 asks for the number of lines that do not have 29 columns. It seems that all the lines have 29 columns. Is this correct? (This is just to verify my answer.)

Meanwhile, I would suggest the following clarifications.

Question description

Our files should have 29 columns. For a given file, write a line of code that prints any lines ~~in a file~~ that do not have 29 columns.

This modification would hopefully prevent some students from using redirection to "print any lines in a file" (although their answer would probably still be correct).

Hint about `NF`

It seems that the current section on awk in the Examples Book (link) does not introduce the variable NF. Shall we add a hint about it?

[290/390-p6] In-field commas

Dear @kevinamstutz , the flight data contain commas in the data entries (e.g., the 16th column ORIGIN_CITY_NAME). As a result, the column index counted according to the header would not work if used in the code (cut -d, or awk -F,). Shall we somehow give a hint or include a note about it?

To see the issue, for example:

head -n3 /class/datamine/data/flights/1987.csv | cut -d, -f24
# "DEST"
# 32575
# 32575

This would be an issue for the following questions:

Question 2. The question asks the students to print out any column using the script. Students may spend long time debugging if they try to print out any field after the 15th column.
Questions 3-6. These questions involve the destination airport. DEST appears as the 24th variable name, but the actual value seems to be in the 25th field (due to the comma in the 16th field ORIGIN_CITY_NAME). (If we do not use FPAT of awk -- see details below -- a manual fix to get DEST is to use cut -d, -f25 or awk -F, '{print $25}'.)

One way to fix the in-field comma is to use FPAT = "([^,]*)|(\"[^\"]+\")" in the BEGIN part of awk. For example,

head -n3 /class/datamine/data/flights/1987.csv | awk 'BEGIN {FPAT = "([^,]*)|(\"[^\"]+\")"} {print $24}'

This solution is found from https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html. It is also mentioned in the Stack Overflow question: https://stackoverflow.com/questions/29642102/how-to-make-awk-ignore-the-field-delimiter-inside-double-quotes/29650812.

Without a thorough search, it seems that cut could not ignore the commas in double quotations, according to this Stack Exchange answer: https://unix.stackexchange.com/questions/452508/how-to-use-the-command-cut-to-cut-out-fields-in-a-csv-file-when-fields-contain/452516#452516

[290/390-p11] Q2: nine-digit zip codes?

Dear @kevinamstutz , Question 2 asks the students to find zip codes that begin with "479". From the solution, I notice that the condition is WHERE zip_code LIKE "479__". However, this will exclude the nine-digit zip codes (found by 479%). Using 479% will get us many more zip codes.

May I ask which way was intended when you designed the question? I think a clarification would be helpful, especially when Question 5 also depends on Question 2. Thanks!

[290-P3] Q3 ideas

Dear @kevinamstutz , I have only suggestion for minor changes for the bullet points in Q3 (shown in italics).

Write a command to list the files and directories within the data directory. (You do not need to recursively list the subdirectories and files contained therein.) What are the names of the files and directors?

~~What are the names of the files?~~ Write another command to return back to your home directory. [Yumin: I moved the first part to the previous bullet point, thinking they are more closely related]

[290-P3] Links in Q5 and Q6

Hi @kevinamstutz , it seems that the links to du and rmdir in the "Relevant topics` of Q5 and Q6, respectively, do not work as intended. They currently direct the users to the Projects section in the Examples Book, and there seem no contents about the two commands under the "Unix" section. Could you double check it? Thanks!

Add plot_usmap content

[290/390-p11] Q4 solution

Dear @kevinamstutz , this is a second post on Question 4 in Project 11 for 290 and 390. I totally understand code in the solution:

SELECT CONCAT(state, ": ", COUNT(*)) AS 'Donations' 
FROM elections WHERE state='IN';

It works completely fine. However, in my first attempt, I used the following code and it did not work:

SELECT CONCAT("IN: ", COUNT(*)) AS Donations
FROM elections WHERE state = "IN";

The following two screenshots show what I got when running the code in R console and in RMarkdown:

With the help from a post online, I managed to show the count within in CONCAT by:

SELECT CONCAT("IN: ", CAST(COUNT(*) AS char(7))) AS Donations
FROM elections WHERE state = "IN";

Do you have any clue what may lead to the different results?

I am asking this because the code in my first attempt may be some students' attempts as well. And I prefer not to suggest CAST for them so that we keep things simple. What do you think?

[290-p4] Miscellaneous typesetting

Dear @kevinamstutz , this post is about things that I notice about typesetting. I may not write everything using the most complete sentences - please let me know if anything is confusing! Thanks!

Introduction

Shall we remove the paragraph of "Important note" since we are requiring the students to use RStudio Server this semester?
The link the_office_dialogue.csv under the section "Dataset" does not work. Could you double check it?

Question 1

~~Login to Scholar~~ Open the Terminal in RStudio ...
The dataset we will use is the only dataset in the data directory (/class/datamine/data) [Yumin: This will clarify the folder to search, in case some students did not remember the directory from Project 3 or wonder whether to start.]
In "Item(s) to submit", remove "Use grep and grepl within R to solve a data-driven problem."

Question 2

The main body of the question may look a bit confusing:

In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. ... You can count to see which column the various bits of data live in.

It sounds like students should use head and explicitly report the number of columns in the dataset, which are actually not required according to "Item(s) to submit". Do you think a clarification is necessary?
In "Item(s) to submit":

The line ot UNIX commands used to ~~perform the operation~~ find the character and original dialogue line that contains "bears. beets. battlestar galactica." [Yumin: This is to clarify the required code, as we probably do not need the students to include also the head part.]
In the Piping & Redirection section, is it possible to include an overview sentence for piping in the opening paragraph? Currently the opening paragraph looks like implying that the section is about redirection only.

Question 6

I think it better if we use the link for grep in R (https://thedatamine.github.io/the-examples-book/r.html#r-grep) for the "Relevant topics". Currently it directs to the Unix grep. (The link for grepl is correct.)

Question 7

I would suggest the following re-arrangement of the sentences.

7. Modify your regular expression in (7) to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.

We count the sets of direction in each line by the pairs of square brackets.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

For example, the following data line has 2 directions: dat$text_w_direction[2789].

For "Item(s) to submit", shall we use ≥ in place of >?

Question 8

The "Note" looks redundant when the example is already given above. Shall we remove it?

Solution

I notice that the solution was prepared with different numbering and some minor discrepancy compared to the questions. Could you update it accordingly?

[290-P3] Q4 idea

Dear @kevinamstutz , my suggestion for Q4 is again only related to type-setting.

Use paragraphs

I suggest that we break the question description into paragraphs. For example,

Let's learn about two more important concepts. . refers to the current working directory, or the directory displayed when you run pwd. Unlike pwd you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called my_file.txt that lives in /home/kamstut (so, a full path of /home/kamstut/my_file.txt), and you are currently in /home/kamstut, you could run: cat ./my_file.txt.

.. represents the parent folder or the folder in which your current folder is contained. So let's say I was in /home/kamstut/projects/ and I wanted to get the contents of the file /home/kamstut/my_file.txt. You could do: cat ../my_file.txt.

When you navigate a directory tree using ~~., .., and ~~~ . and .. you create paths that are called relative paths because they are relative to your current directory. Alternatively, a full path or (absolute path) is the path starting from the root directory. So /home/kamstut/my_file.txt is the absolute path for my_file.txt and ../my_file.txt is a relative path.

Perform the following actions, in order:

On `~`

In the third paragraph above, I also remove ~. To my understanding (without reference to any bash or UNIX documentation), ~ would point to $HOME. I can imagine that it being used in a relative path such as ~/../../class/datamine/apps. But I regard ~ in this example as an alias for $HOME. Perhaps there are other types of relative paths where ~ plays an important role?
Another reason for the removal is that in the second bullet point, the students are asked to write a relative path without using ~.

"Plain `cd`"

May I ask what you meant by "plain cd" here? Did it mean that the students should not use an absolute path?

Write a single command to navigate back to your home directory using a relative path. Do not use ~ or plain cd.

Thank you for the clarification!

[290/390-p10] Clarifying "Batting Average"

Dear @kevinamstutz , I would like to suggest a re-phrase of Question 4 in 290 Project 10 (and Question 3 in 390 Project 10).

Currently the question description starts with

Calculate the Batting Average of batters between 2000 and 2010, ...

I first thought the "average" needs to be taken across players (which is a wrong understanding). Hence, I would like to suggest the following question description.

The Batting Average is a metric for a batter's performance. The Batting Average in a year is calculated by H / AB (the number of hits divided by at-bats). Calculate the seasonal Batting Average for batters between 2000 and 2010 who had more than 300 at-bats in one year. List the top 5 batting averages next to playerID, teamID, and yearID.

What do you think?

[290-p6] `sort` examples

Dear @kevinamstutz , I would like to suggest a bit more explanation for sort in the Unix Chapter. Specifically, I think it would be helpful to introduce how the option -k works.

In the examples, we see -k18,18 and -k18,18 -k4,4r. In the solution to Question 6, we see -k7rn,7 -k5rn,5 and -k7n,7 -k5n,5. (Of course, the students would not see the solution when attempting this project.)

In man sort,

-k, --key=KEYDEF
       sort via a key; KEYDEF gives location and type

The manual page seems not to explain why a column index should be used twice. As a result, some students may use, for example, -k4 instead of -k4,4 if they want to sort by the fourth column. (When trying Project 5, I actually wrote the column index only once and things seem to work for me.) According to an online search, to use the column index twice would limit the sorting action to only that column without all the subsequent columns. Is this correct?

To be more clear, my questions are:

Do you think it okay if we explain why the column index is used twice?
When -k is used together with other options (e.g., -n and -r), does it matter where the single-letter options is written (after the first column index or the second column index)?

Please let me know if you there is anything else that I may clarify. Thanks!

[290-p4] Q8 choice between `str_extract` and `str_extract_all`

Hi @kevinamstutz , the relevant topic of Q8 is str_extract_all instead of str_extract. May I ask why one is recommended over the other?

Basically, my question is: What would be the regex pattern that can extract each pair of "[" and "]"? More details below.

I understand that the two functions differ in the following way: str_extract gets the first instance, while str_extract_all gets all the instances. Since we allow the extraction of everything between the first "[" and the last "]", the two stringr functions seem to do the same - at least in the following code:

library(stringr)

# Version 1
q8_v11 = str_extract_all(dat$text_w_direction, '\\[.*\\]')
q8_v12 = str_extract(dat$text_w_direction, '\\[.*\\]')
all.equal(unlist(q8_v11), 
          q8_v12[!is.na(q8_v12)])

# Version 2, based on solution
q8_v21 = str_extract_all(dat$text_w_direction, '(\\[.*\\])+', simplify = F)
q8_v22 = str_extract(dat$text_w_direction, '(\\[.*\\])+')
all.equal(unlist(q8_v21), 
          q8_v22[!is.na(q8_v22)])

(I asked about the difference of the patterns in the two versions in the previous post #66 .)

I guess the difference between str_extract and str_extract_all depends on how multiple directions in the same line are handled. I tried the four patterns: "\\[.*\\]", "(\\[.*\\])+", "(\\[.*\\])", and "(\\[.*\\].*)" - str_extract_all using the four patterns gave the same result.

all.equal(
  str_extract_all(dat$text_w_direction, '\\[.*\\]'),
  str_extract_all(dat$text_w_direction, '(\\[.*\\])+', simplify = F), # the solution code
  str_extract_all(dat$text_w_direction, '(\\[.*\\])'), 
  str_extract_all(dat$text_w_direction, '(\\[.*\\].*)')
)

However, if we use the regex pattern that identify each pair of "[" and "]", instead of the first "[" and the last "]", I think the result of str_extract would differ from that of str_extract_all. This is why I asked the question in the second paragraph above. Could you help me clarify my understanding?

(This is also my last post about 290 Project 4 - I should have got in my ideas earlier... Thank you so much for the consideration!!)

[290/390-p10] On the submission

Dear @kevinamstutz , I like the first important note under Question 2 for Project 10 of 290 and 390 about the RMarkdown code chunk. Do you think it possible to modify the wording so that the students would realize that they do not need to run library or set up the database connection in each code chunk?

I ask this because the R code chunk under Question 5 in the project template is

```{r}
library(RSQLite)

# This is where we define and initiate a connection.
con <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/chinook/chinook.db")

# We can then use the connection to run queries.
dat <- dbGetQuery(con, "SELECT * FROM employees LIMIT 5;")
head(dat)
```

I am afraid that some students would only copy and paste things (an example on Piazza is here) without trimming. However, there would not be any syntax error if they repeat library and dbConnect in each code chunk.

I am not sure how to re-phrase the note, and not sure whether this is concerning at all. So feel free to close this issue as you see fit. Thanks!

Add important notes to P7, P8, P9 about checking PDF and submission

[290/390-p11] Clarifying question on Intro

Dear @kevinamstutz , the last paragraph in the "Dataset" part in the introduction section of the project says

... As fantastic as this database is, it would be trivial to load up the entire database in R or Python and do your analysis using merge-like functions. ...

May I ask what was intended by this sentence? Also, I am not sure what "merge-like functions" refers to.

Thanks in advance for your clarification!

[290-p4] Overall impression

Dear @kevinamstutz , I am not sure how well the students remember the regex grammar from last year, some of the questions could be a bit difficult. (Actually, it could also be me not having much regex experience - literally I only learned and used it in the Data Mine!) I think we may want to provide more directions or hints, or trim one or two questions and save them for a second regex project.

Below is my impression about the questions. I will provide in separate posts more specific ideas, questions, or suggestions.

First of all, I think the questions are designed at a nice pace! Q1-Q3 are straight-forward given the examples in the Unix section. Q4 extends the case in Q3 with more names of interest. Q5 is about quantifiers. Q6-Q8 demonstrate data analysis in R (including the useful str_extract). Important aspects of regular expression are comprehensively covered by the questions!
However, it is a relatively long list of regex details. I think more help would be necessary (e.g., the use of parentheses).

Ah, I realize that I do not have a specific suggestion in this post after writing the paragraphs above... Please take a look at the other posts - perhaps we could use this issue for discussion in case any trimming is needed. Thanks!

[P6] 29000/39000 move questions

Also, how about moving Q2 to right before Q7 so that the two questions about scripts would be right next to each other?

thedatamine / the-examples-book Goto Github PK

the-examples-book's Introduction

The Examples Book

Contribution

Build

Search index

the-examples-book's People

Contributors

Stargazers

Watchers

Forkers

the-examples-book's Issues

Q1

Q2

Styles of hints for Q1-Q3 and Q4-Q7

Q9 of 390

Keyboard shortcut

Resources for Unix (and other topics)

Question description

Item(s) to submit

Question description

Hint about NF

Introduction

Question 1

Question 2

Question 6

Question 7

7. Modify your regular expression in (7) to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.

Question 8

Solution

Use paragraphs

.. represents the parent folder or the folder in which your current folder is contained. So let's say I was in /home/kamstut/projects/ and I wanted to get the contents of the file /home/kamstut/my_file.txt. You could do: cat ../my_file.txt.

Perform the following actions, in order:

On ~

"Plain cd"

Recommend Projects

Recommend Topics

Recommend Org

Hint about `NF`

`..` represents the parent folder or the folder in which your current folder is contained. So let's say I was in `/home/kamstut/projects/` and I wanted to get the contents of the file `/home/kamstut/my_file.txt`. You could do: `cat ../my_file.txt`.

On `~`

"Plain `cd`"