openintrostat / oilabs-tidy Goto Github PK

View Code? Open in Web Editor NEW

65.0 65.0 84.0 51.37 MB

👩🏿‍💻 OpenIntro Labs in R using the tidyverse design philosophy, grammar, and data structures

Home Page: http://openintrostat.github.io/oilabs-tidy/

License: Creative Commons Attribution Share Alike 4.0 International

HTML 99.87% R 0.06% CSS 0.08%

oilabs-tidy's People

Contributors

Stargazers

Watchers

Forkers

beanumber mine-cetinkaya-rundel ameliamn naaffaan nguyennhatnam merico34 hardin47 kralljr kaz-yos myancheff snowdj smetzer180 benjamin-feder arainsd azamatuss aaronbaggett lsheble han-tun jskuk nozimmjon geoclick jricoiii bawcos staceyhancock dpwynne mmaroto sastoudt haithamzd jkline82 scottknappnj lacombe129 cwru-sdle kundyyy rjsaidi atheobold wfsone oromendia amandadperez otanrikulu mathatgrace jaeminlee-sociology dstoebel petzi53 eunjikim210 monika76five gridl mtoqeerpk stats-matt phdtai noutland ztreisman statistical-learning-with-r intro-to-statistics-with-r ckaterba drdawnstats gvsprasad1205 juconguz nicholaskarlson elahi pham0030 caap-stats-22 melvinmatanos2008 mamcisaac opmrain hvproano keandrerandon creative-research-project-v1-1 elizabethst maelfosso lisahitch clobos marciero karthik-palaniappan raphak oferengel r-tutorials gungormetehan dsever carmen-canedo

oilabs-tidy's Issues

Reference to R reference card

I suggest removing the following in https://github.com/andrewpbray/oiLabs-dplyr-ggplot/blob/master/intro_to_r/intro_to_r.Rmd:

You can refer to the help files or the R reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf to find helpful commands.

since it doesn't really have dplyr / ggplot2 help. We could replace it with references to the RStudio Cheatsheets if desired.

use of ABs as explanatory variable in SLR lab?

What was the thinking behind making at_bats the central explanatory variable in the SLR lab? It's not very intuitive for me. The most obvious explanatory variables would be home_runs or hits or bat_avg, all of which measure offensive production. at_bats sort of measures offensive opportunities, and it is correlated with runs because of the fact that innings end after three outs, but can go on forever. So if your team makes outs at a lower rate, then you get more plate appearances per inning, and thus more at_bats per inning, and along the way presumably more runs. So it is not surprising that there is a strong positive correlation, but it is a very backwards and counter-intuitive way of thinking about explaining the variability in team runs scored.

add five single-table verbs to "intro to data" lab

I just finished the "Intro to Data" lab. It went OK, but I recommend:

renaming this "Intro to Data Wrangling"
introducing the five single-table verbs at the beginning (e.g. select, filter, mutate, arrange, and summarize)
including a short discussion of the "grammar of data manipulation"
focusing the content on applications of the five verbs and creating simple pipelines
keeping the group_by stuff, as I think this is a necessary skill

Without the discussion of the five single table verbs I'm not sure how we expect students to assimilate these skills.

Shiny app server access problems

Many of my students had trouble running the labs that rely on Shiny apps (sampling distributions) because they could not stay connected to the shiny server.

However, very little of the lab actually requires a shiny app (only exercise 6 in the sampling distribution lab (https://openintro.shinyapps.io/sampling_distributions/). If that shiny app were not embedded in the lab (e.g., it was a stand-alone shiny app linked to in the lab), then all students could access the static lab webpage and we would not have as much trouble getting so many students working on the lab at the same time.

ims-01 arbuthnot data

I'm not sure how extensive this issue is. I think the problem is about learnr vs your own .Rmd file. I thought I'd put it here for us to track.

contact_reason → typo
book → ims
feedback_type → labs
lab_software → r_core
contact_name → Mike Keim
contact_email → [email protected]
reply_needed → no_but_okay

contact_message
In Lab 1 (http://openintrostat.github.io/oilabs-tidy/01_intro_to_r/intro_to_r.html) this language appears after the first "arbuthnot" R Command:

"One advantage of RStudio is that it comes with a built-in data viewer. The Environment tab (in the upper right pane) lists the objects in your environment. "

When you do this lab in RMarkdown, running "arbuthnot" does not make that dataframe appear in the Environment. (It does appear later in exercise 2 when you use piping to create the new column named "total"). My students were slowed down by expecting to see something appear in Environment but nothing appearing. Maybe this language could get cleaned up by distinguishing between when things should appear in Environment or not.

"conditioning commands"

In intro_to_data we use the term "conditioning commands", but a quick search yields we might be the only ones calling these that. Should we be looking for different terminology? They're not really "conditioning"... They're typically called "logical operators". If we wanted to avoid that language, we might want to think of something closer to that.

Note that if we were to make a change here we would want to carry it across other versions of these labs as well.

Sampling distributions - Does science benefit you? Exercise 1 issues

The exercise text says "Depending on which 50 people you selected, your estimate could be a bit above or a bit below the true population proportion of 0.26. In general, though, the sample proportion turns out to be a pretty good estimate of the true population proportion, and you were able to get it by sampling less than 1% of the population."

(1) I thought that the true populatin parameter was "0.2".
(2) As far as I'm concerned, we sampled 50 out of 100,000 observations. Isn't that way below "1% of the population"? You're statement is not wrong, if I'm right, but I think the 1% overstate the true dimension of the sample size compared to the population size.

Lab 6 Exercise 2

Lab 6 Exercise 2 asks

What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?

However, the discussion following this exercise makes it clear that the question was meant to be asking for the proportion of those who have texted while driving among those who never wear helmets.

Arbuthnot not adding to environment

Hello! I'm working through Lab 1 in preparation for the spring semester. I've found that when I follow the directions in the lab and simply type arbuthnot into my console (or in a chunk) it does not add arbuthnot to my global environment. Am I doing something incorrectly? Is there an option in RStudio that I need to change?

I can add it by either typing arbuthnot <- arbuthnot or data(arbuthnot) which seems to be how it used to be coded. Any tips?

Testing the labs while writing them with global chunk options

The final version of the labs usually has eval = FALSE for all code chunks, since we show the code and not the output. However when writing the lab we want to make sure all code runs properly. I found the following approach an easier way to make sure of this than what we're doing currently:

Add the following chunk on top of the lab with the option include = FALSE so it doesn't show up in the lab

knitr::opts_chunk$set(eval = TRUE/FALSE)

In this chunk set eval = TRUE when testing, and convert to eval = FALSE when done testing before publishing.
Remove eval = FALSE statements from the individual code chunks in the lab

Note that this can be applied to other versions of the labs as well

building sampling distributions

Note that in practice one rarely gets to build sampling distributions,
because we rarely have access to data from the entire population.

Huh? Do you mean "true sampling distributions"? We approximate sampling distributions all the time, don't we?

Language about RStudio server + instructor help

This is in reference to https://github.com/andrewpbray/oiLabs-dplyr-ggplot/blob/master/intro_to_r/intro_to_r.Rmd:

Since these labs are for wider distribution, for the generic versions (to be hosted on OpenIntro) I suggest we remove references to the RStudio server.

Also in the past we've made an effort to avoid language like "ask me", "ask TAs", etc. so that self learners or those at institutions with different structures could also adopt the labs easily.

Convert all data files to RData?

Looks like we've been sourcing .R files for the labs that don't have custom functions. Seems really awkward to have the data in an R script. The most straightfoward fix would be to have them download and load an RData file instead, which is what the latter labs do anyway. Here's an example.

Even better is putting it all in the oiLabs package. Are there any arguments in favor of keeping the data hosted on the openintro website instead of in a package on github?

Inference for numerical data lab

In the inf for numerical data lab The dplyr chain

nc %>%
  group_by(habit) %>%
  summarize(mean(weight))

does not work for other numerical variables (such as weeks, gained, visits) due to missing data. Do we want to scrub the data set of rows with missing values, or do we want to complicate the 3rd element in the chain by

summarize(mean(weight, na.rm=TRUE))

I vote for the latter since this is a teachable moment about using real data.

Lab 6 more practice

In lab 6, more practice question number 1 seems like it could use clarification:

Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week?

Is this intended to compare those who sleep 10+ hours per day with those who don't sleep 10+ hours per day, or is it intended to compare training every day of the week with other frequencies? I could see either reading being reasonable.

Intro to data: more airports serving Los Angeles thaan LAX

"How delayed were flights that were headed to Los Angeles?"
There are more airports in your data set that serve Los Angeles.
You have data of 31 flights to BUR (Burbank) and 66 to LGB (Long Beach). You may want to be more specific in your question, and explicitly mentioning LAX, or include the other 2 airports.

I know, I know...

Pipe mismatch with tutorials

The tutorials have been updated to use the native pipe, but the labs still use %>%

This will confuse students that are using both the tutorials and labs.

Normal Distribution Lab

Two notes in the Normal Distribution lab:

The qqplots should have diagonal lines indicating when observed quantiles are exactly equal to theoretical normal quantiles. For instance in the mosaic version of the labs they do. Not sure if this is best achieved via ggplot commands, or a wrapper function.
Also almost all students got tripped up in trying to plot the qqplot of sim_norm. They would replace data and not sample in the command qplot(sample = hgt, data = fdims, stat = "qq"), not realizing that since sim_norm is a standalone vector (and not a variable within a data frame) they should be using qplot(sample = sim_norm, stat = "qq"). Perhaps assign sim_norm to the data frame fdims?

SLR Lab Exercise 8 Has No Correct Answer

I am passing on an issue pointed out by an instructor trying to use the Rguroo labs. The same issue exists in the tidyverse lab.

Exercise 8 in the tidyverse lab asks the student to compute the residual for a country with pf_expression_control = 7.4. However, the pf_expression_control scores are rated to the nearest 0.25 of a point. So it is impossible to compute a residual in this exercise.

error when knitting intro_to_data.Rmd

I first had to do some edits to get the nycflights to load
By entering this into the console

install.packages command is needed in the consule
install.packages("tidyverse")
install.packages("devtools")
library(devtools)
install_github("OpenIntroStat/openintro-r-package")

Then when knitting I get error on Line 374
error in eval(lhs, parent, parent) :" object 'nycflights' not found

My file is at
https://github.com/ohlone-math159/lab-02-intro-to-data/blob/master/02_intro_to_data/intro_to_data.Rmd

Convert everything to data frames and remove square bracket notation?

This would be a major rewrite, but (without looking through all the labs) it seems like it'd be possible to remove all references to the vector structure of R and just use a lot of select(). An alternative would be to not go whole hog in the dataframe direction and leave some vectors in. But the arguements for the full rewrite:

Pros:

dplyr literate syntax only. So instead of dollar-sign, vector subsetting, matrix subsetting, and the subset() function, there'd only be filter() and select().
We could introduce chains from lab 1 as the new normal.
The logic of dplyr is similar to that of ggplot, so it opens up more powerful graphics options.
Removal of the word vector, which may might be a bit daunting to many students.

Cons:

I haven't gone through all the labs yet, so there might be some topics that will require an ugly hack or be dropped altogether.
For-loops will require reworking. I think our best option would likely be to use the do() function in mosaic.
This gets fairly far away from traditional R syntax, which will make googling around more confusing for students.

One thing that I think we would need to add if we did this is a lab that did focus on vectors, constructing data frames, and manipulating them using tidyr.

Lab 05a_sampling_distributions: replicates filtered out in sample_props_small

sample_props_small often has fewer than the requested 25 elements.

The call to
filter(scientist_work == "Doesn't benefit")
is filtering out any replicates where there are no "Doesn't benefit"s in the small sample. As a result any replicates with p_hat=0 are filtered out and are not displayed.

This issue is caused by using a small sample size and a true proportion close to 0 (p=.2).

remove GGally dependency?

This seems like an awful big hammer. Why not:

evals %>%
  select(contains("bty")) %>%
  pairs()

vague wording in Exercise 5, Intro to Data lab

Which month has the highest average departure delay from an NYC airport? What about the highest median departure delay? Which of these measures is more reliable for deciding which month(s) to avoid flying if you really dislike delayed flights.

I'm not sure what you are going for here because of the ambiguity of "really dislike delayed flights". It seems to me that if you are trying to avoid a single really long delay, then you should base your judgment on the mean. If you are trying to avoid any kind of delay, then you might base your judgment on the median.

Could this be tightened up?

switch to tidyverse?

Can we rename this project oiLabs-tidyverse and use library(tidyverse) at the beginning of each lab?

SLR Lab Correlation Code Produces NA

This code right before the Sum of squared residuals section:

hfi %>% summarise(cor(pf_expression_control, pf_score))

produces an NA value. I am not sure whether you prefer to fix with

hfi %>% summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))

hfi_small_clean %>% summarise(cor(pf_expression_control, pf_score))

cans of worms opened by ggplot'ification

The plot_ss (SLR lab), inference, and multiLines (multiple regression lab) rely on base graphics. Upon a cursory look, ggplot'ifying them will take a bit of work, so I wanted to discuss this before jumping down that rabbit hole.
In the multiple regression lab with teacher eval data, if we want students to create scatterplots of multiple pairs of variables à la plot(evals[,13:19]) then we'll have to either
- stick with base R
- Use the ggpairs function in the GGally package, thereby increasing the number of package dependencies.
In the simple linear regression lab with MLB data, we ask students to create scatterplots with regression lines and qqplots with qqlines. I don't think there is a way to superimpose such lines without resorting to stat_smooth(method="lm", se = F)

What are your thoughts on

Jumping back and forth between plotting formats in the scenarios above?
Busting out of self contained qplot() calls and using layers?

Residuals vs. standard residuals in simple linear regression lab

Looks like the baseball lab uses .stdresid in the residuals plots. I think we should stick with .resid since that is what is in the textbook, and unless I'm missing it the lab doesn't explain what standardized residuals are, and I'd vote against going there... Happy to do a pull request but wanted to discuss it first in case this was a deliberate decision.

kobe_basket data set is missing

Hey, I think a data set is missing to run the Probability Lab. In the openintro package, there does not appear to be a data set called kobe_basket. I attempted to work around with the data files in the repo but am not able to find a data-set call kobe_basket or a data set with a variable called shot.

Lab 8: Simple linear regression. Exercise 2. Incomplete question.

Exercise 2 of Lab 8 (Simple linear regression) is incomplete. Students are asked to "select the six variables", but it isn't clear which 6 variables are meant. The provided solutions simply ignore this request to select variables.

Add "last updated" to all labs

to the base ones too

data() <Promise> in RStudio

@beanumber and I have discussed this prior.

If you load a data set via data() in RStudio, all you get in the environment panel is a <Promise> of the data set i.e. if you click the variable name, the Data Viewer spreadsheet does not pop up. It is only after you run some function on the object that you get this ability. Students are extremely puzzled by this fact. Ex:

library(openintro)
data(county)

I'm starting to lean more on the Data Viewer than commands like names(), head(), or tail() as it gets students close to the data with all layers of abstraction removed. Also students love the sort and filter functionality.

One could argue that:

If the package lazy loads data sets, then data() is unnecessary. However, I still like explicitly having students run data() to make things seem less magical.
One can use the View() command, but I find clicking on the variable name in the Environment panel quicker. And why the uppercase V?

Do any of you know of a solution to this problem? i.e. students run data() and should immediately be able to load the Data Viewer by clicking the variable name in the Environment panel? A cursory google search seems to suggest the lazy eval nature of R might preclude a solution.

Exercise 4 in Confidence Intervals lab

What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

This only works for the regular version of the textbook. With the Randomization and Simulation version, it is Section 2.8.4.

Lab 7 exercise 2

This exercise asks students to make a side by side violin plot, then the next paragraph mentions box plots comparing medians. I'm guessing the question maybe originally asked students to make a box plot - for consistency, it might be best to either have the question be about box plots, or the next paragraph about violin plots.

Use of `n = n()`

My students found this:

sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(n = n())

to be really confusing. I agree. I would recommend naming the derived column N or num_rows instead of n, which is the name of a function!

Incidentally, I am a big believer in always counting the number of rows, so I'd also recommend putting that into every summarise() call out of habit.

normal prob plot in lab 4

In lab 4 problem 3, it says:

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)

Probably a change in ggplot, but it looks like ggplot can't accept a vector as an argument, but geom_line can. So it seems that options could either be to have the parenthetical include a point about moving aes to the geom instead of the ggplot, or coercing the vector into a data frame when it's created.

code to generate histQQmatch in Normal Distributions labs?

This places a PDF in the wrong directory, then calls a PNG??

pdf("histQQmatch.pdf", height = 20, width = 15)
multiplot(p1, p2, p3, p4, p5, p6, p7, p8,
          layout = matrix(1:8, ncol = 2, byrow = TRUE))
dev.off()

![histQQmatch](more/histQQmatch.png)

ambiguity in Exercise 10 on intro to data lab

In Exercise 10, there is some ambiguity in the definition of "on time". A flight gets arr_type as on time if the delay is <= 0 minutes. However, previously in the lab, dep_type was defined as on time if it was less than 5 minutes late. Do you expect that students will re-define dep_type with a definition that matches that of arr_type in Exercise 10? Or simply use the previous definition, which is logically mismatched? Apparently, virtually all of my students did the latter.

Make sure categorical and numeric data labs don't mix CLT conditions with simulation calculations

These two labs still need a bit of work to make sure we're not asking students to check CLT conditions when we're doing bootstrapping.

Lab 2 link to BTS

Dear authors,

The link to the Bureau of Transportation Statistics (http://www.rita.dot.gov/bts/about/) is wrong. It should be https://www.bts.gov/.

Regards,

Christina

update where yrbss is loaded from

when the next release of the openintro package goes live, update where the data is being loaded from

Refactoring of regression labs

As a todo for Andrew: the SLR has a bit about diagnostics which is not needed since there is no inference done here. Move to the MLR lab.

add glimpse() to first lab?

This combines the information in dim() and names().

www/lab.css?

Is there a reason why labs 5 and 8 use a custom CSS? The CSS files themselves aren't different. Is this because of the Shiny stuff?

Bibliographies and Citations

How do we feel about having bibliographies and citations? This is easily done in markdown.

This is more of an aesthetic issue. In particular note the preamble to the Multiple Regression lab.

Inference for categorical data lab

In the On Your Own section, Q1.a) you ask students to "Form confidence intervals for the true proportion of atheists in both years, and determine whether they overlap." I think a better approach is via a single confidence interval on the difference in proportions.

Even though two individual confidence intervals may overlap, suggesting they are not different, the confidence interval of the difference might still suggest they are in fact different. (If you need an example of this, let me know) This is a common misinterpretation of bar plots with error bars (i.e. dynamite plots).

Intro to data, Exercise 3

"Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

Another useful technique is quickly calculating summary statistics for various groups in your data frame. For example, we can modify the above command using the group_by function to get the same summary stats for each origin airport:"

You ask for arrival delays and instead calculate departure delays:
"sfo_feb_flights %>%
group_by(origin) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())"

lab report template leading to errors: duplicate chunk labels

The Lab Report template provided in the openintro package has a sample code chunk for Exercise 1 and that code chunk has a label ("code-chunk-label").

Many of my students copy and paste this code chunk when completing later Exercises, which leads to an error caused by duplicate code chunk labels. This issue is well outside the scope of what I want them to be considering in an introductory class.

A lot of unnecessary confusion could be avoided if the template simply didn't have the "code-chunk-label" label in that first code chunk.

Similarly, the ellipsis at the bottom of the document (presumably meant to indicate to students that they should continue the document after Exercise 2) seems to cause confusion to students new to RMarkdown because many assume it is an important part of the document, like the chunk delimiters ("```" vs "...").

To summarize, I recommend avoiding unnecessary confusion by modifying the Lab Report template by

deleting the code chunk label "code-chunk-label"
deleting the ellipsis ("...") at the bottom of the document

ggplot2?

The project is called this dplyr but what about ggplot2? Are others in favor of converting the plots to ggplot2 as well?

Pros:
Easier to make multivariable plots
Default look is a lot more this century

Cons:
Harder to customize -- though we do very little customization of plots in the labs anyay

places the version of ames used in Sampling Distributions in the package

This:

load(url("https://stat.duke.edu/~mc301/data/ames.RData")) # will need to be updated

doesn't seem to match the data set that is in the package.