3-4 December 2018, Cambridge University Bioinformatics Training
Instructors: Hugo Tavares & Sandra Cortijo (Sainsbury Laboratory)
This is a general introduction to R for data analysis.
Our practicals will be very hands-on, focusing on learning the necessary sintax to allow you to do data analysis in R, from data manipulation to visualisation. We will focus on tabular data, which is general enough to allow you to apply these skills to a wide range of problems.
Below, we provide links to detailed materials for your reference, many of which were developed by the Data Carpentry organisation.
If you have any queries please post a new issue on our GitHub repository.
Setup
All necessary software and data will be available on the training machines at the Bioinformatics Training Room (Craik-Marshall Building).
However, you are welcome to use your own laptop, in which case you need to:
- Download and install R (here)
- Download and install RStudio (here)
- Install the R package
tidyverse
(open RStudio and go toTools > Install Packages
)
Digital data recording often starts with a spreadsheet software (e.g. Excel). For an effective data analysis, it's crucial to start with a well structured and formatted dataset. Because of this, before diving into R, we will start by having a discussion about common issues that should be considered when recording data in spreadsheets.
Further reading:
- Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10
- Hadley Wickham (2013) Tidy Data, Journal of Statistical Software, 59:10
This lesson will cover the very basics of using R with RStudio.
Detailed reference materials:
This lesson will cover some functions to effectively manipulate and summarise
tabular data using the dplyr
package and we will start to learn how to
visualise data with the ggplot2
package.
Detailed reference materials:
In this session we will apply the concepts learned so far to a worked example of an exploratory data analysis of transcriptomic data.
- Find the lesson materials here
During the lesson, we will also learn a few more tricks in R, including:
Further reading:
- Conesa et al. (2016) A survey of best practices for RNA-seq data analysis, Genome Biology 17, 13
- Jake Lever, Martin Krzywinski & Naomi Altman (2017) Principal component analysis, Nature Methods 14, 641โ642
- Summary of R basics
- Summary of dplyr functions and their equivalent in base R (will also add
data.table
equivalents at some point) - Cheatsheets for dplyr, ggplot2 and more
Extra materials/books:
- R for Data Science - a nice follow-up from this course focusing on "tidyverse" packages
- Introduction to Statistical Learning - an introductory book about machine learning using R
- Also see this course material for a practical introduction to this topic
- Statistical Rethinking (not freely available) - an introduction book about statistical modelling using R
- The "Think X" series of books, which focus on python, but are freely available