2018-12-03-bioinformatics_for_biologists's Introduction

Introduction to data analysis with R

3-4 December 2018, Cambridge University Bioinformatics Training

Instructors: Hugo Tavares & Sandra Cortijo (Sainsbury Laboratory)

This is a general introduction to R for data analysis.

Our practicals will be very hands-on, focusing on learning the necessary sintax to allow you to do data analysis in R, from data manipulation to visualisation. We will focus on tabular data, which is general enough to allow you to apply these skills to a wide range of problems.

Below, we provide links to detailed materials for your reference, many of which were developed by the Data Carpentry organisation.

If you have any queries please post a new issue on our GitHub repository.

Setup

All necessary software and data will be available on the training machines at the Bioinformatics Training Room (Craik-Marshall Building).

However, you are welcome to use your own laptop, in which case you need to:

Download and install R (here)
Download and install RStudio (here)
Install the R package tidyverse (open RStudio and go to Tools > Install Packages)

Data Organisation in Spreadsheets

Digital data recording often starts with a spreadsheet software (e.g. Excel). For an effective data analysis, it's crucial to start with a well structured and formatted dataset. Because of this, before diving into R, we will start by having a discussion about common issues that should be considered when recording data in spreadsheets.

Download data for this lesson here
Find detailed materials here
- example of tidy data

Introduction to R

This lesson will cover the very basics of using R with RStudio.

Detailed reference materials:

exercises

Data manipulation and visualisation in R

This lesson will cover some functions to effectively manipulate and summarise tabular data using the dplyr package and we will start to learn how to visualise data with the ggplot2 package.

Detailed reference materials:

Exploratory RNAseq data analysis in R

In this session we will apply the concepts learned so far to a worked example of an exploratory data analysis of transcriptomic data.

Find the lesson materials here
- rnaseq starting script

During the lesson, we will also learn a few more tricks in R, including:

Further resources

Summary of R basics
Summary of dplyr functions and their equivalent in base R (will also add data.table equivalents at some point)
Cheatsheets for dplyr, ggplot2 and more

Extra materials/books:

R for Data Science - a nice follow-up from this course focusing on "tidyverse" packages
Introduction to Statistical Learning - an introductory book about machine learning using R
- Also see this course material for a practical introduction to this topic
Statistical Rethinking (not freely available) - an introduction book about statistical modelling using R
The "Think X" series of books, which focus on python, but are freely available

2018-12-03-bioinformatics_for_biologists's People

Contributors

Stargazers

Watchers

2018-12-03-bioinformatics_for_biologists's Issues

intructor notes

The exercises for the course have been compiled here.

Outline of things to cover:

Create Rproj and folders "data_output" and "scripts"
Intro
- skip factors
- exercises 1.1 and 1.2
data.frames
- use read_csv() from the beginning to simplify things
- don't spend too much time here, main thing is to explain [rows, columns] for subset and $ to access column
- exercise 1.3
dplyr
- skip spread/gather (covered in extra RNAseq lesson)
- exercises 2.1, 2.2, 2.3
- Do exercise 2.4 with students to save time
ggplot2:
- skip themes and customisation (simply mention them at the end)
- extra: see note below to mention factors
- exercises 3.1-4 (if time is short do some exercises together)

note: extra material for ggplot2 section

So that students intuitively understand factors, introduce them in the plotting
section.

For example:

When doing this plot:

surveys_complete %>% 
  ggplot(aes(sex, hindfoot_length)) +
  geom_boxplot()

What if we want to change the order of the x-axis labels to be "M" first?

Then we need to learn about factors, which are a special way that R has to
encode categorical variables.

Let's look at factors using a simple example first. Then go through the example
of the course materials here, but only the very first section of it.

From there, jump back to the plotting problem and resolve it:

surveys_complete %>% 
  mutate(sex = factor(sex, levels = c("M", "F")))
  ggplot(aes(sex, hindfoot_length)) +
  geom_boxplot()

Exercise 3.4 applies this concept again.

Recommend Projects

tavareshugo / 2018-12-03-bioinformatics_for_biologists Goto Github PK

2018-12-03-bioinformatics_for_biologists's Introduction

Introduction to data analysis with R

Data Organisation in Spreadsheets

Introduction to R

Data manipulation and visualisation in R

Exploratory RNAseq data analysis in R

Further resources

2018-12-03-bioinformatics_for_biologists's People

Contributors

Stargazers

Watchers

2018-12-03-bioinformatics_for_biologists's Issues

intructor notes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent