Note: This is work in development, mainly as a template for personal use, and it is provided "as-is", without guarantees of robustness, correctness or optimality. Please use responsibly.
This RNA-seq workflow consists of a Snakefile
, a configuration file (config.yaml
) and a set of R scripts to perform quality control, preprocessing and differential expression analysis of RNA-seq data. The output can be combined with the iResViewer
R package to generate a shiny application for browsing and sharing the results.
To use the RNA-seq workflow on your own data, first clone this repository to your local machine and then follow the steps below:
- Put the gzipped fastq files in a
FASTQ
directory. TheSnakefile
assumes that these files are named according to the pattern<sample-name>.fastq.gz
(or<sample-name>_R1.fastq.gz
and<sample-name>_R2.fastq.gz
for paired-end data), if this is not the case you need to rename the files or modify theSnakefile
accordingly. - Create a tab-separated metadata text file. This file should have at least two columns: one named
names
, which contains all the values of<sample-name>
from the fastq files, and one namedtype
which is either SE or PE depending on whether the samples were obtained with a single-end or paired-end protocol. In addition, any number of columns can be included and used later in the analysis. All variables required for the differential expression analysis should be included as columns in the metadata text file.
- Include the path to the metadata file in
config.yaml
at the requested line (metatxt :=
). - Add paths to reference files in
config.yaml
. Note that you will also have to add desired paths for indexes etc that will be generated by the workflow. The following reference files are used in the workflow, and can be downloaded from Ensembl or Gencode:- Genome fasta file: Must be uncompressed
- Corresponding GTF file: Must be uncompressed
- A single cDNA or ncRNA fasta file: or a fasta file with cDNA and ncRNA combined (
cat cdna.fastq.gz ncrna.fastq.gz > cdna.ncrna.fastq.gz
). Can be compressed.
- Specify the source of your reference files in
config.yaml
. Please be consistent with build and release versions and fill in:- annotation: Either
Ensembl
orGencode
- organism: Species name separated by an underscore
_
(e.g.Homo_sapiens
) - build: Genome build (e.g.
GRCh38
) - release: Release number (e.g.
93
)
- annotation: Either
- Add the readlength for your RNA-seq reads to
config.yaml
. - Add a "group variable" to
config.yaml
. This will be used to color the samples in downstream visualizations, and should correspond to one of the column names in the metadata text file. - Set the maximal number of cores to use for the tools that support multi-threading.
-
The workflow assumes that all the necessary software is in your path. Alternatively, you can set up a
conda
environment, which will contain all necessary software. First, ensure thatconda
is available and, if necessary, add the channelsr
,conda-forge
andbioconda
(see e.g. here). Then, to create an environment namedrnaseqworkflow
, doconda env create -n rnaseqworkflow --file envs/environment.yaml
Note that after the environment has been created, it can be updated after editing
envs/environment.yaml
, in the case where new versions are needed, usingconda env update -n rnaseqworkflow --file envs/environment.yaml
Before running the workflow, activate the environment with
source activate rnaseqworkflow
-
If you don't want to use
conda
, make sure that all necessary software is installed. The following software is used by the workflow: -
Make sure that all necessary R packages are installed. The workflow uses the following packages:
- The
scripts/run_dge_edgeR.R
script contains the basic code to perform differential expression analysis with edgeR. However, you need to modify it in order to perform the proper analysis for your data. As a minimum, define the design and the contrast(s) you would like to use, based on the variables defined in the metadata text file. If you make additional modifications, make sure that the output of the script follows the requirements outlined inscripts/run_dge_edgeR.R
.
If all the instructions above have been followed, the workflow can be run from the command line by simply typing
snakemake
This will generate all necessary output directories and run the analysis for the indicated samples. If you want to use multiple cores, just do
snakemake --cores 12
The workflow contains two rules to check the versions of the software that has been used. To check the versions of R packages, running snakemake listpackages
will parse the output files generated by R CMD BATCH
and extract all used R packages. The results will be written to a text file. To check the versions of other software, running snakemake softwareversions
will check the versions of the software. Finally, the log
directory contains log files, which should state the version of all software that was used.
The output file of the workflow (output/shiny_results_list.rds
) can be directly used as input to the iResViewer
package in order to generate a shiny application where the results can be viewed. After installing iResViewer
, make sure that the bigWig files listed in the bwFiles
slot of output/shiny_results.rds
are reachable (they will be streamed into the shiny application for coverage visualizations). The following code will start a shiny application where you can browse the results:
library(iResViewer)
res <- readRDS("output/shiny_results_list.rds")
do.call(iResViewer, res)
The title of the shiny application can also be specified:
do.call(iResViewer, c(res, list(appTitle = "myTitle")))
The output file of the workflow (output/shiny_results_sce.rds
) can be directly used as input to the iSEE
package in order to generate a shiny application where the results can be viewed. After installing iSEE
, the following code will start a shiny application where you can browse the results:
library(iSEE)
sce <- readRDS("output/shiny_results_sce.rds")
app <- iSEE(sce)
shiny::runApp(app)