DMRichR

A workflow for the statistical analysis and visualization of differentially methylated regions (DMRs) of CpG count matrices (Bismark cytosine reports) from the CpG_Me pipeline.

Overview

The goal of DMRichR is to make the comprehensive statistical analysis of whole genome bisulfite sequencing (WGBS) data accessible to the larger epigenomics community, so that it no longer remains a niche methodology. Whether it be peripheral samples from a large-scale human epidemiological study or a select set of precious samples from model and non-model organisms, WGBS can provide novel insight into the epigenome and its role in the regulation of gene expression. Furthermore, the functions and workflow are written with the goal of bridging the gap for those transitioning from Illumina's Infinium assay technology (450K and EPIC arrays) by providing statistical analysis and visualization functions that present the data in a familiar format.

The overarching theme of DMRichR is the synthesis of popular Bioconductor R packages for the analysis of genomic data with the tidyverse philosophy of R programming. This allows for a streamlined and tidy approach for downstream data analysis and visualization. In addition to functioning as an R package, the central component of DMRichR is an executable script that is meant to be run as a single call from command line. While this is a non-traditional approach for R programming, it serves as a novel piece of software that simplifies the analysis process while also providing a backbone to build custom workflows on (in a manner similar to a traditional vignette).

DMRichR leverages the statistical algorithms from two popular R packages,dmrseq and bsseq, which enable the inference of differentially methylated regions (DMRs) from low-pass WGBS. In these smoothing based approaches, CpG sites with higher coverage are given a higher weight and used to infer the methylation level of neighboring CpGs with lower coverage. This approach favors a larger sample size over a deeper sequencing depth, and only requires between 1-5x coverage for each sample. By focusing on the differences in methylation levels between groups, rather than the absolute levels within a group, the methodologies utilized allow for a low-pass WGBS approach that assays ~10x more of the genome for only around ~2x the price of competing reduced representation methods (i.e. arrays and RRBS). In our experience, it is these unexplored regions of the genome that contain the most informative results for studies outside of the cancer research domain; however, these regions should also provide novel insight for cancer researchers as well. In order to facilitate an understanding of these DMRs and global methylation levels, DMRichR also works as a traditional R package with a number of downstream functions for statistical analysis and data visualization that can be viewed in the R folder.

A single command line call performs the following steps:

DMR Approach and Interpretation

The main statistical approach applied by the executable script located in the exec folder is dmrseq::dmrseq(), which identifies DMRs in a two step approach:

DMR Detection: The differences in CpG methylation for the effect of interest are pooled and smoothed to give CpG sites with higher coverage a higher weight, and candidate DMRs with a difference between groups are assembled.
Statistical Analysis: A region statistic for each DMR, which is comparable across the genome, is estimated via the application of a generalized least squares (GLS) regression model with a nested autoregressive correlated error structure for the effect of interest. Then, permutation testing of a pooled null distribution enables the identification of significant DMRs. This approach accounts for both inter-individual and inter-CpG variability across the entire genome.

The main estimate of a difference in methylation between groups is not a fold change but rather a beta coefficient, which is representative of the average effect size; however, it is on the scale of the arcsine transformed differences and must be divided by π (3.14) to be similar to the mean methylation difference over a DMR, which is provided in the percentDifference column. Since the testing is permutation based, it provides empirical p-values as well as FDR corrected q-values.

One of the key differences between dmrseq and other DMR identification packages, like bsseq, is that dmrseq is performing statistical testing on the DMRs themselves rather than testing for differences in single CpGs that are then assembled into DMRs like bsseq::dmrFinder() does. This unique approach helps with controlling the false discovery rate and testing the correlated nature of CpG sites in a regulatory region, while also enabling complex experimental designs. However, since dmrseq::dmrseq() does not provide individual smoothed methylation values, bsseq::BSmooth() is utilized to generate individual smoothed methylation values from the DMRs. Therefore, while the DMRs themselves are adjusted for covariates, the individual smoothed methylation values for these DMRs are not adjusted for covariates.

You can also read my general summary of the drmseq approach on EpiGenie.

Example DMR Each dot represents the methylation level of an individual CpG in a single sample, where the size of the dot is representative of coverage. The lines represent smoothed methylation levels for each sample, either control (blue) or DS (red). Genic and CpG annotations are shown below the plot.

Installation

No manual installation of R packages is required, since the required packages and updates will occur automatically upon running the executable script located in the exec folder. However, the package does require Bioconductor, which you can install or update to using:

if (!requireNamespace(c("BiocManager", "remotes"), quietly = TRUE))
  install.packages(c("BiocManager", "remotes"), repos = "https://cloud.r-project.org")
BiocManager::install(version = "3.11")

Additionally, if you are interested in creating your own workflow as opposed to using the executable script, you can download the package using:

Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = TRUE)
BiocManager::install("ben-laufer/DMRichR")

The Design Matrix and Covariates

This script requires a basic design matrix to identify the groups and covariates, which should be named sample_info.xlsx and contain header columns to identify the covariates. The first column of this file should be the sample names and have a header labelled as Name. In terms of the testCovariate label (i.e. Group or Diagnosis), it is important to have the label for the experimental samples start with a letter in the alphabet that comes after the one used for control samples in order to obtain results for experimental vs. control rather than control vs. experimental. You can select which specific samples to analyze from the working directory through the design matrix, where pattern matching of the sample name will only select bismark cytosine report files with a matching name before the first underscore, which also means that sample names should not contain underscores. Within the script, covariates can be selected for adjustment. There are two different ways to adjust for covariates: directly adjust values or balance permutations. Overall, DMRichR supports pairwise comparisons with a minimum of 4 samples (2 per a group). For each discrete covariate, you should also aim to have two samples per each grouping level.

Name	Diagnosis	Age	Sex
SRR3537014	Idiopathic_ASD	14	M
SRR3536981	Control	42	F

Input

DMRichR utilizes Bismark cytosine reports, which are genome-wide CpG methylation count matrices that contain all the CpGs in your genome of interest, including CpGs that were not covered in the experiment. The genome-wide cytosine reports contain important information for merging the top and bottom strand of symmetric CpG sites, which is not present in Bismark coverage and bedGraph files. In general, cytosine reports have the following pattern: *_bismark_bt2_pe.deduplicated.bismark.cov.gz.CpG_report.txt.gz. CpG_Me will generate a folder called cytosine_reports after calling the final QC script (please don't use the cytosine_reports_merged folder for DMRichR). If you didn't use CpG_Me, then you can use the coverage2cytosine module in Bismark to generate the cytosine reports. The cytosine reports have the following format:

chromosome	position	strand	count methylated	count non-methylated	C-context	trinucleotide context
chr2	10470	+	1	0	CG	CGA
chr2	10471	-	0	0	CG	CGG
chr2	10477	+	0	1	CG	CGA
chr2	10478	-	0	0	CG	CGG

Before running the executable, ensure you have the following project directory tree structure for the cytosine reports and design matrix:

├── Project
│   ├── cytosine_reports
│   │   ├── sample1_bismark_bt2.deduplicated.bismark.cov.gz.CpG_report.txt.gz
│   │   ├── sample2_bismark_bt2.deduplicated.bismark.cov.gz.CpG_report.txt.gz
│   │   ├── sample_info.xlsx

This workflow requires the following variables:

-g --genome Select either: hg38, hg19, mm10, mm9, rheMac10, rheMac8, rn6, danRer11, galGal6, bosTau9, panTro6, dm6, canFam3, susScr11, or TAIR9. It is also possible to add other genomes with BSgenome, TxDb, and org.db databases by modifying DMRichR::annotationDatabases().
-x --coverage CpG coverage cutoff for all samples, 1x is the default and minimum value.
-s --perGroup Percent of samples per a group for CpG coverage cutoff, values range from 0 to 1. 1 (100%) is the default. 0.75 (75%) is recommended if you're getting less than 15 million CpGs assayed when this is set to 1.
-m --minCpGs Minimum number of CpGs for a DMR, 5 is default.
-p --maxPerms Number of permutations for DMR and block analyses, 10 is default.
-o --cutoff The cutoff value for the single CpG coefficient utilized to discover testable background regions, values range from 0 to 1, 0.05 (5%) is the default. If you get more than 5,000 DMRs you should try 0.1 (10%).
-t --testCovariate Covariate to test for significant differences between experimental and control (i.e. Diagnosis).
-a --adjustCovariate Adjust covariates that are continuous (i.e. Age) or discrete with two or more factor groups (i.e. Sex). More than one covariate can be adjusted for using single brackets and the ; delimiter, i.e. 'Sex;Age'
-m --matchCovariate Covariate to balance permutations, which is meant for two-group factor covariates in small sample sizes in order to prevent extremely unbalanced permutations. Only one two-group factor can be balanced (i.e. Sex). Note: This will not work for larger sample sizes (> 500,000 permutations) and is not needed for them as the odds of sampling an extremely unbalanced permutation for a covariate decreases with increasing sample size. Futhermore, we generally do not use this in our analyses, since we prefer to directly adjust for sex.
-c --cores The number of cores to use, 20 is recommended but you can go as low as 1, 20 is the default and it requires between 128 to 256 GB of RAM, where the RAM depends on number of samples and coverage.
-e --cellComposition A logical (TRUE or FALSE) indicating whether to run an analysis to estimate cell composition in adult whole blood samples. The analysis will only run for hg38 and hg19. This is a beta feature and requires follow up comparisons with similar array-based papers to confirm accuracy.
-k --sexCheck A logical (TRUE or FALSE) indicating whether to run an analysis to confirm the sex listed in the design matrix based on the ratio of the coverage for the Y and X chromosomes. This argument assumes there is a column in the design matrix named "Sex" [case sensitive] with Males coded as either "Male", "male", "M", or "m" and Females coded as "Female", "female", "F", or "f".

Generic Example

Below is an example of how to execute the main R script (DM.R) in the exec folder on command line. This should be called from the working directory that contains the cytosine reports.

call="Rscript \
--vanilla \
/share/lasallelab/programs/DMRichR/DM.R \
--genome hg38 \
--coverage 1 \
--perGroup '1' \
--minCpGs 5 \
--maxPerms 10 \
--cutoff '0.05' \
--testCovariate Diagnosis \
--adjustCovariate 'Sex;Age' \
--sexCheck TRUE \
--cores 20"

echo $call
eval $call

UC Davis Example

If you are using the Barbera cluster at UC Davis, the following commands can be used to execute DM.R from your login node (i.e. epigenerate), where htop should be called first to make sure the whole node is available. This should be called from the working directory that contains the cytosine reports and not from within a screen.

module load R

call="nohup \
Rscript \
--vanilla \
/share/lasallelab/programs/DMRichR/DM.R \
--genome hg38 \
--coverage 1 \
--perGroup '1' \
--minCpGs 5 \
--maxPerms 10 \
--cutoff '0.05' \
--testCovariate Diagnosis \
--adjustCovariate 'Sex;Age' \
--sexCheck TRUE \
--cores 20 \
> DMRichR.log 2>&1 &"

echo $call
eval $call 
echo $! > save_pid.txt

You can then check on the job using tail -f DMRichR.log and ⌃ Control + c to exit the log view. You can cancel the job from the project directory using cat save_pid.txt | xargs kill. You can also check your running jobs using ps -ef | grep , which should be followed by your username i.e. ps -ef | grep blaufer. Finally, if you still see leftover processes in htop, you can cancel all your processes using pkill -u, which should be followed by your username i.e. pkill -u blaufer.

Alternatively, the executable can also be submitted to the cluster using the shell script via sbatch DM.R.sh.

Workflow and Output

This workflow carries out the following steps:

1) Process Cytosine Reports

DMRichR::processBismark() will load the genome-wide cytosine reports, assign the metadata from the design matrix, and filter the CpGs for equal coverage between the testCovariate as well as any discrete adjustCovariates. There is also an option to confirm the sex of each sample. The end result of this function is a class bsseq object (bs.filtered) that contains the methylated and total count data for each CpG.

2) Genome-wide Background

DMRichR::getBackground() will generate a csv file with the genome-wide background that is termed the bsseq background. The DM.R workflow uses the dmrseq defined background regions; however, the bsseq background is also provided. The dmrseq and bsseq regions offer very different perspectives, since the dmrseq background regions are defined as the testable regions that show a difference between groups and the DMRs overlap precisely with these background regions, which is ideal for many types of enrichment testing. The bsseq background regions are more representative of genome-wide CpG coverage; however, their size is highly variable and they do not overlap precisely with the DMRs. Therefore, each approach has its own strengths and weaknesses.

3) Blocks

The bsseq object is used to identify large blocks (> 5 kb in size) of differential methylation via dmrseq::dmrseq() by using a different smoothing approach than the DMR calling, which "zooms out". It will use 3X more permutations and increase the minimum CpG cutoff by 2x when compared to the DMR calling. In addition to bed files and excel spreadsheets with the significant blocks (sigBlocks) and background blocks (blocks), plots of the blocks will be generated by dmrseq::plotDMRs() and an html report with gene annotations are also generated through DMRichR::annotateRegions() and DMRichR::DMReport().

4) DMRs

The bsseq object is used to call DMRs through dmrseq::dmrseq(). The DMRs typically range in size from a several hundred bp to a few kb. This will generate bed files with the significant DMRs (sigRegions) and background regions (regions).

5) Smoothed Individual Methylation Values

Since dmrseq::dmrseq() smooths the differences between groups, it isn't possible to get individual smoothed methylation values for downstream analyses and visualization. Therefore, the bsseq object is smoothed using bsseq::BSmooth() to create a new bsseq object (bs.filtered.bsseq) with individual smoothed methylation values.

6) ChromHMM and Roadmap Epigenomics Enrichments

Enrichment testing from the chromHMM core 15-state chromatin state model and the related 5 core histone modifications from Roadmap epigenomics 127 reference epigenomes is performed using the LOLA package through the DMRichR::chromHMM() and DMRichR::roadmap() functions. The results are also plotted on a heatmap by DMRichR::chromHMM_heatmap() and DMRichR::roadmap_heatmap(). This is currently restricted to the UC Davis cluster due to requiring large external databases; however, an advanced user can download the databases and modify the functions to refer to their local copy.

7) Smoothed DMR Plots

DMRichR::plotDMRs2() uses the smoothed bsseq object to plot the DMRs along with CpG and gene annotations for model organisms.

8) Global Methylation Analyses

DMRichR::globalStats() uses the smoothed bsseq object to test for differences in global and chromosomal methylation with the same adjustments as the DMR calling, where it generates an excel spreadsheet with the results and input. DMRichR::windowsPCA() generates a PCA plot of 20 kb windows and DMRichR::CGiPCA() generates a PCA of CpG island methylation for hg38, hg19, mm10, mm9, and rn6. The PCA plots are generated using ggbiplot and the ellipses in the PCAs represent the 68% confidence interval, which is 1 standard deviation from the mean for a normal distribution. Finally, DMRichR::densityPlot() generates a density plot of the average single CpG methylation levels for the testCovariate.

9) DMR Heatmap

DMRichR::smoothPheatmap() generates a heatmap of the results with annotations for discrete covariates. The heatmap shows the hierarchal clustering of Z-scores for the non-adjusted percent smoothed individual methylation values for each DMR, where the Z-score corresponds to the number of standard deviations from the mean value of each DMR.

10) DMR Annotations

DMRichR::annotateRegions() annotates both the DMRs and background regions with gene symbols and the results are saved as excel spreadsheets. DMRichR::DMReport() creates an html report of the DMR annotations. DMRichR::annotateGenic() and DMRichR::annotateCpGs() will obtain gene region and CpG annotations for the DMRs and background regions for hg38, hg19, mm10, mm9, and rn6 using the annotatr package. Enrichment testing of genic and CpG annotations can be done using the CGi repository.

11) Manhattan and Q-Q plots

DMRichR::manQQ() will take the output of DMRichR::annotateRegions() and use it to generate Manhattan and Q-Q plots with the CMplot package.

12) Gene Ontology Enrichments

Gene ontology enrichments are performed seperately for all DMRs, the hypermethylated DMRs and the hypomethylated DMRs. All results all saved as excel spreadsheets. There are three approaches used, which are based on R programs that interface with widely used tools:

A) `rGREAT`

rGREAT enables the GREAT approach, which works for hg38, hg19, mm10, and mm9. It performs testing based on genomic coordinates and relative to the background regions. It uses the default GREAT settings, where regions are mapped to genes if they are within a basal regulatory domain of 5 kb upstream and 1 kb downstream; however, it also extends to further distal regions and includes curated regulatory domains.

B) `GOfuncR`

GOfuncR enables the FUNC approach, which works for all genomes and is our preferred method. It utilizes genomic coordinates and performs permutation based enrichment testing for the DMRs relative to the background regions. Regions are only mapped to genes if they are between 5 kb upstream and 1 downstream.

C) `enrichR`

enrichR enables the Enrichr approach, which is based on gene symbols and uses the closest gene to a DMR. It works for all mammalian genomes. While it doesn't utilize genomic coordinates or background regions, it offers a number of extra databases.

Plots

Finally, DMRichR::GOplot() will take the significant results from of all tools, slim them using REVIGO, and then plot the top least dispensable significant terms. This reduces the redundancy of closey related terms and allows for a more comprehensive overview of the top ontologies.

13) Machine Learning

DMRichR::methylLearn() utilizes random forest and support vector machine algorithms from Boruta and sigFeature in a feature selection approach to identify the most informative DMRs based on individual smoothed methylation values. It creates an excel spreadsheet and an html report of the results along with a heatmap.

14) Cell Composition Estimation

The epigenome is defined by its ability to create cell type specific differences. Therefore, when assaying heterogenous sample sources, it is standard for array-based methylation studies to estimate cell type composition and adjust for it in their model. While this is a standard for array-based studies, it is a significant challenge for WGBS studies due to differences in the nature of the data and the lack of appropriate reference sets and methods. In order to address this, we offer two approaches, both of which provide statistics and plots through DMRichR::CCstats() and DMRichR::CCplot(). However, it must be said that, unlike the rest of DMRichR, this is a beta feature that you need to further investigate by comparing to array studies that are similar to yours.

A) The Houseman Method

The Houseman method is a standard for arrays and we have adapted it to work with WGBS data. The workflow will convert the smoothed bsseq object to a matrix of beta values for all EPIC array probes. It will then estimate cell composition using the IDOL reference CpGs in a modified Houseman method via FlowSorted.Blood.EPIC::projectCellType_CP(). If you use the results from this method you should also cite: 1, 2, 3, and 4.

B) The methylCC Method

methylCC is designed to be technology independent by identifying DMRs that define cell types. The workflow uses bumphunter() to find cell type specific DMRs in an array reference database and then examines those regions within your dataset. In this case, it has been modified to utilize the FlowSorted.Blood.EPIC reference dataset and quantile normalization. If you use the results from this method you should also cite: 1 and 2.

15) RData

The output from the main steps is saved in the RData folder so that it can be loaded for custom analyses or to resume an interrupted run:

settings.RData contains the parsed command line options given to DMRichR as well as the annotation database variables. These variables are needed for many of the DMRichR functions, and if you need to reload them, you should also run DMRichR::annotationDatabases(genome) after, since some of the annotation databases have temporary pointers.

bismark.RData contains bs.filtered, which is a bsseq object that contains the filtered cytosine report data and the metadata from sample_info.xlsx in the pData.

Blocks.RData contains blocks, which is a GRanges object of the background blocks. This can be further filtered to produce the sigBlocks object if significant blocks are present.

DMRs.RData contains regions and sigRegions, which are GRanges objects with the background regions and DMRs, respectively.

bsseq.RData contains bs.filtered.bsseq, which is a bsseq object that has been smoothed by bsseq::BSmooth and is used for the individual methylation values (but not the DMR or block calling by dmrseq, which uses a different smoothing approach).

machineLearning.RData contains methylLearnOutput, which is the output from the machine learning feature selection.

cellComposition_Houseman.RData and RData/cellComposition_methylCC.RData contain the output from the cell composition estimation analyses. CC is from the Houseman method, while methylCC and ccDMRs are from the methylCC method.

Citation

If you use DMRichR in published research please cite the following 3 articles:

Laufer BI, Hwang H, Vogel Ciernia A, Mordaunt CE, LaSalle JM. Whole genome bisulfite sequencing of Down syndrome brain reveals regional DNA hypermethylation and novel disease insights. Epigenetics, 2019. doi: 10.1080/15592294.2019.1609867

Korthauer K, Chakraborty S, Benjamini Y, and Irizarry RA. Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics, 2018. doi: 10.1093/biostatistics/kxy007

Hansen KD, Langmead B, Irizarry RA. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 2012. doi: 10.1186/gb-2012-13-10-r83

Publications

The following publications utilize DMRichR:

Laufer BI, Hwang H, Jianu JM, Mordaunt CE, Korf IF, Hertz-Picciotto I, LaSalle JM. Low-Pass Whole Genome Bisulfite Sequencing of Neonatal Dried Blood Spots Identifies a Role for RUNX1 in Down Syndrome DNA Methylation Profiles. bioRxiv preprint. doi: 10.1101/2020.06.18.157693

Murat El Houdigui S, Adam-Guillermin C, Armant O. Ionising Radiation Induces Promoter DNA Hypomethylation and Perturbs Transcriptional Activity of Genes Involved in Morphogenesis during Gastrulation in Zebrafish. International Journal of Molecular Sciences, 2020. doi: 10.3390/ijms21114014

Wöste M, Leitão E, Laurentino S, Horsthemke B, Rahmann S, Schröder C. wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data. BMC Bioinformatics, 2020. doi: 10.1186/s12859-020-3470-5

Mordaunt CE, Jianu JM, Laufer BI, Zhu Y, Dunaway KW, Bakulski KM, Feinberg JI, Volk HE, Lyall K, Croen LA, Newschaffer CJ, Ozonoff S, Hertz-Picciotto I, Fallin DM, Schmidt RJ, LaSalle JM. Cord blood DNA methylome in newborns later diagnosed with autism spectrum disorder reflects early dysregulation of neurodevelopmental and X-linked genes. bioRxiv preprint. doi: 10.1101/850529

Lopez SJ, Laufer BI, Beitnere U, Berg E, Silverman JL, Segal DJ, LaSalle JM. Imprinting effects of UBE3A loss on synaptic gene networks and Wnt signaling pathways. Human Molecular Genetics, 2019. doi: 10.1093/hmg/ddz221

Vogel Ciernia A*, Laufer BI*, Hwang H, Dunaway KW, Mordaunt CE, Coulson RL, Yasui DH, LaSalle JM. Epigenomic convergence of genetic and immune risk factors in autism brain. Cerebral Cortex, 2019. doi: 10.1093/cercor/bhz115

Acknowledgements

The development of this program was suppourted by a Canadian Institutes of Health Research (CIHR) postdoctoral fellowship [MFE-146824] and a CIHR Banting postdoctoral fellowship [BPF-162684]. Hyeyeon Hwang developed methylLearn() and the sex checker for processBismark(). Charles Mordaunt developed getBackground() and plotDMRs2() as well as the CpG filtering approach in processBismark(). I would also like to thank Keegan Korthauer, Matt Settles, Rochelle Coulson, Blythe Durbin-Johnson, Annie Vogel Ciernia, Nikhil Joshi, and Ian Korf for invaluable discussions related to the bioinformatic approaches utilized in this repository.

mnhuda / dmrichr Goto Github PK

dmrichr's Introduction