tushiqi / manorm2 Goto Github PK

MAnorm2 for Normalizing and Comparing ChIP-seq Samples

R 10.52% HTML 89.48%

chip-seq differential-analysis empirical-bayes r-package winsorize-values

manorm2's Introduction

Title	Author	Date	Contact
MAnorm2 1.2.2	Shiqi Tu	2022-10-28	[email protected]

Introduction

MAnorm2 is designed for normalizing and comparing ChIP-seq signals across individual samples or groups of samples. The latest version of MAnorm2 is always available in the CRAN repository and can thus be easily installed by typing install.packages("MAnorm2") in an R session.

For older versions of MAnorm2, select a package from under the dist folder, download it, and type install.packages("/path/to/the/package", repos = NULL). In this way, you may need to pre-install some dependencies of MAnorm2. The current dependencies of the latest MAnorm2 version include locfit (>= 1.5.9), scales (>= 0.3.0), and statmod (>= 1.4.34). All these packages are available in the CRAN repository. For dependencies of other MAnorm2 versions, refer to the Imports field in the corresponding DESCRIPTION file.

Sections below give a brief description of the application scope of MAnorm2 as well as its capability. For a full documentation of MAnorm2, download the HTML version of its vignette from here and use a browser to open it, or type the following code in an R session after installing MAnorm2:

browseVignettes("MAnorm2")

Format of Input Data

For employing the machinery implemented in MAnorm2, you need to prepare a table that profiles the ChIP-seq signal in each of a list of genomic intervals for each of a set of ChIP-seq samples. The following table provides such an instance:

chrom	start	end	s1.read_cnt	s2.read_cnt	s1.occupancy	s2.occupancy
chr1	28112	29788	115	4	1	0
chr1	164156	166417	233	194	1	1
chr1	166417	168417	465	577	1	1
chr1	168417	169906	15	34	0	1

(See the H3K27Ac dataset bundled with MAnorm2 for another example, and type library(MAnorm2); ?H3K27Ac in an R session for a detailed description of it.)

To be specific, each row of the table represents a genomic interval; each of the read_cnt variables corresponds to a ChIP-seq sample and gives the numbers of reads from the sample that fall within the genomic intervals (i.e., the raw read counts); the occupancy variables correspond to the read_cnt variables one by one and specify the occupancy status of each genomic interval in each sample. Note that an occupancy status of 1 indicates the interval is enriched with reads in the sample (compared with, for example, the surrounding genomic regions or the corresponding input sample). In practice, the occupancy status of a genomic interval in a certain ChIP-seq sample could be determined by its overlap with the peaks of the sample. Note also that MAnorm2 refers to an interval as occupied by a sample if the interval is enriched with reads in the sample.

MAnorm2_utils is specifically designed to coordinate with MAnorm2, and we strongly recommend using it to create input tables of MAnorm2.

Application Scope

Although MAnorm2 has been designed to process ChIP-seq data, it could be applied in principle to the analysis of any type of data with a similar structure, including DNase-seq, ATAC-seq and RNA-seq data. The only problem associated with such extensions is how to naturally define "peaks" for specific data types.

Most of the peak callers originally devised for ChIP-seq data (e.g., MACS 1.4) also work for DNase-seq and ATAC-seq data. For RNA-seq data, each row of the input table should stand for a gene, and we recommend setting a cutoff (e.g., 20) of raw read count to define "peak" genes.

Continuous Distribution

In spite of the discrete nature of read counts, MAnorm2 uses continuous distribution to model ChIP-seq data by first transforming raw read counts into raw signal intensities. By default, MAnorm2 completes the transformation by simply adding an offset count to each raw count and taking a base-2 logarithm. Practical ChIP-seq data sets, however, may be associated with various confounding factors, including batch effects, local sequence compositions and background signals measured by input samples. On this account, functions in MAnorm2 have been designed to be independent of the specific transformation employed. And any methods for correcting for confounding factors could be applied before invoking MAnorm2, as long as the resulting signal intensities could be approximately modeled as following the normal distribution (in particular, consider carefully whether it is necessary to apply a logarithmic transformation in the final step).

The primary reason for which MAnorm2 models ChIP-seq signals as continuous random variables is that the mathematical theory of count distributions is far less tractable than that of the normal distribution. For example, current statistical methods based on the negative binomial distribution are frequently relied on approximations of various kinds. Specifically, variance (or dispersion) estimates for individual genomic intervals are typically treated as known parameters, and their uncertainty can hardly be incorporated into the statistical tests for identifying differential signals.

Besides, after an extensive correction for confounding factors, the resulting data range is almost certainly not limited to non-negative integers, and the data may have lost their discrete nature and be more akin to a continuous distribution. Moreover, transforming read counts towards the normal distribution unlocks the application of a large repository of mature statistical methods that are initially developed for analyzing continuous measurements (e.g., intensity data from microarray experiments). Refer to the voom article for a detailed discussion of this topic.

Normalization

MAnorm2 implements a robust method for normalizing raw signal intensities across any number of ChIP-seq samples.

Technically, it considers the common peak regions of two ChIP-seq samples (i.e., the genomic intervals occupied by both samples; see also the section of Format of Input Data) to have globally invariant signals between them. Based on this assumption, MAnorm2 applies a linear transformation to the raw signal intensities of one of the two samples such that

the resulting M values (differences in signal intensities between the two samples) at the common peak regions have an arithmetic mean of 0;
sample Pearson correlation coefficient between the resulting M and A values at the common peak regions is 0 (A values refer to average signal intensities across the two samples).

This procedure is for normalizing a pair of ChIP-seq samples. It can be extended to normalization of any number of samples so that the normalized signal intensities are comparable across all of them. For more information regarding the normalization method implemented in MAnorm2, type the following code in an R session after installing it:

library(MAnorm2)
?normalize
?normBioCond

Differential and Hypervariable Analyses

MAnorm2 designs a self-contained system of utility functions for calling differential ChIP-seq signals between two or more biological conditions, each with or without replicate samples. It also implements a method named HyperChIP for calling hypervariable ChIP-seq signals across samples. To be noted, the framework implemented in MAnorm2 for differential and hypervariable analyses requires the input signal intensities to have been normalized but is independent of the specific normalization method. This means that any normalization and/or bias correction tools could be adapted to MAnorm2, as long as the resulting signal measurements are suited to be modeled by normal distribution. For example, for highly regular samples, you might want to perform the normalization based on their size factors (refer to the normalizeBySizeFactors function in MAnorm2 for details).

Technically, MAnorm2 has implemented an S3 class named "bioCond" for grouping ChIP-seq samples belonging to the same biological condition, and it has devised a number of functions for handling objects of this class. Taking advantage of these functions, you can

call genomic intervals with differential ChIP-seq signals between two or more bioConds;
call genomic intervals with hypervariable ChIP-seq signals across multiple samples or bioConds;
perform hierarchical clustering on a set of ChIP-seq samples or bioConds by measuring the distance (i.e., dissimilarity) between each pair of them.

Note that, for the samples grouped into a bioCond, MAnorm2 models the relationship between observed mean signal intensities at individual intervals and the associated (observed) variances. In practice, this modeling strategy could compensate for the lack of sufficient replicates for deriving accurate variance estimates for individual intervals. And each of the above analyses takes advantage of the modeling of mean-variance trend to improve variance estimation.

For an overview of the interface functions provided by MAnorm2, type the following code in an R session after installing it:

library(MAnorm2)
?MAnorm2

Citation

To cite the MAnorm2 package in publications, please use

Tu, S., et al., MAnorm2 for quantitatively comparing groups of ChIP-seq samples. Genome Res, 2021. 31(1): p. 131-145.

If you have performed MA normalization with a pseudo-reference profile as baseline, or have employed a Winsorization-based robust parameter estimation framework, or have performed a hypervariable analysis, please cite additionally

Chen, H., et al., HyperChIP: identification of hypervariable signals across ChIP-seq or ATAC-seq samples. Genome Biol, 2022. 23(1): p. 62.

manorm2's People

Contributors

Stargazers

Watchers

Forkers

wwang-chcn rnaimehaom

manorm2's Issues

printing the results of MAnorm2

Dear Shiqi,

thank you for your work on MAnorm2 and for the tutorial. Could you please let me know how can I print the results of the function diffTest ? Thank you !

Bogdan

fitMeanVarCurve error

library(MAnorm2)
data <- read.table("D:\12-8/1-7/epic/MAnorm2/epiccallpeak_0.05_h3k_profile_bins.xls",header = T)
norm <- normalize(data, count = 6:7, occupancy = 12:13)
norm <- normalize(norm, count = 8:9, occupancy = 14:15)
conds <- list(d65 = bioCond(norm[6:7], norm[12:13], name = "d65"),d90 = bioCond(norm[8:9], norm[14:15], name = "d90"))
conds <- normBioCond(conds
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE,init.coef = c(0.1, 10))

During the parametric fit procedure:
After iteration 1: coef = (0.0821818, 2.01018); 1361 (0.78%) outlier(s) detected
After iteration 2: coef = (0.0497606, 2.1688); 1439 (0.82%) outlier(s) detected
After iteration 3: coef = (0.0276251, 2.63155); 1388 (0.79%) outlier(s) detected
After iteration 4: coef = (0.0131094, 2.95489); 1379 (0.79%) outlier(s) detected
After iteration 5: coef = (0.00597452, 3.119); 1418 (0.81%) outlier(s) detected
After iteration 6: coef = (0.00337247, 3.17739); 1420 (0.81%) outlier(s) detected
After iteration 7: coef = (0.00145952, 3.1951); 1439 (0.82%) outlier(s) detected
After iteration 8: coef = (0.000429077, 3.24471); 1451 (0.83%) outlier(s) detected
After iteration 9: coef = (6.32897e-05, 3.25055); 1459 (0.83%) outlier(s) detected
After iteration 10: coef = (0.000623793, 3.23509); 1450 (0.83%) outlier(s) detected
After iteration 11: coef = (0.000624391, 3.23728); 1449 (0.83%) outlier(s) detected
After iteration 12: coef = (0.00129248, 3.22126); 1438 (0.82%) outlier(s) detected
After iteration 13: coef = (0.000451543, 3.24328); 1451 (0.83%) outlier(s) detected
After iteration 14: coef = (0.000940307, 3.22948); 1441 (0.82%) outlier(s) detected
After iteration 15: coef = (0.00266528, 3.1899); 1427 (0.82%) outlier(s) detected
After iteration 16: coef = (0.000354502, 3.24825); 1450 (0.83%) outlier(s) detected
After iteration 17: coef = (0.000917505, 3.23022); 1441 (0.82%) outlier(s) detected
After iteration 18: coef = (0.00239003, 3.19667); 1431 (0.82%) outlier(s) detected
After iteration 19: coef = (0.00504742, 3.13537); 1420 (0.81%) outlier(s) detected
After iteration 20: coef = (0.00144397, 3.2225); 1430 (0.82%) outlier(s) detected
After iteration 21: coef = (6.27676e-05, 3.2536); 1458 (0.83%) outlier(s) detected
After iteration 22: coef = (0.000623851, 3.23535); 1450 (0.83%) outlier(s) detected
After iteration 23: coef = (0.000624447, 3.23728); 1449 (0.83%) outlier(s) detected
After iteration 24: coef = (0.00133988, 3.2201); 1436 (0.82%) outlier(s) detected
After iteration 25: coef = (0.000587136, 3.24076); 1450 (0.83%) outlier(s) detected
After iteration 26: coef = (0.000538113, 3.23912); 1450 (0.83%) outlier(s) detected
Converged.

There were 50 or more warnings (use warnings() to see the first 50)
1: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
2: step size truncated due to divergence
3: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
4: step size truncated due to divergence
5: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
6: step size truncated due to divergence
7: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
8: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
9: step size truncated due to divergence
10: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
11: step size truncated due to divergence
12: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
13: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
14: step size truncated due to divergence
15: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
16: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
17: step size truncated due to divergence
18: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
19: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
20: step size truncated due to divergence
21: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
22: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
23: step size truncated due to divergence
24: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
25: glm.fit: algorithm did not converge
26: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
27: step size truncated due to divergence
28: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
29: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
30: step size truncated due to divergence
31: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
32: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
33: step size truncated due to divergence
34: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
35: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
36: step size truncated due to divergence
37: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
38: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
39: step size truncated due to divergence
40: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
41: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
42: step size truncated due to divergence
43: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
44: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
45: step size truncated due to divergence
46: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
47: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
48: step size truncated due to divergence
49: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs
50: In log(ifelse(y == 0, 1, y/mu)) : 产生了NaNs

Do these warning will affect the final results?
why the warnings occur?
Thanks advance!

fitMeanVarCurve: Error: no valid set of coefficients

Hi, I am testing MAnorm2 on my ChIP-seq data. Its two samples with two replicates each, of a transcription factor.

To save time I am currently working with one chromosome, I successfully created the input-file with profile_bins:

Produced by profile_bins, version=1.0.0

Equivalent parameter settings:
peaks=s11_peaks.bed,s12_peaks.bed,s21_peaks.bed,s22_peaks.bed
reads=S11.bed,S12.bed,S21.bed,S22.bed
labs=s1,s1,s2,s2
n=MAnorm2_result
min-peak-gap=150
typical-bin-size=2000
summits=using_middle_points
shiftsize=100
paired=False
keep-dup=all
method=byBin
fix-bin-size=False

Summary of peak numbers in peak files:
peak_file	peaks_after_filtering	peaks_after_merging
s11_peaks.bed	5491	5491
s12_peaks.bed	5491	5491
s21_peaks.bed	5393	5393
s22_peaks.bed	5393	5393

Divide merged peaks into reference genomic bins:
9685 bins in total

Summary of read counts in read files:
read_file	total_reads	redundant_reads	redundancy_ratio
S11.bed	909875	0	0.0%
S12.bed	911507	0	0.0%
S21.bed	908564	0	0.0%
S22.bed	909814	0	0.0%

Number of reads falling within reference genomic bins:
read_file	reads_after_filtering	reads_within_bins	within_ratio
S11.bed	909875	308201	33.9%
S12.bed	911507	309885	34.0%
S21.bed	908564	307243	33.8%
S22.bed	909814	307258	33.8%

Finished, check the output file:
MAnorm2_result_profile_bins.xls

and then in R this happens:

R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(MAnorm2)
MAnorm2 1.0.0 2019-12-01
> data <- read.table('MAnorm2_result_profile_bins.xls',header = T)
> norm <- normalize(data, count = 4:7, occupancy = 8:11)
> conds <- list(s1 = bioCond(norm[4:5], norm[8:9], name = 's1'), s2 = bioCond(norm[6:7], norm[10:11], name = 's2'))
> conds <- normBioCond(conds)
> conds <- fitMeanVarCurve(conds, method = 'parametric', occupy.only = TRUE)
During the parametric fit procedure:
Error: no valid set of coefficients has been found: please supply starting values
In addition: Warning message:
In log(ifelse(y == 0, 1, y/mu)) : NaNs produced
Execution halted

> head(data)
  chrom   start     end s1.read_cnt s1.read_cnt.1 s2.read_cnt s2.read_cnt.1 s1.occupancy
1 chr19 3015000 3016000          52            40          34            43            1
2 chr19 3028000 3031000          52            46          54            49            1
3 chr19 3037000 3038000           6             8          12             8            0
4 chr19 3046000 3047000          18            18           9            11            1
5 chr19 3066000 3067000          31            47          43            39            0
6 chr19 3080000 3081500          23            26          23            34            1
  s1.occupancy.1 s2.occupancy s2.occupancy.1
1              1            0              0
2              1            0              0
3              0            1              1
4              1            0              0
5              0            1              1
6              1            0              0

I could not find any obvious differences to the demo data set, but when I us the local regression method, it works:

> conds <- fitMeanVarCurve(conds, method = 'local', occupy.only = TRUE)
During the local regression procedure:
After iteration 1: 382 (3.57%) outlier(s) detected
After iteration 2: 383 (3.57%) outlier(s) detected
After iteration 3: 384 (3.58%) outlier(s) detected
After iteration 4: 384 (3.58%) outlier(s) detected
Converged.

Specific difference peak between two tissues

Hi, I have a preliminary question for you.
In a result with no sample duplication, for example, in terms of the sample data, does it mean that the Mval value is positive and the GM12892 group is more open than the GM12891 group?

Because I just learned this part of knowledge, I am not quite clear about this basic question, please reply to me and answer, thank you!

Multi variable analysis

This isn't really an issue, but I wanted to know if MAnorm2 was capable of multi-variant analysis in the same way DESeq2 is. For example, if we wanted to remove the effects of a "condition" vs the "group" in DESeq2 we would use the design ~ group + condition. Is there anything like that for MAnorm2?

about cutoff for HVP selection?

Dear MAnorm2 developer,
We get result of HVPs by HyperChIP of MAnorm2 as follow.
We wonder what cutoff is better for HVP selection?
Two questions:

padj<0.05 is one basic cutoff?
What is the role of "observed.var" and "fold.change" and how can we select them?

#> observed.mean observed.var prior.var fold.change pval padj
#> 1 4.123172 0.62767325 3.3719506 0.18614545 0.108561188 0.3391233
#> 2 8.835486 0.01362584 0.2844984 0.04789425 0.008610023 0.1013005
#> 3 5.794082 0.16260797 1.4398955 0.11293040 0.043945034 0.2164067
#> 4 4.580327 4.82749982 2.6687559 1.80889526 0.247900708 0.5171521
#> 5 8.302970 0.07726829 0.3872625 0.19952434 0.122594081 0.3598123
#> 6 4.440024 1.65003664 2.8669925 0.57552876 0.639231994 0.8224108

How to get the common, specific peaks list?

Dear Shiqi,

Using the MAplot function, we can visualize the common, X specific, Y specific and others peaks plot.

But, how to get the common, X specific, Y specific and others peaks list?

My understanding is that the occupancy state: 0,1, represents whether or not the sample contains the peak. 1 means yes, 0 means no, and if both are 1, then the peak is common. I wonder if my understanding is correct？

Thank you very much for your reply~

issues about Figure2B in the article

Dear MAnorm2 developer,

I am new to the Chip-Seq Analysis. And recently, I am learing to use MAnorm2 to normalize my data, and also repet the result of Figure2B. Here are my question:

1、What are the X and Y axis of Figure2B mean, how do you rank promoters, is it throught the T statistic?
2、And how do you calculate the proportion linked with DEGs？

I am sorry if I asked silly questions. Any answers of you would be appreciated.
Looking forward to your precious reply. Thanks a lot!!

Peak files

Do the peak files are produced by using Ip samples to callpeak with the normalization of Input samples or just using Ip samples to callepeak？