pb-CpG-tools

A collection of tools for the analysis of CpG/5mC data from PacBio HiFi reads aligned to a reference genome (e.g., an aligned BAM). To use these tools, the HiFi reads should already contain 5mC base modification tags, generated on-instrument or by using primrose. The aligned BAM should also be sorted and indexed.

Available Tools

aligned_bam_to_cpg_scores.py: Obtain a list of CpG/5mC sites and modification probabilities from a BAM file of HiFi reads aligned to a reference genome.

Requirements

Environment

There are several Python packages required to run the aligned_bam_to_cpg_scores.py script. These can easily be installed using conda and the conda_env_cpg.yaml file provided:

# create conda environment
$ conda env create -f conda_env_cpg.yaml

# activate environment
$ conda activate cpg

These packages could also be installed manually. However, the specific package versions listed in the yaml file must be used to ensure a stable environment.

Input BAM

The input BAM should be sorted and have an associated index file (.bai). The HiFi reads in the aligned input BAM file must contain the SAM tags MM and ML which encode base modifications. The tags are described in the SAM tag specification document. These are the default tags generated by primrose for HiFi CpG datasets.

Tag	Type	Description
`MM`	`Z`	Base modifications / methylation
`ML`	`B,C`	Base modification probabilities

Notes for ML: The continuous probability range of 0.0 to 1.0 is remapped to the discrete integers 0 to 255 inclusively. The probability range corresponding to an integer N is N/256 to (N + 1)/256.

These tools are explicitly designed for 5mC tags (e.g., MM:Z:C+m) and do not support other modification types or tags in multiple-modification format.

Usage

Usage:
  python aligned_bam_to_cpg_scores.py -b input.bam -f ref.fasta -o label [options]

Mandatory arguments:
  -b, --bam             FILE    The aligned BAM file.
  -f, --fasta           FILE	The reference fasta file.
  -o, --output_label    STR     Label for output files, which results in [label].bed/bw.
  
Optional arguments:
  -p, --pileup_mode     count/model    Use a model-based approach to score modifications across sites 
                                       (model) or a simple count-based approach (count). [count]
  -d, --model_dir       STR     Full path to the directory containing the model (*.pb files) to 
                                load. [None]
  -m, --modsites        denovo/reference    Only output CG sites with a mod probability > 0 (denovo), 
                                            or output all CG sites based on the supplied reference 
                                            fasta (reference). [denovo]
  -c, --min_coverage    INT     Minimum coverage required for filtered outputs. [4]
  -q, --min_mapq        INT     Ignore alignments with MAPQ < N. [0]
  -a, --hap_tag         STR     The SAM tag containing haplotype information. [HP]
  -s, --chunksize       INT     Break reference regions into chunks of this size for parallel 
                                processing. [500000]
  -t, --threads         INT     Number of threads for parallel processing. [1]
  -h, --help                    Show this help message and exit.

Details

The -p, --pileup_mode selects the method used to calculate modification probabilities.

model: This approach is preferred. It uses distributions of modification scores and a machine-learning model to calculate the modification probabilities across CG sites. Using this option requires providing the path to the directory containing the model files with the -d, --model_dir flag. The required model is available in the pileup_calling_model/ directory.
count: This method is simplistic. For a given site, the number of reads with a modification score of >0.5 and <0.5 are counted and the modification probability is given as a percentage.

The -m, --modsites flag determines which sites will be reported.

denovo: This option will identify and output all CG sites found in the consensus sequence from the reads in the pileup. This allows reporting of CG sites with zero modification probability. This mode does not ensure that the reference also displays a CG site (e.g., there could be sequence mismatches between the reads and reference).
reference: This option will identify and output all CG sites found in the reference sequences. This allows reporting of CG sites with zero modification probability. This mode does not ensure that aligned reads also display a CG site (e.g., there could be sequence mismatches between the reads and reference).

Using the -a, --hap_tag flag allows an arbitrary SAM tag to be used to identify haplotypes, rather than the default HP tag. The haplotype values must be 0, 1, and 2, where 0 is not assigned/ambiguous.

Outputs

There are bed format (.bed) and bigwig format (.bw) files generated for the complete read set and each separate haplotype (when available).

In addition, there are coverage filtered version of these files produced that are based on the minimum coverage value set for -c, --min_coverage.

A log file is also produced ([label]-aligned_bam_to_cpg_scores.log), which contains useful information.

If haplotype information is absent, the following 4 files are expected:

[label].total.[pileup_mode].bed
[label].total.[pileup_mode].bw
[label].total.[pileup_mode].mincov[X].bed
[label].total.[pileup_mode].mincov[X].bw

If haplotype information is present, the following 12 files are expected:

[label].total.[pileup_mode].bed
[label].hap1.[pileup_mode].bed 
[label].hap2.[pileup_mode].bed
[label].total.[pileup_mode].bw
[label].hap1.[pileup_mode].bw 
[label].hap2.[pileup_mode].bw
[label].total.[pileup_mode].mincov[X].bed
[label].hap1.[pileup_mode].mincov[X].bed 
[label].hap2.[pileup_mode].mincov[X].bed
[label].total.[pileup_mode].mincov[X].bw
[label].hap1.[pileup_mode].mincov[X].bw 
[label].hap2.[pileup_mode].mincov[X].bw

The bed file columns will differ between the -p model and -p count methods, but both share the first six columns:

reference name
start coordinate
end coordinate
modification probability
haplotype
coverage

For -p count, four additional columns are present:

modified site count
unmodified site count
avg mod score
avg unmod score

For -p model, three additional columns are present:

estimated modified site count (extrapolated from model modification probability)
estimated unmodified site count (extrapolated from model modification probability)
discretized modification probability (calculated from estimated mod/unmod site counts)

The bigwig files are in an indexed binary format and contain columns 1-4 listed above, and are preferred for loading CpG/5mC tracks in IGV.

Example Data

An aligned BAM file containing HiFi reads with 5mC tags (HG002.GRCh38.haplotagged.bam) is freely available for download: https://downloads.pacbcloud.com/public/dataset/HG002-CpG-methylation-202202/

The sample is HG002/NA24385 from the Human Pangenome Reference Consortium HG002 Data Freeze v1.0, and is aligned to GRCh38. There are also four unaligned bam files containing the HiFi reads.

Performance

The current -s, --chunksize default of 500,000 requires 1-3 Gb memory per thread with a 30X coverage aligned BAM. A higher coverage dataset would use more memory per thread.

Benchmarks were run using the HG002.GRCh38.haplotagged.bam dataset described in the above section. Peak memory was estimated using 3Gb per thread.

Threads (`-t`)	Chunk Size (`-s`)	Wall Time (`h:m:s`)	Estimated Peak Memory
48	500000 (default)	3:15:56	144 Gb
36	500000 (default)	3:58:06	108 Gb
24	500000 (default)	5:19:03	72 Gb
12	500000 (default)	8:58:01	36 Gb

Changelog

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

miller-alexander / pb-cpg-tools Goto Github PK

pb-cpg-tools's Introduction

pb-CpG-tools

Available Tools

Requirements

Environment

Input BAM

Usage

Details

Outputs

Example Data

Performance

Changelog

DISCLAIMER

pb-cpg-tools's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent