Coder Social home page Coder Social logo

miller-alexander / pb-cpg-tools Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pacificbiosciences/pb-cpg-tools

0.0 0.0 0.0 223 KB

Collection of tools for the analysis of CpG data

License: BSD 3-Clause Clear License

Python 73.10% PureBasic 26.90%

pb-cpg-tools's Introduction

pb-CpG-tools

A collection of tools for the analysis of CpG/5mC data from PacBio HiFi reads aligned to a reference genome (e.g., an aligned BAM). To use these tools, the HiFi reads should already contain 5mC base modification tags, generated on-instrument or by using primrose. The aligned BAM should also be sorted and indexed.

Available Tools

  • aligned_bam_to_cpg_scores.py: Obtain a list of CpG/5mC sites and modification probabilities from a BAM file of HiFi reads aligned to a reference genome.

Requirements

Environment

There are several Python packages required to run the aligned_bam_to_cpg_scores.py script. These can easily be installed using conda and the conda_env_cpg.yaml file provided:

# create conda environment
$ conda env create -f conda_env_cpg.yaml

# activate environment
$ conda activate cpg

These packages could also be installed manually. However, the specific package versions listed in the yaml file must be used to ensure a stable environment.

Input BAM

The input BAM should be sorted and have an associated index file (.bai). The HiFi reads in the aligned input BAM file must contain the SAM tags MM and ML which encode base modifications. The tags are described in the SAM tag specification document. These are the default tags generated by primrose for HiFi CpG datasets.

Tag Type Description
MM Z Base modifications / methylation
ML B,C Base modification probabilities

Notes for ML: The continuous probability range of 0.0 to 1.0 is remapped to the discrete integers 0 to 255 inclusively. The probability range corresponding to an integer N is N/256 to (N + 1)/256.

These tools are explicitly designed for 5mC tags (e.g., MM:Z:C+m) and do not support other modification types or tags in multiple-modification format.

Usage

Usage:
  python aligned_bam_to_cpg_scores.py -b input.bam -f ref.fasta -o label [options]

Mandatory arguments:
  -b, --bam             FILE    The aligned BAM file.
  -f, --fasta           FILE	The reference fasta file.
  -o, --output_label    STR     Label for output files, which results in [label].bed/bw.
  
Optional arguments:
  -p, --pileup_mode     count/model    Use a model-based approach to score modifications across sites 
                                       (model) or a simple count-based approach (count). [count]
  -d, --model_dir       STR     Full path to the directory containing the model (*.pb files) to 
                                load. [None]
  -m, --modsites        denovo/reference    Only output CG sites with a mod probability > 0 (denovo), 
                                            or output all CG sites based on the supplied reference 
                                            fasta (reference). [denovo]
  -c, --min_coverage    INT     Minimum coverage required for filtered outputs. [4]
  -q, --min_mapq        INT     Ignore alignments with MAPQ < N. [0]
  -a, --hap_tag         STR     The SAM tag containing haplotype information. [HP]
  -s, --chunksize       INT     Break reference regions into chunks of this size for parallel 
                                processing. [500000]
  -t, --threads         INT     Number of threads for parallel processing. [1]
  -h, --help                    Show this help message and exit.

Details

The -p, --pileup_mode selects the method used to calculate modification probabilities.

  • model: This approach is preferred. It uses distributions of modification scores and a machine-learning model to calculate the modification probabilities across CG sites. Using this option requires providing the path to the directory containing the model files with the -d, --model_dir flag. The required model is available in the pileup_calling_model/ directory.
  • count: This method is simplistic. For a given site, the number of reads with a modification score of >0.5 and <0.5 are counted and the modification probability is given as a percentage.

The -m, --modsites flag determines which sites will be reported.

  • denovo: This option will identify and output all CG sites found in the consensus sequence from the reads in the pileup. This allows reporting of CG sites with zero modification probability. This mode does not ensure that the reference also displays a CG site (e.g., there could be sequence mismatches between the reads and reference).
  • reference: This option will identify and output all CG sites found in the reference sequences. This allows reporting of CG sites with zero modification probability. This mode does not ensure that aligned reads also display a CG site (e.g., there could be sequence mismatches between the reads and reference).

Using the -a, --hap_tag flag allows an arbitrary SAM tag to be used to identify haplotypes, rather than the default HP tag. The haplotype values must be 0, 1, and 2, where 0 is not assigned/ambiguous.

Outputs

There are bed format (.bed) and bigwig format (.bw) files generated for the complete read set and each separate haplotype (when available).

In addition, there are coverage filtered version of these files produced that are based on the minimum coverage value set for -c, --min_coverage.

A log file is also produced ([label]-aligned_bam_to_cpg_scores.log), which contains useful information.

If haplotype information is absent, the following 4 files are expected:

[label].total.[pileup_mode].bed
[label].total.[pileup_mode].bw
[label].total.[pileup_mode].mincov[X].bed
[label].total.[pileup_mode].mincov[X].bw

If haplotype information is present, the following 12 files are expected:

[label].total.[pileup_mode].bed
[label].hap1.[pileup_mode].bed 
[label].hap2.[pileup_mode].bed
[label].total.[pileup_mode].bw
[label].hap1.[pileup_mode].bw 
[label].hap2.[pileup_mode].bw
[label].total.[pileup_mode].mincov[X].bed
[label].hap1.[pileup_mode].mincov[X].bed 
[label].hap2.[pileup_mode].mincov[X].bed
[label].total.[pileup_mode].mincov[X].bw
[label].hap1.[pileup_mode].mincov[X].bw 
[label].hap2.[pileup_mode].mincov[X].bw

The bed file columns will differ between the -p model and -p count methods, but both share the first six columns:

  1. reference name
  2. start coordinate
  3. end coordinate
  4. modification probability
  5. haplotype
  6. coverage

For -p count, four additional columns are present:

  1. modified site count
  2. unmodified site count
  3. avg mod score
  4. avg unmod score

For -p model, three additional columns are present:

  1. estimated modified site count (extrapolated from model modification probability)
  2. estimated unmodified site count (extrapolated from model modification probability)
  3. discretized modification probability (calculated from estimated mod/unmod site counts)

The bigwig files are in an indexed binary format and contain columns 1-4 listed above, and are preferred for loading CpG/5mC tracks in IGV.

Example Data

An aligned BAM file containing HiFi reads with 5mC tags (HG002.GRCh38.haplotagged.bam) is freely available for download: https://downloads.pacbcloud.com/public/dataset/HG002-CpG-methylation-202202/

The sample is HG002/NA24385 from the Human Pangenome Reference Consortium HG002 Data Freeze v1.0, and is aligned to GRCh38. There are also four unaligned bam files containing the HiFi reads.

Performance

The current -s, --chunksize default of 500,000 requires 1-3 Gb memory per thread with a 30X coverage aligned BAM. A higher coverage dataset would use more memory per thread.

Benchmarks were run using the HG002.GRCh38.haplotagged.bam dataset described in the above section. Peak memory was estimated using 3Gb per thread.

Threads (-t) Chunk Size (-s) Wall Time (h:m:s) Estimated Peak Memory
48 500000 (default) 3:15:56 144 Gb
36 500000 (default) 3:58:06 108 Gb
24 500000 (default) 5:19:03 72 Gb
12 500000 (default) 8:58:01 36 Gb

Changelog

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

pb-cpg-tools's People

Contributors

dportik avatar ctsa avatar rhallpb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.