Coder Social home page Coder Social logo

patty's Introduction

PATTY: a computational method for correcting open chromatin bias in bulk and single-cell CUT&Tag data

Precise profiling of epigenomes, including histone modifications and transcription factor binding sites, is essential for better understanding gene regulatory mechanisms. Cleavage Under Targets & Tagmentation (CUT&Tag) is an easy and low-cost epigenomic profiling method that can be performed on a low number of cells and at the single-cell level. A large number of CUT&Tag datasets have been generated in cancer samples, providing a valuable resource. CUT&Tag experiments use the hyperactive transposase Tn5 for tagmentation. We found that preference of Tn5 captured reads toward accessible chromatin regions can influence the distribution of CUT&Tag reads and cause open chromatin biases, further confounding the analysis of CUT&Tag data. The high sparsity of single-cell sequencing data makes the open chromatin biases more substantial than in bulk sequencing data. Here, we present a comprehensive computational method, PATTY (Propensity Analyzer for Tn5 Transposase Yielded bias), to mitigate the open chromatin bias inherent in CUT&Tag data at both bulk and single-cell levels. By integrating existing transcriptome and epigenome data using machine learning and comprehensive modeling, we demonstrate that PATTY yields more accurate and robust detection of histone modification occupancy sites for bulk CUT&Tag data than existing methods. We further design a single-cell CUT&Tag analysis framework by utilizing the model trained from bulk data and show improved cell clustering from bias-corrected single-cell CUT&Tag data compared to using raw data. This model paved the way for further development of computational tools for improving bulk and single-cell CUT&Tag data analysis.

0. Introduction of PATTY package

PATTY performs correction of open chromatin bias of CUT&Tag data at both bulk and single-cell levels. PATTY pre-estimated the open chromatin bias and pre-trained the correction model for a given type of histone modification using CUT&Tag data in the K562 cell line, then applied the pre-trained model on the bulk/sc CUT&Tag data. For bulk data, SELMA estimates the bias expected cleavages on chromatin accessibility regions (peaks) and compares with observed cleavages. For single-cell data, SELMA estimates the summarized bias score on each candidate chromatin accessibility region (peak bias score, PBS) and uses the peaks with low PBS for single-cell clustering analysis.

  • Changelog
    v1.0.0 First version of PATTY with both single-cell(sc) and bulk mode.

1. Installation

  • Package requirements
    PATTY requires python 3.6+ and Rscript v3+ to run.
    PATTY requires python3 packages numpy pre-installed.

# for root user

$ cd PATTY
$ sudo python setup.py install  

# if you are not root user, you can install PATTY at a specific location where you have the write permission

$ python setup.py install --prefix /home/PATTY  # Here you can replace “/home/PATTY” with any location 
$ export PATH=/home/PATTY/bin:$PATH    # setup PATH for the software
$ export PYTHONPATH=/home/PATTY/lib/python3.6/site-packages:$PYTHONPATH    # setup PYTHONPATH for module import

# To check the PATTY package, just type:

$ PATTY --help  # If you see the help manual, you have successfully installed PATTY

# NOTE:

  • To install PATTY on MacOS, the users need to download and install Command Line Tools beforehand
  • PATTY requires python3 packages numpy pre-installed.
  • Bedtools (Quinlan et al., Bioinformatics. 2010) and UCSC tools (Kuhn et al., Brief Bioinform. 2013) are recommended for data pre-processing. The PATTY package will install both tools automatically if the users does not have them pre-installed in the default PATH.
  • Some functions (single-cell clustering) of SELMA require the related packages pre-installed (see Section 4)
  • The installation should be finished in about one minute.

2. Run PATTY (usage)

Essential paramters

To run PATTY by default parameters, you can set the following parameters:

  • -m MODE, --mode=MODE Mode of PATTY, choose from sc(single-cell) or bulk
  • -c CUTTAG, --cuttag=CUTTAG Input fragments file in bed format for CUT&Tag data, with .bed extension. for sc mode, the 4th(name) column of the bed file represents the name of the corresponding individual cell
  • -a ATAC, --atac=ATAC Input fragments file in bed format for ATAC-seq data, with .bed extension. The ATAC-seq fragments were used as bulk data for both sc and bulk modes.
  • -f FACTOR, --factor=FACTOR Factor type of the CUT&Tag data. Currently PATTY support H3K27me3 (default) and H3K27ac
  • -o OUTNAME, --outname=OUTNAME Name of output results
  • -p PEAK, --peak=PEAK
    [optional] external peak/region file for the candidate peaks/regions.

Example of running SELMA with default parameters:

# sc mode

$ PATTY -m sc -c ${path}/testdata_CUTTAGreads.bed.gz -a ${path}/testdata_ATACreads.bed.gz -f H3K27me3 -o testsc  

# bulk mode

$ SELMA -m bulk -c ${path}/testdata_CUTTAGreads.bed.gz -a ${path}/testdata_ATACreads.bed.gz -f H3K27me3 -o testbulk 

4. Customize candidate peaks/regions.

PATTY provides an option (-p) to take user-supplied customized peak files as the target regions for the PATTY analysis. The required peak file should be in BED format (plain text), have >=4 columns (chrom, start, end, name), and the total length should be >= 200kbps (1k 200bp bins). By default, PATTY uses the peaks detected from the same dataset (e.g., fragments.bed file) using SICER for the peak calling (with any method) to ensure sufficient cleavages/signal on the peak regions. Below is the example with an external/customized peak file:
The test files (testdata_reads.bed.gz and testpeak.bed) in the following cmd lines can be downloaded via the link in section x

# sc mode

$ PATTY -m sc -c ${path}/testdata_CUTTAGreads.bed.gz -a ${path}/testdata_ATACreads.bed.gz -p ${path}/testpeak.bed  -f H3K27me3 -o testsc  

# bulk mode

$ SELMA -m bulk -c ${path}/testdata_CUTTAGreads.bed.gz -a ${path}/testdata_ATACreads.bed.gz -p ${path}/testpeak.bed  -f H3K27me3 -o testbulk 

5. Pre-processing steps for generating the input fragments file.

PATTY takes aligned fragment files (in .bed format) as input. Users can perform any pre-processing steps to customize the fragment files. We recommend keeping only high-quality reads with perfect alignment (e.g., MAPQ > 30) to run PATTY. For bulk data, using unique paired-end fragments (unique loci) can reduce the potential influence of PCR over-amplification. For single-cell data, users can keep only unique fragments in each individual cell.

6. Install and use published single-cell clustering methods based on PATTY bias correction.

PATTY sc mode implements several cell clustering methods in the single-cell clustering analysis in addition to the default Kmeans analysis. To activate these methods (name, version and link listed below), users need to install the related package, and specify the method by the --clusterMethod parameter. If a method is declared by the --clusterMethod parameter but is not installed, SELMA will skip the single-cell clustering analysis.

PATTY also provides UMAP/t-SNE visualization for the single-cell clustering analysis. Users can activate this function by the --UMAP parameter. For the PCAkm method, the umap package in R is required.

7. Output files

  1. NAME_summaryReports.pdf is the summary pdf file which contains information on:

    • Input file and parameter description
    • basic QC of the data
    • Summary of the SELMA bias estimation/correction results

    #Note: This pdf file is only generated if pdflatex is pre-installed. A NAME_summaryReports.txt file is generated as well. A .tex file will also be generated in case users want to make the pdf document later.

  2. NAME_peaks.bed is the peaks detected from the fragment files (using SICER). Each peak was split into 200bp bins.

  3. NAME_PATTYscore.bdg (bulk mode only) is the PATTYscore for each candidate 200bp-bin, in bedGraph format. The scores are transformed into a 200bp-resolution track for the input regions. For genomic regions not covered by the input regions, PATTY will assign 0 as the score.

  4. NAME_binXcell.txt.gz (sc mode only) is the bin-by-cell PATTY score matrix generated from the single-cell analysis. Cells are filtered by the total reads count per cell (default >=10,000 reads, can be changed through the parameters).

  5. NAME_scClustering.txt.gz (sc mode only) is the cell clustering result using the PATTY bias corrected matrix.

8. Testing data and example of output files

We provided the test data for users to test SELMA. The sc/bulk output can also be generated with the cmd lines in Section 3/4 using the testing data as input. Click the file names to download (copy the backupLink for cmdline download).

  • testing data: Dropbox
  • testing peak file(optional for -p): Dropbox
  • testing cellnames (optional for --cellnames in sc mode): Dropbox
  • output for PATTY bulk mode with testing data input: Dropbox
  • output for PATTY sc mode with testing data input: Dropbox
  • The PATTY with testing data (e.g., using sc mode) will be finished within 30 minutes.

9. Other parameters in the PATTY pipeline

You can also set the following parameters for more accurate bias estimation and correction:

  • --cellnames=CELLNAMES
    [sc optional] Single column file for name list of used individual cells, each line contains the name of an individual cell. This parameter is only used for sc mode. This parameter is not used very commonly.
  • --readCutoff=READCUTOFF
    [sc optional] Reads number cutoff for high-quality cells. Cells with < 10000(default) reads will be discarded in the analysis. Users can change this parameter for samples with low sequencing depth to include more cells in the analysis. Setting a lower number for this parameter will possibly decrease the accuracy of clustering results due to the low-quality cells.
  • --peakMinReads=PEAKMINREADS
    [sc optional] Peaks with < 10(default) cleavages covered (across all high-quality cells) will be discarded in the analysis.
  • --peakMaxReads=PEAKMAXREADS
    [sc optional] Peaks with > X cleavages covered (across all high-quality cells) will be discarded in the analysis. Set 0 to close this function (default)
  • --clusterMethod=CLUSTERMETHOD
    [sc optional] Method used for single-cell clustering analysis. The default is Kmeans(PCA dim reduction + K-means clustering). Optional choices (Seurat, scran, and APEC) require related packages installed (described in section x)
  • --clusterNum=CLUSTERNUM
    [sc optional] Number of clusters specified for K-means clustering and only used for the PCAkm (setting by --clusterMethod) method. The default is 7.
  • --topDim=TOPDIM
    [sc optional] Number of dimensions (with highest Variance) used for clustering. Only used for PCAkm(PC) and ArchR (Latent variable). This number is suggested to be >=30 (deafult=60)
  • --UMAP
    [sc optional] Turn on this parameter to generate a UMAP plot for the clustering results
  • --overwrite
    [optional] Force overwrite; setting this parameter will remove the existing result! PATTY will terminate if there is a folder with the same name as -o in the working directory. Set this parameter to force PATTY to run.
  • --keeptmp
    [optional] Whether or not to keep the intermediate results (tmpResults/)

Reproduce cell clustering results using the PATTY package

Users can reproduce the clustering results for the nano-CT data in the manuscript (Figure 6, H3K27me3 nano-CT data in mouse brain, K-means clustering) by running PATTY with the following cmd line:

$ PATTY -m sc -i ${path}/testdata_reads.bed.gz -f H3K27me3 -o nanoCT_H3K27me3 --UMAP --overwrite --keeptmp --cellnames ${path}/testsc_cellnames.txt

The test files (testdata_reads.bed.gz and testsc_cellnames.txt) in the cmd line can be downloaded via the link in section x.

Supplementary data

  • PATTY pre-trained models for H3K27me3 (backupLink) and H3K27ac (backupLink). Both models were trained from CUT&Tag data in K562 cell line. Note that these pre-trained models are already built-in and used in the PATTY package.

patty's People

Contributors

tarela avatar

Watchers

 avatar

Forkers

zang-lab

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.