Coder Social home page Coder Social logo

pipe4c's Introduction

pipe4C - a 4C-seq processing pipeline

A pipeline that processes multiplexed 4C-seq reads directly from FASTQ files. It generates files in a range of widely used formats to facilitate visualization and further data analysis using standard genome browsers and tools, including our recently developed peak caller for 4C-seq data peakC.

If you have any difficulties using the pipeline, please do not hesitate to contact us ([email protected]).

Citation

Krijger PHL, Geeven G, Bianchi V, Hilvering CRE, de Laat W. 4C-seq from beginning to end: A detailed protocol for sample preparation and data analysis. Methods. 2019 Jul 26. pii: S1046-2023(18)30474-2. doi: 10.1016/j.ymeth.2019.07.014. https://doi.org/10.1016/j.ymeth.2019.07.014

Prerequisites

Installation

Download the latest version of the pipeline from this git repository using:

    $ wget https://github.com/deLaatLab/pipe4C/archive/master.zip
    $ unzip master.zip
    $ cd ./pipe4C-master

note: the pipe4C.R and functions.R files need to be placed in the same folder.

Files required to run the pipeline:

  • Reads in (compressed) FASTQ format.

    • Illumina Sequencing Systems generate raw data files in binary base call (BCL) format. Illumina offers bcl2fastq conversion software to demultiplex (based on the index used in the non-reading primer) and convert BCL files. If multiple flow cell lanes have been used to sequence the library, a single Read 1 FASTQ file should be created per index per sequence run by combining the FASTQ files for each flow cell lane using a standard “cat” command after BCL conversion or by using the no-lane-splitting option when running bcl2fastq.
  • Configuration file (conf.yml)

    • Global and system specific parameters (such as e.g. paths and genome assemblies installed) that are likely to remain constant across different runs of the pipeline are defined in the global configuration file (conf.yml). In each run the pipeline initially loads the parameters defined in this global configuration file, and then proceeds to load run specific parameters and the experiment specific data defined in a separate file (vpFile). The global configuration file can be edited using any standard text editor. The list of parameters that need to be set at least once upon installation on a system in the global configuration file are shown in table 1.

Name Description
fragFolder Path to the folder containing the fragment end libraries of the reference genomes
normalizeFactor Reads mapped to the 4C fragment end library are normalized to account for sequencing depth according to the normalizeFactor
enzymes Enzyme names used in the viewpoint file and their corresponding recognition motifs
genomes Genome names used in the viewpoint file plus corresponding BSgenome packages
bowtie2 Path to corresponding bowtie2 index of reference genome. The reference genome assembly used to generate the index should match to the reference genome that was used to generate the BSgenome
maxY Maximal Y value in local 4C cis plot
plotView Number of bp to plot around viewpoint in local 4C cis plot
xaxisUnit X-axis unit (Mb, Kb or bp)
plotType Plots will either be saved as PDF or PNG
binSize Genome bin size used in the genome plot
qualityCutoff Q-score. Trim 3′-end of all sequences using a sliding window as soon as 2 out of 5 nucleotides have quality encoding less than the Q-score. 0 = no trimming
trimLength Trim reads to defined capture length from 3′-end. 0 = no trimming
minAmountReads Minimum required amount of reads containing the primer sequence. If less reads are identified the experiment will not be further processed
readsQuality Bowtie2 minimum required mapping quality score for mapped reads
mapUnique Extract uniquely mapped reads, based on the lack of XS tag
cores Number of CPU cores for parallelization
wSize The running mean window size
nTop Top fragment ends discarded for calculation of normalizeFactor
nonBlind Only keep non-blind fragments
wig Create wig files for all samples
plot Create viewpoint coverage plot for all samples
genomePlot Create genomeplot for all samples (only possible if analysis is “all” in vpFile)
tsv Create tab separated value file for all samples
bins Count reads for binned regions
mismatchMax The maximum number of mismatches allowed during demultiplexing

Table 1. Description of parameters that need to be defined in the configuration file.


  • Viewpoint file
    • Experiment specific parameters for each 4C-seq experiment are organized in a viewpoint file. Parameters in this file are stored in a tab-delimited format, with each row containing information for a separate experiment:
expname primer firstenzyme secondenzyme genome vpchr vppos analysis fastq
mESC_Sox2 GAGGGTAATTTTAGCCGATC DpnII Csp6I mm9 3 34547661 all index1.fastq.gz
mESC_Mccc1 TTGCACCCGTCTTCTTGATC DpnII Csp6I mm9 3 35873313 cis index1.fastq.gz

Table 2. Example of a viewpoint file in which two experiments are demultiplexed from the same FASTQ file based on their primer sequence.


Name Description
expname Unique experiment name
primer Primer sequence
firstenzyme First restriction enzyme name (nearest to reading primer)
secondenzyme Second restriction enzyme name
genome Reference genome of interest
vpchr The chromosome that contains the viewpoint (See note 15)
vppos Coordinate of viewpoint position. Any bp position within the VP can be used except the RE motifs (see note 15)
analysis The final output tables will contain all reads (all) or only the reads that have been mapped to the VP chromosome (cis). For most analysis cis is sufficient and the generated output files will be smaller and therefore easier to process on local computers
fastq Name of the FASTQ file
spacer (optional) Spacer length. Number of nt included as spacer in the primer to enable out of phase sequencing. Default = 0. The spacer sequence will not be used for demultiplexing. If the spacer sequence is used as a barcode include the sequence in the primer sequence and set the spacer length to 0

Table 3. Description of parameters that are required in the viewpoint file for processing a 4C-seq experiment.

Running the pipeline:

Rscript <path to pipe4C.R script> [any additional arguments]. 

A list of both required and optional parameters that are recognized by the pipe4C.R script are shown in table 4. Default values (except vpFile, fqFolder, outFolder and confFile) are stored in the configuration file.

For example,

Rscript pipe4C.R –-vpFile [path to vpFile] --fqFolder [path to folder containing the FASTQ files] –-outFolder [path to output folder] --cores 8 --wig --plot --genomePlot

will run the pipeline using 8 cores and generates a wig file, a viewpoint plot and a genome plot as output, next to the default outputs.

Name Description
vpFile* path to the viewpoint file
fqFolder* path to the folder containing the FASTQ files
outFolder* path to the output folder
confFile path to configuration file – default is conf.yml in folder containing the pipeline script
mismatchMax The maximum number of mismatches allowed during demultiplexing
qualityCutoff Q-score. Trim 3′-end of all sequences using a sliding window as soon as 2 out of 5 nucleotides has quality encoding less than the Q-score
trimLength Trim reads to defined capture length from 3′-end
minAmountReads Minimum required amount of reads containing the primer sequence. If less reads are identified the experiment will not be further processed
readsQuality Bowtie2 minimum required mapping quality score for mapped reads
mapUnique Extract uniquely mapped reads, based on the lack of XS tag
cores Number of CPU cores for parallelization
wSize The running mean window size
nTop Top fragment ends discarded for normalization
nonBlind Only keep non-blind fragments
wig Create wig files for all samples
plot Create viewpoint coverage plot for all samples
genomePlot Create genomeplot for all samples (only possible if analysis is “all” in vpFile)
tsv Create tab separated value file for all samples
bins Count reads for binned regions

Table 4. Description of parameters that are recognized by the pipe4C.R script. * are required.

Step by step tutorial

In this tutorial we will explain how to run the pipeline and perform peakC analysis on the Sox2 4C-seq data from Geeven et al. (SRA files GSM2824300, GSM2824301 and GSM2824302).

Set up the pipeline

Download the pipeline including the example files

Download the pipeline and example files using the following command.

$ wget https://github.com/deLaatLab/pipe4C/archive/master.zip
$ unzip master.zip
$ cd ./pipe4C-master

note: the pipe4C.R and functions.R files need to be placed in the same folder. The FASTQ files and viewpoint file can be found in the example folder.

Modify the configuration file (conf.yml) using any plain text editor

  • Change the location in which the fragmented genome will be generated (fragFolder).
  • Change the location in which the bowtie2 index is stored.

Run the pipeline

Rscript pipe4C.R --vpFile=./example/VPinfo.txt --fqFolder=./example/ --outFolder=./outF/ --cores 8 --plot --wig

This will run the pipeline using 8 cores and generates a viewpoint plot and wig file as output, next to the default outputs.

Let's have a look at the generated files

The report file

A report file is generated that contains quality metrics of all the 4C-libraries processed by the pipeline.

The report file indicates that the quality of the experiment is good, as:

  • ~80% of the reads map to the viewpoint chromosome (fragMappedCisPercCorr).
  • ~75% of the reads mapping to the viewpoint chromosome maps within 1MB from the viewpoint (cov1Mb).
  • Furthermore ~90% of the mappable DpnII fragment ends within 100kb from the viewpoint (capt100Kb) have at least one read.

The viewpoint plots

The viewpoint plots can be found in the PLOT folder. A coverage plot of the viewpoint region is generated based on the normalized and smoothened data. The genomic region that is visualized, the height of the Y-axis, the unit of the X-axis (Mb, kb, bp) the file type (PDF or PNG) and window size are defined and can be adjusted in the configuration file. In addition the quality metrics are displayed in this figure for a quick impression of the data.

Wig files

The normalized smoothened data will be written to a wig file for visualization in genome browsers such as the UCSC Genome Browser (https://genome.ucsc.edu/) and IGV.

RDS files

The pipeline produces R-objects stored as rds files, which contain all mapped 4C-seq reads as well as the mapping statistics and viewpoint information relevant to the experiment. This allows for a more thorough interactive analysis and visualization of the data in R.

Perform peak calling using PeakC in R

PeakC is a method that was designed to identify reproducible peaks in regions close to the viewpoint in 4C-seq data. It computes non-parametric statistics based on ranks of coverage of 4C fragment ends with respect to a background model. This model estimates the background contact frequency of both the region upstream and downstream of the viewpoint independently for each 4C-seq experiment individually and uses a statistical model to identify genomic regions that are significantly contacted.

To run peakC on the rds files created by the pipe4C pipeline, you need to load the peakC R-package (make sure it's installed first!) and source the pipe4C functions in R:

library(peakC)
pipe4CFunctionsFile <- "functions.R"
source(pipe4CFunctionsFile)

Next, you need to select rds files from 4C-Seq experiments generated with the same VP that you want to analyze.

Select 4C-seq experiments generated using the same VP

resultsDir <- "./outF/RDS/"
setwd(resultsDir)
rdsFiles <- c("set_1_viewpoint_10_ESC_replicate_1.rds","set_1_viewpoint_10_ESC_replicate_2.rds", "set_1_viewpoint_10_ESC_replicate_3.rds")

Now run peakC, with the default parameters

resPeakC <- doPeakC(rdsFiles = rdsFiles)

and plot the results

plot_C(resPeakC,y.max=750)

Finally, extract the identified peak regions from the peakC analysis from the objects and export the results in a BED file which can be used for visualization in a genome browser together with the generated wig files.

resPeaks <- getPeakCPeaks(resPeakC=resPeakC)
peaksFile <- "./outF/set_1_viewpoint_10_ESC_peakC_peaks.bed"
exportPeakCPeaks(resPeakC=resPeakC,bedFile=peaksFile,name="set_1_viewpoint_10_ESC_peakC_peaks")

pipe4c's People

Contributors

krijgerp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.