PRAM manuscript key results and scripts for reproducibility

Introduction
Setup dependent files
'Noise-free' benchmark
- Key results
- Reproducibility
Human master set
- Key results
- Reproducibility
Mouse hematopoietic system
- Key results
- Reproducibility
Reference
Contact

Introduction

This repository contains key results reported in PRAM's manuscript and R scripts to reproduce them on user's local machine. We provided results for 'noise-free' benchmark, human master set transcript models, and mouse hematopoietic transcript models. In the sections below, We will describe each of them in details. To reproduce these results, we recommend to run all the R scripts in Linux, where we have tested their reproducibility. Also, please make sure to setup dependent files first before running any other R scripts.

To obtain this repository, please use the follow command

git clone https://github.com/pliu55/pram_paper

It will create a directory pram_paper/ that contains the following folders and files:

0_setup/
- run.R: script to setup dependent software and files
1_benchmark/
- reported/: results for 'noise-free' benchmark
- run.R: script for reproducing the results
2_human/:
- reported/: results for human master set transcript models
- prepareEncodeBam.R and run.R: scripts for reproducing the results
3_mouse/:
- reported/: results for mouse hematopoitic system

Setup dependent files

To reproduce PRAM's results, we need to prepare required software and genomic files first with the following commands:

cd 0_setup/
./run.R

The script run.R will download and install:

the latest PRAM package
transcript-building software:
- Cufflinks
- StringTie
- TACO
human gene annotation from GENCODE version v24
human genome version hg38

This script requires ~ 9 GB hard drive space and takes ~ 10 minutes using a single 2.1 GHz CPU. All the dependent software and files will be saved in 0_setup/output/.

'Noise-free' benchmark

Key results

Results for the 'noise-free' benchmark test are in the folder 1_benchmark/reported/ with their descriptions listed in the table below

file name	description
target_transcript_ids.txt	GENCODE v24 transcript IDs for the 1,256 target transcripts
plcf.gtf	predicted transcript models by PRAM's pooling + Cufflinks method
plst.gtf	predicted transcript models by PRAM's pooling + StringTie method
cfmg.gtf	predicted transcript models by PRAM's Cufflinks + Cuffmerge method
stmg.gtf	predicted transcript models by PRAM's StringTie + merging method
cftc.gtf	predicted transcript models by PRAM's Cufflinks + TACO method
model_eval.tsv	precision and recall for transcript models predicted by the above five methods in terms of exon nucleotide (row name: `exon_nuc`), individual junction (row name: `indi_jnc`), and transcript structure (row name: `tr_jnc`)

Reproducibility

To reproduce the model prediction results, run the follow command:

cd 1_benchmark/
./run.R

The script run.R will:

download 'noise-free' input RNA-seq BAM files to 1_benchmark/input/
predict transcript models by PRAM's five meta-assembly methods and save prediction results as GTF files in 1_benchmark/output/. Files will be named in the same way as in the table above
compare transcript models with GENCODE annotation and save the evaluation results in 1_benchmark/output/model_eval.tsv

The script run.R requires ~23 GB hard drive space and takes ~3 hours using forty 2.1 GHz CPUs. To adjust to the running CPUs on your own machine, please edit the njob_in_para and nthr_per_job variables in run.R to make sure njob_in_para * nthr_per_job do not exceed the number of available cores.

Human master set

Key results

Five meta-assembly methods of PRAM were applied to predict intergenic transcript models based on thirty human ENCODE RNA-seq datasets. All five prediction results are saved in 2_human/reported/:

file name	PRAM method
plcf.gtf.gz	pooling + Cufflinks
plst.gtf.gz	pooling + StringTie
cfmg.gtf.gz	Cufflinks + Cuffmerge
stmg.gtf.gz	StringTie + merging
cftc.gtf.gz	Cufflinks + TACO

We quantified the expression levels of transcript models predicted by 'pooling + Cufflinks' together with GENCODE (v24)-annotated transcripts in each of the 30 ENCODE RNA-seq datasets. Their expression levels (in TPM) can be found in isoforms.tpm.gz

Reproducibility

To reproduce the model prediction results, run the follow command:

cd 2_human/
./prepareEncodeBam.R
./run.R

The script prepareEncodeBam.R will download thirty human RNA-seq BAM files from ENCODE, index and save them in 2_human/input/. It will take ~500 GB hard drive space and cost ~3 hours using thirty 2.1 GHz CPUs. You can adjust the number of running CPUs by the njob_in_para variable in prepareEncodeBam.R.

The script run.R will predict transcript models in human intergenic regions based on the downloaded BAM files. It will take ~20 GB space and ~4.5 hours using forty 2.1 GHz CPUs. To customize the number of running CPUs for your own machine is the same as in reproducing benchmark results. Predicted models will be saved as GTF files in 2_human/output/. Files will be named in the same way as the table above.

Mouse hematopoietic system

Key results

Three meta-assembly methods of PRAM were applied to predict intergenic transcript models based on thirty-two RNA-seq datasets from mouse hematopoietic system, followed by selection of transcript models that do not overlap with RefSeq genes and have mappability ≥ 0.8. All three prediction results are saved in 3_mouse/reported/:

file name	PRAM method
plcf.gtf.gz	pooling + Cufflinks
cfmg.gtf.gz	Cufflinks + Cuffmerge
cftc.gtf.gz	Cufflinks + TACO

Reproducibility

The way to use PRAM to predict intergenic transcript models for mouse hematopoietic system is the same as for human master set. You can refer to the script run.R in human master set for the usage of PRAM.

We do not provide scripts for automatically reproducing the results because:

Some mouse ENCODE RNA-seq datasets do not have alignment BAM files available, such as ENCSR000CLU and ENCSR000CLY
Some mouse ENCODE RNA-seq datasets have alignment BAM files available, such as ENCSR000CHV and ENCSR000CHY. But they were based on GENCODE vM4, not vM9, which we used to define known genes and intergenic regions.
The mouse RNA-seq alignment BAM file we generated takes ~750 GB hard drive space, which would cost a long time for users to download.

Therefore, we simply provided the results instead. You are always welcome to contact us regarding the details on reproducing these results.

Reference

PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments. Peng Liu, Alexandra A. Soukup, Emery H. Bresnick, Colin N. Dewey, and Sündüz Keleş. Genome Research 2020 https://doi.org/10.1101/gr.252445.119

Contact

Got a question? Please report it at the issues tab in this repository.

pliu55 / pram_paper Goto Github PK

pram_paper's Introduction

PRAM manuscript key results and scripts for reproducibility

Table of Contents

Introduction

Setup dependent files

'Noise-free' benchmark

Key results

Reproducibility

Human master set

Key results

Reproducibility

Mouse hematopoietic system

Key results

Reproducibility

Reference

Contact

pram_paper's People

Contributors

Stargazers

Watchers

pram_paper's Issues

Recommend Projects

Recommend Topics

Recommend Org