Coder Social home page Coder Social logo

pram_paper's Introduction

PRAM manuscript key results and scripts for reproducibility

Table of Contents


Introduction

This repository contains key results reported in PRAM's manuscript and R scripts to reproduce them on user's local machine. We provided results for 'noise-free' benchmark, human master set transcript models, and mouse hematopoietic transcript models. In the sections below, We will describe each of them in details. To reproduce these results, we recommend to run all the R scripts in Linux, where we have tested their reproducibility. Also, please make sure to setup dependent files first before running any other R scripts.

To obtain this repository, please use the follow command

git clone https://github.com/pliu55/pram_paper

It will create a directory pram_paper/ that contains the following folders and files:

  • 0_setup/
    • run.R: script to setup dependent software and files
  • 1_benchmark/
    • reported/: results for 'noise-free' benchmark
    • run.R: script for reproducing the results
  • 2_human/:
    • reported/: results for human master set transcript models
    • prepareEncodeBam.R and run.R: scripts for reproducing the results
  • 3_mouse/:
    • reported/: results for mouse hematopoitic system

Setup dependent files

To reproduce PRAM's results, we need to prepare required software and genomic files first with the following commands:

cd 0_setup/
./run.R

The script run.R will download and install:

  • the latest PRAM package
  • transcript-building software:
    • Cufflinks
    • StringTie
    • TACO
  • human gene annotation from GENCODE version v24
  • human genome version hg38

This script requires ~ 9 GB hard drive space and takes ~ 10 minutes using a single 2.1 GHz CPU. All the dependent software and files will be saved in 0_setup/output/.

'Noise-free' benchmark

Key results

Results for the 'noise-free' benchmark test are in the folder 1_benchmark/reported/ with their descriptions listed in the table below

file name description
target_transcript_ids.txt GENCODE v24 transcript IDs for the 1,256 target transcripts
plcf.gtf predicted transcript models by PRAM's pooling + Cufflinks method
plst.gtf predicted transcript models by PRAM's pooling + StringTie method
cfmg.gtf predicted transcript models by PRAM's Cufflinks + Cuffmerge method
stmg.gtf predicted transcript models by PRAM's StringTie + merging method
cftc.gtf predicted transcript models by PRAM's Cufflinks + TACO method
model_eval.tsv precision and recall for transcript models predicted by the above five methods in terms of exon nucleotide (row name: exon_nuc), individual junction (row name: indi_jnc), and transcript structure (row name: tr_jnc)

Reproducibility

To reproduce the model prediction results, run the follow command:

cd 1_benchmark/
./run.R

The script run.R will:

  • download 'noise-free' input RNA-seq BAM files to 1_benchmark/input/
  • predict transcript models by PRAM's five meta-assembly methods and save prediction results as GTF files in 1_benchmark/output/. Files will be named in the same way as in the table above
  • compare transcript models with GENCODE annotation and save the evaluation results in 1_benchmark/output/model_eval.tsv

The script run.R requires ~23 GB hard drive space and takes ~3 hours using forty 2.1 GHz CPUs. To adjust to the running CPUs on your own machine, please edit the njob_in_para and nthr_per_job variables in run.R to make sure njob_in_para * nthr_per_job do not exceed the number of available cores.

Human master set

Key results

Five meta-assembly methods of PRAM were applied to predict intergenic transcript models based on thirty human ENCODE RNA-seq datasets. All five prediction results are saved in 2_human/reported/:

file name PRAM method
plcf.gtf.gz pooling + Cufflinks
plst.gtf.gz pooling + StringTie
cfmg.gtf.gz Cufflinks + Cuffmerge
stmg.gtf.gz StringTie + merging
cftc.gtf.gz Cufflinks + TACO

We quantified the expression levels of transcript models predicted by 'pooling + Cufflinks' together with GENCODE (v24)-annotated transcripts in each of the 30 ENCODE RNA-seq datasets. Their expression levels (in TPM) can be found in isoforms.tpm.gz

Reproducibility

To reproduce the model prediction results, run the follow command:

cd 2_human/
./prepareEncodeBam.R
./run.R

The script prepareEncodeBam.R will download thirty human RNA-seq BAM files from ENCODE, index and save them in 2_human/input/. It will take ~500 GB hard drive space and cost ~3 hours using thirty 2.1 GHz CPUs. You can adjust the number of running CPUs by the njob_in_para variable in prepareEncodeBam.R.

The script run.R will predict transcript models in human intergenic regions based on the downloaded BAM files. It will take ~20 GB space and ~4.5 hours using forty 2.1 GHz CPUs. To customize the number of running CPUs for your own machine is the same as in reproducing benchmark results. Predicted models will be saved as GTF files in 2_human/output/. Files will be named in the same way as the table above.

Mouse hematopoietic system

Key results

Three meta-assembly methods of PRAM were applied to predict intergenic transcript models based on thirty-two RNA-seq datasets from mouse hematopoietic system, followed by selection of transcript models that do not overlap with RefSeq genes and have mappability ≥ 0.8. All three prediction results are saved in 3_mouse/reported/:

file name PRAM method
plcf.gtf.gz pooling + Cufflinks
cfmg.gtf.gz Cufflinks + Cuffmerge
cftc.gtf.gz Cufflinks + TACO

Reproducibility

The way to use PRAM to predict intergenic transcript models for mouse hematopoietic system is the same as for human master set. You can refer to the script run.R in human master set for the usage of PRAM.

We do not provide scripts for automatically reproducing the results because:

  • Some mouse ENCODE RNA-seq datasets do not have alignment BAM files available, such as ENCSR000CLU and ENCSR000CLY
  • Some mouse ENCODE RNA-seq datasets have alignment BAM files available, such as ENCSR000CHV and ENCSR000CHY. But they were based on GENCODE vM4, not vM9, which we used to define known genes and intergenic regions.
  • The mouse RNA-seq alignment BAM file we generated takes ~750 GB hard drive space, which would cost a long time for users to download.

Therefore, we simply provided the results instead. You are always welcome to contact us regarding the details on reproducing these results.

Reference

PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments. Peng Liu, Alexandra A. Soukup, Emery H. Bresnick, Colin N. Dewey, and Sündüz Keleş. Genome Research 2020 https://doi.org/10.1101/gr.252445.119

Contact

Got a question? Please report it at the issues tab in this repository.

pram_paper's People

Contributors

pliu55 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

pram_paper's Issues

A question about PRAM result

Hi, I used the PRAM pipeline to predict the intergenic transcripts including 300 plant samples transcipts data. There are
20,313 transcript models grouped into 12,019 gene models. Then I added this intergenic gtf into my genome anottation gtf to quantified each samples' gene TPM by using STAR to align reads and Stringtie to calculate the value of TPM. After that, I found there are 5,000 novel intergenic transcripts got 0 TPM in any samples. I checked the data that you quantified in isoforms.tpm.gz. It is also have nearly 900 novel intergenic transcripts got 0 TPM in any samples. But it is much better than me.
I want to ask why PRAM pipeline produce these 0 TPM intergenic transcripts. And, How you quantified your intergenic transcipts? My results looked uncredible.
Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.