Coder Social home page Coder Social logo

ribohmm's Introduction

RiboHMM

Installation

The RiboHMM package can be installed through pip from the Github repository:

pip install git+https://github.com/djf604/RiboHMM

This will install two executables:

  • ribohmm, which is the interface for learning the model and inferring translated sequences
  • ribohmm-utils, which is used to pre-process the input data and compute mappability

Pre-processing the Sample Data

Aligned Ribo-seq and RNA-seq data (as BAM files) must first be converted to a tabix-indexed tabular count format. The ribohmm-utils executable has a sub-program called bam-to-counts which automates much of this process. Given multiple BAM files, bam-to-counts will output a single tabix-indexed counts file which has aggregated counts from all input. For example:

ribohmm-utils bam-to-counts --bams sample1.riboseq.bam sample2.riboseq.bam --bam-type riboseq \
--output-prefix example_output

The above will read in sample1.riboseq.bam and sample2.riboseq.bam and produce two files:

  • example_output.ribo.counts.bed.gz, the tabular counts file
  • example_output.ribo.counts.bed.gz.tbi, the tabix index of the above file

The sample BAMs can also be input directly into the main ribohmm executable and will be converted as above on-the-fly.

For more detailed documentation refer to the wiki (coming soon).

Computing Mappability for Ribo-seq Data

Since ribosome footprints are typically short (28-31 base pairs), footprints originating from many positions in the transcriptome are likely to not be uniquely mappable. Thus, with standard parameters for mapping ribosome footprint sequencing data, a large fraction of the transcriptome will have no footprints mapping to them due to mappability issues. While RiboHMM can be used without accounting for missing data due to mappability, we have observed the results to be substantially more accurate when mappability is properly handled.

Given a GTF file that contains the transcriptome, mappability information (i.e., whether each position in the transcriptome can produce a uniquely mappable ribosome footprint or not) can be obtained in 3 steps:

  1. For each desired footprint length (default 28-31bp), build a FASTQ with all footprints that could originate from the given transcriptome. This is done with the ribohmm-utils sub-program mappability-generate.

    ribohmm-utils mappability-generate --gtf-file transcriptome.gtf --fasta-reference genome.fa \
    --footprint-lengths 28 29 30 31 --output-fastq-stub example_output
    

    The above will produces four files:

    • example_output_footprint28.fq.gz
    • example_output_footprint29.fq.gz
    • example_output_footprint30.fq.gz
    • example_output_footprint31.fq.gz
  2. Align the created synthetic FASTQs, using the same mapping strategy used for the original ribosome footprint profiling data. The BAM alignments will be the input into the next step.

  3. For each desired footprint length, build a tabix-indexed mappability file from the BAM produced in step 2. This marks whether a footprint originating from a given position uniquely mapped back to the same place. This is done with the ribohmm-utils sub-program mappability-compute.

    ribohmm-utils mappability-compute --mappability-bam example_output_length28.bam \
    --output-tabix example_output_mappability_28.bed 
    

    The above produces two files:

    • example_output_mappability_28.bed.gz
    • example_output_mappability_28.bed.gz.tbi

For more detailed documentation refer to the wiki (coming soon).

Running the RiboHMM Algorithm

The main interface to RiboHMM is through the ribohmm executable. By default, both the parameter learning and inference steps are run in sequence. Some flags can be given to change that behavior:

  • --learn-only causes the program to learn the model parameters, save them to a model parameters JSON, and exit
  • --infer-only cause the program to skip the model learning step, instead accepting a model parameters JSON and moving directly to inference

For more detailed documentation refer to the wiki (coming soon).

Learning the Model

In general the necessary inputs to learn the model parameters are:

  • Reference genome in FASTA format
  • Transcriptome in GTF format
  • Ribo-seq BAM(s) or tabix-indexed counts file (created with ribohmm-utils bam-to-counts)

Optional but helpful inputs include:

  • Corresponding RNA-seq BAM(s) or tabix-indexed counts file (created with ribohmm-utils bam-to-counts)
  • Mappability tabix-indexed counts files (created with ribohmm-utils mappability-generate,mappability-compute)

An example:

ribohmm --learn-only --reference ref/genome.fa --transcriptome ref/transcriptome.gtf \
    --riboseq-counts data/example_output.ribo.counts.bed.gz \
    --rnaseq-counts data/example_output.rna.counts.bed.gz \
    --mappability-tabix-prefix data/mappability/example_output_mappability \
    --output run001 --batch-size 10

This produces a directory called run001 which has a single file model_parameters.json, which can be passed into future --infer-only runs of RiboHMM.

Inferring Translated Sequences

Unless ribohmm is run with the --learn-only flag, this inference step is automatically run following the learning step. This step can also be run directly by giving the --infer-only flag and providing a model parameters JSON.

If run directly, this step generally needs as input the same set as the learning step, with the addition of the a model parameters JSON, which is often called model_parameters.json.

An example:

ribohmm --infer-only --reference ref/genome.fa --transcriptome ref/transcriptome.gtf \
    --riboseq-counts data/example_output.ribo.counts.bed.gz \
    --rnaseq-counts data/example_output.rna.counts.bed.gz \
    --mappability-tabix-prefix data/mappability/example_output_mappability \
    --model-parameters run001/model_parameters.json \
    --output run001

This produces inside of the run001 directory a file called inferred_CDS.bed, which contains the inferred translated sequences.

Support

If errors are encountered, please open an issue on this repository with a detailed bug report.

ribohmm's People

Contributors

djf604 avatar

Watchers

James Cloos avatar  avatar

Forkers

heejungshim

ribohmm's Issues

learn_infer error

I got the following log message by running

ribohmm learn-infer --reference-fasta ./reference/hg19.fa --transcriptome-gtf ./reference/RNAseqGeuvadis_STAR_combined.gtf --riboseq-bam ./test_data/all_uniquely_mapped_reads.sort.bam --rnaseq-bam ./test_data/RNAseqGeuvadis_STAR_combined.final.correct.sort.bam --kozak-model ./build/lib/ribohmm/include/kozak_model.npz --log-output ./log/ --batch-size 10 --mappability-tabix-prefix ./mappability/old/tophat_out/accepted_hits_mappability --output-directory ./output/batch_10

The following have been reloaded with a version change:

  1. GCC/4.9.2 => GCC/8.2.0
  2. binutils/2.25 => binutils/2.30-GCCcore-8.2.0
  3. icc/2016.u3-GCC-4.9.2 => icc/2018.u4-GCC-8.2.0
  4. iccifort/2016.u3-GCC-4.9.2 => iccifort/2018.u4-GCC-8.2.0
  5. ifort/2016.u3-GCC-4.9.2 => ifort/2018.u4-GCC-8.2.0
  6. iimpi/2016.u3-GCC-4.9.2 => iimpi/2018.u4-GCC-8.2.0
  7. imkl/11.3.3.210-iimpi-2016.u3-GCC-4.9.2 => imkl/2018.4.274-iimpi-2018.u4-GCC-8.2.0
  8. impi/5.1.3.223-iccifort-2016.u3-GCC-4.9.2 => impi/2018.4.274-iccifort-2018.u4-GCC-8.2.0
  9. intel/2016.u3 => intel/2018.u4

Completed chr1
Completed chr2
Completed chr3
Completed chr4
Completed chr5
Completed chr6
Completed chr7
Completed chr8
Completed chr9
Completed chr10
Completed chr11
Completed chr12
Completed chr13
Completed chr14
Completed chr15
Completed chr16
Completed chr17
Completed chr18
Completed chr19
Completed chr20
Completed chr21
Completed chr22
Completed chrX
Completed chrY
Completed chrMT
Traceback (most recent call last):
File "/home/zzhou/.local/bin/ribohmm", line 11, in
load_entry_point('RiboHMM==1.0.0', 'console_scripts', 'ribohmm')()
File "/home/zzhou/.local/lib/python3.7/site-packages/RiboHMM-1.0.0-py3.7.egg/ribohmm/init.py", line 30, in execute_from_command_line
File "/home/zzhou/.local/lib/python3.7/site-packages/RiboHMM-1.0.0-py3.7.egg/ribohmm/_cmds/learn_infer.py", line 10, in main
File "/home/zzhou/.local/lib/python3.7/site-packages/RiboHMM-1.0.0-py3.7.egg/ribohmm/_cmds/_main.py", line 107, in execute_ribohmm
File "/home/zzhou/.local/lib/python3.7/site-packages/RiboHMM-1.0.0-py3.7.egg/ribohmm/contrib/bam_to_tbi.py", line 109, in convert_riboseq
File "/usr/local/easybuild/software/Anaconda3/5.3.1/lib/python3.7/subprocess.py", line 304, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/easybuild/software/Anaconda3/5.3.1/lib/python3.7/subprocess.py", line 756, in init
restore_signals, start_new_session)
File "/usr/local/easybuild/software/Anaconda3/5.3.1/lib/python3.7/subprocess.py", line 1413, in _execute_child
executable = os.fsencode(executable)
File "/usr/local/easybuild/software/Anaconda3/5.3.1/lib/python3.7/os.py", line 809, in fsencode
filename = fspath(filename) # Does type-checking of filename.
TypeError: expected str, bytes or os.PathLike object, not NoneType

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.