Coder Social home page Coder Social logo

althapalignr_py's Introduction

Introduction

This repository holds the development history of the make_a_table.py script embedded in jknightlab/AltHapAlignR, plus a few extra modifications. Please report all issues and pull-requests there, not here.

The script is designed to read a set of BAM files representing reads aligned to various haplotypes, combine them and intersect the alignments with gene coordinates. The output tells for each read the read it belongs to and on which haplotypes it is found.

Installation

The python script has a few dependencies:

  • pybam: "Pure Python" -but fast- library to read BAM files
  • intervaltree: "Pure Python" library that implements interval trees
  • quicksect: C/Python library that implements interval trees too but is about 4x faster than intervaltree. Note that its installation may require Cython and a compiler (e.g. gcc) setup.

Only one of the last two is needed, quicksect being the preferred option for performance reasons.

There are several ways of bringing them in, the easiest being with pip. Note that you may want to first setup a virtualenv before installing the dependencies, to ensure your environment is clean and self-contained. For instance:

# Where the files are going to be stored
ALTHAPALIGN_VENV=$PWD/althapalign_virtualenv

# To create a "virtualenv" (only the first time)
virtualenv $ALTHAPALIGN_VENV

# To start using the "virtualenv"
source $ALTHAPALIGN_VENV/bin/activate

# To install the dependencies
pip install https://github.com/JohnLonginotto/pybam/zipball/master
pip install cython
pip install quicksect

# To stop using it, once finished
deactivate

Execution

Inputs

Command line

The general syntax is

Usage: make_a_table.py [options] gtf_file bam_file_1 bam_file_2 ...

Options:
  -h, --help            show this help message and exit
  -r NAME_TO_REPLACE NEW_NAME, --rename_gene=NAME_TO_REPLACE NEW_NAME
                        Replace some erroneous gene names
  -g GENE_TYPES, --gene_types=GENE_TYPES
                        Comma-separated list of gene biotypes to use [default:
                        protein_coding]. Use an empty string for no filtering
  -t TRANSCRIPT_TYPES, --transcript_types=TRANSCRIPT_TYPES
                        Comma-separated list of transcript biotypes to use for
                        the exon-overlap filtering [default: protein_coding].
                        Use an empty string for no filtering
  -n, --no_gtf_filter   Do not use a GTF file to filter the reads. The command
                        line arguments are then expected to all be BAM files.

The -r option can be repeated if several genes have the wrong name. For instance, we use Gencode 21 in the paper, which has to be fixed with -r VARSL VARS2 -r C6orf205 MUC21

Output

The output is usually a tab-separated file with n+2 columns, n being the number of BAM files.

read name gene name edit distance in BAM file 1 edit distance in BAM file 2 ...

Each edit distance is actually the sum of the edit distances of the two reads in each pair.

If the --no_gtf_filter option is used, the second column is skipped.

Runtime statistics

The script's CPU-time is linear in the total number of reads found in the BAM files. Depending on the CPU, it can parse between 30,000 and 80,000 reads per second. The analysis from the paper over 8 BAM files comprising more than 13 million reads takes less than 3 minutes on an Intel i7-7500U CPU @ 2.70GHz (on 1 core only). The memory consumption is limited than 20MB with quicksect, 35MB with intervaltree, regardless of the number of reads. It only depends on the number of genes and exons in the GTF file.

Example command-line

./make_a_table.py -r VARSL VARS2 -r C6orf205 MUC21 gencode.v21.only_MHC.annotation.gtf.gz */*picard-sorted*bam > output_file

althapalignr_py's People

Contributors

muffato avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.