Coder Social home page Coder Social logo

bed_annotation's Introduction

BED Annotation

Build Status Anaconda-Server Badge

A tool that assigns gene names to regions in a BED file based on Ensembl genomic features overlap.

Requirements

Python 3.6, 3.7, 3.8, 3.9, 3.10.

Installation

pip install bed_annotation

Usage

bed_annotation INPUT.bed -g hg19 -o OUTPUT.bed

The script checks each BED region against the Ensembl genomic features database, and writes a BED file in a standardized format with a gene symbol, strand and exon rank in 4-6th columns:

INPUT.bed:

chr1    69090   70008
chr1    367658  368597

OUTPUT.bed:

chr1    69090   70008   OR4F5   1       +
chr1    367658  368597  OR4F29  1       +

Available genomes (to provide with -g): GRCh37, hg19, hg38.

Transcripts order

The piority for choosing transcripts for annotation is the following:

  • Overlap % with transcript
  • Overlap % with CDS
  • Overlap % with exons
  • Biotype (protein_coding > others > *RNA > *_decay > sense_* > antisense > translated_* > transcribed_*)
  • TSL (1 > NA > others > 2 > 3 > 4 > 5)
  • Presence of a HUGO gene symbol
  • Is cancer canonical
  • Transcript size

Extended annotation

Use --extended option to report extra columns with details on features, biotype, overlapping transcripts and overlap sizes:

bed_annotation INPUT.bed -g hg19 -o OUTPUT.bed --extended

OUTPUT.bed:

## Tx_overlap_%: part of region overlapping with transcripts
## Exon_overlaps_%: part of region overlapping with exons
## CDS_overlaps_%: part of region overlapping with protein coding regions
#Chrom  Start   End     Gene    Exon  Strand  Feature Biotype         Ensembl_ID      TSL HUGO    Tx_overlap_% Exon_overlaps_% CDS_overlaps_% Ori_Fields
chr1    69090   70008   OR4F5   1     +       capture protein_coding  ENST00000335137 NA  OR4F5   100.0        100.0           99.7
chr1    367658  368597  OR4F29  1     +       capture protein_coding  ENST00000426406 NA  OR4F29  100.0        100.0           99.7

Ambuguous annotations

Regions may overlap mltiple genes. The --ambiguities controls how the script resolves such ambiguities

  • --ambiguities all -- report all reliable overlaps (in order in the "priority" section, see above)
  • --ambiguities all_ask -- stop execution and ask user which annotation to pick
  • --ambiguities best_all (default) -- find the best overlap, and if there are several equally good, report all (in terms of the "priority" above)
  • --ambiguities best_ask -- find the best overlap, and if there are several equally good, ask user
  • --ambiguities best_one -- find the best overlap, and if there are several equally good, report any of them

Note that the first 4 options might output multiple lines per region, e.g.:

bed_annotation INPUT.bed -g hg19 -o OUTPUT.bed --extended --ambiguities best_all

OUTPUT.bed:

## Tx_overlap_%: part of region overlapping with transcripts
## Exon_overlaps_%: part of region overlapping with exons
## CDS_overlaps_%: part of region overlapping with protein coding regions
#Chrom  Start   End     Gene    Exon    Strand  Feature Biotype Ensembl_ID      TSL     HUGO    Tx_overlap_%    Exon_overlaps_% CDS_overlaps_%
chr1    69090   70008   OR4F5   1       +       capture protein_coding  ENST00000335137 NA      OR4F5   100.0   100.0   100.0
chr1    367658  368597  OR4F29  1       +       capture protein_coding  ENST00000426406 NA      OR4F29  100.0   100.0   100.0
chr1    367658  368597  OR4F29  1       +       capture protein_coding  ENST00000412321 NA      OR4F29  100.0   100.0   100.0

Other options

  • --coding-only: take only the features of type protein_coding for annotation
  • --high-confidence: annotate with only high confidence regions (TSL is 1 or NA, with HUGO symbol, total overlap size > 50%)
  • --canonical: use only canonical transcripts to annotate (which to the most part means the longest transcript, by SnpEff definition)
  • --short: add only the 4th "Gene" column (outputa 4-col BED file instead of 6-col)
  • --output-features: good for debugging. Under each BED file region, also output Ensemble featues that were used to annotate it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.