Coder Social home page Coder Social logo

standage / tag Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 1.0 1.61 MB

Genome annotation data analysis and management implemented in pure Python

Home Page: http://tag.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 99.66% Makefile 0.34%
genomics python gff3 genome-annotation bioinformatics hacktoberfest

tag's Introduction

tag: Toolkit for Annotating Genomes

Supported Python versions PyPI version GenHub build status codecov.io coverage

Computational biology is 90% text formatting and ID cross-referencing!
-- discouraged graduate students everywhere

tag is a free open-source software package for analyzing genome annotation data.

# Compute number of exons per gene
import tag
reader = tag.GFF3Reader(infilename='/data/genomes/mybug.gff3.gz')
for gene in tag.select.features(reader, type='gene'):
    exons = [feat for feat in gene if feat.type == exon]
    print('num exons:', len(exons))

To install the most recent stable release execute pip install tag from your terminal. Full installation instructions and project documentation are available at https://tag.readthedocs.io.

tag's People

Contributors

standage avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

aipolly

tag's Issues

Reader and writer classes

  • support for gzip compression
  • writer resets IDs by default
  • more tests for sorted input data (and assumed sorted mode)

Implement collapse mode for bacterial annotation evaluation

It would be helpful for visualization purposes to collapse redundant gene predictions from different sources. If 3 ab initio gene predictors agree on a single gene model and 2 others agree on a different model, it would be better to print two glyphs instead of 5, with the supporting annotation sources printed above.

Improve setup.py

It's very minimal at this point. Need to specify scripts and other info to allow proper installation.

Add more tests for tag occ

There is now a simple CDS test, and a complex test involving multi-features. One or two nice middle-complexity tests would be nice: some overlap, but not multi-features.

Add ParsEval test

Should probably be a subset of an entire genome annotation to allow running in < 1 minute.

Error handling Parent attribute for features with multiple parents

Handling of the Parent attribute it wonky when a feature has multiple parents. In the following example, the first two mat_pep features should have both CDS1 and CDS2 as parents, but only CDS2 is shown. This makes me suspect the first parent ID is being overwritten when the second parent ID is set.

>>> import tag
>>> 
>>> gene = tag.Feature("chrom", "gene", 0, 100)
>>> cds1 = tag.Feature("chrom", "CDS", 0, 60)
>>> cds2 = tag.Feature("chrom", "CDS", 0, 90)
>>> pep1 = tag.Feature("chrom", "mat_pep", 1, 58)
>>> pep1 = tag.Feature("chrom", "mat_pep", 1, 28)
>>> pep2 = tag.Feature("chrom", "mat_pep", 31, 58)
>>> pep3 = tag.Feature("chrom", "mat_pep", 61, 88)
>>> 
>>> gene.add_child(cds1)
>>> gene.add_child(cds2)
>>> 
>>> cds1.add_child(pep1)
>>> cds2.add_child(pep1)
>>> 
>>> cds1.add_child(pep2)
>>> cds2.add_child(pep2)
>>> 
>>> cds2.add_child(pep3)
>>> 
>>> print(repr(gene))
chrom   tag     gene    1       100     .       .       .       .
chrom   tag     CDS     1       60      .       .       .       .
chrom   tag     CDS     1       90      .       .       .       .
chrom   tag     mat_pep 2       28      .       .       .       .
chrom   tag     mat_pep 32      58      .       .       .       .
chrom   tag     mat_pep 62      88      .       .       .       .
>>> 
>>> w = tag.GFF3Writer([gene])
>>> w.write()
##gff-version 3
chrom   tag     gene    1       100     .       .       .       ID=gene1
chrom   tag     CDS     1       60      .       .       .       ID=CDS1;Parent=gene1
chrom   tag     CDS     1       90      .       .       .       ID=CDS2;Parent=gene1
chrom   tag     mat_pep 2       28      .       .       .       Parent=CDS2
chrom   tag     mat_pep 32      58      .       .       .       Parent=CDS2
chrom   tag     mat_pep 62      88      .       .       .       Parent=CDS2
###
>>>

I thought this might just be a quirk of how features created in silico without ID and Parent features are handled. So I tried the following, and got an unexpected result.

>>> import tag
>>> 
>>> gene = tag.Feature("chrom", "gene", 0, 100, attrstr="ID=g1")
>>> cds1 = tag.Feature("chrom", "CDS", 0, 60, attrstr="ID=c1")
>>> cds2 = tag.Feature("chrom", "CDS", 0, 90, attrstr="ID=c2")
>>> pep1 = tag.Feature("chrom", "mat_pep", 1, 58)
>>> pep1 = tag.Feature("chrom", "mat_pep", 1, 28)
>>> pep2 = tag.Feature("chrom", "mat_pep", 31, 58)
>>> pep3 = tag.Feature("chrom", "mat_pep", 61, 88)
>>> 
>>> gene.add_child(cds1)
>>> cds1.add_attribute("Parent", "g1")
>>> gene.add_child(cds2)
>>> cds2.add_attribute("Parent", "g1")
>>> 
>>> cds1.add_child(pep1)
>>> cds2.add_child(pep1)
>>> pep1.add_attribute("Parent", "c1", append=True)
>>> pep1.add_attribute("Parent", "c2", append=True)
>>> 
>>> cds1.add_child(pep2)
>>> cds2.add_child(pep2)
>>> pep2.add_attribute("Parent", "c1", append=True)
>>> pep2.add_attribute("Parent", "c2", append=True)
>>> 
>>> cds2.add_child(pep3)
>>> pep3.add_attribute("Parent", "c2")
>>> 
>>> print(repr(gene))
chrom   tag     gene    1       100     .       .       .       ID=g1
chrom   tag     CDS     1       60      .       .       .       ID=c1;Parent=g1
chrom   tag     CDS     1       90      .       .       .       ID=c2;Parent=g1
chrom   tag     mat_pep 2       28      .       .       .       Parent=c1,c2
chrom   tag     mat_pep 32      58      .       .       .       Parent=c1,c2
chrom   tag     mat_pep 62      88      .       .       .       Parent=c2
>>> 
>>> w = tag.GFF3Writer([gene])
>>> w.write()
##gff-version 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/daniel.standage/Software/tag/tag/writer.py", line 87, in write
    feature.add_attribute('ID', fid)
  File "/Users/daniel.standage/Software/tag/tag/feature.py", line 444, in add_attribute
    oldvalue=oldid)
  File "/Users/daniel.standage/Software/tag/tag/feature.py", line 455, in add_attribute
    assert oldvalue in self._attrs[attrkey]
AssertionError
>>>

Spitballing

  • tag jury: evm
  • tag gavel: gaeval
  • tag db: geneannology

New 'infer' module

...with generator functions focused on inference of implicit features in gene structure annotations.

Better handling of feature score

Currently, score is either stored as None with . as the textual GFF3 representation, or it's stored as a float with {:.3f} as the textual GFF3 representation. However, the GFF3 spec makes it clear that the semantics of the score are ill-defined (other than the fact that it's a floating point number). It might be worth handling this a bit better: autodetecting decimal notation vs scientific notation, or perhaps simply setting cutoffs below/above which scientific notation is used.

Add some basic generators

  • retrieve entries by type (directive, feature, sequence, etc)
  • retrieve features from a specified window
  • retrieve features by type (gene, exon, etc)

Sorting of all object types

Feature, directive, and comment objects all need to be able to sort correctly with respect to each other. As I see it, ##gff-version < ##sequence-region < other directives < comments < features.

Migrate aeneas --> tag

  • shorter
  • relevant ("tagging" the genome with annotations)
  • available on PyPI
  • most important: easy to pronounce, remember, and google!

FeatureIndex class

  • interval forest (dictionary of interval trees)
  • sequence regions, both inferred and explicit (if any)
  • functions to consume features, directives, and files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.