genometools / genometools Goto Github PK

GenomeTools genome analysis system.

License: Other

Lua 0.73% Python 1.55% Ruby 7.36% Shell 0.72% Haskell 0.01% Perl 0.03% C 79.85% C++ 1.88% Makefile 0.28% HTML 7.49% CSS 0.02% sed 0.01% Go 0.08%

genome bioinformatics gff3 genomics library annotation repeats python ruby lua

genometools's Introduction

GenomeTools

The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named libgenometools which contains a wide variety of classes for efficient and convenient implementation of sequence and annotation processing software.

If you are interested in gene prediction, have a look at GenomeThreader.

Platforms

GenomeTools has been designed to run on every POSIX compliant UNIX system, for example, Linux, macOS, and OpenBSD.

Building and Installation

Debian-based operating systems

Debian and Ubuntu users can install the most recent stable version simply using apt, e.g.

apt-get install genometools

(as root) to install the gt executable. To install the library and development headers, use

apt-get install libgenometools0 libgenometools0-dev

instead. This is not required to just use the tools.

macOS (via Homebrew)

If Homebrew is installed, GenomeTools can be installed on supported macOS versions using brew:

brew install genometools

Building from source

To use GenomeTools on systems that do not have native packages, or to modify GenomeTools at build time, you need to build from source. Source tarballs are available from GitHub. For instructions on how to build the source by yourself, have a look at the INSTALL file. In most cases (e.g. on a 64-bit Linux system) something like

make -j4

should suffice. On 32-bit systems, add the 32bit=yes option. Add cairo=no if you do not have the Cairo libraries and their development headers installed. This will, however, remove AnnotationSketch support from the resulting binary. When your binary has been built, use the install target and prefix option to install the compiled binary on your system. Make sure you repeat all the options from the original make run. So

make -j4 install prefix=~/gt

would install the software in the gt subdirectory in the current user's home directory. If no prefix option is given, the software will be installed system-wide (requires root access).

Contributing

GenomeTools uses a collective code construction contract for contributions (and the process explains how to submit a patch). Basically, just fork this repository on GitHub, start hacking on your own feature branch and submit a pull request when you are ready. Our recommended coding style is explained in the developer's guide (among other technical guidelines).

To report a bug, ask a question, or suggest new features, use the GenomeTools issue tracker.

genometools's People

Contributors

Stargazers

Watchers

Forkers

satta garonenur gordon standage oeigenbrod stefan-kurtz ggonnella kowsky mbruns joergi-w lparsons silky imclab dorleosterode geoffmomin yesimon sjackman annseidel hannah42 plantandfoodresearch oyamasiphula chiyaoivy teythoon endboy520 skycomingkho flyinteller inambioinfo jianlian92 liupfskygre rareseas photocyte wbyu fskpang datafiend smyang2018 anandksrao stevemmarshall wangdi2014 gerbenvoshol schnappi-wkl biosharp-dotnet-labs 1010stone ketsubouchi sam217pa bauerlev maol-corteva kojix2 khjia wu123tj hlkfoz holmrenser sgiorgetti rnshah9 mayhemheroes zm-git-dev aerhufker schaudge ajunlonglive xiamaz wangdi2016 eshack94 elliotberry crowmane420

genometools's Issues

GtSeqabstract len issues

The Class GtSeqabstract needs a parameter len, this is not used in any way.

The class does therefore not work as intuitively guessed.

There is some serious refactoring needed, and unit test should be added, to show how the class is intended to be used.

Honor sequence-region pragmas when specified

The GFF3 spec discusses the utility of ##sequence-region directives for bounds checking. Many GFF3 files do not provide these directives, but the GtGFF3InStream class conveniently infers theses boundaries from the annotations corresponding to each sequence.

However, it seems that when ##sequence-region pragmas are provided, they are ignored or overridden by the GtGFF3InStream class. For example, I have a GFF3 file with the entry ##sequence-region chr8 1 1000000 to indicate that the corresponding sequence is 100kb long. However, the right-most annotation in the file spans 88551-92176. When I parse these data with a GtGFF3InStream object, I get [1-92176] as the range, rather than [1-100000].

Being able to infer ranges from the annotations is no doubt a very useful feature, but when ##sequence-region directives are explicitly provided, shouldn't these be honored and used?

Allow removal/deletion of feature nodes from the annotation graph

The functionality of removing a node from an annotation graph is still missing. This is important to gain full graph manipulation support, as it is the only operation missing IIRC.

We need to agree on an API, with the following questions in mind:

Do we only support to remove leaves or also internal nodes (and their children, if not otherwise connected to the CC)?
How do we modify refcounts?
How do we update the tree/non-tree property?

Travis runtime enhancement

I have to take a closer look at travis for this, but looking at the run times of our many travis tests, clang compiled code performs better.
It should be possible to compile with gcc and run gt -test, but do all the other tests only with clang, this would reduce the total runtime and I doubt that we find bugs that are compiler dependant.

Add *_try_cast functions to API

The GtNodeVisitor interface provides a flexible solution for processing genome nodes. The various *_try_cast functions are not nearly as elegant, but can be quite useful. Unfortunately, they are not yet part of the public API. I suggest exposing these functions to 3rd party users.

AnnotationSketch weirdness

I encountered some strange behavior with the sketch API recently, and I've isolated a small use case to reproduce it. See http://gremlin2.soic.indiana.edu/tmp/sketch-issue/.

When I run with the default files (first command in README), a separate track is created for the 5' UTR but also seems to duplicate the 3' UTR.

When I remove the ID from the 5' UTR (second command in README), everthing renders correctly.

Bug with the AnnotationSketch logic?

Issues with GtFeatureOutStream class

I just noted two concerns with the GtFeatureOutStream class.

The seqstr object (GtStr *) is not being freed properly.
The regioncache object (GtArray) is being used as a queue, whereas the GtQueue is of course better suited for this. (The featurecache object is also being used as a queue, but since it is being created by another function, it doesn't make sense to copy all of the data to a new queue--reversing the array and popping entries is an acceptable alternative in this case).

gt_xfread vs gt_xfwrite

There is a discrepancy between gt_xfread and gt_xfwrite, read returns the number of bytes read, but write does not. Both already check for the right number of bytes and exit with an error if there is a difference. So there is no need for gt_xfread to return the number of bytes read.

Would it be ok to change this?

~0 is a bad choice to represent an undefined INT.

GT_UNDEF_INT is defined as ~0, which on systems using 2s complement (is there any current system where it is not?) means -1.

I think a better choice would be to define it as INT_MIN or INT_MAX: -1 is a value which does often mean something. Indeed I noticed this for example because I created in my development branch an integer option which has -1 as default value and the parser shows in this case "default: undefined" (which per se can be fixed by showing the default manually, but this is not the point).

Name of the printf macros GT_LU etc

I think a good name for the macros such as GT_LU should reflect the fact that they are for GtUword and GtWord. Instead LU only refers to the "lu", which is not always true, and this is the reason why the macro exists.

Therefore, I propose renaming the macros to GT_WU, GT_WUS, GT_WD, GT_WDS.

join more tools in toolboxes

the list of tools is quite long,

some tools, like csa and cds do similar things, in this case for gff3. These could be put into one toolbox, to shorten the list of tools.

Documentation for deprecated functions and renaming API-functions

as I mentioned here: #68 it would be nice if there was something to mark functions as deprecated in the docu. One could mention it in the comment, but changing the text colour or something?

And the second Idea: If the API-change is a simple renaming, it is possible to use a typedef to introduce the new function name, but it will not appear in the docu, how about adding functionality to recognize such things and document them as if they were functions?
I think its complicated but might be interesting if more API-changes are proposed in the future.

hmm class unit tests

using

make opt=no curses=no cairo=no test -j6 testthreads=6

hmm class...error
first error: gt_ensure(gt_double_equals_double(gt_hmm_rmsd(fair_hmm, fair_hmm), 0.0)) failed: function gt_hmm_unit_test, file src/extended/hmm.c, line 644.
This is probably a bug, please report.

this fails in the current master.

Range macros for new types.

type_api.h should also define the macros GT_WORD_MIN, GT_WORD_MAX, GT_UWORD_MAX to support the GtUword and GtWord types.

Implement complete relationship resolution.

In the current implementation, part_of relationships are not inherited by child terms in an is_a relationship from their parent term. Example: match_part is currently not recognized by the GtTypeChecker as being part_of cDNA_match, but match_part is generally part_of a match, and cDNA_match is_a match. This breaks relationship checking in such cases. As other SO-aware projects must have solved this problem somehow, I will read http://www.gmod.org/wiki/Chado_CV_Module#Transitive_Closure in detail and try to find out how to do this accurately.

Node stream for resetting GtFeatureNode source

Features created in memory by default have no source. When working with these, perhaps in combination with feature nodes from other sources, a node stream that enables resetting GtFeatureNode source values would be very helpful.

unsigned long on Windows

The implicit assumption we make in GenomeTools so far is that unsigned long has the word size of the machine (4 byte on 32-bit systems and 8 byte on 64-bit systems).
Unfortunately, on Windows unsigned long is always 4 bytes wide which leads to problems in the code.

How to we want to deal with this problem? We could reuse GtUlong (or introduce GtUword) in such a way that the unsigned long behaviour on Linux is mirrored on Windows for GtUlong and use GtUlong where appropriate.

Or we switch to types which always have a defined size, regardless of the machine.
No matter what, we have to make sure that we don't break existing APIs.

Any comments?

bit pack string module unit test fails with seed 509083971

this was recognized when travis failed only on one build but not the others.
This also fails with older versions of gt.

Assertion failed: (numBitsList[0] <= sizeof (val[0])*CHAR_BIT), function gt_bsStoreNonUniformUInt8Array, file src/core/bitpackstringop8.c, line 513.

make spgt fails

After the change to Gt(Uw|W)ord, make spgt outputs hundreds of warnings.

Setting ID attribute for feature nodes

The API documentation states that gt_feature_node_set_attribute and gt_feature_node_add_attribute should not be used for ID and Parent attributes. This makes sense to some extent, but what about for features created in memory using the API? For example, I have code that creates several coding sequences in memory and then outputs them. I try to set IDs via the API--I get no error/warning messages, but the ID attribute remains unchanged and I still get warning messages like

warning: feature ID "" not unique: changing to .1
warning: feature ID "" not unique: changing to .2

Comments or suggestions?

Add parser for Dbxref abbreviations and validate their format in GFF3

The GFF3 spec states that Dbxref and Ontology_term attributes should have the form DBTAG:ID where the list of valid DBTAGs comes from a separate abbreviation file defined here: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec

We should be able to parse such a file and use its content to validate the values of Dbxref and Ontology_term reserved attributes in GFF3 files.

gt_malloc _realloc etc and the "possible NULL pointer" problem

Hi,

as I see it our memory allocation functions will never return NULL, right?

Is there a way to make compilers and static checkers aware of that? There are many gt_asserts checking this after calling a function that uses gt_malloc. As I see it makes sense to assert for not NULL in a function that gets a pointer as parameter, but not after calling gt_malloc.

And: compiling with assert=no results in static checkers complaining about possible NULL-pointer dereference.

SVG suggestion, if possible.

I don't know how complicated this would be, but it would be great if there was away to add SVG mouseover or click actions to AnnotationSketch. I've been using it very successfully now for a while to generate hard-coded images that I can send to coworkers without requiring them to access a database or paying an expensive commercial license for VectorNTI when all we need is very simple a Genbank/gff viewer (ie not Artemis). It would be even more useful to be able to layer additional information (ie nucleotide positions, or even whole sequences) into the image without cluttering up labels.

It looks like you use Cairo to render, and the resulting SVGs are not a bit human readable, so I have no idea of the complexity here, but I also have no familiarity with Cairo. But mouseover and click events are built into the SVG spec, so maybe it's not impossible? Just a thought.

And thanks for the great set of tools!

Include API - extern variables in documentation

I am currently working on an extension to the automatic documentation for the API to include extern variables.
If these are part of a module it works fine, but if defined within a class file there need to be changes for them to be recognized.
I will add 2 modules that will benefit from this later.

Right now there is only the gt_jobs variable in core/thread_api.h and to include this into the API (which in itself makes sense) there are 3 possible changes:

rename gt_jobs so it will be recognized as part of one of the classes in threads_api.h. It has to be moved on a line after the class is defined for that to work.
move gt_jobs to multithread_api which is a module. This could be complicated, because the definition has to be moved to.
by far the easiest way: add a comment line before the extern variable, with an Name and codeword module in it. This will add a module with only that variable to the documentation. I propose to name the module Thread like the file it resides in.

src_check is to aggressive

the word "long" is not that uncommon in strings for some printout but triggers the warning in src_check.

is there a way around that that does not need a c-parser?

store encseq to disk from EncseqBuilder

there is no possibility to generate an encseq within memory and then store it directly to hard disk.
it would be nice to create an encseq with EncseqBuilder and then store it. right now, I have to write the sequences into a FASTA file and then convert the FASTA file to an encseq

get random sequences with encseq decode

add an option to encseq decode that takes a number as a parameter and extracts this number of random sequences from the encseq.

no duplicates

maybe extend this, so that parameter can be total length of extracted sequences.

gt_ensure uses err

gt_ensure is an makro in the form of a funciton, it uses GtError err without err being one of the parameters of that makro.
this should be changed to:

#define gt_ensure(had_err, err, exp)

Outstream using feature index

The GtFeatureStream class is convenient when the data to be processed will ultimately reside in a GtFeatureIndex object--analogous to the GtArrayInStream class for GtArray objects.

However, no analog of the GtArrayOutStream class exists for feature index. This would also be convenient to facilitate use of a feature index as the starting point for a node stream.

type checker part_of reflexivity

The type checker currently reports for any feature type that it is part_of itself, which is obviously not always correct but rather depends on the contents of the underlying ontology. The reason for that behaviour is that in gt_type_node_has_parent in that case the check for parent and child ID equality (which I guess was intended for recursion termination) is always triggered right away, without looking at the ontology.

To fix this one would have to base every decision solely on what is in the adjacency matrices for the ontology DAG, using recursive search in the nodes only to extend the matrices. @gordon do you agree?

wrong line number in encseq encode error message

When using encseq encode with -protein argument and the input below, an error is reported in line 3 although the lower case character causing the error is located in line 4.
Why does a lower case character cause an error at all? If the -protein argument is not given, there is no error.

Edit: there are other examples where the error occurs although the -protein argument is not given.

Input fasta file:

>YAL001C
M
>YAL002W
Ms

check format of GFF3 Gap attribute

The GFF3 format specifies a Gap attribute used for describing alignments in detail using a CIGAR-like string. It would be nice if the GenomeTools were able to check this string for validity, both syntactically as well as length-wise (i.e. comparing the number of matches and indels vs. the length of the feature).

Seed for Testsuite

Right now there is no way to find the seed with which a testsuite run was started.
So if travis fails just on one test and this is not reproducible, this might be due to some random value, which is very hard to debug without the seed.

Here are some ideas to fix this:

testsuite.rb produces random number, outputs this number and uses this number as a seed for all tests in that run.

To pass the seed value to all the thousands of tests each calling gt multiple times one either has to change all the calls to gt in the testsuite and add -seed=X or we change the gt binary to check for an environmental variable GT_SEED. If defined it will use this. So the testsuite simple has to set that variable and no test-code has to be changed.
The gt-option -seed takes precedence over GT_SEED for those cases where a specific test has to use exactly one seedvalue and the testsuite only generates a random value if GT_SEED is not defined so it is possible to reproduce errors within the testsuite depending on random values.
[edit: typos]

Documentation of the allowed types

Currently it is not written in the developer manual (at least I did not find it).

expand gt encseq info

this all assumes that files usually represent some logical subset of a sequence set in an encseq

alternatively this could be in a different subtool seqstats or similar

current master does not do much with make test

bin/gt -test gets executed but no test starts with the testsuite.

Renaming UNDEF macros

I think the GT_UNDEF_ULONG and GT_UNDEF_LONG macro should be renamed to GT_UNDEF_UWORD and GT_UNDEF_WORD.

Multi-features with different attributes

Hi everyone (e.g. @gordon ),

is there a specific reason why we are disallowing multi-features with different attributes? I have a few CDS's here containing an additional 'rank' attribute which takes different values. I had a look at the spec, and I could not find any mention of this restriction.

Sascha

Assertion messages ("This is a bug, report it") should include link to GitHub issue tracker.

These messages should also include a link to the issue tracker to make sure that bugs receive more attention and can be handled in a centralized way. Moreover, issues and their solutions become searchable in the web interface.

Usage of -usedesc/-matchdesc/MD5

AFAICS, when someone uses -seqfile, -seqfiles, and -encseq to specify a sequence to be used in a tool (using the GtRegionMapping interface), they must decide how to map sequence IDs (e.g. headers) to region IDs in an annotation. This is currently done by specifying -usedesc or -matchdesc options, or by tagging region IDs with MD5 hashes identifying the respective sequences, in which case the connection is made automagically. However, it is currently possible not to specify -usedesc or -matchdesc options as they are not mandatory. So in principle, it is possible to give a sequence file (or a GtEncseq, or a set of sequence files) with no hint about region mapping and no MD5 tagging. In this case, the first sequence in the set is (silently) taken as the source for all regions in the annotation. This can be very confusing and misleading if someone simply expects things to work (e.g. when using legacy LTRharvest output with 'seq0', 'seq1'... tags as input for the new LTRdigest which supports -seqfile, -seqfiles, and -encseq). All you eventually get is an error message if you try to access regions outside the sequence length.
Is there any reason why this behaviour was there in the first place? I am fairly certain I did not implement it on purpose like this, but there were tests that expected it IIRC. I'd propose the following as a suggestion to make things more straightforward:

Usually expect -usedesc or -matchdesc as options if someone uses -seqfile, -seqfiles, and -encseq
If -regionmapping is used, use the entries in the mapping file to refer to the first sequence in the respective sequence files
If none of these is given, try to use MD5 mapping or fail when no MD5 is present!

In this case, users would not get unexpected results. Any comments or ideas?

make manuals fails

On my Ubuntu Linux 32-bit machine make manuals fails with:

! LaTeX Error: File `bbm.sty' not found.

and the file bbm.sty is not part of GenomeTools.
Is this a bug or do I have to install another Latex package?

Rename option_new for (u)long

The methods:

gt_option_new_long
gt_option_new_ulong
gt_option_new_ulong_min
gt_option_new_ulong_min_max

should be renamed to

gt_option_new_word
gt_option_new_uword
gt_option_new_uword_min
gt_option_new_uword_min_max

in order to reflect the type of variable where the value is stored after the change to GtUword/GtWord.

test 'gt repfind small' fails for some seeds

seeds where it fails to run:

471897537
71207745

errormessage in both cases but at different stages of program progress:

Assertion failed: (queryrep != NULL && pos < queryrep->length), function gt_mmsearch_accessquery, file src/match/esa-mmsearch.c, line 55.

travis build time

the travis build time depends mainly on our ruby testsuite.
I will split the tests up, so every build will be done multiple times with different tests.

Signature for 'gt_feature_index_has_seqid'

It seems a bit strange that the function always returns a 0 and accepts a pointer to a boolean. Is there a particular reason for this function signature instead of just returning the boolean?

make compact ulong store dynamic

right now, compact ulong store is as class to store fixed width values of arbitrary widths in an array like fashion.
Its only drawback in my opinion is its rigidity, the number of elements to be stored has to be known in advance.

I think it would be fairly easy to change this to be dynamic, resizing the array when setting values outside of the bounds.

What do you think?

GFF3 validator should not stop on first error

Currently, the GFF3 parser stops on and outputs only the first error encountered in the input. For automated use, e.g. in a GFF/Git-based annotation tracking system, it might be convenient to have it output all errors instead of only the first one. This functionality could, for example, be optional in the parser or input stream, so it may be activated in the validator tool but not in the other GtGFF3InStream-using tools.
I see that is difficult to do because we try to ensure that everything that comes out of an input stream is guaranteed to be valid, and some errors simply cannot lead to valid annotation graphs. However, one might restrict this more relaxed error output to typically non-fatal errors in GFF3 input.

Make class alloc lock module API public

The GtAllocLock module is used in node stream and node visitor implementations for thread safety while casting (if I understand correctly). Both the node stream interface and the node visitor interface are described in the public API, but the GtAllocLock module is not accessible in the public API, which makes it unavailable for use by third party implementations leveraging these interfaces. I suggest we make this module available via the public API.

GenomeTools Windows Port

I'm currently porting GenomeTools to Windows (MinGW).
Some notes can be found here:
https://github.com/genometools/genometools/wiki/GenomeTools-Windows-Port
I'm pushing in this branch:
https://github.com/gordon/genometools/tree/windows
It's not ready for a pull request, though. Comments welcome.

clang 3.4 does not compile tre library

src/external/tre/lib/tre-compile.c does not compile with clang 3.4, because it complains about

errcode = errcode

they use goto there in that part of the code. I felt bad for just reading that.

hadn't we some problems with warnings in external code befor? Is it possible to turn them off for some of the external stuff? (Preverably all of it, because we are not the maintainers of the external files.)