Comments (10)
I think this is best done in a separate tool, except for the human readable output, which should be available in encseq info
as well.
from genometools.
why do you want this in a separate tool? It is "just" extending the information.
I am not sure for the N50, because I might have seen that somewhere in the output of another tool. But don't ask me which.
concerning the human readable: we'll leave the original numbers in there, too. So parsing the output can still give the absolute numbers.
from genometools.
another Question: KB or KiB?
from genometools.
I think it should be done in a separate tool (or at least by setting an additional option in encseq info
) because per-file output can get rather large when many sequence files are in the encseq, and the purpose of the encseq info
tool is primarily to get a quick overview of the encseq's properties. These may drown in the long output when including output on a per-file basis.
I do not think there is a proper calculation tool for N50 values. There might be in readjoiner (maybe ask @ggonnella ) but nothing generic for any encseq IIRC. Grepping for 'N50' shows some readjoiner-related Ruby script, but that seems to be it.
Regarding the human readable output, I propose to introduce an -h
option (similar to what GNU df or du has) which outputs '400K' ,'123M', '3G', '23T' etc. instead of the absolute numbers. I think this follows the principle of the least surprise for the user. If they want to parse them from the output, they just do not specify -h
. What do you think?
from genometools.
There is lots of redundancies in gt, there is already something to get assembly statistics in gt seqstat which calculates N50 and N80. But then again it uses fasta input.
If I have an encseq and I want that statistics, what would be the workflow "of least surprise" (or what is it called) for a user.
If I have something to do on an encseq, I would first search in the encseq toolbox.
from genometools.
Well, in other tools (mostly those connecting annotations and sequences) there are options -seqfile
, -seqfiles
, -encseq
etc. which allow the user to specify the sequence source. It is possible to access both GtEncseq and GtBioseq sequences (and their lengths) using the GtSeqCol interface. Depending on which tools you have been using before, it may be less or more surprising to have to use the 'other' way. I just think that either way, things that are not intimately tied to a specific representation (e.g. encseq encoding/decoding etc.) should be done generically. I admit that this has not been done consistently in the past, but we should keep an open mind on not continuing that path in the future.
from genometools.
I do not like the options that specify the type of input. Isn't it rather simple to have some file types that can be handled by us and recognized by gt?
A generic toolbox that could take anything as input, with tools that only need simple serial access to the data could use a generic container that is either encseq, the old indices (build with the seq tool? whatever that is) or sequences.
Than all tools that have to have random access, they will either load an existing encseq, or if given some sequence files build it and load it.
The interfaces for that should be the same for all tools. I know that changing this would break a lot of old stuff.
from genometools.
I think a generic toolbox for sequences, besides the toolbox for encseq, would be a great idea! Then stuff like shredder, convertseq, extractseq, sequniq, seqfilter, seqstat, seqtransform, seqtranslate, seqorder, splitfasta, dev readreads, dev seqlensort would also find a very natural container and could be slowly all also implemented in a generalized form which accepts encseq as input.
from genometools.
Can this be closed?
from genometools.
yes yes... I have to show @joergi-w the way to autoclose an issue with a pull request :-)
from genometools.
Related Issues (20)
- Running LTRharvest/digest with parallel HOT 5
- gt gff3 -sortlines fails on particular gff HOT 11
- Difficulty extracting intronic regions HOT 2
- There is no -retainids when using ltrdigest, Why don't add one? HOT 2
- We need to run tirvish two times for a single genome? HOT 1
- Install error: cairo.h: No such file or directory on RedHat 7 HOT 2
- How to test whether the GenomeTools library is installed properly HOT 1
- After installation, how to use and test whether the function is normal? HOT 4
- bed_to_gff3 bug HOT 9
- error: ignoring return value of 'fwrite', declared with attribute warn_unused_result [-Werror=unused-result] HOT 3
- LtrPipeline: GenomeTools failed to run ltrharvest. Error code: 139 HOT 4
- gff3validator "Sequence Ontology" out of date? HOT 3
- Assertion error: gt_feature_node_remove_leaf HOT 2
- gt gff3 loss of intergenic regions HOT 1
- sketch all non-overlapping genes on one track HOT 4
- -Werror is for CI + developer only HOT 4
- Fails to build on macOS Ventura / Xcode 14.x HOT 3
- New build warnings on GCC 13 turned errors are blocking builds HOT 1
- Build time issue with GenomeTools 1.5.9 HOT 5
- Aborted (core dumped) with LTR harvest HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genometools.