this all assumes that files usually represent some logical subset of a sequence set in

yes yes... I have to show <a class="user-mention notranslate" data-hovercard-type="use

expand gt encseq info about genometools HOT 10 CLOSED

genometools commented on May 30, 2024

expand gt encseq info

from genometools.

Comments (10)

satta commented on May 30, 2024

I think this is best done in a separate tool, except for the human readable output, which should be available in encseq info as well.

from genometools.

Garonenur commented on May 30, 2024

why do you want this in a separate tool? It is "just" extending the information.

I am not sure for the N50, because I might have seen that somewhere in the output of another tool. But don't ask me which.

concerning the human readable: we'll leave the original numbers in there, too. So parsing the output can still give the absolute numbers.

from genometools.

Garonenur commented on May 30, 2024

another Question: KB or KiB?

from genometools.

satta commented on May 30, 2024

I think it should be done in a separate tool (or at least by setting an additional option in encseq info) because per-file output can get rather large when many sequence files are in the encseq, and the purpose of the encseq info tool is primarily to get a quick overview of the encseq's properties. These may drown in the long output when including output on a per-file basis.

I do not think there is a proper calculation tool for N50 values. There might be in readjoiner (maybe ask @ggonnella ) but nothing generic for any encseq IIRC. Grepping for 'N50' shows some readjoiner-related Ruby script, but that seems to be it.

Regarding the human readable output, I propose to introduce an -h option (similar to what GNU df or du has) which outputs '400K' ,'123M', '3G', '23T' etc. instead of the absolute numbers. I think this follows the principle of the least surprise for the user. If they want to parse them from the output, they just do not specify -h. What do you think?

from genometools.

Garonenur commented on May 30, 2024

There is lots of redundancies in gt, there is already something to get assembly statistics in gt seqstat which calculates N50 and N80. But then again it uses fasta input.
If I have an encseq and I want that statistics, what would be the workflow "of least surprise" (or what is it called) for a user.
If I have something to do on an encseq, I would first search in the encseq toolbox.

from genometools.

satta commented on May 30, 2024

Well, in other tools (mostly those connecting annotations and sequences) there are options -seqfile, -seqfiles, -encseq etc. which allow the user to specify the sequence source. It is possible to access both GtEncseq and GtBioseq sequences (and their lengths) using the GtSeqCol interface. Depending on which tools you have been using before, it may be less or more surprising to have to use the 'other' way. I just think that either way, things that are not intimately tied to a specific representation (e.g. encseq encoding/decoding etc.) should be done generically. I admit that this has not been done consistently in the past, but we should keep an open mind on not continuing that path in the future.

from genometools.

Garonenur commented on May 30, 2024

I do not like the options that specify the type of input. Isn't it rather simple to have some file types that can be handled by us and recognized by gt?

A generic toolbox that could take anything as input, with tools that only need simple serial access to the data could use a generic container that is either encseq, the old indices (build with the seq tool? whatever that is) or sequences.

Than all tools that have to have random access, they will either load an existing encseq, or if given some sequence files build it and load it.

The interfaces for that should be the same for all tools. I know that changing this would break a lot of old stuff.

from genometools.

ggonnella commented on May 30, 2024

I think a generic toolbox for sequences, besides the toolbox for encseq, would be a great idea! Then stuff like shredder, convertseq, extractseq, sequniq, seqfilter, seqstat, seqtransform, seqtranslate, seqorder, splitfasta, dev readreads, dev seqlensort would also find a very natural container and could be slowly all also implemented in a generalized form which accepts encseq as input.

from genometools.

satta commented on May 30, 2024

Can this be closed?

from genometools.

Garonenur commented on May 30, 2024

yes yes... I have to show @joergi-w the way to autoclose an issue with a pull request :-)

from genometools.

expand gt encseq info about genometools HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent