Coder Social home page Coder Social logo

Comments (10)

satta avatar satta commented on May 30, 2024

I think this is best done in a separate tool, except for the human readable output, which should be available in encseq info as well.

from genometools.

Garonenur avatar Garonenur commented on May 30, 2024

why do you want this in a separate tool? It is "just" extending the information.

I am not sure for the N50, because I might have seen that somewhere in the output of another tool. But don't ask me which.

concerning the human readable: we'll leave the original numbers in there, too. So parsing the output can still give the absolute numbers.

from genometools.

Garonenur avatar Garonenur commented on May 30, 2024

another Question: KB or KiB?

from genometools.

satta avatar satta commented on May 30, 2024

I think it should be done in a separate tool (or at least by setting an additional option in encseq info) because per-file output can get rather large when many sequence files are in the encseq, and the purpose of the encseq info tool is primarily to get a quick overview of the encseq's properties. These may drown in the long output when including output on a per-file basis.

I do not think there is a proper calculation tool for N50 values. There might be in readjoiner (maybe ask @ggonnella ) but nothing generic for any encseq IIRC. Grepping for 'N50' shows some readjoiner-related Ruby script, but that seems to be it.

Regarding the human readable output, I propose to introduce an -h option (similar to what GNU df or du has) which outputs '400K' ,'123M', '3G', '23T' etc. instead of the absolute numbers. I think this follows the principle of the least surprise for the user. If they want to parse them from the output, they just do not specify -h. What do you think?

from genometools.

Garonenur avatar Garonenur commented on May 30, 2024

There is lots of redundancies in gt, there is already something to get assembly statistics in gt seqstat which calculates N50 and N80. But then again it uses fasta input.
If I have an encseq and I want that statistics, what would be the workflow "of least surprise" (or what is it called) for a user.
If I have something to do on an encseq, I would first search in the encseq toolbox.

from genometools.

satta avatar satta commented on May 30, 2024

Well, in other tools (mostly those connecting annotations and sequences) there are options -seqfile, -seqfiles, -encseq etc. which allow the user to specify the sequence source. It is possible to access both GtEncseq and GtBioseq sequences (and their lengths) using the GtSeqCol interface. Depending on which tools you have been using before, it may be less or more surprising to have to use the 'other' way. I just think that either way, things that are not intimately tied to a specific representation (e.g. encseq encoding/decoding etc.) should be done generically. I admit that this has not been done consistently in the past, but we should keep an open mind on not continuing that path in the future.

from genometools.

Garonenur avatar Garonenur commented on May 30, 2024

I do not like the options that specify the type of input. Isn't it rather simple to have some file types that can be handled by us and recognized by gt?

A generic toolbox that could take anything as input, with tools that only need simple serial access to the data could use a generic container that is either encseq, the old indices (build with the seq tool? whatever that is) or sequences.

Than all tools that have to have random access, they will either load an existing encseq, or if given some sequence files build it and load it.

The interfaces for that should be the same for all tools. I know that changing this would break a lot of old stuff.

from genometools.

ggonnella avatar ggonnella commented on May 30, 2024

I think a generic toolbox for sequences, besides the toolbox for encseq, would be a great idea! Then stuff like shredder, convertseq, extractseq, sequniq, seqfilter, seqstat, seqtransform, seqtranslate, seqorder, splitfasta, dev readreads, dev seqlensort would also find a very natural container and could be slowly all also implemented in a generalized form which accepts encseq as input.

from genometools.

satta avatar satta commented on May 30, 2024

Can this be closed?

from genometools.

Garonenur avatar Garonenur commented on May 30, 2024

yes yes... I have to show @joergi-w the way to autoclose an issue with a pull request :-)

from genometools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.