Coder Social home page Coder Social logo

Comments (18)

jnalanko avatar jnalanko commented on September 28, 2024 1

Finished! Matches 100% with Themisto color sets! Love to see it.

from fulgor.

jnalanko avatar jnalanko commented on September 28, 2024 1

99% sure :). I've been hit with some really rare and subtle bugs before.

from fulgor.

jermp avatar jermp commented on September 28, 2024

Hi Jarno,
your interpretation is correct. This dumps the distinct color classes only, not the (expanded) map unitig -> color.
If you need it, I can easily implement it.

By the way, this dump is missing a newline at the end of the file :).

Right! That was intentional, but better to keep style consistent. I'll add it. Thanks!

from fulgor.

jnalanko avatar jnalanko commented on September 28, 2024

Would be nice to have, not urgent though! It might benefit others also for interoperability between tools.

from fulgor.

jermp avatar jermp commented on September 28, 2024

Sure!
So I'm thinking about the following format:

num_references [X]
num_unitigs [Y]
num_color_lists [Z]
[unitig1_string] [color_list]
[unitig2_string] [color_list]
...
[unitigY_string] [color_list]

So, one line for each piece of information; things are one-single-space separated.

Please, provide feedback.

Q. Is it better to call num_references, num_documents instead?

Note that unitigs will be output sorted by color list in this format, as this is the way they are stored in Fulgor.

from fulgor.

jnalanko avatar jnalanko commented on September 28, 2024

That looks good to me!

Q. Is it better to call num_references, num_documents instead?

In my terminology, that would be num_colors. But either of those two are fine.

For reference, in Themisto I have a colored unitig dump command that produces two files:

  1. A fasta file of unitigs, with integer unitig id in the header. Like this:
>14789358964
AAAAAAAAAAAAAAAAAAAAAAAAAAGAGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>95
CTAGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>22377361820
TCGCCAGGATTTTTCCGTTGCCATTTCGGATTTTTGGTATTTGCTATACGGCGCAACGCGAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Those ids are actually colexicographic ranks of some k-mer in the unitig, if I remember correctly.

  1. A file with pairs (unitig id, color list) list, like this:
14789358964 3015
95 3434 3757 3927
22377361820 4161 4175

The unitigs are listed in the same order as they come in the fasta file, so the unitig id here is actually a bit redundant. I don't write the lengths of the lists but that is a minor detail.

Anyway, my format should be quite easily comparable to yours.

from fulgor.

jermp avatar jermp commented on September 28, 2024

I see. Your formats makes sense, although I'd prefer to keep everything in one file.
I guess it's easier for Themisto to split stuff into two files since unitigs are not stored in color order, but in lex-order of some kmers, as you said.

from fulgor.

jermp avatar jermp commented on September 28, 2024

(I'm reporting here part of our conversation of X, just to keep track of it.)
On a second thought, I think a dump organized in three files is actually better.

  1. dump.metadata.txt contains useful meta data, like num_references, num_unitigs, and num_color_lists, others? The list of the original filenames maybe?

  2. dump.unitigs.fa lists the unitigs in fasta format, where the header of a unitig is > x y, where x is the unitig-id and y is the color-id. Note: unitig-ids will always be increasing and exactly those returned by SSHash. Color-ids will be non-decreasing instead.

  3. dump.colors.txt lists all the distinct colors in the format [color-id] [color-size] [color-list].

from fulgor.

jermp avatar jermp commented on September 28, 2024

Ok, done as of 9d5901e. Can you check it?

For the small example with the 10 salmonella files shipped with the repo, we can build the three files above as follows.

./fulgor build -l ../test_data/salmonella_10_filenames.txt -o ../test_data/salmonella_10 -k 31 -m 19 -d tmp_dir -g 1 -t 1 --verbose --check
./fulgor dump -i ../test_data/salmonella_10.fur

salmonella_10.metadata.txt:

num_references=10
num_unitigs=86630
num_color_classes=171

salmonella_10.unitigs.fa:

> unitig_id=0 color_id=0
GATTGAGCACCAACTGCGAGAATCAGGTGTTGAAGAGCAAGGGCGTGTGTTTATCGAAAAAGCTATTGAGCAGCCGCTTGATCCACAA
> unitig_id=1 color_id=0
GAAATTTAACGGCTGTTTTTCCGGCCAGATGTTATGTCTGGCTGGTTTTATTGTTTTGATTTTAAAGGAATTTACAGTGAATAAATGGCGTAACCCCACTGGGTGGTTATGTGCGGTAGCTATGCCTTTTG
> unitig_id=2 color_id=0
GCGCTGAACATCAGCGCCTTTCTGCGACAGCTCAATCATGCATTCGCCAATCACGGCAATC
...

salmonella_10.colors.txt:

color_id=0 size=4 0 3 7 8
color_id=1 size=1 8
color_id=2 size=10 0 1 2 3 4 5 6 7 8 9
color_id=3 size=6 1 2 4 5 6 9
...

from fulgor.

jermp avatar jermp commented on September 28, 2024

@jnalanko, have you had a chance to try it?

from fulgor.

jnalanko avatar jnalanko commented on September 28, 2024

Still no! I'll try to verify it against my Themisto index this weekend.

from fulgor.

jnalanko avatar jnalanko commented on September 28, 2024

Update: running my verifier now on a big dataset. I could not compare dumps directly because Themisto outputs both strands, whereas Fulgor only canonical, and also cyclic unitigs are tricky. It's not a very optimized verifier so it might take a day to run.

from fulgor.

jermp avatar jermp commented on September 28, 2024

But what are you trying to verify? Recall that GGCAT does not necessarily output maximal unitigs, so there might be discrepancies in the unitigs. Kmers and their color sets must instead always be the same.

from fulgor.

jnalanko avatar jnalanko commented on September 28, 2024

I'm verifying the color set of each k-mer, which should be the same in both tools, or otherwise there is a bug somewhere.

from fulgor.

jermp avatar jermp commented on September 28, 2024

Oh I see. When building the indexes with ./fulgor build, the user can specify --check. In this case, we compare the color set of each kmer against the colors as returned by GGCAT. Similarly, now assuming Fulgor is correct, we compare kmers and color sets of meta-colored Fulgor against Fulgor.

from fulgor.

jermp avatar jermp commented on September 28, 2024

Good! Weren't you sure? :)

from fulgor.

rob-p avatar rob-p commented on September 28, 2024

We've all been there, but great to know!

from fulgor.

jermp avatar jermp commented on September 28, 2024

Alright! Closing this now.

from fulgor.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.