Comments (18)
Finished! Matches 100% with Themisto color sets! Love to see it.
from fulgor.
99% sure :). I've been hit with some really rare and subtle bugs before.
from fulgor.
Hi Jarno,
your interpretation is correct. This dumps the distinct color classes only, not the (expanded) map unitig -> color.
If you need it, I can easily implement it.
By the way, this dump is missing a newline at the end of the file :).
Right! That was intentional, but better to keep style consistent. I'll add it. Thanks!
from fulgor.
Would be nice to have, not urgent though! It might benefit others also for interoperability between tools.
from fulgor.
Sure!
So I'm thinking about the following format:
num_references [X]
num_unitigs [Y]
num_color_lists [Z]
[unitig1_string] [color_list]
[unitig2_string] [color_list]
...
[unitigY_string] [color_list]
So, one line for each piece of information; things are one-single-space separated.
Please, provide feedback.
Q. Is it better to call num_references
, num_documents
instead?
Note that unitigs will be output sorted by color list in this format, as this is the way they are stored in Fulgor.
from fulgor.
That looks good to me!
Q. Is it better to call num_references, num_documents instead?
In my terminology, that would be num_colors. But either of those two are fine.
For reference, in Themisto I have a colored unitig dump command that produces two files:
- A fasta file of unitigs, with integer unitig id in the header. Like this:
>14789358964
AAAAAAAAAAAAAAAAAAAAAAAAAAGAGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>95
CTAGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>22377361820
TCGCCAGGATTTTTCCGTTGCCATTTCGGATTTTTGGTATTTGCTATACGGCGCAACGCGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Those ids are actually colexicographic ranks of some k-mer in the unitig, if I remember correctly.
- A file with pairs (unitig id, color list) list, like this:
14789358964 3015
95 3434 3757 3927
22377361820 4161 4175
The unitigs are listed in the same order as they come in the fasta file, so the unitig id here is actually a bit redundant. I don't write the lengths of the lists but that is a minor detail.
Anyway, my format should be quite easily comparable to yours.
from fulgor.
I see. Your formats makes sense, although I'd prefer to keep everything in one file.
I guess it's easier for Themisto to split stuff into two files since unitigs are not stored in color order, but in lex-order of some kmers, as you said.
from fulgor.
(I'm reporting here part of our conversation of X, just to keep track of it.)
On a second thought, I think a dump organized in three files is actually better.
-
dump.metadata.txt
contains useful meta data, likenum_references
,num_unitigs
, andnum_color_lists
, others? The list of the original filenames maybe? -
dump.unitigs.fa
lists the unitigs in fasta format, where the header of a unitig is> x y
, wherex
is the unitig-id andy
is the color-id. Note: unitig-ids will always be increasing and exactly those returned by SSHash. Color-ids will be non-decreasing instead. -
dump.colors.txt
lists all the distinct colors in the format[color-id] [color-size] [color-list]
.
from fulgor.
Ok, done as of 9d5901e. Can you check it?
For the small example with the 10 salmonella files shipped with the repo, we can build the three files above as follows.
./fulgor build -l ../test_data/salmonella_10_filenames.txt -o ../test_data/salmonella_10 -k 31 -m 19 -d tmp_dir -g 1 -t 1 --verbose --check
./fulgor dump -i ../test_data/salmonella_10.fur
salmonella_10.metadata.txt
:
num_references=10
num_unitigs=86630
num_color_classes=171
salmonella_10.unitigs.fa
:
> unitig_id=0 color_id=0
GATTGAGCACCAACTGCGAGAATCAGGTGTTGAAGAGCAAGGGCGTGTGTTTATCGAAAAAGCTATTGAGCAGCCGCTTGATCCACAA
> unitig_id=1 color_id=0
GAAATTTAACGGCTGTTTTTCCGGCCAGATGTTATGTCTGGCTGGTTTTATTGTTTTGATTTTAAAGGAATTTACAGTGAATAAATGGCGTAACCCCACTGGGTGGTTATGTGCGGTAGCTATGCCTTTTG
> unitig_id=2 color_id=0
GCGCTGAACATCAGCGCCTTTCTGCGACAGCTCAATCATGCATTCGCCAATCACGGCAATC
...
salmonella_10.colors.txt
:
color_id=0 size=4 0 3 7 8
color_id=1 size=1 8
color_id=2 size=10 0 1 2 3 4 5 6 7 8 9
color_id=3 size=6 1 2 4 5 6 9
...
from fulgor.
@jnalanko, have you had a chance to try it?
from fulgor.
Still no! I'll try to verify it against my Themisto index this weekend.
from fulgor.
Update: running my verifier now on a big dataset. I could not compare dumps directly because Themisto outputs both strands, whereas Fulgor only canonical, and also cyclic unitigs are tricky. It's not a very optimized verifier so it might take a day to run.
from fulgor.
But what are you trying to verify? Recall that GGCAT does not necessarily output maximal unitigs, so there might be discrepancies in the unitigs. Kmers and their color sets must instead always be the same.
from fulgor.
I'm verifying the color set of each k-mer, which should be the same in both tools, or otherwise there is a bug somewhere.
from fulgor.
Oh I see. When building the indexes with ./fulgor build
, the user can specify --check
. In this case, we compare the color set of each kmer against the colors as returned by GGCAT. Similarly, now assuming Fulgor is correct, we compare kmers and color sets of meta-colored Fulgor against Fulgor.
from fulgor.
Good! Weren't you sure? :)
from fulgor.
We've all been there, but great to know!
from fulgor.
Alright! Closing this now.
from fulgor.
Related Issues (17)
- Error in ggcat_querier compilation HOT 5
- Warning due to 64-bit hash codes HOT 9
- How does fulgor handle multi-mappers? HOT 15
- Num_contigs must be less than 2^32 Aborted (core dumped) HOT 27
- Compilation error HOT 4
- Missing -lrt linker flag on Ubuntu HOT 3
- Add a note in the README HOT 1
- Even better build pipeline for meta colored dBG HOT 1
- Fulgor build failed because of 128bits integers HOT 10
- Feature request: creates a distinct color for each sequence in the input file HOT 5
- Consistent terminology HOT 9
- Fulgor Indexing Error Due to Empty Bucket Detection HOT 1
- Error running make HOT 4
- Feature request: Print output instead of writing to file HOT 7
- Compilation error HOT 15
- Build fails due to missing header HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fulgor.