Comments (15)
Hi @jeremymsimon,
Fulgor
(the current version at least) is a purely color-based index. Hence, there is no positional information stored about where a sequence's constituent k-mers map. Therefore, the interpretation of a mapping from fulgor
is exactly as you suggest — it signifies that the sequence likely occurred (n >= 1 times) in the reported reference. It would potentially be possible to think about indexing abundance information (i.e. number of occurrences of a unitig in each reference) or approximate position information, but the current index is not focused on that use case. Is there an application where e.g. the number of occurrences would be useful to you? Or the approximate positions? If you want the full positions, then I think you'd have to fall back to a full positional index, which is quite powerful, but generally much larger.
Best,
Rob
from fulgor.
Hey @rob-p!
For the particular application I'm playing around with, knowing if a given sequence matches at all is certainly useful, but it would be great to know how many times it matched within a certain number of mismatches. I was able to install and run fulgor
easily and was quickly able to do some testing though, so it seems great in that regard!
I played around with pufferfish
also but given my application, I couldn't get an index generated efficiently without downsampling the references first. I have for now settled on an approach to pre-select genomic intervals of interest, then creating an index either per individual, or one mega-index with each individual coded as colors in the ccdBG (ie columns in the fulgor
output)
Also, this may be easiest to describe via a call to step through some data, but both fulgor
and pufferfish
seem to occasionally miss known occurrences of sequences in a somewhat kmer-length-dependent fashion. Briefly, most known matches are found easily, but some may drop out at k=X
and reappear at k=X+2
as though the seed got split and thus not found. This could be an artifact of my short query sequences and thus short k-mers in the index (k=11 or 13), so if this is an expected phenomenon given that, it would be good to know if there are any possible workarounds without introducing too many spurious matches.
If fulgor
were able to account for multiple matches and we were able to work around the above issue, I think it may be perfect for my specific use here. I'm working on a minimally reproducible example dataset as well and can share that when it's ready
Thanks!
from fulgor.
Hi @jeremymsimon,
Very interesting! We'd certainly be interested in investigating any minimal reproducible examples you find. There is a filtering that happens with highly-repetitive seeds in pufferfish
, but there shouldn't be in fulgor
(depending on the type of mapping options you pass). That is, if you use "strict intersection" or "threshold union" methods, they are the naive implementations and don't otherwise have any filtering or thresholding of repetitive seeds.
If you're interested in a more memory efficient index that supports positional information, you may be interested in looking at piscem
— which is very much like pufferfish
swapping out the previous representation of the k-mer to unitig mapping with sshash
. However, by virtue of having to store all of the positional information, it will still likely be substantially larger than fulgor
.
Best,
Rob
from fulgor.
Hi @jeremymsimon,
thank you for your interest in Fulgor! It looks like you would perhaps need Pufferfish2 if you're interested in positional information (exact kmer location) rather than just its color, although I would first runs Fulgor to see its performance.
As @rob-p mentioned, we're more then happy to assist you with a smaller dataset at hand! Please, let us know.
Cheers,
-Giulio
from fulgor.
Thanks @jermp and @rob-p - I'll hopefully have a small scale dataset to illustrate this later this week!
from fulgor.
Hi @jeremymsimon, any news regarding the dataset?
from fulgor.
Yes, so sorry for the delay. I've attached here a directory containing a set of query (queries.fasta
) and target sequences (*.minExampleSubset.fasta
), as well as the code I've used in testing for each of pufferfish
, fulgor
, vargas
, and blat
(*.sh
). I've been doing a bit of a bake-off comparing and contrasting the different methods to find which may be the most appropriate choice here given our question but also the limitations of having such short queries (and thus, very short k-mers)
The blat
output contains what I'd consider the "expected" case, and I'm comparing the results of the other methods to that and noting how and where they differ.
Generally:
-
fulgor
doesn't tell me where a match occurs (as you noted above; this is likely OK to be without) and doesn't tell me if a given match occurred more than once, which would be very useful in the case of multimappers. You can correct me if I'm wrong, butfulgor
may also be heuristic in that it can't guarantee to find a match (?). -
pufferfish
is close, but seems to suffer from random dropouts as noted above. I suspect this is when the sequence gets bisected/split across two k-mers, thus you sometimes get a match at certain k but miss it at other k. Is this a problem true forfulgor
as well? -
vargas
should be considered an exact search, but currently isn't reporting whether a given match occurred more than once (e.g. FANCF). I'm discussing this further with the Langmead lab currently
My hope is that this setup is mostly self explanatory once you dive into the directory and scripts, but absolutely feel free to ask here or by email if you'd like me to clarify anything or provide more context.
Thanks in advance!
from fulgor.
Hi @jeremymsimon and sorry for the late reply but it's a very busy period here (as usual!).
Your setup looks very interesting and, as we already said, we're both (I and @rob-p) happy to assist you in your research if you use Fulgor.
I do not know exactly what you're searching for but I can try to answer some questions:
fulgor may also be heuristic in that it can't guarantee to find a match (?).
Fulgor might have false positives but never false negatives. That is, a sequence might be falsely reported as a match because its kmers are found in the index whereas it does not really occur in any of the indexed genomes. This is a limitation of all kmer-based methods (not only of Fulgor). This is why a longer kmer length is probably useful for. You mentioned you use k=11 or 13, but we would use k=28 or 31 (current default value).
But Fulgor will never have false negatives, i.e., given a true positive query, all its kmers will be found in the index, hence the query will always be reported as present in the index.
when the sequence gets bisected/split across two k-mers, thus you sometimes get a match at certain k but miss it at other k.
Mmh I'm not sure I understand what you mean by "the sequence gets split across two k-mers". Can you please clarify or provide a toy example?
As Rob mentioned, pufferfish can be used with some filtering option, but we don't use any filtering in Fulgor, hence any indexed kmer must always be found at query time.
Hope this clarifies a bit! Please let me know.
Best,
-Giulio
from fulgor.
Hi @jeremymsimon,
Just some notes about what you're likely seeing in pufferfish
; it implements several heuristics to avoid having to repeatedly handle highly-repetitive matches during chaining. This can lead do dropping some hits if they occur in super-repetitive regions. At some point, we exposed variables to tweak these from the command line, but they don't have highly informative names (as they were more meant for developer tweaking than end-user use).
Regardless, if it turns out you really do need positional information for your hits, then one place to look might be piscem-cpp
and the piscem
wrapper. This is our next-generation positional index, and it builds upon @jermp's excellent sshash as the k-mer to unitig mapping, and so it's considerably more memory frugal than pufferfish. We are actively developing and improving it, and so are happy to consider feature requests that might help you.
If you don't need positional information, as Giulio suggests, there is no filtering currently done in fulgor. However, you won't get positional information (it doesn't compute this), and it also doesn't keep track of how many times in each reference a given k-mer occurs. Specifically, fulgor considers the "color" of a unitig as the distinct set of references in which the unitig occurs, regardless of the number of occurrences. This is what it considers when performing the mapping.
Finally, I'll add one other point to the excellent description Giulio gives above. When you are looking up the mapping information for reads that consist of more than one k-mer, then the result you get can depend heavily on which specific mapping algorithm you use. The strict pseudoalignment algorithm requires that, to report a reference r_i
as being a mapping for the read, then for every k-mer k_j
that occurs in the index at all, k_j
must be contained within r_i
. That is, if you have a read where 100 k-mers match the index, and r_i
contains 99 of them, but the last k-mer only maps to, say r_{i+1}
, then your read will not be reported as a mapping to r_i
(and may not be reported as mapping at all unless the other 99 k-mers also match to r_{i+1}
. The non-strict intersection lets you control the threshold of k-mers that match in the index that must be present in a reference to report that reference as a mapping for the read.
--Rob
from fulgor.
Thanks @rob-p and @jermp ! This is helpful. So my search has relatively short queries, on the order of 20-25bp as opposed to typical applications (e.g. RNA-seq reads) that are longer. Do you still recommend a k-mer length of 28 or 31? It isn't explicitly mentioned this way, but it seemed there was a pattern that the recommended k-mer length was roughly half the size of the query sequences (ie 28-31bp works for 50-75bp reads) but maybe that was me over-interpreting. And can you similarly provide guidelines for the minimizer parameter as I don't see this recommendation anywhere?
In the files I shared, there were example queries that behaved or didn't behave as we expected, and some variability from algorithm to algorithm, but before saying more it may be due to a poor choice in k-mer/minimizer lengths so it would be helpful to clarify that first.
If a known match is found with 100% confidence, then we likely can get away without knowing the locations, and in fact this may be preferable for our applications as it will obscure any potentially identifiable information in the results.
We do however need to know if a given match occurred more than once, and ideally how many times within a certain distance/number of mismatches. It sounds like fulgor
can't do that based on your description, unfortunately, but my guess is a lot of the above (including best practices for k-mer length) will apply to pufferfish
and piscem
as well
Thanks!
from fulgor.
Hi @jeremymsimon,
So my search has relatively short queries, on the order of 20-25bp as opposed to typical applications (e.g. RNA-seq reads) that are longer.
Well, I'd say 20-25 bp is even shorter than a single kmer....so kmer-based indexes using k < 20 might yield poor results in terms of accuracy. For the minimizer length: actually there is no good rule. It does really depend on the data you are indexing. Usually I use m=17 for less fragmented dBGs and m=19 or 20 for highly fragmented dBGs but that is for k=31 :)
I never tried k=20, but I guess something like m=13 could work well.
We do however need to know if a given match occurred more than once, and ideally how many times within a certain distance/number of mismatches. It sounds like fulgor can't do that based on your description, unfortunately, but my guess is a lot of the above (including best practices for k-mer length) will apply to pufferfish and piscem as well.
Correct, Fulgor does not encode any multiplicity, whereas pufferfish2 or piscem have everything :)
from fulgor.
Thanks @jermp - so is the general rule of thumb for the k-mer size then to be as long as your shortest possible query length? ie if I have a range of query sizes from 20-25bp, you'd recommend k=20
but if it was a range of 25-30 you'd recommend k=25
?
from fulgor.
You're very welcome @jeremymsimon!
so is the general rule of thumb for the k-mer size then to be as long as your shortest possible query length?
I do not know if it is a rule of thumb but I'd say "no" because, in general, one builds an index without assuming specific query workloads nor distributions.
if I have a range of query sizes from 20-25bp, you'd recommend k=20 but if it was a range of 25-30 you'd recommend k=25?
For these values of k yes, it is a reasonable choice. But I'm surprised you have so short queries! In general, if your shortest query length is, say, 200, I'd not recommend k=200 but k=31 or k=63.
from fulgor.
Thanks @jermp and apologies for the delay, I was out of town. I'm building this all for a very specific gRNA sequence screen, so the queries are short and fixed at that length, rather than a typical usage for sequencing reads or similar.
I will repeat my testing with k=20, and try your other suggestions above (including piscem and others). I'll close this for now and reopen if I have any further questions!
Thanks for all of your help!
from fulgor.
No problem at all @jeremymsimon. Keep us posted, we're both very curious about this experimentation.
Best,
-Giulio
from fulgor.
Related Issues (14)
- Error in ggcat_querier compilation HOT 5
- Warning due to 64-bit hash codes HOT 9
- Num_contigs must be less than 2^32 Aborted (core dumped) HOT 27
- Compilation error HOT 4
- Missing -lrt linker flag on Ubuntu HOT 3
- Add a note in the README HOT 1
- Even better build pipeline for meta colored dBG HOT 1
- Fulgor build failed because of 128bits integers HOT 10
- Feature request: creates a distinct color for each sequence in the input file HOT 5
- How to interpret the color dump file? HOT 18
- Consistent terminology HOT 9
- Fulgor Indexing Error Due to Empty Bucket Detection HOT 1
- Build fails due to missing header HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fulgor.