Comments (9)
I don’t remember if I ever knew this for sure, but yea I assume the same thing. That these are unique sequences that based on the settings weren’t successfully “assigned” to an already inferred dada2-"trusted" sequence, and they weren’t strong enough to be trusted on their own to be accurate (I guess due to errors and/or too few counts of that unique sequence). But that’s just a guess. I tried looking at the methods of the dada2 paper again, but the stats are just over my head :/ Maybe @benjjneb can throw in here if that guess is on the right track or not :)
I approve this explanation :)
from astrobiomike.github.io.
A note: I would almost never recommend setting a high minOverlap
. The main purpose of minOverlap
is to prevent spurious merging due to random matching between a few bases at the beginning and the end, not to specify the final amplicon length.
You can filter out off-target amplicon lengths once you get to the final merged table (e.g. in our tutorial this is shown, perhaps in Mike's too). That's the better way to achieve that goal.
from astrobiomike.github.io.
Oh, no! I'm sorry, Emily! I meant to come back to this forever ago when i had time and i totally dropped the ball :(
Seeing as how late i am, are you beyond this now? Or would it still help for me to take a look and give you my thoughts? Sorry again
from astrobiomike.github.io.
from astrobiomike.github.io.
Hiya!
- In your summary tab, you have a drop (only a couple of reads) from
filtered to the dada_f and dada_r. Is this due to the error rate and the
dereplication rates given for the parameters?
I don’t remember if I ever knew this for sure, but yea I assume the same thing. That these are unique sequences that based on the settings weren’t successfully “assigned” to an already inferred dada2-"trusted" sequence, and they weren’t strong enough to be trusted on their own to be accurate (I guess due to errors and/or too few counts of that unique sequence). But that’s just a guess. I tried looking at the methods of the dada2 paper again, but the stats are just over my head :/ Maybe @benjjneb can throw in here if that guess is on the right track or not :)
- If you place parameters for certain lengths within functions such as
filterAndTrim() and mergePairs() it should guarantee that the reads are
greater or equal to the parameter you set correct? So for your example
workflow, there shouldn't be a read that is less than 175?
So yea this should be the case if setting the “minLen” parameter in filterAndTrim()
, and there shouldn’t be any reads after that shorter than 175 in the example I have up. I don’t see a similar setting for mergePairs()
though. Also in looking again at the options regarding lengths (which I feel like I kind of confuse myself in a different way each time I think about them), it seems my setting “minLen=175” in that example is superfluous anyway, as apparently the “truncLen” parameter not only cuts things longer than that down to that size, but also throws away any reads shorter than that. So having truncLen(c(250,200))
is saying the minimum length is going to be 250 for the forward and 200 for the reverse, in addition to chopping anything longer down to that. I should revisit that and update it, thanks for making me look at it again :)
- Finally, does the error change each time you run it? Sometimes when I
have run a sample set, it will use all the samples present to learn the
error rate versus sometimes it will only use a couple of the samples. Is
this due to the size of the samples for the best efficiency?
Yeppers, it’s due to the size of the dataset. It by default will go up to 100 million bases and then stop pulling from samples, if all samples together don’t reach that, it will use all samples/reads.
Regarding your specific dropped stuff, I see why you were asking about the dropped sequences from the "filtered" to the "dada_f" and "dada_r" following the dada()
step. To be explained by what we guessed above in your question 1, i would think you would have to have a lot of unique sequences as compared to the total sequences in the samples that lost a lot of reads there. I don't know what this ratio typically looks like, but here's something ugly i just threw together to see it for the rocks in the happy belly example:
num_uniqs_R1 <- integer() ; num_reads_R1 <- integer() ; num_uniqs_R2 <- integer() ; num_reads_R2 <- integer()
for (index in 1:length(samples)) {
print(samples[index])
num_uniqs_R1[index] <- length(derep_forward[[samples[index]]]$uniques)
num_reads_R1[index] <- length(derep_forward[[samples[index]]]$map)
num_uniqs_R2[index] <- length(derep_reverse[[samples[index]]]$uniques)
num_reads_R2[index] <- length(derep_reverse[[samples[index]]]$map)
}
df <- data.frame(row.names=samples, "num_uniqs_R1"=num_uniqs_R1, "num_reads_R1"=num_reads_R1, "num_uniqs_R2"=num_uniqs_R2, "num_reads_R2"=num_reads_R2, "percent_uniq_R1"=round(num_uniqs_R1 / num_reads_R1 * 100, 2), "percent_uniq_R2"=round(num_uniqs_R2 / num_reads_R2 * 100, 2))
df
num_uniqs_R1 num_reads_R1 num_uniqs_R2 num_reads_R2 percent_uniq_R1 percent_uniq_R2
B1 552 1498 514 1498 36.85 34.31
B2 224 529 203 529 42.34 38.37
B3 186 457 171 457 40.70 37.42
B4 203 475 189 475 42.74 39.79
BW1 723 2109 660 2109 34.28 31.29
BW2 2760 5527 2506 5527 49.94 45.34
R10 5501 10354 5054 10354 53.13 48.81
R11BF 3452 8028 3113 8028 43.00 38.78
R11 4846 8138 4568 8138 59.55 56.13
R12 9747 14423 9288 14423 67.58 64.40
R1A 6894 10906 6445 10906 63.21 59.10
R1B 9380 14672 8799 14672 63.93 59.97
R2 9310 15660 8762 15660 59.45 55.95
R3 10225 15950 9710 15950 64.11 60.88
R4 10407 17324 9613 17324 60.07 55.49
R5 10981 16728 10432 16728 65.64 62.36
R6 8520 13338 8047 13338 63.88 60.33
R7 4432 7331 4145 7331 60.46 56.54
R8 6171 11192 5696 11192 55.14 50.89
R9 4234 7853 3970 7853 53.92 50.55
> summary(df$percent_uniq_R1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
34.28 42.94 57.30 54.00 63.38 67.58
> summary(df$percent_uniq_R2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.29 39.54 53.19 50.34 59.32 64.40
So in this dataset we're seeing a little over 50% of reads being unique (the rocks are generally a pretty diverse environment). Maybe see what that looks like for your stuff?
And for the merging, I'm sure this is one of the primary angles you're looking into with Ben's help, but it's easy to lose a lot of reads there if they don't overlap enough to successfully merge, which is dependent on the expected amplicon size and sequencing setup. The step where we most have to think about this is probably the filterAndTrim()
step. Have you spent time already making sure your trimming isn't cutting them too short to successfully merge?
from astrobiomike.github.io.
from astrobiomike.github.io.
Well the example workflow data are 2x300, not sure if that's part of what's going on yet. What primers were used on yours?
from astrobiomike.github.io.
from astrobiomike.github.io.
The actual minOverlap I used for my run was 125. I came up with this based
on the fact that the two trim lengths were 200 each and the original cycles
were 250 so 200 +200 -251 = 149. But I dropped it to 125 just in case.
Is this saying the forward and reverse were trimmed down to length 200? If that's the case, for spanning 515-806, there'd only be a nominal overlap of 109 I think (forward going from 515+200=715; reverse going 806-200=606; so the overlap window is 715-606=109; unless my brain is not working properly right now, which happens a lot). So you might still be losing a lot of things due to that with a minOverlap of 125. I'd drop it really low and see if a bunch more seqs make it through.
I have another clarification question for you based on some points you make
in the full workflow if you don't mind. You make the point in the chimera
step that if the sum(seqtab.nochim)/sum(seqtab) is less than 80% you might
want to go back to the filterAndTrim step and increase the maxEE. How high
can the maxEE go? I've only really seen 5 as the highest.
I actually can't find this, can you tell me more specifically where I'm saying this, ha. I'm having a hard time figuring out the context or what I meant. I'm not sure at the moment why I would relate chimeras to the quality filtering... But more to the question of how high can the maxEE go, there usually aren't set answers to things like that unfortunately, as it's up to the researcher to decide how loose or strict to be with the data, and like lots of stuff it's a balance between keeping as much data as we can vs keeping the highest quality data we can. The only thing I'd say I might consider beyond that is that maxEE is equivalent to the number of expected errors in a read, so we also would want to think about it differently if we were talking about 50 bp reads or 250 bp reads (meaning allowing 5 errors in 50 bps is a bigger deal than allowing 5 errors in 250 bps).
But yeah, please let me know where I have that suggestion you're talking about, because i'd like to make sure i was making sense, ha
from astrobiomike.github.io.
Related Issues (20)
- Broader taxonomy resolution after distinguishing ASVs HOT 7
- Mistake, bug, or typo HOT 2
- New Topic Idea HOT 2
- Error in installing sabre HOT 3
- De novo genome assembly HOT 3
- REgarding the full example workflow DADA2 HOT 2
- suggestion HOT 2
- Tutorial on Filtering Host Reads - just a thoughtNew Topic Idea HOT 2
- Support for SILVA v138.1 HOT 2
- Genus/Family-based stacked bar charts HOT 3
- decontam bug in subsetting fasta file when there are no contaminants HOT 1
- Dada2 merging doubt HOT 4
- How to define vector for decontam in a merged sequence table? HOT 3
- Mistake when running the protocol HOT 6
- Error in running the process in my data HOT 2
- Deseq error HOT 2
- Error in # generating and visualizing the PCoA with phyloseq HOT 2
- Phylogenetic tree construction HOT 1
- Nano issue HOT 1
- Mistake, bug, or typo HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from astrobiomike.github.io.