Hi Mike! Thank you for posting that amazing tutorial on Amplicon Analysis. It has been

Extreme loss of reads during several steps in dada2 pipeline. about astrobiomike.github.io HOT 9 CLOSED

astrobiomike commented on July 18, 2024

Extreme loss of reads during several steps in dada2 pipeline.

from astrobiomike.github.io.

Comments (9)

benjjneb commented on July 18, 2024 1

I don’t remember if I ever knew this for sure, but yea I assume the same thing. That these are unique sequences that based on the settings weren’t successfully “assigned” to an already inferred dada2-"trusted" sequence, and they weren’t strong enough to be trusted on their own to be accurate (I guess due to errors and/or too few counts of that unique sequence). But that’s just a guess. I tried looking at the methods of the dada2 paper again, but the stats are just over my head :/ Maybe @benjjneb can throw in here if that guess is on the right track or not :)

I approve this explanation :)

from astrobiomike.github.io.

benjjneb commented on July 18, 2024 1

A note: I would almost never recommend setting a high minOverlap. The main purpose of minOverlap is to prevent spurious merging due to random matching between a few bases at the beginning and the end, not to specify the final amplicon length.

You can filter out off-target amplicon lengths once you get to the final merged table (e.g. in our tutorial this is shown, perhaps in Mike's too). That's the better way to achieve that goal.

from astrobiomike.github.io.

AstrobioMike commented on July 18, 2024

Oh, no! I'm sorry, Emily! I meant to come back to this forever ago when i had time and i totally dropped the ball :(

Seeing as how late i am, are you beyond this now? Or would it still help for me to take a look and give you my thoughts? Sorry again

from astrobiomike.github.io.

Jatbee32 commented on July 18, 2024

Hi Mike, I am actually going over some of it with Ben Callahan but I would love to hear some of your input. Particularly for the summary tab. I'm going to use your example from the amplicon full workflow for my questions okay? We have come to the conclusion there are two problem spots, the dereplication and the merging which may or may not be due to non-target taxa. But for clarification for myself, I had a couple of questions I can direct towards your summary tab. 1. In your summary tab, you have a drop (only a couple of reads) from filtered to the dada_f and dada_r. Is this due to the error rate and the dereplication rates given for the parameters? 2. If you place parameters for certain lengths within functions such as filterAndTrim() and mergePairs() it should guarantee that the reads are greater or equal to the parameter you set correct? So for your example workflow, there shouldn't be a read that is less than 175? 3. Finally, does the error change each time you run it? Sometimes when I have run a sample set, it will use all the samples present to learn the error rate versus sometimes it will only use a couple of the samples. Is this due to the size of the samples for the best efficiency? Thank you for helping!

…

On Wed, Oct 2, 2019 at 1:39 PM Mike Lee ***@***.***> wrote: Oh, no! I'm sorry, Emily! I meant to come back to this forever ago when i had time and i totally dropped the ball :( Seeing as how late i am, are you beyond this now? Or would it still help for me to take a look and give you my thoughts? Sorry again — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13?email_source=notifications&email_token=ALDKV7KXPRB4U2XY4AOGXEDQMTMFXA5CNFSM4IT7NOIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAFSRCQ#issuecomment-537602186>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALDKV7KAKOJYRC7QM7K4MRTQMTMFXANCNFSM4IT7NOIA> .

from astrobiomike.github.io.

AstrobioMike commented on July 18, 2024

Hiya!

In your summary tab, you have a drop (only a couple of reads) from
filtered to the dada_f and dada_r. Is this due to the error rate and the
dereplication rates given for the parameters?

I don’t remember if I ever knew this for sure, but yea I assume the same thing. That these are unique sequences that based on the settings weren’t successfully “assigned” to an already inferred dada2-"trusted" sequence, and they weren’t strong enough to be trusted on their own to be accurate (I guess due to errors and/or too few counts of that unique sequence). But that’s just a guess. I tried looking at the methods of the dada2 paper again, but the stats are just over my head :/ Maybe @benjjneb can throw in here if that guess is on the right track or not :)

If you place parameters for certain lengths within functions such as
filterAndTrim() and mergePairs() it should guarantee that the reads are
greater or equal to the parameter you set correct? So for your example
workflow, there shouldn't be a read that is less than 175?

So yea this should be the case if setting the “minLen” parameter in filterAndTrim(), and there shouldn’t be any reads after that shorter than 175 in the example I have up. I don’t see a similar setting for mergePairs() though. Also in looking again at the options regarding lengths (which I feel like I kind of confuse myself in a different way each time I think about them), it seems my setting “minLen=175” in that example is superfluous anyway, as apparently the “truncLen” parameter not only cuts things longer than that down to that size, but also throws away any reads shorter than that. So having truncLen(c(250,200)) is saying the minimum length is going to be 250 for the forward and 200 for the reverse, in addition to chopping anything longer down to that. I should revisit that and update it, thanks for making me look at it again :)

Finally, does the error change each time you run it? Sometimes when I
have run a sample set, it will use all the samples present to learn the
error rate versus sometimes it will only use a couple of the samples. Is
this due to the size of the samples for the best efficiency?

Yeppers, it’s due to the size of the dataset. It by default will go up to 100 million bases and then stop pulling from samples, if all samples together don’t reach that, it will use all samples/reads.

Regarding your specific dropped stuff, I see why you were asking about the dropped sequences from the "filtered" to the "dada_f" and "dada_r" following the dada() step. To be explained by what we guessed above in your question 1, i would think you would have to have a lot of unique sequences as compared to the total sequences in the samples that lost a lot of reads there. I don't know what this ratio typically looks like, but here's something ugly i just threw together to see it for the rocks in the happy belly example:

num_uniqs_R1 <- integer() ; num_reads_R1 <- integer() ; num_uniqs_R2 <- integer() ; num_reads_R2 <- integer()

for (index in 1:length(samples)) {
  print(samples[index])
  num_uniqs_R1[index] <- length(derep_forward[[samples[index]]]$uniques)
  num_reads_R1[index] <- length(derep_forward[[samples[index]]]$map)
  num_uniqs_R2[index] <- length(derep_reverse[[samples[index]]]$uniques)
  num_reads_R2[index] <- length(derep_reverse[[samples[index]]]$map)
}

df <- data.frame(row.names=samples, "num_uniqs_R1"=num_uniqs_R1, "num_reads_R1"=num_reads_R1, "num_uniqs_R2"=num_uniqs_R2, "num_reads_R2"=num_reads_R2, "percent_uniq_R1"=round(num_uniqs_R1 / num_reads_R1 * 100, 2), "percent_uniq_R2"=round(num_uniqs_R2 / num_reads_R2 * 100, 2))
df

      num_uniqs_R1 num_reads_R1 num_uniqs_R2 num_reads_R2 percent_uniq_R1 percent_uniq_R2
B1             552         1498          514         1498           36.85           34.31
B2             224          529          203          529           42.34           38.37
B3             186          457          171          457           40.70           37.42
B4             203          475          189          475           42.74           39.79
BW1            723         2109          660         2109           34.28           31.29
BW2           2760         5527         2506         5527           49.94           45.34
R10           5501        10354         5054        10354           53.13           48.81
R11BF         3452         8028         3113         8028           43.00           38.78
R11           4846         8138         4568         8138           59.55           56.13
R12           9747        14423         9288        14423           67.58           64.40
R1A           6894        10906         6445        10906           63.21           59.10
R1B           9380        14672         8799        14672           63.93           59.97
R2            9310        15660         8762        15660           59.45           55.95
R3           10225        15950         9710        15950           64.11           60.88
R4           10407        17324         9613        17324           60.07           55.49
R5           10981        16728        10432        16728           65.64           62.36
R6            8520        13338         8047        13338           63.88           60.33
R7            4432         7331         4145         7331           60.46           56.54
R8            6171        11192         5696        11192           55.14           50.89
R9            4234         7853         3970         7853           53.92           50.55

> summary(df$percent_uniq_R1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.28   42.94   57.30   54.00   63.38   67.58 
> summary(df$percent_uniq_R2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  31.29   39.54   53.19   50.34   59.32   64.40

So in this dataset we're seeing a little over 50% of reads being unique (the rocks are generally a pretty diverse environment). Maybe see what that looks like for your stuff?

And for the merging, I'm sure this is one of the primary angles you're looking into with Ben's help, but it's easy to lose a lot of reads there if they don't overlap enough to successfully merge, which is dependent on the expected amplicon size and sequencing setup. The step where we most have to think about this is probably the filterAndTrim() step. Have you spent time already making sure your trimming isn't cutting them too short to successfully merge?

from astrobiomike.github.io.

Jatbee32 commented on July 18, 2024

Hi guys, Thanks for the responses and clarification! Right now I'm focusing on the first drop in reads because my merging parameters shouldn't be allowing for any dropped reads. I am also using a 2x251bp for these samples and used the exact same parameters in your workflow. So that's a problem for a later date. Maybe if we get the front end working the rest will follow.

from astrobiomike.github.io.

AstrobioMike commented on July 18, 2024

Well the example workflow data are 2x300, not sure if that's part of what's going on yet. What primers were used on yours?

from astrobiomike.github.io.

Jatbee32 commented on July 18, 2024

Sorry, I think Ben's is 2x250 pipeline. So I must have followed his parameters when trying to get a feel for where the overlap should be. The actual minOverlap I used for my run was 125. I came up with this based on the fact that the two trim lengths were 200 each and the original cycles were 250 so 200 +200 -251 = 149. But I dropped it to 125 just in case. It's V4 so the standard primers from the Earth Microbiome Project. http://www.earthmicrobiome.org/protocols-and-standards/16s/ (primers) I have another clarification question for you based on some points you make in the full workflow if you don't mind. You make the point in the chimera step that if the sum(seqtab.nochim)/sum(seqtab) is less than 80% you might want to go back to the filterAndTrim step and increase the maxEE. How high can the maxEE go? I've only really seen 5 as the highest.

from astrobiomike.github.io.

AstrobioMike commented on July 18, 2024

The actual minOverlap I used for my run was 125. I came up with this based
on the fact that the two trim lengths were 200 each and the original cycles
were 250 so 200 +200 -251 = 149. But I dropped it to 125 just in case.

Is this saying the forward and reverse were trimmed down to length 200? If that's the case, for spanning 515-806, there'd only be a nominal overlap of 109 I think (forward going from 515+200=715; reverse going 806-200=606; so the overlap window is 715-606=109; unless my brain is not working properly right now, which happens a lot). So you might still be losing a lot of things due to that with a minOverlap of 125. I'd drop it really low and see if a bunch more seqs make it through.

I have another clarification question for you based on some points you make
in the full workflow if you don't mind. You make the point in the chimera
step that if the sum(seqtab.nochim)/sum(seqtab) is less than 80% you might
want to go back to the filterAndTrim step and increase the maxEE. How high
can the maxEE go? I've only really seen 5 as the highest.

I actually can't find this, can you tell me more specifically where I'm saying this, ha. I'm having a hard time figuring out the context or what I meant. I'm not sure at the moment why I would relate chimeras to the quality filtering... But more to the question of how high can the maxEE go, there usually aren't set answers to things like that unfortunately, as it's up to the researcher to decide how loose or strict to be with the data, and like lots of stuff it's a balance between keeping as much data as we can vs keeping the highest quality data we can. The only thing I'd say I might consider beyond that is that maxEE is equivalent to the number of expected errors in a read, so we also would want to think about it differently if we were talking about 50 bp reads or 250 bp reads (meaning allowing 5 errors in 50 bps is a bigger deal than allowing 5 errors in 250 bps).

But yeah, please let me know where I have that suggestion you're talking about, because i'd like to make sure i was making sense, ha

from astrobiomike.github.io.

Extreme loss of reads during several steps in dada2 pipeline. about astrobiomike.github.io HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent