Coder Social home page Coder Social logo

Comments (9)

benjjneb avatar benjjneb commented on July 18, 2024 1

I don’t remember if I ever knew this for sure, but yea I assume the same thing. That these are unique sequences that based on the settings weren’t successfully “assigned” to an already inferred dada2-"trusted" sequence, and they weren’t strong enough to be trusted on their own to be accurate (I guess due to errors and/or too few counts of that unique sequence). But that’s just a guess. I tried looking at the methods of the dada2 paper again, but the stats are just over my head :/ Maybe @benjjneb can throw in here if that guess is on the right track or not :)

I approve this explanation :)

from astrobiomike.github.io.

benjjneb avatar benjjneb commented on July 18, 2024 1

A note: I would almost never recommend setting a high minOverlap. The main purpose of minOverlap is to prevent spurious merging due to random matching between a few bases at the beginning and the end, not to specify the final amplicon length.

You can filter out off-target amplicon lengths once you get to the final merged table (e.g. in our tutorial this is shown, perhaps in Mike's too). That's the better way to achieve that goal.

from astrobiomike.github.io.

AstrobioMike avatar AstrobioMike commented on July 18, 2024

Oh, no! I'm sorry, Emily! I meant to come back to this forever ago when i had time and i totally dropped the ball :(

Seeing as how late i am, are you beyond this now? Or would it still help for me to take a look and give you my thoughts? Sorry again

from astrobiomike.github.io.

Jatbee32 avatar Jatbee32 commented on July 18, 2024

from astrobiomike.github.io.

AstrobioMike avatar AstrobioMike commented on July 18, 2024

Hiya!

  1. In your summary tab, you have a drop (only a couple of reads) from
    filtered to the dada_f and dada_r. Is this due to the error rate and the
    dereplication rates given for the parameters?

I don’t remember if I ever knew this for sure, but yea I assume the same thing. That these are unique sequences that based on the settings weren’t successfully “assigned” to an already inferred dada2-"trusted" sequence, and they weren’t strong enough to be trusted on their own to be accurate (I guess due to errors and/or too few counts of that unique sequence). But that’s just a guess. I tried looking at the methods of the dada2 paper again, but the stats are just over my head :/ Maybe @benjjneb can throw in here if that guess is on the right track or not :)

  1. If you place parameters for certain lengths within functions such as
    filterAndTrim() and mergePairs() it should guarantee that the reads are
    greater or equal to the parameter you set correct? So for your example
    workflow, there shouldn't be a read that is less than 175?

So yea this should be the case if setting the “minLen” parameter in filterAndTrim(), and there shouldn’t be any reads after that shorter than 175 in the example I have up. I don’t see a similar setting for mergePairs() though. Also in looking again at the options regarding lengths (which I feel like I kind of confuse myself in a different way each time I think about them), it seems my setting “minLen=175” in that example is superfluous anyway, as apparently the “truncLen” parameter not only cuts things longer than that down to that size, but also throws away any reads shorter than that. So having truncLen(c(250,200)) is saying the minimum length is going to be 250 for the forward and 200 for the reverse, in addition to chopping anything longer down to that. I should revisit that and update it, thanks for making me look at it again :)

  1. Finally, does the error change each time you run it? Sometimes when I
    have run a sample set, it will use all the samples present to learn the
    error rate versus sometimes it will only use a couple of the samples. Is
    this due to the size of the samples for the best efficiency?

Yeppers, it’s due to the size of the dataset. It by default will go up to 100 million bases and then stop pulling from samples, if all samples together don’t reach that, it will use all samples/reads.

Regarding your specific dropped stuff, I see why you were asking about the dropped sequences from the "filtered" to the "dada_f" and "dada_r" following the dada() step. To be explained by what we guessed above in your question 1, i would think you would have to have a lot of unique sequences as compared to the total sequences in the samples that lost a lot of reads there. I don't know what this ratio typically looks like, but here's something ugly i just threw together to see it for the rocks in the happy belly example:

num_uniqs_R1 <- integer() ; num_reads_R1 <- integer() ; num_uniqs_R2 <- integer() ; num_reads_R2 <- integer()

for (index in 1:length(samples)) {
  print(samples[index])
  num_uniqs_R1[index] <- length(derep_forward[[samples[index]]]$uniques)
  num_reads_R1[index] <- length(derep_forward[[samples[index]]]$map)
  num_uniqs_R2[index] <- length(derep_reverse[[samples[index]]]$uniques)
  num_reads_R2[index] <- length(derep_reverse[[samples[index]]]$map)
}

df <- data.frame(row.names=samples, "num_uniqs_R1"=num_uniqs_R1, "num_reads_R1"=num_reads_R1, "num_uniqs_R2"=num_uniqs_R2, "num_reads_R2"=num_reads_R2, "percent_uniq_R1"=round(num_uniqs_R1 / num_reads_R1 * 100, 2), "percent_uniq_R2"=round(num_uniqs_R2 / num_reads_R2 * 100, 2))
df
      num_uniqs_R1 num_reads_R1 num_uniqs_R2 num_reads_R2 percent_uniq_R1 percent_uniq_R2
B1             552         1498          514         1498           36.85           34.31
B2             224          529          203          529           42.34           38.37
B3             186          457          171          457           40.70           37.42
B4             203          475          189          475           42.74           39.79
BW1            723         2109          660         2109           34.28           31.29
BW2           2760         5527         2506         5527           49.94           45.34
R10           5501        10354         5054        10354           53.13           48.81
R11BF         3452         8028         3113         8028           43.00           38.78
R11           4846         8138         4568         8138           59.55           56.13
R12           9747        14423         9288        14423           67.58           64.40
R1A           6894        10906         6445        10906           63.21           59.10
R1B           9380        14672         8799        14672           63.93           59.97
R2            9310        15660         8762        15660           59.45           55.95
R3           10225        15950         9710        15950           64.11           60.88
R4           10407        17324         9613        17324           60.07           55.49
R5           10981        16728        10432        16728           65.64           62.36
R6            8520        13338         8047        13338           63.88           60.33
R7            4432         7331         4145         7331           60.46           56.54
R8            6171        11192         5696        11192           55.14           50.89
R9            4234         7853         3970         7853           53.92           50.55
> summary(df$percent_uniq_R1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.28   42.94   57.30   54.00   63.38   67.58 
> summary(df$percent_uniq_R2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  31.29   39.54   53.19   50.34   59.32   64.40

So in this dataset we're seeing a little over 50% of reads being unique (the rocks are generally a pretty diverse environment). Maybe see what that looks like for your stuff?

And for the merging, I'm sure this is one of the primary angles you're looking into with Ben's help, but it's easy to lose a lot of reads there if they don't overlap enough to successfully merge, which is dependent on the expected amplicon size and sequencing setup. The step where we most have to think about this is probably the filterAndTrim() step. Have you spent time already making sure your trimming isn't cutting them too short to successfully merge?

from astrobiomike.github.io.

Jatbee32 avatar Jatbee32 commented on July 18, 2024

from astrobiomike.github.io.

AstrobioMike avatar AstrobioMike commented on July 18, 2024

Well the example workflow data are 2x300, not sure if that's part of what's going on yet. What primers were used on yours?

from astrobiomike.github.io.

Jatbee32 avatar Jatbee32 commented on July 18, 2024

from astrobiomike.github.io.

AstrobioMike avatar AstrobioMike commented on July 18, 2024

The actual minOverlap I used for my run was 125. I came up with this based
on the fact that the two trim lengths were 200 each and the original cycles
were 250 so 200 +200 -251 = 149. But I dropped it to 125 just in case.

Is this saying the forward and reverse were trimmed down to length 200? If that's the case, for spanning 515-806, there'd only be a nominal overlap of 109 I think (forward going from 515+200=715; reverse going 806-200=606; so the overlap window is 715-606=109; unless my brain is not working properly right now, which happens a lot). So you might still be losing a lot of things due to that with a minOverlap of 125. I'd drop it really low and see if a bunch more seqs make it through.

I have another clarification question for you based on some points you make
in the full workflow if you don't mind. You make the point in the chimera
step that if the sum(seqtab.nochim)/sum(seqtab) is less than 80% you might
want to go back to the filterAndTrim step and increase the maxEE. How high
can the maxEE go? I've only really seen 5 as the highest.

I actually can't find this, can you tell me more specifically where I'm saying this, ha. I'm having a hard time figuring out the context or what I meant. I'm not sure at the moment why I would relate chimeras to the quality filtering... But more to the question of how high can the maxEE go, there usually aren't set answers to things like that unfortunately, as it's up to the researcher to decide how loose or strict to be with the data, and like lots of stuff it's a balance between keeping as much data as we can vs keeping the highest quality data we can. The only thing I'd say I might consider beyond that is that maxEE is equivalent to the number of expected errors in a read, so we also would want to think about it differently if we were talking about 50 bp reads or 250 bp reads (meaning allowing 5 errors in 50 bps is a bigger deal than allowing 5 errors in 250 bps).

But yeah, please let me know where I have that suggestion you're talking about, because i'd like to make sure i was making sense, ha

from astrobiomike.github.io.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.