Questions: 1. In analyzing an assembly made from diploid DNA using Illumina reads from

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

diploid assembly, haploid reads about merfin HOT 4 CLOSED

rwhetten commented on September 27, 2024

diploid assembly, haploid reads

from merfin.

Comments (4)

gf777 commented on September 27, 2024

Hi @rwhetten

What a big genome!

"I take this to mean that the Illumina reads should lack kmers found in the assembly, but any kmers present in the reads and not in the assembly should be either sequencing errors or assembly errors." YES

if your read set represents an haploid genome, I think ideally you'd want to try to separate the two haplotypes as much as possible (i.e. have a haploid representation of the genome). You will still have issues with kmers from SNPs and such as the two haplotypes will be scrambled, but at least the diploid fraction will be represented faithfully.
Even if you were to purge the genome, you'd normally combine primary and alternate assemblies into one when running Merfin. Here the complication is that the reads are haploid. So I guess 1) applies, and you'd want to purge first
Yes, definitely. I suppose that this big genome is due to some massive stuff repeated over and over. Check the gs2 plot in log scale, all that is essentially not modelled by gs2

K-mers not found in reads (missing) --> this is usually evidence of either false duplications or retained haplotypes. Have you tried to generate Merqury spectra plots with the illumina data? Not sure what to expect given the haploid nature of your illumina data but could be helpful

from merfin.

rwhetten commented on September 27, 2024

Thanks for the quick reply! I had run the Genomescope2 modeling with -m 100 because the interesting peaks are at lower multiplicity - repeating that analysis with -m 200 million gave a size estimate of 17 Gb - not quite correct, but closer than before. There are a lot of kmers with multiplicity > 1 million.
A Merqury spectra_cn plot of the diploid wtdbg2 assembly compared to the haploid Illumina reads shows kmers at the same multiplicity in the reads with varying copy number in the assembly, from 0 to 4. I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. I'm running a minimap2 alignment of the long reads to the assembly for use in purging, but everything takes a long time with this size genome.

from merfin.

gf777 commented on September 27, 2024

That's cool. Yes, I'd say the results with gs2 are expected, but why are you limiting kmer cov? It sounds like an example where you don't want to do that.

"I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. " YES

What is interesting to me is that the different copies are all at the same kmer multiplicity. I'd say that with haploid reads you expect both the haploid and diploid fraction of the genome to be at haploid coverage, however 4 copy and above will be seen 2, 3, 4 etc number of times even in a haploid read set. So basically all green purple orange and some of the blue is false duplications..

from merfin.

rwhetten commented on September 27, 2024

@gf777 I included the -m option because I thought it was required - 200 million is larger than the highest multiplicity, so the results from genomescope2 are the same without an upper limit.
Thanks for your guidance in interpreting the output plots. I'll try some different approaches to purging the false duplications.

from merfin.

diploid assembly, haploid reads about merfin HOT 4 CLOSED

Comments (4)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent