Coder Social home page Coder Social logo

diploid assembly, haploid reads about merfin HOT 4 CLOSED

rwhetten avatar rwhetten commented on September 27, 2024
diploid assembly, haploid reads

from merfin.

Comments (4)

gf777 avatar gf777 commented on September 27, 2024

Hi @rwhetten

What a big genome!

"I take this to mean that the Illumina reads should lack kmers found in the assembly, but any kmers present in the reads and not in the assembly should be either sequencing errors or assembly errors." YES

  1. if your read set represents an haploid genome, I think ideally you'd want to try to separate the two haplotypes as much as possible (i.e. have a haploid representation of the genome). You will still have issues with kmers from SNPs and such as the two haplotypes will be scrambled, but at least the diploid fraction will be represented faithfully.

  2. Even if you were to purge the genome, you'd normally combine primary and alternate assemblies into one when running Merfin. Here the complication is that the reads are haploid. So I guess 1) applies, and you'd want to purge first

  3. Yes, definitely. I suppose that this big genome is due to some massive stuff repeated over and over. Check the gs2 plot in log scale, all that is essentially not modelled by gs2

K-mers not found in reads (missing) --> this is usually evidence of either false duplications or retained haplotypes. Have you tried to generate Merqury spectra plots with the illumina data? Not sure what to expect given the haploid nature of your illumina data but could be helpful

from merfin.

rwhetten avatar rwhetten commented on September 27, 2024

Thanks for the quick reply! I had run the Genomescope2 modeling with -m 100 because the interesting peaks are at lower multiplicity - repeating that analysis with -m 200 million gave a size estimate of 17 Gb - not quite correct, but closer than before. There are a lot of kmers with multiplicity > 1 million.
A Merqury spectra_cn plot of the diploid wtdbg2 assembly compared to the haploid Illumina reads shows kmers at the same multiplicity in the reads with varying copy number in the assembly, from 0 to 4. I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. I'm running a minimap2 alignment of the long reads to the assembly for use in purging, but everything takes a long time with this size genome.
specitra-cn_plot_image

from merfin.

gf777 avatar gf777 commented on September 27, 2024

That's cool. Yes, I'd say the results with gs2 are expected, but why are you limiting kmer cov? It sounds like an example where you don't want to do that.

"I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. " YES

What is interesting to me is that the different copies are all at the same kmer multiplicity. I'd say that with haploid reads you expect both the haploid and diploid fraction of the genome to be at haploid coverage, however 4 copy and above will be seen 2, 3, 4 etc number of times even in a haploid read set. So basically all green purple orange and some of the blue is false duplications..

from merfin.

rwhetten avatar rwhetten commented on September 27, 2024

@gf777 I included the -m option because I thought it was required - 200 million is larger than the highest multiplicity, so the results from genomescope2 are the same without an upper limit.
Thanks for your guidance in interpreting the output plots. I'll try some different approaches to purging the false duplications.

from merfin.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.