Comments (4)
Hi @rwhetten
What a big genome!
"I take this to mean that the Illumina reads should lack kmers found in the assembly, but any kmers present in the reads and not in the assembly should be either sequencing errors or assembly errors." YES
-
if your read set represents an haploid genome, I think ideally you'd want to try to separate the two haplotypes as much as possible (i.e. have a haploid representation of the genome). You will still have issues with kmers from SNPs and such as the two haplotypes will be scrambled, but at least the diploid fraction will be represented faithfully.
-
Even if you were to purge the genome, you'd normally combine primary and alternate assemblies into one when running Merfin. Here the complication is that the reads are haploid. So I guess 1) applies, and you'd want to purge first
-
Yes, definitely. I suppose that this big genome is due to some massive stuff repeated over and over. Check the gs2 plot in log scale, all that is essentially not modelled by gs2
K-mers not found in reads (missing) --> this is usually evidence of either false duplications or retained haplotypes. Have you tried to generate Merqury spectra plots with the illumina data? Not sure what to expect given the haploid nature of your illumina data but could be helpful
from merfin.
Thanks for the quick reply! I had run the Genomescope2 modeling with -m 100 because the interesting peaks are at lower multiplicity - repeating that analysis with -m 200 million gave a size estimate of 17 Gb - not quite correct, but closer than before. There are a lot of kmers with multiplicity > 1 million.
A Merqury spectra_cn plot of the diploid wtdbg2 assembly compared to the haploid Illumina reads shows kmers at the same multiplicity in the reads with varying copy number in the assembly, from 0 to 4. I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. I'm running a minimap2 alignment of the long reads to the assembly for use in purging, but everything takes a long time with this size genome.
from merfin.
That's cool. Yes, I'd say the results with gs2 are expected, but why are you limiting kmer cov? It sounds like an example where you don't want to do that.
"I take this to mean the assembly is missing some non-error sequences present in the Illumina data, while some sequences present once in the Illumina data are present 2 to 4 times in the assembly. " YES
What is interesting to me is that the different copies are all at the same kmer multiplicity. I'd say that with haploid reads you expect both the haploid and diploid fraction of the genome to be at haploid coverage, however 4 copy and above will be seen 2, 3, 4 etc number of times even in a haploid read set. So basically all green purple orange and some of the blue is false duplications..
from merfin.
@gf777 I included the -m option because I thought it was required - 200 million is larger than the highest multiplicity, so the results from genomescope2 are the same without an upper limit.
Thanks for your guidance in interpreting the output plots. I'll try some different approaches to purging the false duplications.
from merfin.
Related Issues (13)
- -peak is haploid or diploid? HOT 1
- Segfault in VCF loading HOT 5
- Adjusted QV values are very low and not improved by merfin HOT 30
- null output in hist and dump due to missing seek operation HOT 5
- Input for cartesian plot HOT 2
- the usage and meaning of the plot in complele human genome article HOT 2
- merfin doesnt respect -threads option HOT 4
- Using gff file with merfin ? HOT 3
- the choice of --peak HOT 2
- Seg Fault? HOT 3
- Polish_genome HOT 1
- Failed with 'Aborted'; backtrace (libbacktrace): HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from merfin.