marbl / merqury Goto Github PK

View Code? Open in Web Editor NEW

272.0 272.0 19.0 3.87 MB

k-mer based assembly evaluation

License: Other

Shell 57.74% R 10.56% Java 31.71%

merqury's People

Contributors

Stargazers

Watchers

Forkers

bredeson yedomon dendisuhubdy kant sailfish009 javibio-git shehongbing biocko yi1873 wook2014 wolongac alguoo314 licheng0921 alex-dtg fjruizruano ningshuang-yao pickettbd kbarslan odielwoolmore

merqury's Issues

kmer completeness file empty

Hello,

When I run merqury, it runs without error, I get all the spectra-cn plots, and the qv file populates okay, I can even see meryl dbs generated for my assemblies, but the kmer completeness file is empty. Also, the asm only wig files are empty as well.
What do you think the issue could be?

Thank you,
Unnati

Error reported when run meryl, but looks like it has run fine

Hi,
I've just been running meryl on some fastq.gz files, and for one of the files I got the following errors in the log. But it looks like it has completed fine, and I have the output files. Can I just ignore it?

gzip: ERR126028_2.fastq.gz: invalid compressed data--crc error

gzip: ERR126028_2.fastq.gz: invalid compressed data--length error

Thanks,
Jenni

Whether high ploidy affects the assessment of completeness

We evaluated the diploid genome, the result of qv is ~60, the result of k-mer completeness ~95. However, when we evaluating the high ploidy genome(same species, different ploidy), the result of qv is also ~60, k-mer completeness was only 70-80.

On one hand , We think that, the assembly of a high-ploidy genome is more difficult than diploid genome, the quality of the resulting genome is relatively low, so the k-mer completeness is not very good. On the other hand, Merqury is mainly used to evaluate the quality of diploid genomes. whether Merqury can perform well in evaluating high ploidy genomes?

Best
JX

Difference in 2 and 3 copy kmers between read data sets

Hi,

I was looking to see what else I could pull out of the data and decided to compare the 2-copy and 3-copy kmers in different genomes.

For this I used the output files:
genome.genome.spectra-cn.hist
genome.genome.only.hist

And then I used awk to extract 2 and 3 copy kmers from the spectra-cn.hist file and sum them ($3 in file).
For the 2-copy kmers I then added the 2-copy number from the only.hist file.

However, using the same genome, but two different data sets I found different results:

Genome    | 2-copy     | 3 copy*readdata
Tc_171026 | 66,301,944 | 12,801,905*Tci2W&NW
Tc_171026 | 65,692,035 | 13,050,376*poolseq

I can't understand why this would be? I'm assuming that the 2 and 3 copy etc are related to the number of times they occur in the genome. So, for the 2-copy, although the output in only.hist will be different between different read sets, I don't understand why the sum of all the 2-copy kmers in both files would change.

What am I getting wrong?

Thanks!
Jenni

Merqury over-estimates genome completeness

Hey,
Great tool, it's very useful and easy to use!

I ran Merqury on an Illumina-based assembly I had and obtained a completeness estimate much larger than I expected, given the genome's known size. I think I found a semantic bug in it's completeness calculation:

When calculating the size of a genome with k-mers, the calculation must take into account the multiplicity of the k-mer as well as its frequency in order to normalize for genomic copy number. As Merqury is currently written, it only sums up the frequency and doesn't take copy number into account (it is basically assuming all k-mers have a copy number of 1 in the genome).

The equation for calculating the fraction of genome assembled is:

G = k_d * sum [ k_i * f_a(k_i) ] / sum [ k_i * f_r(k_i) ] ,  for i in 1 -> n

where:
G is the fraction of genome assembled (i.e., the completeness)
k_i is the value at multiplicity i
k_d is the value of the diploid peak multiplicity
f_a(k_i) is the frequency at multiplicity k_i in the assembly
f_r(k_i) is the frequency at multiplicity k_i in the reads

See spectra-cn.sh, lines 132-134:

TOTAL=`meryl statistics $read_solid | head -n3 | tail -n1 | awk '{print $2}'`
ASM=`meryl statistics $asm.solid.meryl | head -n3 | tail -n1 | awk '{print $2}'`
echo -e "${asm}\tall\t${ASM}\t${TOTAL}" | awk '{print $0"\t"((100*$3)/$4)}' >> completeness.stats

If we let $dmult to be the full-depth diploid peak multiplicity, then the above lines could be modified to instead be:

TOTAL=`meryl statistics $read_solid | tail -n +11 | awk 'BEGIN{s=0} {s+=($1*$2)} END{print s}'`
ASM=`meryl statistics $asm.solid.meryl | tail -n +11 | awk 'BEGIN{s=0} {s+=($1*$2)} END{print s}'
echo -e "${asm}\tall\t${ASM}\t${TOTAL}" | awk -v d=$dmult '{print $0"\t"((100*d*$3)/$4)}' >> completeness.stats

I hope this is useful.

Best,
Jessen

Getting a bed file of kmers in assembly but not in reads

Firstly, thanks for this very useful tool!

I'd like to create a bed file annotating kmers in my assembly which are not in the reads.

As far as I can see, this is done as part of spectra-cn.sh by running

meryl-lookup -bed -sequence $asm_fa -mers ${asm}.0.meryl > ${asm}_only.bed

However, when I try running this, meryl-lookup gives me an error message:

Unknown option '-bed'.
No report-type (-existence, etc) supplied.

I'm using meryl-lookup v1.2 within the merqury conda package.

Does meryl-lookup no longer support bed output, or am I missing something? Is there another way to do this?

Thanks,
Mike

Interpreting the new wig files for QV and hapmer density

Hi Arang,
The latest changes are great, it is significantly faster and easier to work with! My question was around some of the new .wig files, which replaced the .tdf files I never got around to exploring. In this figure, the rows are QV, paternal hapmers, and maternal hapmers respectively.

Is the interpretation that the numbers are respectively the number of

mismatched bases from the readmers
bases contained in paternal/maternal hapmers matching the sequence base

Also when zooming in significantly (100bp), they all tend to take on triangular shapes, which I wouldn't naively expect if the above definitions are near correct. It may be related to some windowed overlap (like the convolution of two boxcar functions), but that doesn't particularly seem realistic given the genomic context and presumed variability in coverage.

Thanks,
Alex

Help interpreting .spectra-cn.ln plots

Hi again,

I am looking over different .spectra-cn.ln plots of Illumina data for different strains of highly repetitive Borrelial genomes. The sequencing depth is between 100-150x but the kmer_multiplicity peak seems to be shifted by 50x or more and I am not sure if that makes a difference. Some of the plots look ok like the first plot I attached. Others look like the second plot where there is a plateau of read only kmers (I am assuming this is because that specific strain had a lower completeness, ~94%) and then I got one weird ones (the last one attached) where the plot of the kmers just plateaus (this happens in datasets with ~1000x coverage). Not sure if there are any issues out right with the second plot or if that is just showing that there are missing kmers in the assembly and I know there is something wrong with the last plot. Any guidance would be appreciated. Thanks!

Test data

Do you have any test data so that I can confirm that my setup is working?

Completeness stats too high, filt count much too high...

Hi,
Thanks again for Merqury, and all your help so far.

I have two different Illumina read sets (one straight sequenced (probably with some PCR) and one with WGA). I ran Merqury with them both together, then separately. The non-WGA completeness stats look sensible to me - about 50% (so, I'm interpreting this as: about 50% of the distinct kmers with counts > the .filt file cut-off present in the read data are also found in the genome assembly).
But the WGA completeness stats are much higher - about 86%, yet the spectra plots are very similar. The filter count for the WGA is set at about 237. I re-plotted the data, extending the x-axis to 250 to check if there was something really odd going on - a massive peak at this multiplicity, but nothing there.

So I moved all of my output files into a new directory, copied the .filt file and changed it to 4 (the count cut-off for the non-WGA, when both read databases were used together the cut-off had been 6). And re-ran Merqury. It ran fine, but the completeness stats are empty.

Is there any particular reason why it might be doing this, or any way that I can over-ride it?

This is the spectra plot:

And this is an extended x-axis one:

Also, could you please clarify for me - when you say distinct kmers in the read data/genome - this is a kmer sequence. It doesn't matter how many times it is present in the data, it is only counted once? And that if a kmer is present in both the genome assembly and the read database, but at a low multiplicity in the read database (so that it is excluded) it is not included in the assembly total kmers?

Thanks!
Jenni

Unable to run meryl with a for script

Hi,

Thanks for Merqury!

I was following your example script on https://github.com/marbl/merqury/wiki/1.-Prepare-meryl-dbs and found that it would not work.

I changed it to:

for i in $i.fastq.gz ;
do

1. Build meryl dbs

/export/home2/jmi45g/anaconda3/envs/Merqury/bin/meryl count k=20 $i.fastq.gz output $i.meryl ;
done

at which point it seemed to want to run, but was doing something strange with my files and was complaining that it wasn't sure what to do with a file of .fastq.gz.fastq.gz

I've run it with them individually, which has worked fine, they just didn't seem able to handle my for script.

Thanks.

Error when running

I ran Meryl without issues on all fastq files + merged the resulting files. After running merqury as follows:

$MERQURY/merqury.sh ./manatee.meryl /blue/conesa/share/manatee_assembly/ont_polishing/polished_sequences.fasta CBtest

It runs without stdout error but the files are empty and there are errors in the log file. See below:

/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 88
Copy = 3 ..

Copy = 4 ..

Copy >4 ..

Copy numbers in k-mers found only in asm

No asm2_fa given. Done.

polished_sequences only

Write output

Get asm only for spectra-asm

Plot CBtest.spectra-asm.hist

Rscript /home/christinaboucher/merqury-1.3/plot/plot_spectra_cn.R -f CBtest.spectra-asm.hist -o CBtest.spectra-asm -z CBtest.dist_only.hist
[1] "x_max: "

Clean up

Done!
cannot remove ‘read.k.polished_sequences.3.meryl’: No such file or directory
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 88: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 89: meryl: command not found
rm: cannot remove ‘read.k.polished_sequences.4.meryl’: No such file or directory
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 95: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 96: meryl: command not found
rm: cannot remove ‘read.k.polished_sequences.gt4.meryl’: No such file or directory
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 102: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 103: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 104: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 105: -: syntax error: operand expected (error token is "-")
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 117: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 121: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 122: meryl: command not found
/home/christinaboucher/merqury-1.3/eval/spectra-cn.sh: line 125: meryl: command not found
Loading required package: argparse
Loading required package: ggplot2
Loading required package: scales
Error in [.data.frame(dat_0, , 3) : undefined columns selected
Calls: spectra_cn_plot -> [ -> [.data.frame
In addition: Warning message:
In max(dat[dat[, 1] != "read-total" & dat[, 1] != "read-only" & :
no non-missing arguments to max; returning -Inf
Execution halted
rm: cannot remove ‘polished_sequences.0.meryl’: No such file or directory
rm: cannot remove ‘read.k.polished_sequences.0.meryl’: No such file or directory
rm: cannot remove ‘read.k.polished_sequences.meryl’: No such file or directory
rm: cannot remove ‘manatee.gt0.meryl’: No such file or directory

Insight into correcting would be great.

Thanks.
Christina

block_n_stats.sh

Dear author
I'm running the example you' ve given, but I ran the script with an error, as follows:

I added a path to the environment variable, why is it still like this

Hope to get your reply

Phase

Dear author
I would like to consult where the color is inconsistent in the picture, (the changing contig),What is the generating file that lets me know which contig it is?

I personally feel, looking from the picture, this switch error is somewhat serious
Last question
Why in the generated switches.txt,Show only the switch error results of one phase, where the other phase is to see?

Discrepancy between Genomescope vs Merqury in predicted genome size

Hi there,

I am working on some Borrelia genomes (lots of linear plasmids [10kb to 200kb] some circular plasmids [30kb-90kb] and a linear chromosome [~900kb]) which contain a high amount of repetitive content. I used Merqury (which is really cool and may prove super useful) and got encouraging results in terms of accuracy and completeness and confirmed my suspicions about some areas that I thought didn't assemble properly. I wanted to quantify the repetitive genome content/confirm predicted genome size and looked to use the read.hist file on Genomescope. The predicted genome size on Genomescope is much larger than what the completeness reflects in the Merqury output (e.g. my assembly is ~1.5Mb and Merqury indicates this is 98% complete but on Genomescope, k=15, ploidy 1, the predicted size is ~2Mb). Am I conceptually missing something here in how Merqury and Genomescope calculate the predicted genome size?

Error running merqury a message from util.sh

Hello, an amazing tool,

I am trying to run merqury but I get this error:

./merqury.sh: line 16: /util/util.sh: No such file or directory
link: missing operand after 'read_db.meryl'
Try 'link --help' for more information.
read:

No haplotype dbs provided.
Running Merqury in non-trio mode...

link: missing operand after 'asm1.fasta'
Try 'link --help' for more information.
asm1:
out : test

Get spectra-cn plots and QV stats

the command that I am using is

./merqury.sh read_21.meryl datura.fasta test

Also when i check logs I get this message:

./merqury.sh: line 71: /eval/spectra-cn.sh: No such file or directory

any help?

Best regards
Ivan

Question: a way to extract read-only k-mers

Hi again Dr. Rhie,
I'm having a great time using merqury to asses quality and completeness of different versions of our assembly. It's an awesome tool and it is allowing us to make more informed decisions about haplotig purging and other parameter tweaks. So thanks again!

As I was examining our spectra I noticed our assembly is missing some heterozygosity ie a black (read-only) peak at 1-copy.

I was wondering if there is a way to pull out those kmers specifically and then the reads they came from. My thought is that we could then assemble at least part of the missing heterozygosity using the short reads then add that to our haplotigs to get a more complete assembly.

Can you suggest a way to extract those kmers and their sequence for such a purpose?

Thanks in advance,
Ben

No plots generated

Hi, thanks for developing this cool tool! I used merqury.sh command, and it works smoothly, except that no plots were generated. I didn't get any errors while running the command. The mercury tool was installed via miniconda3, and I had build an environment specific for mercury. To solve the problem, I tried copying all the .R and .sh executables into /miniconda3/envs/merqury/bin, and I also tried update all packages in the environment. But still I did not solve the problem. Could you help, please?

Codes are here:
merqury.sh sample.meryl sample1.fasta sample_try

Files generated after running the command:
sample_try.completeness.stats
sample_try.dist_only.hist
sample_try.sample1.only.hist
sample_try.sample1.qv
sample_try.sample1.spectra-cn.hist
sample_try.qv
sample.hist
sample.hist.ploidy
sample.filt

No phase block result

Hi,
I have been using Merqury to assess a trio-binning assembly. The first three steps successfully completed. However, two phaseblock steps generate files with no data.
I checked result.phase-block1.log. It indicates "Not enough memory to load 943454956 distinct 21-kmers.
Need at least 4.738 GB memory".
But I assigned 140gb of memory in my slurm script.
Can you suggest how to resolve the issue.
Thank you.
Zhen

remove module load?

Hi there,

Thanks for making this interesting and useful tool! I have found on our cluster that the lines that check if R, bedtools, and samtools are installed wrecks some havoc on getting this to run. Specifically it loads the default version of R, when packages are not all installed for whatever that default version is, and the different capitalization of bedtools (vs BEDTools).

Anyway, just wonder if simpler to assume these are installed as dependencies rather than having these hard set module load calls? I commented them all out to get my jobs to run. Just thought I'd make this suggestion if I am not unique here.

It would, for example, save time when I go to pull the latest update, to not have to comment these out :)

e.g. this:

has_module=$(check_module)
if [[ $has_module -gt 0 ]]; then
        echo "No modules available.."
else
	module load bedtools
	module load samtools
	module load R
fi

thanks!

Question: Non parental hap-mer identification

Hello,
I'm very excited to test and assess the quality of our phased assembly (FALCON-Unzip -> FALCON-Phase).
However due to the nature of our system we are unable to acquire the parents of our sequenced individual.
In the preprint you mention:

"Merqury’s methods are extensible to polyploid genomes and it is possible to identify hap-mers using orthogonal datasets (e.g. Hi-C, Strand-seq). To support alternative k-mer classification methods, Merqury is designed to receive any pre-computed hap-mer set as input."

Though this is beyond the scope of Mergury's pipeline could you direct me to methods for creating such a hap-mer set using Hi-C data?

While a little circular (Hi-C was used to phase the genome) It would be great to be able to asses the phasing accuracy using this method.

Thanks,
Ben

QV at +inf

Hi Folks

I am very happy with this tool but cannot figure out the value of Quality for my assembly... I have a value of +inf

head *.qv
==> MerquryOutput.qv <==
tog7291FlyeRacon3Medaka	0	9680225839	+inf	0

The completeness is at 100%. I suspect the +inf means we reach the maximum theoreticaly possible, but can we have a value (e.g. 60 ?)

Help in a weird spectra-cn.fl.png, QV and C interpretation

Hi @arangrhie,

Merqury is such a great tool, congrats!

I assembled a fungus genome assembly with Canu software using PacBio reads, polished it with Illumina datasets, and scaffolded it with Nanopore as the best strategy. The genome was haploid and presented high repetitive content of 90%. Then, I wanted to know about my quality results with Merqury. However, I obtained very weird results from which I can not give some reasonable interpretation, especially in the huge kmers proportion in reads_only, this appears that the fungus assembly could be bigger than the one that I got? (100 MB).

kindly, I would like to ask you for some help with this interpretation.

Thanks a million.
I attached my a spectra-cn.fl.png plot.
The QV and C results are asm_scaf 60590 93691673 44.6792 3.4047e-05 and asm_scaf all 1186554 2790136 42.5267 respectively.

Best wishes!

Slurm is an unlisted dependency for parallelized runs.

Add Slurm to the dependency list.

Low completeness with high QV

Hi Arang,

Thanks for the wonderful software.
We were trying to apply it to our fish data but the results look a bit off.
The QV stats are as follows -
referencegenome 2207970 963657040 39.6167 0.000109226

However the completeness looks very low -
referencegenome all 624153086 2400034559 26.006

I noticed that you declare $asm.solid.meryl. Is it generated from filt.sh or am I missing something here?
For some reason the filt.sh generates a read.filt with a value of 0 - ploidy, depth and boundary are all 0. The histogram of my illumina reads db is as follows -
1 1669096430
2 730938129
Is there any missing output prefix in #line52 of spectra-cn.sh?

We have larger nodes available than multiple smaller ones hence need run it as a single node job rather than a batch job.

hap_blot

dear author
I finished hap_blot this step, why is it like this, what caused it?

I found out in the log that this step was not wrong

Low consensus quality value

Hi,
it's a so great tool. I used it to measure the QV, and the value is only 23.3546. And i notice that the QV is more 30 in our example data. So should i need to further polish my genome using Illumina data?

Thanks

Hongbing She

How to understand the error rate

Dear Arangrhie,

I used mercury to assess the QV and error rate in my assemblies, and run into a question about how much difference it is between error rate e.g. 9.12E-06 and 1.24E-06? Could we understand the error rate 9.12E-06 as 9.12 bp errors per million bp? From the equation written in the mercury paper, I guess I can't understand in this way, right? Sorry if this question is dumb.

The reason why I ask this question is that I have several assemblies, and it's difficult for me to choose the best one according to QV and N50. Please see below:

Assembly. QV Error rate. N50 N75
assembly1 54.92 3.2E-06 16Mbp 6Mbp
assembly2 59.06 1.24E-06 10Mbp 4Mbp
assembly3 50.39 9.12E-06 18Mbp 15Mbp

I would like to know how difference the QV means among these assemblies, in order to decide whether I should choose the one with the best N50 and N75 values or the one with the highest QV.

Could you please give me some hints?

Thanks a lot!

Merqury installed with conda

We are trying to run merqury which has been installed with conda. I am assuming all of the dependencies are provided in the .yml conda recipes. We also have other lmod modules available like R, bedtools, samtools, etc. Which in this case would be incompatible with merqury module. It looks like when running merqury it's trying to load some of these modules with the provided scripts:

share/merqury/merqury-mash.sh: module load mash
share/merqury/trio/hapmers.sh: module load R
share/merqury/trio/spectra-hap.sh: module load R
share/merqury/trio/hapmers_to_bigwig.sh:module load ucsc
share/merqury/trio/hap_blob.sh: module load R
share/merqury/trio/switch_error.sh: module load R
share/merqury/trio/block_n_stats.sh: module load bedtools
share/merqury/trio/block_n_stats.sh: module load samtools
share/merqury/trio/block_n_stats.sh: module load R
share/merqury/util/bed_to_bigwig.sh:module load bedtools
share/merqury/util/bed_to_bigwig.sh:module load ucsc/406
share/merqury/eval/asm_multiplicity.sh:module load samtools
share/merqury/eval/asm_multiplicity.sh:module load ucsc/396
share/merqury/eval/spectra-cn.sh: module load R
share/merqury/eval/read_multiplicity.sh:module load samtools
share/merqury/eval/read_multiplicity.sh:module load ucsc/396

My first question is why does this behavior exist? If the software was installed with conda all of the dependencies should be available in the $PATH which they are as far as I can tell.
My second question is how can we mitigate this? Merqury is trying to load R which is inside of the container which will not work and the modules ucsc have different version on our cluster.

Segmentation fault in meryl-lookup

Hi,
Thank you for a very useful tool! I'm currently trying on a trio data set where we had CCS reads for the F1 and Illumina for the progenitors. I assembled it with HiCanu (no trio data involved) and got a very good assembly. Now I'm running merqury with the parent hap-mers to help evaluate the ploidy and the phasing.
I'm getting a consistent crash in meryl-lookiup when executed via the phase_block.sh wrapper. It seems like the default stdout output causes it to seg fault. I could work around it by adding -output to the command and then passing that file to the awk (see below).

meryl-lookup -dump -memory 4 -sequence $scaff.fasta -mers $hap.meryl  |  awk -v hap=$hap_short -v k=$k '$(NF-4)=="T" {print $1"\t"$(NF-5)"\t"($(NF-5)+k)"\t"hap}' > $out.$hap.bed

-- Loading kmers from 'sw_orange.hapmer.meryl' into lookup table.

 p       prefixes             bits gigabytes (allowed: 4 GB)
-- -------------- ---------------- ---------
15          32768       1131534860     0.132
16          65536       1106740638     0.129
17         131072       1084043568     0.126
18         262144       1065540802     0.124
19         524288       1055426644     0.123 (smallest)
20        1048576       1062089702     0.124
21        2097152       1102307192     0.128
22        4194304       1209633546     0.141
23        8388608       1451177628     0.169
24       16777216       1961157166     0.228
25       33554432       3008007616     0.350
26       67108864       5128599890     0.597
27      134217728       9396675812     1.094
28      268435456      17959719030     2.091 (used)
29      536870912      35112696840     4.088
30     1073741824      69445543834     8.085
31     2147483648     138138129196    16.081
32     4294967296     275550191294    32.078
-- -------------- ---------------- ---------

For 26891374 distinct 20-mers (with 28 bits used for indexing and 12 bits for tags):
    2.091 GB memory
    2.000 GB memory for index (268435456 elements 64 bits wide)
    0.038 GB memory for tags  (26891374 elements 12 bits wide)
    0.053 GB memory for data  (26891374 elements 17 bits wide)

Will load 26891374 kmers.  Skipping 0 (too low) and 0 (too high) kmers.
Allocating space for 26891374 suffixes of 12 bits each -> 322696488 bits (0.038 GB) in blocks of 32.000 MB
                     26891374 values   of 17 bits each -> 457153358 bits (0.053 GB) in blocks of 32.000 MB
Loaded 26891374 kmers.  Skipped 0 (too low) and 0 (too high) kmers.
-- Opening sequences in 'canu.FG5.contigs.fasta'.

Failed with 'Segmentation fault'; backtrace (libbacktrace):
utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
(null)::0 in (null)()
meryl/meryl-lookup.C::97 in _Z13dumpExistenceP10dnaSeqFileP20compressedFileWriterRSt6vectorIP20kmerCountExactLookupSaIS5_EERS3_IPKcSaISA_EE()
meryl/meryl-lookup.C::426 in main()
(null)::0 in (null)()
(null)::0 in (null)()
Segmentation fault

My workaround:

meryl-lookup -dump -memory 4 -sequence $scaff.fasta -mers $hap.meryl -threads 36 -output dump_tmp                                                                                              
cat dump_tmp | awk -v hap=$hap_short -v k=$k '$(NF-4)=="T" {print $1"\t"$(NF-5)"\t"($(NF-5)+k)"\t"hap}' > $out.$hap.bed  && rm dump_tmp

[completes normally]

As a side note, I also noticed that the -threads flag is not being passed to the program. It sped up the db IO greatly when I added it.

Best,
Eugene

exclude read for HiC scaffolding

Hello, @arangrhie

In the VGP preprint, you use exclude_read.sh for binning the HiC reads. But in my trio project, we only have child hifi reads and parents Illumina reads.

So can I just use the paternal or maternal spciefic reads or using HiFi kmer to get hapmer?

Thanks,
Zhigui Bao

HG002.k21.meryl.tar and NA12878.k21.meryl.tar are missing

Hi ,

I am downloading data of HG002 and NA12878 from here but the HG002.k21.meryl.tar and NA12878.k21.meryl.tar links failed before my downloads were finished. Is there any other available links to download them?

Best wishes
Lidong Guo

Issue with link...

Hi all

I have containerized my merqury to be able to test it from anywhere, but I am facing issues with link when launching mercury.sh:

"""
Singularity> echo $MERCURY
/opt/tools/merqury-1.1
Singularity> $MERCURY/merqury.sh tog7291FlyeRacon3Medaka.meryl tog7291FlyeRacon3Medaka.fasta MerquryOutput
/opt/tools/merqury-1.1/merqury.sh: line 16: /merqury-1.1//util/util.sh: No such file or directory
link: missing operand after 'tog7291FlyeRacon3Medaka.meryl'
Try 'link --help' for more information.
read:

No haplotype dbs provided.
Running Merqury in non-trio mode...

link: missing operand after 'tog7291FlyeRacon3Medaka.fasta'
Try 'link --help' for more information.
asm1:
out : MerquryOutput

mkdir: cannot create directory 'logs': Read-only file system

Get spectra-cn plots and QV stats
/opt/tools/merqury-1.1/merqury.sh: line 71: logs/MerquryOutput.spectra-cn.log: No such file or directory
Singularity>
"""

Does anyone have an idea ? I am in a writable folder, and the link command ask for two parameters, the source and the target

The container is based on ubuntu 20.04

Conda r-packages problem

Hello,

I am trying to run merqury but it does not output the png plots. I checked inside the log directory and I noticed this error:

Don't know what to do with 'purged.0.meryl'.
/data/SBCS-MartinDuranLab/03-Giacomo/src/anaconda3/merqury_env2/share/merqury/eval/spectra-cn.sh: line 111: -: syntax error: operand expected (error token is "-")
Loading required package: argparse
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘argparse’
Loading required package: ggplot2
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘ggplot2’
Loading required package: scales
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘scales’
Error in ArgumentParser(description = "Make spectra-cn plots. Line, filled, and stacked spectra-cn plots will be generated.") : 
  could not find function "ArgumentParser"
Execution halted

I have installed it through conda and I am getting this "warning" message:

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done

My .condarc file looks like this:

channels:
  - bioconda
  - conda-forge
  - r
  - defaults
  - etetoolkit

I suspect something is stopping conda from setting the right paths needed for the r-packages.
What can I try to do? Thanks for your help!

Best regards,

Giacomo

Hap_blot.sh

When I run merqury, the second step: hap_blob.sh ,the problem arises
This is the code I run:
/home/duhuipeng/DHP/merqury-1.1/merqury.sh /home/duhuipeng/DHP/DRC/DRC.SON.k21.meryl/ ../../DRC.father.k21.meryl/ ../../DRC.mother.k21.meryl/ ../father_purged.fasta ../purged_mo.fasta test-2
the error as follow in hap_blob

at first I thought core dump was out of memory, but I found that I couldn't run when I used the size of 3T memory. How can I solve this problem?
Because I want to draw a picture of its phase separation,But this step is wrong is not drawn, I hope to get your advice.

Looking forward to your reply!
Best

Non-count Meryl operations don't appear to use all available cores without explicitly set "threads="

Hi,

On the LSF cluster I am using, I've noticed that the Meryl "count" operation behaves as described with respect to the parallelisation section in the README, i.e. using all available cores.

export OMP_NUM_THREADS=T; meryl count threads=T ...

behaves the same as

export OMP_NUM_THREADS=T; meryl count ...

On the other hand, this doesn't appear to be true for other operations (I've only checked intersect and difference), where

export OMP_NUM_THREADS=T; meryl difference threads=T ...

will run with correct threading while

export OMP_NUM_THREADS=T; meryl difference ...

will run with only a single core.

I couldn't find much information if these other operations can be safely parallelised, but they appear to give equal output after some manual checking. I was able to run the eval/spectra-cn.sh sequence much faster after manually adding "threads=T" to the many intersection/difference calls.

I wasn't sure if this is some system based issue of SLURM vs LSF, some race condition risk, or if you had any other thoughts on this.

Thanks,
Alex

Executing scripts using sh is giving errors.

Hi,
I am trying to use Merqury (hapmers.sh specifically) but am getting errors like:

/srv/netscratch/dep_mercier/grp_schneeberger/software/merqury-1.1/build/intersect.sh: 3: /srv/netscratch/dep_mercier/grp_schneeberger/software/merqury-1.1/build/intersect.sh: [[: not found

This is happening because the scripts (intersect.sh, filt.sh etc) are being called using SH. Indeed, when I change the specific lines from sh $MERQURY/build/intersect.sh $child_meryl $hap2.only.meryl $hap2.inherited to bash $MERQURY/build/intersect.sh $child_meryl $hap2.only.meryl $hap2.inherited, it works fine. As all scripts are supposed to be run using bash (they all have #!/bin/bash), I am not sure why they are being called using sh. If there is no specific reasoning for it, then it would be great if the commands are changed to use bash instead of sh.

Expanding X axis in spectra plots...

Hi there,

Very nice tool you have here!! I'm working with a polyploid (4X) and was wondering if there is an easy way to change the x axis in the produced spectra plots. The view cuts off at 45 multiplicity-ish and I really need to see probably up to 80 multiplicity.

I'm somewhat versed in R, but I thought that maybe instead of me accidentally messing something up, you might know the line(s) off the top of your head to change to see higher kmer multiplicity?

Thank you for any help you can provide!

Kindly,
Charity

Meryl-lookup runs single threaded in phase_block regardless of thread count

Hi Arang,
I noticed that the different instances of meryl-lookup across merqury run single core, even when env params are set like OMP_NUM_THREADS. I tried manually running the scripts with the meryl-lookup explicitly set with the -threads param it can take, and the results did run parallel and had the same md5sum, so I believe this is a valid behaviour.

I've come across meryl issues in the past (#12) because I was using the meryl tip version, but I was able to recreate this with the meryl tip and v1.3. There is a chance it is just something on my end, but I tried to be as thorough as possible.

I did some testing, and the threading is explicitly set to 1 for the meryl-lookup operation (maybe because of AS_configure in here), while the meryl print operation in the same phase_block file correctly runs with the OMP_NUM_Threads value.

Thanks,
Alex

Issue with finding util.sh

Hi,

When running merqury I run into a problem. I can run

merqury.sh --help

and get the help manual no problem.

But when I try to run it with my actual files (a read_db.meryl, genome.fasta output_prefix) it complains about three things, of which two I am thinking might follow on from the first.

~/Merqury/share/merqury/merqury.sh: line 16: /util/util.sh: No such file or directory
link: missing operand after reads_jenni.meryl' Try link --help' for more information.
link: missing operand after jenni.fasta' Try link --help' for more information.

The script is:

#!/bin/sh
#PBS -l walltime=30:00:00
#PBS -l cput=30:00:00
#PBS -l nodes=1:centos6
#PBS -d ~/Jenni_genome_Nov2020/Merqury

export PATH="~/Merqury/share/merqury/:$PATH"

~/Merqury/share/merqury/merqury.sh reads_jenni.meryl jenni.fasta jenni

In the ~/Merqury/share/merqury directory I have both merqury.sh and util/util.sh

If I do:

more $pwd/util.sh

I get the util.sh script. And when I list the contents of the util folder the util.sh has a size (i.e. its not 0).

So I wondered if it needed the $MERQURY in the merqury.sh script specified and I tried to do it by inserting a line in my script above the merqury.sh command:

~/Merqury/share/merqury=$MERQURY

But it threw up an error and said that there was No such file or directory for this.

Any advice would be great. Thanks!

Meryl new-old

Dear author

I am operating on the updated meryl now
I'm thinking of two data sets, specific kmer for each individual
My code are as follows

meryl count k=21 reads1.fasta oiutput k1_21.meryl
meryl count k=21 reads2fasta output k2_21.meryl
meryl difference output k1_not_k2.meryl k1_21.meryl k2_21.meryl
But now I want to convert these specific folders into sequences, There was a mistake in my operation
my code:
meryl threads=25 print k1_not_k2.meryl/ |awk '{if (match($1,">")) { COUNT=substr($1, 2, length($1)); } else {print $1" "COUNT}}' |awk '{if ($NF < 150) print $0}' > k1.counts
But k1.counts generated 0

I look forward to your reply

Limit threads

Hi,
How to limit threads when running merqury (merqury.sh)? I used default parameter and it seemed to take up too many cores.
I'll really appreciate your suggestions.
Thanks.

Is it possible that switch error is zero ?

Hi Arang,

I evaluated two NA12878 haplotype-aware assemblies by pre-built meryl dbs(from here) and all metrics are fine except the swtich errors are all zero. I had evaluated one of the assembly with a self-built NA12878 meryl and pre-built hapmers and the swtich error was not zero but 0.14%.

Below are the contents of *stats and switch.txt from pre-built meryl dbs of NA12878

NA12878.Pat	all	2232613826	2294572208	97.2998
NA12878.Mat	all	2233271654	2294572208	97.3285
both	all	2277921738	2294572208	99.2744
NA12878.Pat	NA12878.pat	17953494	18740902	95.7985
NA12878.Pat	NA12878.mat	32750	19878688	0.164749
NA12878.Mat	NA12878.pat	6469	18740902	0.0345181
NA12878.Mat	NA12878.mat	19022934	19878688	95.6951
both	NA12878.pat	17954819	18740902	95.8055
both	NA12878.mat	19025207	19878688	95.7066

NA12878.NA12878.Pat.100_20000	12,261	2,695,070,073	22	219,808	3,849,155	18,432,329	0	18144557	0%
NA12878.NA12878.Mat.100_20000	11,977	2,719,270,146	22	227,041	5,188,431	28,072,484	0	19170889	0%

NA12878.NA12878.Mat.100_20000 switch error rate (%) (Num. switches / Total markers found): 0	19170889	0%

Below are the contents of *stats and swiches.txt from self-built NA12878 meryl and pre-built hapmers.

ma_fab.mother	all	2232023738	2300926387	97.0054
pa_fab.father	all	2231414126	2300926387	96.9789
both	all	2276242413	2300926387	98.9272
ma_fab.mother	Pat.hapmer	6469	18740902	0.0345181
ma_fab.mother	Mat.hapmer	19022934	19878688	95.6951
pa_fab.father	Pat.hapmer	17953494	18740902	95.7985
pa_fab.father	Mat.hapmer	32750	19878688	0.164749
both	Pat.hapmer	17954819	18740902	95.8055
both	Mat.hapmer	19025207	19878688	95.7066

na12878.ma_fab.mother.100_20000	12,057	2,717,709,829	22	225,405	5,188,431	28,072,484	3215	19170841	0.0167703%
na12878.pa_fab.father.100_20000	12,614	2,684,659,568	22	212,832	3,683,712	18,432,329	26403	18144465	0.145515%

na12878.pa_fab.father.100_20000 switch error rate (%) (Num. switches / Total markers found): 26403	18144465	0.145515%

Although those file names are different , they are linked to the identical files. I am sure the input fastas are identical and the software and parameters are the same. The hapmers are also the same and the only difference is different NA12878 meryl db.

Another NA12878 haplotype-aware assembly is pulibic data from :
ftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/asm/NA12878-denovo-H1.fa.gz and ftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/asm/NA12878-denovo-H2.fa.gz

The merqury I used stays in 0d67a1d ( Wed May 13 15:24:14 2020 -0400 ) . Should I update it to the lastest ?

I had evaluated HG002 with pre-built meryl dbs and the swith errors are not zero.

I do not know how to debug this . I appreciate any advice in advance.

Best wishes.

Lidong Guo

Question: why the error bar in spectra_cn.png is higher than that in asm.st.png

Hi ,

I noticed that the error bar in spectra_cn.png is higher than than in asm.st.png. Howerver , if both of them refer to the assembly k-mers absent from the read set, they should be consistent.

To figure out their difference, I had read related souce codes but only found I was even more confused.

Here is my understanding of related source codes:

In asm.st.png , the total height of error bar refers to "the total number of assembly k-mer types absent from the read set".
1. 1 In meryl statistics this value is named by "distinct".
In spectra_cn.png, the total height of error bar refers to the total number of assembly k-mers absent from the read set , where k-mers=accumulation( each k-mer type's multiplicity).
2.1 In meryl statistics this value is named by "persent".
2.2 The height of error bar colored by copy-number-one color refer to "distinct".
2.2 The height of error bar colored by copy-number-two color refer to "persent"-"distinct" .
2.2.1 Thoes kmers in "persent"-"distinct" may have a copy number larger than 2.

I'm not sure whether this is a specific design or a bug.

Best wishes
Lidong Guo

Quality value equal to Inf

Hello and thank you for developing Merqury,

I have launched it on a plant genome and it finishes without crashing. However, the QVs I obtain are equal to inf, with k-mers specific to the assemblies equal to 0. The _only.bed and _only.wig files of the assembly are also empty. Here is the command I launched:

bash merqury/merqury.sh /env/cns/bigtmp2/reapr_pahang/reads/reads.meryl assembly_v2.fasta merqury_v2

Here is the content of the .qv file:

assembly_v2	0	405325924	inf	0

And finally, here is the spectra-cn log, could you please help me solve this issue?
merqury_v2.spectra-cn.log

The words on the picture are garbled

Hi,
I have a problem with this output picture. Why is the text above scrambled? I have tried to reanalyze it for many times, but the result is the same.

sincerely,
YiRan

conda install

I'm really excited merqury! Will you be supporting a conda install of the program any time soon?

NGS reads and Hifi reads which will be better for running Merqury?

Hi,
I assemble a genome using Hifi reads. However, I am confused that NGS reads and Hifi reads which might be better for qualifying our genome. I thinks NGS reads will be biased on high GC and repetitive region which might not be perfect for qualifying genome, but in the cookbook of Merqury I only see explanation on how to use NGS reads to run Merqury. So here I ask about this question that NGS reads and Hifi reads which will be better for running Merqury?

Awk error and looking for more docs/tutorial on merqury

Good day and thank you sharing great software!

I am trying merqury for the first time. I installed both Meryl and Merqury but i am having trouble with the first step estimating ideal kmer size:

sh $MERQURY/best_k.sh 924902646 [tolerable_collision_rate=0.001]

But after I run this line I get an error:

genome: 924902646 tolerable collision rate: [tolerable_collision_rate=0.001] awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted

Have I missed an argument in the command?

Also I was wondering if there are any currently available merqury tutorials or examples available online?

Thanks and kind regards
Crystal

Adding $out to static file names to avoid filename conflicts

Hi Arang,
This tool has been extremely useful in assessing new assemblies, but just wanted to offer a suggesting similar to issue #14. Currently, there are multiple places where files are written without prefixes, like

completeness.stats
switches.txt
cutoffs.txt

This isn't an issue if merqury is run in individual folders, but does if for example a primary assembly and trio-binned assemblies are assessed in the same folder. Most of these files exist in scripts with an existing $out variable, and so I was able to modify it to work better for my use case by just using e.g. $out.switches.txt.

It may cause issues for other users who already directly use these file names, but otherwise I've experienced no issues in running merqury on multiple assemblies (from the same reads) in the same folder.

Many thanks,
Alex

marbl / merqury Goto Github PK

merqury's People

Contributors

Stargazers

Watchers

Forkers

merqury's Issues

1. Build meryl dbs

Copy numbers in k-mers found only in asm

polished_sequences only

Write output

Get asm only for spectra-asm

Plot CBtest.spectra-asm.hist

Clean up

Recommend Projects

Recommend Topics

Recommend Org