Dear medaka developers, I run medaka on a 2.7 Gb genome to polish wi

medaka stitch runtime about medaka HOT 8 CLOSED

nanoporetech commented on July 17, 2024

medaka stitch runtime

from medaka.

Comments (8)

cjw85 commented on July 17, 2024

The main consensus step of medaka can indeed be parallelized: the steps of the medaka_consensus program can be run individually.

Medaka comes with a script mini_align which is responsible for running minimap2, sorting, and indexing the bam. This can run this as:

basecalls=...
draft=...
alignments=align.bam
align_threads=32
mini_align -i ${basecalls} -r ${draft} -P -m -p ${alignments} -t ${align_threads}

After which the consensus calling can be run on regions independently:

region="chr1:0-10000000" # for example
model=r941_flip235 # or as appropriate
probs=${region}_probs.hdf
medaka consensus ${alignments} ${probs} --model ${model} --batch_size --threads 8

When running on multiple regions, it would be recommended to provide an overlap of 500 bases.

After calculating all _probs.hdf files, the final consensus for a contig can be calculated by running, e.g.:

medaka stitch chr1*_probs.hdf chr1.fasta

The medaka stitch program does not provide any parallelism (we will implement that in a future release), but it can be run on contigs independently. I am still surprised that it has taken more than a day! I do not recall it taking so long when run on a human genome. Can you share the current output, has it processed multiple contigs or does it seem completely stuck?

For your larger dataset, after performing alignments medaka itself scales linearly with the draft size and is (largely) independent of depth.

from medaka.

MichelMoser commented on July 17, 2024

Thank you for the info. So i will start stitching multiple contigs in parallel as this is taking ages and stitching is still ongoing =).

Stitch is progressing through the contigs but in a very slow manner, sometimes spending hours resolving some non-overlap issue as you can see here:

[12:06:25 - Stitch] Processing ctg1487Racon2.
[12:06:34 - Stitch] Processing ctg148Racon2.
[12:38:21 - Stitch] Processing ctg1490Racon2.
[12:38:44 - Stitch] Processing ctg1491Racon2.
[12:39:05 - Stitch] Processing ctg1492Racon2.
[12:40:16 - Stitch] Processing ctg1493Racon2.
[12:40:23 - Stitch] Processing ctg1494Racon2.
[12:40:34 - Stitch] Processing ctg1495Racon2.
[12:41:13 - Stitch] Processing ctg1496Racon2.
[12:41:49 - Stitch] Processing ctg1497Racon2.
[12:42:00 - Stitch] Processing ctg1498Racon2.
[12:42:10 - Stitch] Processing ctg149Racon2.
[12:57:57 - Stitch] Processing ctg14Racon2.
[15:10:05 - Stitch] There is no overlap betwen ctg14Racon2:20146771.0-20151911.2492 and ctg14Racon2:20151910.10493-20155625.0
[15:10:08 - Stitch] ctg14Racon2:20151910.1493-20151911.11492 ends before ctg14Racon2:20151910.10493-20155625.0, skipping.
[15:58:05 - Stitch] Processing ctg1500Racon2.
[15:58:16 - Stitch] Processing ctg1502Racon2.
[15:58:22 - Stitch] Processing ctg1503Racon2.
[15:58:26 - Stitch] Processing ctg1504Racon2.
[16:00:36 - Stitch] Processing ctg1506Racon2.
[16:00:40 - Stitch] Processing ctg1507Racon2.

Those contigs are in the 500 Kb to 4 Mb range, so not large at all. Assembly was produced by wtdbg2 and previousley polished two times with racon as recommended.

Could you tell me where medaka is saving intermediate stitching output so i could get the processed contigs up till now? Can it be retrieved somewhere from the job-running node? So i dont have to wait for stitch to finish on the whole genome.

Thank you,
Michel

from medaka.

MichelMoser commented on July 17, 2024

So as the prob.hdf file already has been generated, i ended up calling :

medaka stitch   consensus_probs.hdf   batch${id}.polished.fasta   --regions $contigs

where contigs is a space separated string of contig names:

ctg601Racon2 ctg602Racon2 ctg603Racon2 ctg604Racon2 ctg605Racon2 ctg606Racon2 ctg607Racon2 ctg608Racon2...

Unfortunately, i just saw that you reurn a list of tuples with seq id and corrected seq from a loop. So unless the loop is finished correctly, there are no intermediate files written out. I will fix that as i run repeatedly into Keyerrors when looking for contig names, which should actually be present.

Example:

[11:41:07 - Stitch] Processing ctg718Racon2.
[11:41:08 - Stitch] Processing ctg719Racon2.
[11:41:09 - Stitch] Processing ctg720Racon2.
Traceback (most recent call last):
File "/mnt/users/michelmo/.conda/envs/medaka06/bin/medaka", line 11, in <module>
sys.exit(main())
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/medaka.py", line 261, in main
 args.func(args)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/stitch.py", line 343, in stitch
 joined = stitch_from_probs(args.inputs, regions=args.regions)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/stitch.py", line 51, in stitch_from_probs
 s1 = next(data_gen)
 File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/datastore.py", line 247, in yield_from_feature_files 
for d in self.index[ref_name]:
KeyError: 'ctg720Racon2'

But when i check bam-index of calls_to_draft.bam, the contig is actually present:

$samtools view -H ../cattle_medaka/calls_to_draft.bam |grep ctg720
@SQ     SN:ctg720Racon2 LN:22288

Have you experienced similar behaviour?

from medaka.

cjw85 commented on July 17, 2024

It does look like things are running a little slower than our experience, is your data on a network file server? Currently medaka stitch needs to read back quite a bit of data from the file. What's the total size of your consensus_probs.hdf file?

Returning the list of tuples from the function comes hasn't been an issue for smaller datasets, we'll get this changed to yield results as they are produced.

The KeyErrors are likely resulting from regions that medaka was unable to process; there are a few reasons this can occur. It is currently up to the use to determine uncorrected regions by comparing the input and output sequences. We will think about how to track these and provide a report.

Motivated by your observations, we are looking to remove the need for running medaka stitch by doing the work on the fly, though that will take some time to implement.

from medaka.

MichelMoser commented on July 17, 2024

Size is 141 G. So nothing enormous.

So at the moment, the KeyErrors are causing medaka stitch to fail for all follow-up contigs. That shouldnt happen, right?
I will try to fix this, cause i actually get quite a lot of KeyErrors.

from medaka.

MichelMoser commented on July 17, 2024

So do i get this right?
If you run medaka_consensus, stitching runs under the hood and KeyErrors get caught, meaning some of the sequences pass unpolished (because they dont have aligned reads) but also unnoticed,
whereas if you run medaka stitch, such contigs throw a KeyError and cause the programm to exit.

from medaka.

cjw85 commented on July 17, 2024

That is in essence correct. (The KeyErrors do not arise during medaka_consensus as medaka stitch simply inspects the files to see which contigs are available to be processed).

from medaka.

MichelMoser commented on July 17, 2024

Ok, thanks. So i just have to catch the KeyError and guide stitch to the next contig to continue.

from medaka.

medaka stitch runtime about medaka HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent