Coder Social home page Coder Social logo

medaka stitch runtime about medaka HOT 8 CLOSED

nanoporetech avatar nanoporetech commented on July 17, 2024
medaka stitch runtime

from medaka.

Comments (8)

cjw85 avatar cjw85 commented on July 17, 2024

The main consensus step of medaka can indeed be parallelized: the steps of the medaka_consensus program can be run individually.

Medaka comes with a script mini_align which is responsible for running minimap2, sorting, and indexing the bam. This can run this as:

basecalls=...
draft=...
alignments=align.bam
align_threads=32
mini_align -i ${basecalls} -r ${draft} -P -m -p ${alignments} -t ${align_threads}

After which the consensus calling can be run on regions independently:

region="chr1:0-10000000" # for example
model=r941_flip235 # or as appropriate
probs=${region}_probs.hdf
medaka consensus ${alignments} ${probs} --model ${model} --batch_size --threads 8

When running on multiple regions, it would be recommended to provide an overlap of 500 bases.

After calculating all _probs.hdf files, the final consensus for a contig can be calculated by running, e.g.:

medaka stitch chr1*_probs.hdf chr1.fasta

The medaka stitch program does not provide any parallelism (we will implement that in a future release), but it can be run on contigs independently. I am still surprised that it has taken more than a day! I do not recall it taking so long when run on a human genome. Can you share the current output, has it processed multiple contigs or does it seem completely stuck?

For your larger dataset, after performing alignments medaka itself scales linearly with the draft size and is (largely) independent of depth.

from medaka.

MichelMoser avatar MichelMoser commented on July 17, 2024

Thank you for the info. So i will start stitching multiple contigs in parallel as this is taking ages and stitching is still ongoing =).

Stitch is progressing through the contigs but in a very slow manner, sometimes spending hours resolving some non-overlap issue as you can see here:

[12:06:25 - Stitch] Processing ctg1487Racon2.
[12:06:34 - Stitch] Processing ctg148Racon2.
[12:38:21 - Stitch] Processing ctg1490Racon2.
[12:38:44 - Stitch] Processing ctg1491Racon2.
[12:39:05 - Stitch] Processing ctg1492Racon2.
[12:40:16 - Stitch] Processing ctg1493Racon2.
[12:40:23 - Stitch] Processing ctg1494Racon2.
[12:40:34 - Stitch] Processing ctg1495Racon2.
[12:41:13 - Stitch] Processing ctg1496Racon2.
[12:41:49 - Stitch] Processing ctg1497Racon2.
[12:42:00 - Stitch] Processing ctg1498Racon2.
[12:42:10 - Stitch] Processing ctg149Racon2.
[12:57:57 - Stitch] Processing ctg14Racon2.
[15:10:05 - Stitch] There is no overlap betwen ctg14Racon2:20146771.0-20151911.2492 and ctg14Racon2:20151910.10493-20155625.0
[15:10:08 - Stitch] ctg14Racon2:20151910.1493-20151911.11492 ends before ctg14Racon2:20151910.10493-20155625.0, skipping.
[15:58:05 - Stitch] Processing ctg1500Racon2.
[15:58:16 - Stitch] Processing ctg1502Racon2.
[15:58:22 - Stitch] Processing ctg1503Racon2.
[15:58:26 - Stitch] Processing ctg1504Racon2.
[16:00:36 - Stitch] Processing ctg1506Racon2.
[16:00:40 - Stitch] Processing ctg1507Racon2.

Those contigs are in the 500 Kb to 4 Mb range, so not large at all. Assembly was produced by wtdbg2 and previousley polished two times with racon as recommended.

Could you tell me where medaka is saving intermediate stitching output so i could get the processed contigs up till now? Can it be retrieved somewhere from the job-running node? So i dont have to wait for stitch to finish on the whole genome.

Thank you,
Michel

from medaka.

MichelMoser avatar MichelMoser commented on July 17, 2024

So as the prob.hdf file already has been generated, i ended up calling :

medaka stitch   consensus_probs.hdf   batch${id}.polished.fasta   --regions $contigs

where contigs is a space separated string of contig names:

ctg601Racon2 ctg602Racon2 ctg603Racon2 ctg604Racon2 ctg605Racon2 ctg606Racon2 ctg607Racon2 ctg608Racon2...

Unfortunately, i just saw that you reurn a list of tuples with seq id and corrected seq from a loop. So unless the loop is finished correctly, there are no intermediate files written out. I will fix that as i run repeatedly into Keyerrors when looking for contig names, which should actually be present.

Example:

[11:41:07 - Stitch] Processing ctg718Racon2.
[11:41:08 - Stitch] Processing ctg719Racon2.
[11:41:09 - Stitch] Processing ctg720Racon2.
Traceback (most recent call last):
File "/mnt/users/michelmo/.conda/envs/medaka06/bin/medaka", line 11, in <module>
sys.exit(main())
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/medaka.py", line 261, in main
 args.func(args)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/stitch.py", line 343, in stitch
 joined = stitch_from_probs(args.inputs, regions=args.regions)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/stitch.py", line 51, in stitch_from_probs
 s1 = next(data_gen)
 File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/datastore.py", line 247, in yield_from_feature_files 
for d in self.index[ref_name]:
KeyError: 'ctg720Racon2'

But when i check bam-index of calls_to_draft.bam, the contig is actually present:

$samtools view -H ../cattle_medaka/calls_to_draft.bam |grep ctg720
@SQ     SN:ctg720Racon2 LN:22288

Have you experienced similar behaviour?

from medaka.

cjw85 avatar cjw85 commented on July 17, 2024

It does look like things are running a little slower than our experience, is your data on a network file server? Currently medaka stitch needs to read back quite a bit of data from the file. What's the total size of your consensus_probs.hdf file?

Returning the list of tuples from the function comes hasn't been an issue for smaller datasets, we'll get this changed to yield results as they are produced.

The KeyErrors are likely resulting from regions that medaka was unable to process; there are a few reasons this can occur. It is currently up to the use to determine uncorrected regions by comparing the input and output sequences. We will think about how to track these and provide a report.

Motivated by your observations, we are looking to remove the need for running medaka stitch by doing the work on the fly, though that will take some time to implement.

from medaka.

MichelMoser avatar MichelMoser commented on July 17, 2024

Size is 141 G. So nothing enormous.

So at the moment, the KeyErrors are causing medaka stitch to fail for all follow-up contigs. That shouldnt happen, right?
I will try to fix this, cause i actually get quite a lot of KeyErrors.

from medaka.

MichelMoser avatar MichelMoser commented on July 17, 2024

So do i get this right?
If you run medaka_consensus, stitching runs under the hood and KeyErrors get caught, meaning some of the sequences pass unpolished (because they dont have aligned reads) but also unnoticed,
whereas if you run medaka stitch, such contigs throw a KeyError and cause the programm to exit.

from medaka.

cjw85 avatar cjw85 commented on July 17, 2024

That is in essence correct. (The KeyErrors do not arise during medaka_consensus as medaka stitch simply inspects the files to see which contigs are available to be processed).

from medaka.

MichelMoser avatar MichelMoser commented on July 17, 2024

Ok, thanks. So i just have to catch the KeyError and guide stitch to the next contig to continue.

from medaka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.