Comments (8)
The main consensus step of medaka can indeed be parallelized: the steps of the medaka_consensus
program can be run individually.
Medaka comes with a script mini_align
which is responsible for running minimap2, sorting, and indexing the bam. This can run this as:
basecalls=...
draft=...
alignments=align.bam
align_threads=32
mini_align -i ${basecalls} -r ${draft} -P -m -p ${alignments} -t ${align_threads}
After which the consensus calling can be run on regions independently:
region="chr1:0-10000000" # for example
model=r941_flip235 # or as appropriate
probs=${region}_probs.hdf
medaka consensus ${alignments} ${probs} --model ${model} --batch_size --threads 8
When running on multiple regions, it would be recommended to provide an overlap of 500 bases.
After calculating all _probs.hdf
files, the final consensus for a contig can be calculated by running, e.g.:
medaka stitch chr1*_probs.hdf chr1.fasta
The medaka stitch program does not provide any parallelism (we will implement that in a future release), but it can be run on contigs independently. I am still surprised that it has taken more than a day! I do not recall it taking so long when run on a human genome. Can you share the current output, has it processed multiple contigs or does it seem completely stuck?
For your larger dataset, after performing alignments medaka itself scales linearly with the draft size and is (largely) independent of depth.
from medaka.
Thank you for the info. So i will start stitching multiple contigs in parallel as this is taking ages and stitching is still ongoing =).
Stitch is progressing through the contigs but in a very slow manner, sometimes spending hours resolving some non-overlap issue as you can see here:
[12:06:25 - Stitch] Processing ctg1487Racon2.
[12:06:34 - Stitch] Processing ctg148Racon2.
[12:38:21 - Stitch] Processing ctg1490Racon2.
[12:38:44 - Stitch] Processing ctg1491Racon2.
[12:39:05 - Stitch] Processing ctg1492Racon2.
[12:40:16 - Stitch] Processing ctg1493Racon2.
[12:40:23 - Stitch] Processing ctg1494Racon2.
[12:40:34 - Stitch] Processing ctg1495Racon2.
[12:41:13 - Stitch] Processing ctg1496Racon2.
[12:41:49 - Stitch] Processing ctg1497Racon2.
[12:42:00 - Stitch] Processing ctg1498Racon2.
[12:42:10 - Stitch] Processing ctg149Racon2.
[12:57:57 - Stitch] Processing ctg14Racon2.
[15:10:05 - Stitch] There is no overlap betwen ctg14Racon2:20146771.0-20151911.2492 and ctg14Racon2:20151910.10493-20155625.0
[15:10:08 - Stitch] ctg14Racon2:20151910.1493-20151911.11492 ends before ctg14Racon2:20151910.10493-20155625.0, skipping.
[15:58:05 - Stitch] Processing ctg1500Racon2.
[15:58:16 - Stitch] Processing ctg1502Racon2.
[15:58:22 - Stitch] Processing ctg1503Racon2.
[15:58:26 - Stitch] Processing ctg1504Racon2.
[16:00:36 - Stitch] Processing ctg1506Racon2.
[16:00:40 - Stitch] Processing ctg1507Racon2.
Those contigs are in the 500 Kb to 4 Mb range, so not large at all. Assembly was produced by wtdbg2 and previousley polished two times with racon as recommended.
Could you tell me where medaka is saving intermediate stitching output so i could get the processed contigs up till now? Can it be retrieved somewhere from the job-running node? So i dont have to wait for stitch to finish on the whole genome.
Thank you,
Michel
from medaka.
So as the prob.hdf file already has been generated, i ended up calling :
medaka stitch consensus_probs.hdf batch${id}.polished.fasta --regions $contigs
where contigs is a space separated string of contig names:
ctg601Racon2 ctg602Racon2 ctg603Racon2 ctg604Racon2 ctg605Racon2 ctg606Racon2 ctg607Racon2 ctg608Racon2...
Unfortunately, i just saw that you reurn a list of tuples with seq id and corrected seq from a loop. So unless the loop is finished correctly, there are no intermediate files written out. I will fix that as i run repeatedly into Keyerrors when looking for contig names, which should actually be present.
Example:
[11:41:07 - Stitch] Processing ctg718Racon2.
[11:41:08 - Stitch] Processing ctg719Racon2.
[11:41:09 - Stitch] Processing ctg720Racon2.
Traceback (most recent call last):
File "/mnt/users/michelmo/.conda/envs/medaka06/bin/medaka", line 11, in <module>
sys.exit(main())
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/medaka.py", line 261, in main
args.func(args)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/stitch.py", line 343, in stitch
joined = stitch_from_probs(args.inputs, regions=args.regions)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/stitch.py", line 51, in stitch_from_probs
s1 = next(data_gen)
File "/mnt/users/michelmo/.conda/envs/medaka06/lib/python3.6/site-packages/medaka/datastore.py", line 247, in yield_from_feature_files
for d in self.index[ref_name]:
KeyError: 'ctg720Racon2'
But when i check bam-index of calls_to_draft.bam, the contig is actually present:
$samtools view -H ../cattle_medaka/calls_to_draft.bam |grep ctg720
@SQ SN:ctg720Racon2 LN:22288
Have you experienced similar behaviour?
from medaka.
It does look like things are running a little slower than our experience, is your data on a network file server? Currently medaka stitch
needs to read back quite a bit of data from the file. What's the total size of your consensus_probs.hdf
file?
Returning the list of tuples from the function comes hasn't been an issue for smaller datasets, we'll get this changed to yield results as they are produced.
The KeyErrors
are likely resulting from regions that medaka
was unable to process; there are a few reasons this can occur. It is currently up to the use to determine uncorrected regions by comparing the input and output sequences. We will think about how to track these and provide a report.
Motivated by your observations, we are looking to remove the need for running medaka stitch
by doing the work on the fly, though that will take some time to implement.
from medaka.
Size is 141 G. So nothing enormous.
So at the moment, the KeyErrors are causing medaka stitch to fail for all follow-up contigs. That shouldnt happen, right?
I will try to fix this, cause i actually get quite a lot of KeyErrors.
from medaka.
So do i get this right?
If you run medaka_consensus, stitching runs under the hood and KeyErrors get caught, meaning some of the sequences pass unpolished (because they dont have aligned reads) but also unnoticed,
whereas if you run medaka stitch, such contigs throw a KeyError and cause the programm to exit.
from medaka.
That is in essence correct. (The KeyError
s do not arise during medaka_consensus
as medaka stitch
simply inspects the files to see which contigs are available to be processed).
from medaka.
Ok, thanks. So i just have to catch the KeyError
and guide stitch to the next contig to continue.
from medaka.
Related Issues (20)
- Missing pyabpoa in Docker image HOT 3
- process pool error in medaka tandem HOT 10
- Bad heap free list error in medaka stitch HOT 2
- medaka_consensus run without pyabpoa HOT 1
- Medaka error
- medaka error pyabpoa and libgsl.so.25 HOT 1
- Duplicate entries in annotated VCF file HOT 2
- Unable to run medaka_consensus on Mac M3 HOT 5
- Empty vcfs when running medaka_haploid_variant command HOT 2
- Makefile:158: recipe for target 'check_lfs' failed make: *** [check_lfs] Error 1 HOT 2
- 1.6.0 release unavailable in pypi HOT 2
- Medaka Compatibility with Fungal Reads HOT 6
- I run medaka consensus in HPC, it only generated HDF5 data, how to generate consensus. fasta? HOT 6
- Is it possible to use medaka in offline mode? HOT 11
- Unable to install medaka on Mac M3 HOT 9
- help please with minimap2, tabix, bgzip and bcftools binary files
- Python 3.12 compatibility for pip HOT 1
- batch size and GPU use HOT 4
- ModelStoreTF exception <class 'tensorflow.python.framework.errors_impl.InternalError'> HOT 1
- Need help with 'AVX instructions not available' error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from medaka.