adamtaranto / teloclip Goto Github PK
View Code? Open in Web Editor NEWA tool for the recovery of unassembled telomeres from soft-clipped read alignments.
License: Other
A tool for the recovery of unassembled telomeres from soft-clipped read alignments.
License: Other
Hi Adam,
love the tool, but came across something strange.
I ran the tool like this:
minimap2 -ax map-ont P9424_final.fasta ../P9424.correctedReads.fasta.gz | teloclip --ref P9424_final.fasta.fai | samtools sort > P9424_teloclip.bam
When I look at the bam files I get a lot of these NNNNNNN reads as in the below image mapping to contig ends.
But when I go and look that read name up in the actual fasta reads file, they are proper reads without any Ns.
Any idea where that might come from?
Cheers
Add support for plain text reporting of alignments of interest.
Aim: Summary file that can be awk'd to list all OHs at one end of a given contig for visual inspection.
Dear Adam,
It is amazing to find such a great tool for telomere anchoring! But I would like to say some expression you mentioned in the tutorial is blurred, especially the input and output file (like in.sam, out.sam and out.bam...). If you can clarify them, I would appreciate it. By the way, do you have any recommendations about citing your work, just github link or you have published it?
Thanks
hi,
I just found this tool and I was trying to fill telomere to our assembly.
I just installed this tool by conda as you wrote
conda install -c bioconda teloclip
then I tried use the test data like:
[$]> lh> samtools view -h test.bam | teloclip --ref test.fna.fai | samtools sort > out.bam
total 180K
-rw-r--r-- 1 pliu ebio 17K Nov 1 13:28 test.bam
-rw-r--r-- 1 pliu ebio 336 Nov 1 13:28 test.bam.bai
-rw-r--r-- 1 pliu ebio 5.4K Nov 1 13:28 test.fna
-rw-r--r-- 1 pliu ebio 95 Nov 1 13:28 test.fna.fai
-rw-r--r-- 1 pliu ebio 141K Nov 1 13:28 test.sam
[$]
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
samtools sort: failed to read header from "-";
then I tried run another command:
[$]~> samtools view -h test.bam | teloclip --ref test.fna.fai --motifs TTAGGG | teloclip-extract --refIdx test.fna.fai --extractReads --extractDir SplitOverhangs
Importing reference contig info from: test.fna.fai
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip-extract", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_extract.py", line 283, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
contigInfo = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
do you have any idea about this issue?
Best, PF
Current alignment length is taken as "M" length adjacent to a clipped end "S".
Most long read alignments have more complex alignments with many indels. This leads to underestimation of alignment length and causes teloclip to miss most 3' overhangs.
Task: Explicitly sum all indel and alignment blocks from CIGAR string when reporting alignment length.
Existing modules:
New module:
Tasks:
Dear autor:
I am assembling a plant genome, and now it is assembled to the level without gap. It have 21 chromosomes, but so far only 39 have been counted, and there are 3 motifs (AAACCCT) without telomeres.
Seeing the tools you developed, I think it will save a lot of manual work. I have tried step by step,
1. First index the reference assembly
samtools faidx hifiasm.fasta
2. Reading alignments from SAM file
minimap2 -ax map-hifi -t 120 hifiasm.fasta ../hifi/all.hifi.fasta | samtools view -@ 40 -h -F 0x100 >in.sam
3. Report clipped alignments containing target motifs
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam| samtools sort -@ 60 > out.motifs.bam
4. Extract clipped reads
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam | teloclip-extract --refIdx hifiasm.fasta.fai --extractReads --extractDir SplitOverhangs
but I am not familiar with the use of miniasm. the result is empty !
minimap2 -t 60 -x ava-pb B01_L.fasta no-gap.id_B01.fasta > telo-read.paf
miniasm -f no-gap.id_B01.fasta telo-read.paf > telo-min.gfa
awk '/^S/{print ">"$2"\n"$3}' telo-min.gfa | seqkit seq > telo-min.fasta
Could you give me some suggestions? Thank you very much.
When installing and running teloclip through conda (Python v 3.11.5), I received the following error:
Traceback (most recent call last):
File "/home/sk893857/miniconda3/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
note: This appears to be an invalid file read mode in Python 3. I solved this by manually replacing all instances of 'rU' with 'r' in both scripts. Maybe the conda version of teloclip needs a fix?
Motif runs in clipped regions of raw ONT reads may contain homopolymer errors where the number of sequential repeated bases is incorrect. Currently teloclip
has an option --noPoly
which compresses homopolymer runs in both the motif and the read befor looking for matches. However, this is likely to reduce specificity and increase false positive matches.
Proposal: Replace homopolymer compression with a regex based fuzzy search method.
Steps:
--noPoly
to be removed in future major release.Should default behaviour be exact match or fuzzy search?
Hi, @Adamtaranto , very nice tool!
Is it possible to add a new functionality that extracts reads soft-clipped at NNN gaps in scaffolds? This may be useful to reconstruct centromeric sequences.
Best regards,
Jun
Add option to report n telo motifs within x bases of a contig end using fasta input. Output annotations + text summary.
Hi,
I am wondering is there any way (not the manual checking) to extend the contigs according to the telomere overhangs identified?
Thx
Motif search currently requires at least one instance of any user-specified motif in a read that is clipped at a contig end. This match can occur anywhere in the read, and may not necessarily be in the distal clipped region.
Add option to restrict motif search to the clipped end (or ends for long reads that span entire contigs).
main app tries to import isMotifInClip() which has been removed. Add replacement func.
In the meantime v0.0.4 from PyPi still works fine.
May need to have isClipMotif() function check if the current alignment is on the negative strand before extracting sequence for motif search.
Assumptions:
If above are true, then 3' overhang seq for 1000M500S should be retrieved as:
and for a 5' alignment 500S1000M:
Note: If printing overhang sequence to a summary file we should RC the extracted OH for neg strand alignments.
Add module for interactive extension of contigs.
Note: Need to write wrappers for racon, minimap2, and miniasm.
A general refresh of the package is needed.
Phase 1 - Code cleanup and packaging
Phase 2 - Readability
teloclip-extract
will write reads with terminal clipped-overhangs to fasta files. To aide in sanity checks we should include some metadata to the fasta header.
Proposed fields:
Other behaviour changes:
Continuing from issue #13. Update codebase to improve readability.
Tasks include:
Add typehints
Update logging
Add Docstrings
Split long functions into sub-functions
Add as default behaviour - search given motif and its reverse complement.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.