adamtaranto / teloclip Goto Github PK
View Code? Open in Web Editor NEWA tool for the recovery of unassembled telomeres from soft-clipped read alignments.
License: Other
A tool for the recovery of unassembled telomeres from soft-clipped read alignments.
License: Other
Add module for interactive extension of contigs.
Note: Need to write wrappers for racon, minimap2, and miniasm.
Add as default behaviour - search given motif and its reverse complement.
Dear autor:
I am assembling a plant genome, and now it is assembled to the level without gap. It have 21 chromosomes, but so far only 39 have been counted, and there are 3 motifs (AAACCCT) without telomeres.
Seeing the tools you developed, I think it will save a lot of manual work. I have tried step by step,
1. First index the reference assembly
samtools faidx hifiasm.fasta
2. Reading alignments from SAM file
minimap2 -ax map-hifi -t 120 hifiasm.fasta ../hifi/all.hifi.fasta | samtools view -@ 40 -h -F 0x100 >in.sam
3. Report clipped alignments containing target motifs
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam| samtools sort -@ 60 > out.motifs.bam
4. Extract clipped reads
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam | teloclip-extract --refIdx hifiasm.fasta.fai --extractReads --extractDir SplitOverhangs
but I am not familiar with the use of miniasm. the result is empty !
minimap2 -t 60 -x ava-pb B01_L.fasta no-gap.id_B01.fasta > telo-read.paf
miniasm -f no-gap.id_B01.fasta telo-read.paf > telo-min.gfa
awk '/^S/{print ">"$2"\n"$3}' telo-min.gfa | seqkit seq > telo-min.fasta
Could you give me some suggestions? Thank you very much.
hi,
I just found this tool and I was trying to fill telomere to our assembly.
I just installed this tool by conda as you wrote
conda install -c bioconda teloclip
then I tried use the test data like:
[$]> lh> samtools view -h test.bam | teloclip --ref test.fna.fai | samtools sort > out.bam
total 180K
-rw-r--r-- 1 pliu ebio 17K Nov 1 13:28 test.bam
-rw-r--r-- 1 pliu ebio 336 Nov 1 13:28 test.bam.bai
-rw-r--r-- 1 pliu ebio 5.4K Nov 1 13:28 test.fna
-rw-r--r-- 1 pliu ebio 95 Nov 1 13:28 test.fna.fai
-rw-r--r-- 1 pliu ebio 141K Nov 1 13:28 test.sam
[$]
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
samtools sort: failed to read header from "-";
then I tried run another command:
[$]~> samtools view -h test.bam | teloclip --ref test.fna.fai --motifs TTAGGG | teloclip-extract --refIdx test.fna.fai --extractReads --extractDir SplitOverhangs
Importing reference contig info from: test.fna.fai
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip-extract", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_extract.py", line 283, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
contigInfo = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
do you have any idea about this issue?
Best, PF
Dear Adam,
I hope you are doing well. I just got to know about this tool to find telomers in assembled genomes. I Am trying to run this on one of my assembled genomes but the documentation in readme file is confusing me.
I also checked a previously opened issue #20 by @cyycyj but stat also increased some confusions so I will try to put all confusions in one go so yu can help.
As from the Documentation I see 3 major steps to follow
samtools faidx ref.fa
minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam
Step by step as mentioned by @cyycyj in #20
1. minimap2 -ax map-ont ref.fa pacbio.fq.gz > mapped.sam
2. samtools view -h -F 0x100 mapped.sam > FILTERED.SAM
3. teloclip --refIdx ref.fa.fai filtered.sam > teloclip_filtered.sam
5. samtools sort teloclip_filtered.sam -o TELOCLIP_FILTERED.BAM
samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam
Step by step as mentioned by @cyycyj in #20
1. samtools view -h in.bam > header.sam #convert bam to sam including header information
2. teloclip --ref ref.fa.fai --motifs TTAGGG header.sam > teloclip_TTAGGG_overhangs.sam
3. samtools sort teloclip_TTAGGG_overhangs.sam > teloclip_TTAGGG_overhangs.bam
QUESTION: This in.bam
is causig confusion. as @cyycyj had asked already in #20
in.bam
will be the output from step 2 being the FILTERED.SAM (after sam2bam conversion)
file ?in.bam
will be the final output from step 2 being the TELOCLIP_FILTERED.BAM
file ?QUESTION regarding Matching noisy target motifs*
when using the --fuzzy
option with teloclip, i get error teloclip: error: unrecognized arguments: --fuzzy
samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs
QUESTION: here the in.bam
will be which of the following ?
in.bam
will be the output from step 2 being the FILTERED.SAM (after sam2bam conversion)
file ?in.bam
will be the final output from step 2 being the TELOCLIP_FILTERED.BAM
file ?in.bam
will be the final output from step 3 being the TELOCLIP_TTAGGG_OVERHANGS.BAM
file ?" the bam
or sam
that gets passed to teloclip should always be either raw aligments from an alignment tool like minimap2
OR alignments that have been filtered with samtools view
to remove low quality alignments."
This is point of confusion, So in both step 2,3,4 the in.bam
file should be the either the raw alignment file mapped.sam
or filteres alignment file FILTERED.SAM
which are generated in step2
. can you please shed some light on this.
all these steps should be executed to run teloclip ??
Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam
Step3: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam
Step4: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs
Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs
If you are able to help with the mwntioned questions it will be great help.
Best regards
Hi,
I am wondering is there any way (not the manual checking) to extend the contigs according to the telomere overhangs identified?
Thx
Add support for plain text reporting of alignments of interest.
Aim: Summary file that can be awk'd to list all OHs at one end of a given contig for visual inspection.
Current alignment length is taken as "M" length adjacent to a clipped end "S".
Most long read alignments have more complex alignments with many indels. This leads to underestimation of alignment length and causes teloclip to miss most 3' overhangs.
Task: Explicitly sum all indel and alignment blocks from CIGAR string when reporting alignment length.
When installing and running teloclip through conda (Python v 3.11.5), I received the following error:
Traceback (most recent call last):
File "/home/sk893857/miniconda3/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
note: This appears to be an invalid file read mode in Python 3. I solved this by manually replacing all instances of 'rU' with 'r' in both scripts. Maybe the conda version of teloclip needs a fix?
main app tries to import isMotifInClip() which has been removed. Add replacement func.
In the meantime v0.0.4 from PyPi still works fine.
Existing modules:
New module:
Tasks:
teloclip-extract
will write reads with terminal clipped-overhangs to fasta files. To aide in sanity checks we should include some metadata to the fasta header.
Proposed fields:
Other behaviour changes:
Add option to report n telo motifs within x bases of a contig end using fasta input. Output annotations + text summary.
Continuing from issue #13. Update codebase to improve readability.
Tasks include:
Add typehints
Update logging
Add Docstrings
Split long functions into sub-functions
A general refresh of the package is needed.
Phase 1 - Code cleanup and packaging
Phase 2 - Readability
Motif search currently requires at least one instance of any user-specified motif in a read that is clipped at a contig end. This match can occur anywhere in the read, and may not necessarily be in the distal clipped region.
Add option to restrict motif search to the clipped end (or ends for long reads that span entire contigs).
Hi, @Adamtaranto , very nice tool!
Is it possible to add a new functionality that extracts reads soft-clipped at NNN gaps in scaffolds? This may be useful to reconstruct centromeric sequences.
Best regards,
Jun
Motif runs in clipped regions of raw ONT reads may contain homopolymer errors where the number of sequential repeated bases is incorrect. Currently teloclip
has an option --noPoly
which compresses homopolymer runs in both the motif and the read befor looking for matches. However, this is likely to reduce specificity and increase false positive matches.
Proposal: Replace homopolymer compression with a regex based fuzzy search method.
Steps:
--noPoly
to be removed in future major release.Should default behaviour be exact match or fuzzy search?
Hi Adam,
love the tool, but came across something strange.
I ran the tool like this:
minimap2 -ax map-ont P9424_final.fasta ../P9424.correctedReads.fasta.gz | teloclip --ref P9424_final.fasta.fai | samtools sort > P9424_teloclip.bam
When I look at the bam files I get a lot of these NNNNNNN reads as in the below image mapping to contig ends.
But when I go and look that read name up in the actual fasta reads file, they are proper reads without any Ns.
Any idea where that might come from?
Cheers
Dear Adam,
It is amazing to find such a great tool for telomere anchoring! But I would like to say some expression you mentioned in the tutorial is blurred, especially the input and output file (like in.sam, out.sam and out.bam...). If you can clarify them, I would appreciate it. By the way, do you have any recommendations about citing your work, just github link or you have published it?
Thanks
May need to have isClipMotif() function check if the current alignment is on the negative strand before extracting sequence for motif search.
Assumptions:
If above are true, then 3' overhang seq for 1000M500S should be retrieved as:
and for a 5' alignment 500S1000M:
Note: If printing overhang sequence to a summary file we should RC the extracted OH for neg strand alignments.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.