adamtaranto / teloclip Goto Github PK

View Code? Open in Web Editor NEW

35.0 5.0 4.0 839 KB

A tool for the recovery of unassembled telomeres from soft-clipped read alignments.

License: Other

Python 97.74% Dockerfile 2.26%

bioinformatics genome-assembly telomere telomere-length telomeres telomeric

teloclip's Issues

Interactive Contig extension

Add module for interactive extension of contigs.

Take teloclip filtered reads as input
For each contig end display overhangs
Prompt user to accept an extension, or
Error correct + assemble reads and realign to ref contig

Note: Need to write wrappers for racon, minimap2, and miniasm.

Add rev complement motif search

Add as default behaviour - search given motif and its reverse complement.

Could you explain in detail how to extend reads?

Dear autor：
I am assembling a plant genome, and now it is assembled to the level without gap. It have 21 chromosomes, but so far only 39 have been counted, and there are 3 motifs (AAACCCT) without telomeres.
Seeing the tools you developed, I think it will save a lot of manual work. I have tried step by step,
1. First index the reference assembly
samtools faidx hifiasm.fasta
2. Reading alignments from SAM file
minimap2 -ax map-hifi -t 120 hifiasm.fasta ../hifi/all.hifi.fasta | samtools view -@ 40 -h -F 0x100 >in.sam
3. Report clipped alignments containing target motifs
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam| samtools sort -@ 60 > out.motifs.bam
4. Extract clipped reads
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam | teloclip-extract --refIdx hifiasm.fasta.fai --extractReads --extractDir SplitOverhangs

but I am not familiar with the use of miniasm. the result is empty !

minimap2 -t 60 -x ava-pb B01_L.fasta no-gap.id_B01.fasta > telo-read.paf 
miniasm -f no-gap.id_B01.fasta telo-read.paf > telo-min.gfa 
awk '/^S/{print ">"$2"\n"$3}' telo-min.gfa | seqkit seq > telo-min.fasta

Could you give me some suggestions? Thank you very much.

No 'rU' option in Python 3.12

hi,
I just found this tool and I was trying to fill telomere to our assembly.
I just installed this tool by conda as you wrote
conda install -c bioconda teloclip
then I tried use the test data like:
[$]> lh
total 180K
-rw-r--r-- 1 pliu ebio 17K Nov 1 13:28 test.bam
-rw-r--r-- 1 pliu ebio 336 Nov 1 13:28 test.bam.bai
-rw-r--r-- 1 pliu ebio 5.4K Nov 1 13:28 test.fna
-rw-r--r-- 1 pliu ebio 95 Nov 1 13:28 test.fna.fai
-rw-r--r-- 1 pliu ebio 141K Nov 1 13:28 test.sam
[$]> samtools view -h test.bam | teloclip --ref test.fna.fai | samtools sort > out.bam
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
samtools sort: failed to read header from "-";
then I tried run another command:
[$]~> samtools view -h test.bam | teloclip --ref test.fna.fai --motifs TTAGGG | teloclip-extract --refIdx test.fna.fai --extractReads --extractDir SplitOverhangs
Importing reference contig info from: test.fna.fai
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip-extract", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_extract.py", line 283, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
contigInfo = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
do you have any idea about this issue?
Best, PF

Execution of Teloclip and input output streams

Dear Adam,

I hope you are doing well. I just got to know about this tool to find telomers in assembled genomes. I Am trying to run this on one of my assembled genomes but the documentation in readme file is confusing me.
I also checked a previously opened issue #20 by @cyycyj but stat also increased some confusions so I will try to put all confusions in one go so yu can help.
As from the Documentation I see 3 major steps to follow

Step1: First index the reference assembly

samtools faidx ref.fa

Step2: Streaming SAM records from aligner

minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam

Step by step as mentioned by @cyycyj in #20

1. minimap2 -ax map-ont ref.fa pacbio.fq.gz > mapped.sam
2. samtools view -h -F 0x100 mapped.sam > FILTERED.SAM
3. teloclip --refIdx ref.fa.fai filtered.sam > teloclip_filtered.sam
5. samtools sort teloclip_filtered.sam -o TELOCLIP_FILTERED.BAM

Step 3: Report clipped alignments containing target motifs

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam

Step by step as mentioned by @cyycyj in #20

1. samtools view -h in.bam > header.sam    #convert bam to sam including header information
2. teloclip --ref ref.fa.fai --motifs TTAGGG header.sam > teloclip_TTAGGG_overhangs.sam
3. samtools sort teloclip_TTAGGG_overhangs.sam > teloclip_TTAGGG_overhangs.bam

QUESTION: This in.bam is causig confusion. as @cyycyj had asked already in #20

This in.bam will be the output from step 2 being the FILTERED.SAM (after sam2bam conversion) file ?
This in.bam will be the final output from step 2 being the TELOCLIP_FILTERED.BAM file ?

QUESTION regarding Matching noisy target motifs*
when using the --fuzzy option with teloclip, i get error teloclip: error: unrecognized arguments: --fuzzy

Step 4: Extract clipped reads

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

QUESTION: here the in.bam will be which of the following ?

This in.bam will be the output from step 2 being the FILTERED.SAM (after sam2bam conversion) file ?
This in.bam will be the final output from step 2 being the TELOCLIP_FILTERED.BAM file ?
This in.bam will be the final output from step 3 being the TELOCLIP_TTAGGG_OVERHANGS.BAM file ?

Major Confusions 1 in #20 you answered as

" the bam or sam that gets passed to teloclip should always be either raw aligments from an alignment tool like minimap2 OR alignments that have been filtered with samtools view to remove low quality alignments."

This is point of confusion, So in both step 2,3,4 the in.bam file should be the either the raw alignment file mapped.sam or filteres alignment file FILTERED.SAM which are generated in step2. can you please shed some light on this.

Major Confusions 2

all these steps should be executed to run teloclip ??

Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam
Step3: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam
Step4: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

OR step 2 3 4 are just different ways to run teloclip and can be combined into one step as below ?

Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

If you are able to help with the mwntioned questions it will be great help.
Best regards

Automatically extend contigs

Hi,
I am wondering is there any way (not the manual checking) to extend the contigs according to the telomere overhangs identified?

Thx

Automate testing

Add test cases for core functions.
Add github action to run test on PRs

Summary report option

Add support for plain text reporting of alignments of interest.

Option to give a summary output file name
For each alignment: Read ID, bitflag, contig ID, contig start, contig end, contig OH location (left/right), OH seq
Add padding to internal (contig side) boundary of soft-clip if the alignment does not end exactly at contig end
Add option to flip left-end extension (i.e report left OH as RC) to help with spotting approximate alignments by eye.
Write two lines for alignments that overhang both ends of a small contig.

Aim: Summary file that can be awk'd to list all OHs at one end of a given contig for visual inspection.

Calculate alignment end position from CIGAR

Current alignment length is taken as "M" length adjacent to a clipped end "S".

Most long read alignments have more complex alignments with many indels. This leads to underestimation of alignment length and causes teloclip to miss most 3' overhangs.

Task: Explicitly sum all indel and alignment blocks from CIGAR string when reporting alignment length.

error "invalid mode: 'rU' " in teloclip (v. bioconda)

When installing and running teloclip through conda (Python v 3.11.5), I received the following error:

Traceback (most recent call last):
File "/home/sk893857/miniconda3/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'

note: This appears to be an invalid file read mode in Python 3. I solved this by manually replacing all instances of 'rU' with 'r' in both scripts. Maybe the conda version of teloclip needs a fix?

Missing isMotifInClip()

main app tries to import isMotifInClip() which has been removed. Add replacement func.

In the meantime v0.0.4 from PyPi still works fine.

Feature: Align and Extend

Existing modules:

Base Teloclip extracts any reads that are soft-clipped at contig ends (optionally checking for telomeric motifs)
Teloclip-extract: bins reads from teloclip into output files per contig end (left / right for each contig)

New module:

Teloclip-extend: Extend a contig with the soft-clipped overhang of a single aligned read or contig.

Tasks:

Write function that takes a single aligned read and extends a contig with the overhanging (soft-clipped) segment of the alignment.
Handle cases where alignment is clipped at both ends of contig
Update argparser to use submodule keywords [filter, extract, extend]

Add metadata to extracted overhang reads

teloclip-extract will write reads with terminal clipped-overhangs to fasta files. To aide in sanity checks we should include some metadata to the fasta header.

Proposed fields:

Anchor len: how much of the read is aligned to the ref sequence
Overhang len: how long is the overhang
Ref name: which sequence is it aligned to
Motifs : Motifs of regex patterns used in filtering (comma delim list)
Motif counts: Total count of motif matches in overhang (list matching order of motifs)

Other behaviour changes:

Sort fasta by overhang length
Include reference segment in output fasta (from earliest alignment start coord)
Log total read counts for each end of each reference sequence.
Print histogram of overhang depths
Log warning if unbalanced overhangs (i.e. most ends have 5 and one has 500 reads)

Find existing telo repeats in contigs

Add option to report n telo motifs within x bases of a contig end using fasta input. Output annotations + text summary.

Improve readability

Continuing from issue #13. Update codebase to improve readability.

Tasks include:

Add typehints
Update logging
Add Docstrings
Split long functions into sub-functions

Package Distribution

Add github action to package and push to pypi on new release
Update Bioconda version

Refresh codebase

A general refresh of the package is needed.

Phase 1 - Code cleanup and packaging

Restructure package
Migrate packaging to pyproject toml
Add codespace + gitpod config files

Phase 2 - Readability

Add typehints
Update logging
Add Docstrings

Restrict motif search to clipped region

Motif search currently requires at least one instance of any user-specified motif in a read that is clipped at a contig end. This match can occur anywhere in the read, and may not necessarily be in the distal clipped region.

Add option to restrict motif search to the clipped end (or ends for long reads that span entire contigs).

functionality to extract reads at NNN gaps in scaffolds?

Hi, @Adamtaranto , very nice tool!

Is it possible to add a new functionality that extracts reads soft-clipped at NNN gaps in scaffolds? This may be useful to reconstruct centromeric sequences.

Best regards,
Jun

Feature: Fuzzy motif search

Motif runs in clipped regions of raw ONT reads may contain homopolymer errors where the number of sequential repeated bases is incorrect. Currently teloclip has an option --noPoly which compresses homopolymer runs in both the motif and the read befor looking for matches. However, this is likely to reduce specificity and increase false positive matches.

Proposal: Replace homopolymer compression with a regex based fuzzy search method.

Steps:

Convert input motif to regex where all runs of > 1 base allow for a +/- 1 range.
Search given sequence with pattern
Return count of non-overlapping matches
Add noisy reporting of motif counts per readname in L/R end of contigname
Add Warning for depreciated --noPoly to be removed in future major release.

Should default behaviour be exact match or fuzzy search?

NNNNN reads

Hi Adam,
love the tool, but came across something strange.

I ran the tool like this:

minimap2 -ax map-ont P9424_final.fasta ../P9424.correctedReads.fasta.gz | teloclip --ref P9424_final.fasta.fai  | samtools sort > P9424_teloclip.bam

When I look at the bam files I get a lot of these NNNNNNN reads as in the below image mapping to contig ends.
But when I go and look that read name up in the actual fasta reads file, they are proper reads without any Ns.
Any idea where that might come from?

Cheers

Confusing about input and output file

Dear Adam,

It is amazing to find such a great tool for telomere anchoring! But I would like to say some expression you mentioned in the tutorial is blurred, especially the input and output file (like in.sam, out.sam and out.bam...). If you can clarify them, I would appreciate it. By the way, do you have any recommendations about citing your work, just github link or you have published it?

Thanks

Verify: Correct read slice coords for negative strand alignments

May need to have isClipMotif() function check if the current alignment is on the negative strand before extracting sequence for motif search.

Assumptions:

Align coords are reported in fwd orientation from leftmost ref position.
CIGAR coords are reported for ref fwd orientation (left to right on ref)
Read string is not altered from raw read (may be reversed relative to CIGAR coords)

If above are true, then 3' overhang seq for 1000M500S should be retrieved as:

If forward alignment: Seq[-500:]
If reverse alignment: Seq[0:500]

and for a 5' alignment 500S1000M:

If forward alignment: Seq[0:500]
If reverse alignment: Seq[-500:]

Note: If printing overhang sequence to a summary file we should RC the extracted OH for neg strand alignments.