Coder Social home page Coder Social logo

adamtaranto / teloclip Goto Github PK

View Code? Open in Web Editor NEW
33.0 5.0 3.0 839 KB

A tool for the recovery of unassembled telomeres from soft-clipped read alignments.

License: Other

Python 97.74% Dockerfile 2.26%
bioinformatics genome-assembly telomere telomere-length telomeres telomeric

teloclip's Issues

NNNNN reads

Hi Adam,
love the tool, but came across something strange.

I ran the tool like this:

minimap2 -ax map-ont P9424_final.fasta ../P9424.correctedReads.fasta.gz | teloclip --ref P9424_final.fasta.fai  | samtools sort > P9424_teloclip.bam

When I look at the bam files I get a lot of these NNNNNNN reads as in the below image mapping to contig ends.
But when I go and look that read name up in the actual fasta reads file, they are proper reads without any Ns.
Any idea where that might come from?

Cheers

image

Summary report option

Add support for plain text reporting of alignments of interest.

  • Option to give a summary output file name
  • For each alignment: Read ID, bitflag, contig ID, contig start, contig end, contig OH location (left/right), OH seq
  • Add padding to internal (contig side) boundary of soft-clip if the alignment does not end exactly at contig end
  • Add option to flip left-end extension (i.e report left OH as RC) to help with spotting approximate alignments by eye.
  • Write two lines for alignments that overhang both ends of a small contig.

Aim: Summary file that can be awk'd to list all OHs at one end of a given contig for visual inspection.

Package Distribution

  • Add github action to package and push to pypi on new release
  • Update Bioconda version

Confusing about input and output file

Dear Adam,

It is amazing to find such a great tool for telomere anchoring! But I would like to say some expression you mentioned in the tutorial is blurred, especially the input and output file (like in.sam, out.sam and out.bam...). If you can clarify them, I would appreciate it. By the way, do you have any recommendations about citing your work, just github link or you have published it?

Thanks

No 'rU' option in Python 3.12

hi,
I just found this tool and I was trying to fill telomere to our assembly.
I just installed this tool by conda as you wrote
conda install -c bioconda teloclip
then I tried use the test data like:
[$]> lh
total 180K
-rw-r--r-- 1 pliu ebio 17K Nov 1 13:28 test.bam
-rw-r--r-- 1 pliu ebio 336 Nov 1 13:28 test.bam.bai
-rw-r--r-- 1 pliu ebio 5.4K Nov 1 13:28 test.fna
-rw-r--r-- 1 pliu ebio 95 Nov 1 13:28 test.fna.fai
-rw-r--r-- 1 pliu ebio 141K Nov 1 13:28 test.sam
[$]
> samtools view -h test.bam | teloclip --ref test.fna.fai | samtools sort > out.bam
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
samtools sort: failed to read header from "-";
then I tried run another command:
[$]~> samtools view -h test.bam | teloclip --ref test.fna.fai --motifs TTAGGG | teloclip-extract --refIdx test.fna.fai --extractReads --extractDir SplitOverhangs
Importing reference contig info from: test.fna.fai
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip", line 10, in
Traceback (most recent call last):
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/bin/teloclip-extract", line 10, in
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
sys.exit(main())
^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/run_extract.py", line 283, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
contigInfo = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ebio/abt5_projects/3D_genome_of_brown_algae/envs/hic/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'
do you have any idea about this issue?
Best, PF

Calculate alignment end position from CIGAR

Current alignment length is taken as "M" length adjacent to a clipped end "S".

Most long read alignments have more complex alignments with many indels. This leads to underestimation of alignment length and causes teloclip to miss most 3' overhangs.

Task: Explicitly sum all indel and alignment blocks from CIGAR string when reporting alignment length.

Feature: Align and Extend

Existing modules:

  • Base Teloclip extracts any reads that are soft-clipped at contig ends (optionally checking for telomeric motifs)
  • Teloclip-extract: bins reads from teloclip into output files per contig end (left / right for each contig)

New module:

  • Teloclip-extend: Extend a contig with the soft-clipped overhang of a single aligned read or contig.

Tasks:

  • Write function that takes a single aligned read and extends a contig with the overhanging (soft-clipped) segment of the alignment.
  • Handle cases where alignment is clipped at both ends of contig
  • Update argparser to use submodule keywords [filter, extract, extend]

Could you explain in detail how to extend reads?

Dear autor:
I am assembling a plant genome, and now it is assembled to the level without gap. It have 21 chromosomes, but so far only 39 have been counted, and there are 3 motifs (AAACCCT) without telomeres.
Seeing the tools you developed, I think it will save a lot of manual work. I have tried step by step,
1. First index the reference assembly
samtools faidx hifiasm.fasta
2. Reading alignments from SAM file
minimap2 -ax map-hifi -t 120 hifiasm.fasta ../hifi/all.hifi.fasta | samtools view -@ 40 -h -F 0x100 >in.sam
3. Report clipped alignments containing target motifs
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam| samtools sort -@ 60 > out.motifs.bam
4. Extract clipped reads
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam | teloclip-extract --refIdx hifiasm.fasta.fai --extractReads --extractDir SplitOverhangs

but I am not familiar with the use of miniasm. the result is empty !

minimap2 -t 60 -x ava-pb B01_L.fasta no-gap.id_B01.fasta > telo-read.paf 
miniasm -f no-gap.id_B01.fasta telo-read.paf > telo-min.gfa 
awk '/^S/{print ">"$2"\n"$3}' telo-min.gfa | seqkit seq > telo-min.fasta 

Could you give me some suggestions? Thank you very much.

Automate testing

  • Add test cases for core functions.
  • Add github action to run test on PRs

error "invalid mode: 'rU' " in teloclip (v. bioconda)

When installing and running teloclip through conda (Python v 3.11.5), I received the following error:

Traceback (most recent call last):
File "/home/sk893857/miniconda3/bin/teloclip", line 10, in
sys.exit(main())
^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/run_self.py", line 44, in main
ContigDict = teloclip.read_fai(args.refIdx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sk893857/miniconda3/lib/python3.11/site-packages/teloclip/init.py", line 247, in read_fai
with open(path, "rU") as f:
^^^^^^^^^^^^^^^^
ValueError: invalid mode: 'rU'


note: This appears to be an invalid file read mode in Python 3. I solved this by manually replacing all instances of 'rU' with 'r' in both scripts. Maybe the conda version of teloclip needs a fix?

Feature: Fuzzy motif search

Motif runs in clipped regions of raw ONT reads may contain homopolymer errors where the number of sequential repeated bases is incorrect. Currently teloclip has an option --noPoly which compresses homopolymer runs in both the motif and the read befor looking for matches. However, this is likely to reduce specificity and increase false positive matches.

Proposal: Replace homopolymer compression with a regex based fuzzy search method.

Steps:

  • Convert input motif to regex where all runs of > 1 base allow for a +/- 1 range.
  • Search given sequence with pattern
  • Return count of non-overlapping matches
  • Add noisy reporting of motif counts per readname in L/R end of contigname
  • Add Warning for depreciated --noPoly to be removed in future major release.

Should default behaviour be exact match or fuzzy search?

Automatically extend contigs

Hi,
I am wondering is there any way (not the manual checking) to extend the contigs according to the telomere overhangs identified?

Thx

Restrict motif search to clipped region

Motif search currently requires at least one instance of any user-specified motif in a read that is clipped at a contig end. This match can occur anywhere in the read, and may not necessarily be in the distal clipped region.

Add option to restrict motif search to the clipped end (or ends for long reads that span entire contigs).

Missing isMotifInClip()

main app tries to import isMotifInClip() which has been removed. Add replacement func.

In the meantime v0.0.4 from PyPi still works fine.

Verify: Correct read slice coords for negative strand alignments

May need to have isClipMotif() function check if the current alignment is on the negative strand before extracting sequence for motif search.

Assumptions:

  • Align coords are reported in fwd orientation from leftmost ref position.
  • CIGAR coords are reported for ref fwd orientation (left to right on ref)
  • Read string is not altered from raw read (may be reversed relative to CIGAR coords)

If above are true, then 3' overhang seq for 1000M500S should be retrieved as:

  • If forward alignment: Seq[-500:]
  • If reverse alignment: Seq[0:500]

and for a 5' alignment 500S1000M:

  • If forward alignment: Seq[0:500]
  • If reverse alignment: Seq[-500:]

Note: If printing overhang sequence to a summary file we should RC the extracted OH for neg strand alignments.

Interactive Contig extension

Add module for interactive extension of contigs.

  • Take teloclip filtered reads as input
  • For each contig end display overhangs
  • Prompt user to accept an extension, or
  • Error correct + assemble reads and realign to ref contig

Note: Need to write wrappers for racon, minimap2, and miniasm.

Refresh codebase

A general refresh of the package is needed.

Phase 1 - Code cleanup and packaging

  • Restructure package
  • Migrate packaging to pyproject toml
  • Add codespace + gitpod config files

Phase 2 - Readability

  • Add typehints
  • Update logging
  • Add Docstrings

Add metadata to extracted overhang reads

teloclip-extract will write reads with terminal clipped-overhangs to fasta files. To aide in sanity checks we should include some metadata to the fasta header.

Proposed fields:

  • Anchor len: how much of the read is aligned to the ref sequence
  • Overhang len: how long is the overhang
  • Ref name: which sequence is it aligned to
  • Motifs : Motifs of regex patterns used in filtering (comma delim list)
  • Motif counts: Total count of motif matches in overhang (list matching order of motifs)

Other behaviour changes:

  • Sort fasta by overhang length
  • Include reference segment in output fasta (from earliest alignment start coord)
  • Log total read counts for each end of each reference sequence.
  • Print histogram of overhang depths
  • Log warning if unbalanced overhangs (i.e. most ends have 5 and one has 500 reads)

Improve readability

Continuing from issue #13. Update codebase to improve readability.

Tasks include:

Add typehints
Update logging
Add Docstrings
Split long functions into sub-functions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.