The pairtools from open2c

pairsamtools merge fails when symbolic links are provided as an input

Here is an example:
pairsamtools merge pairsam_chunk* --nproc 2 -o MATa_R2.lane1.pairsam.gz
where
pairsam_chunk1 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.01.pairsam.gz pairsam_chunk2 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.00.pairsam.gz
fail with the message:

...
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_merge.py", line 59, in merge_py
      h, _ = _headerops.get_header(f)
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/_headerops.py", line 38, in get_header
    for line in instream:
File "/home/polzovatel/miniconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

While running

pairsamtools merge /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.0* --nproc 2 -o MATa_R2.lane1.pairsam.gz

yields non-empty MATa_R2.lane1.pairsam.gz

add chrom.sizes to pairs header

multimer compatability?

maybe not worth thinking about... but maybe it is? it looks like the pairsam should be flexible enough to extend to the case of things like c-walks?

in merge, add an option to take a header from the first file and ignore other headers

pairsamtools merge single file no action?

Wouldn't it be a good approach not to do anything with an input file, in case pairsamtools merge is provided with only a single file ?

Right now it's up to distiller to check if it's a single-file input. Single files are treated differently - they're copied by distiller and are left untouched.
pairsamtools merge however does something to the pairsam header, even if it's just a single file and there is nothing to merge. Modified headers cause downstream issues in case of hierarchical merge implemented in distiller. Dealing with that in pairsamtools merge would make distiller code cleaner.

add continuous integration

Trim column list in header after split

After splitting off SAMs from pairs the #columns header still reads:

#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type sam1 sam2

make a conda recipe

dedup should check that the read pairs are dedupped and do not contain unmapped reads

maybe make dedup skip unmapped reads?

in stats, report cis_1kb+, cis_3kb+, cis_10kb+, cis_30kb+

use line.rstrip() to avoid bugs

line might not end with newline symbol, so strip it instead of line[:-1] ...

https://github.com/mirnylab/pairsamtools/blob/cd269671cc493819b68c021a2b4300abc1f34bdc/pairsamtools/pairsam_select.py#L138

in stats, split dist freq by distance first, directionality second

default to bgzip if pbgzip is absent

add an option to use other compressors for input/output (lz4, etc)

split _common into _io and _pairsam_format

allow specifying column order for sorting and other tools

merge pairsam-related tools into pairsam tools

-- use click

in merge, add an option to concatenate pairsam files instead of merging sorted files

in parse, add other options to store (CIGAR, genomic span, NM, AS, XA, etc)

optical dedup stats

Enhancement:

readid contains info about spatial position of the physical read (short DNA molecule) on a sequencing flowcell.
This information can be used to find out how many of the observed duplicates are originating from PCR preceding sequencing step, and how many have been formed in the flowcell due to non-ideal conditions/read migration etc.

Proposition:

we can add an onlineDeduplicator-style buffer in the stats module that would keep track of incoming read-pairs and would start accumulating them as soon as it hits a DD-one, until the next one in the buffer in non-DD-one:

rid  c1   c2   p1   p2   s1   s2  pt  index
.    .    .    .    .    .    .   LL    1
.    .    .    .    .    .    .   DD    2
.    .    .    .    .    .    .   DD    3
.    .    .    .    .    .    .   DD    4
.    .    .    .    .    .    .   DD    5
.    .    .    .    .    .    .   LL    6

so with the input as above the content of the buffer would include pairs with index from 1-5 (including).

read_id (rid) column of that buffer can be further used to cluster pairs by their X,Y - position on the flowcell and to generate corresponding statistics.
We could use the read-clustering algorithm developed by Betul @betulakgol in the lab, and would have to optional because it is not really known, if readid of "any" sequencer would have such information and if it is in the same exact format.

Software:

from coding perspective, we would only need to make stats.add_pair(c1,p1,s1,c2,p2,s2,pt) to accept a read-id as well stats.add_pair(rid,c1,p1,s1,c2,p2,s2,pt) (either permanently or optionally - how? dunno ...) and implement that onlineDedup-style buffer inside the stats, so that everyhting will be nice and hidden from the dedup perspective.

stats are becoming pretty heavy at that point, and some Cythonization might be applied to it (split stats into _stats.pyx with the class and methods and pairsamtools_stats.py - a stand alone script capable of doing stats).

Seems like a doable and not very intrusive thing, which I already discussed with @hakanozadam,
@nvictus, @golobor - what do you think?

flip and resort pairs in sort

Versioning Pairsamtools

I think it would be nice to start versioning pairsamtools. I am seeing differences in functions and APIs between the versions of pairsamtools from May and December 2017. With versioning, we can impose a particular version on tools & pipelines using pairsamtools. Right now, in pairsamtools, I see

__version__ = '0.0.1-dev'

We'll try to debug this, but in the meantime, maybe you guys could provide some extra insight into this?

Techincal Qs:

should we parse line_buffer[i] right next to where we're writing it to outstream?
should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}?
Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78
It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

open2c / pairtools Goto Github PK

pairtools's People

Contributors

Stargazers

Watchers

Forkers

pairtools's Issues

Enhancement:

Proposition:

Software:

Recommend Projects

Recommend Topics

Recommend Org