open2c / pairtools Goto Github PK
View Code? Open in Web Editor NEWExtract 3D contacts (.pairs) from sequencing alignments
License: MIT License
Extract 3D contacts (.pairs) from sequencing alignments
License: MIT License
Here is an example:
pairsamtools merge pairsam_chunk* --nproc 2 -o MATa_R2.lane1.pairsam.gz
where
pairsam_chunk1 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.01.pairsam.gz pairsam_chunk2 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.00.pairsam.gz
fail with the message:
...
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_merge.py", line 59, in merge_py
h, _ = _headerops.get_header(f)
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/_headerops.py", line 38, in get_header
for line in instream:
File "/home/polzovatel/miniconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
While running
pairsamtools merge /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.0* --nproc 2 -o MATa_R2.lane1.pairsam.gz
yields non-empty MATa_R2.lane1.pairsam.gz
maybe not worth thinking about... but maybe it is? it looks like the pairsam should be flexible enough to extend to the case of things like c-walks?
Wouldn't it be a good approach not to do anything with an input file, in case pairsamtools merge
is provided with only a single file ?
Right now it's up to distiller
to check if it's a single-file input. Single files are treated differently - they're copied by distiller
and are left untouched.
pairsamtools merge
however does something to the pairsam
header, even if it's just a single file and there is nothing to merge. Modified headers cause downstream issues in case of hierarchical merge implemented in distiller
. Dealing with that in pairsamtools merge
would make distiller
code cleaner.
After splitting off SAMs from pairs the #columns
header still reads:
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type sam1 sam2
maybe make dedup skip unmapped reads?
line might not end with newline symbol, so strip it instead of line[:-1] ...
-- use click
readid
contains info about spatial position of the physical read (short DNA molecule) on a sequencing flowcell.
This information can be used to find out how many of the observed duplicates are originating from PCR preceding sequencing step, and how many have been formed in the flowcell due to non-ideal conditions/read migration etc.
we can add an onlineDeduplicator
-style buffer in the stats
module that would keep track of incoming read-pairs and would start accumulating them as soon as it hits a DD
-one, until the next one in the buffer in non-DD
-one:
rid c1 c2 p1 p2 s1 s2 pt index
. . . . . . . LL 1
. . . . . . . DD 2
. . . . . . . DD 3
. . . . . . . DD 4
. . . . . . . DD 5
. . . . . . . LL 6
so with the input as above the content of the buffer would include pairs with index from 1-5 (including).
read_id (rid)
column of that buffer can be further used to cluster pairs by their X
,Y
- position on the flowcell and to generate corresponding statistics.
We could use the read-clustering algorithm developed by Betul @betulakgol in the lab, and would have to optional because it is not really known, if readid
of "any" sequencer would have such information and if it is in the same exact format.
from coding perspective, we would only need to make stats.add_pair(c1,p1,s1,c2,p2,s2,pt)
to accept a read-id
as well stats.add_pair(rid,c1,p1,s1,c2,p2,s2,pt)
(either permanently or optionally - how? dunno ...) and implement that onlineDedup
-style buffer inside the stats, so that everyhting will be nice and hidden from the dedup
perspective.
stats
are becoming pretty heavy at that point, and some Cythonization might be applied to it (split stats
into _stats.pyx
with the class and methods and pairsamtools_stats.py
- a stand alone script capable of doing stats).
Seems like a doable and not very intrusive thing, which I already discussed with @hakanozadam,
@nvictus, @golobor - what do you think?
I think it would be nice to start versioning pairsamtools. I am seeing differences in functions and APIs between the versions of pairsamtools from May and December 2017. With versioning, we can impose a particular version on tools & pipelines using pairsamtools. Right now, in pairsamtools, I see
__version__ = '0.0.1-dev'
we can calculate Pc(s) during stats computation. We'd need chromsizes (can take from pairs header, or supply an external file) and centromeres/gaps (can take from a curated database/online/external file).
via @Phlya
Eg have a flag --col NAME,index that can be repeated multiple types.
This is just a preliminary thing - more of a question rather than issue:
In case we have a 30'000 pieces scaffold instead of a genome with ~20 well defined chromosomes, would you guys expect any "delays" with the parsing step?
In our experience it took more than a day on 100' million read chunks, and after that we killed it. Same step takes minutes for hg19
and ~10-30 million reads chunks.
We'll try to debug this, but in the meantime, maybe you guys could provide some extra insight into this?
In order to address open2c/distiller-nf#80 we agreed to add stats to the dedup
.
And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)
Techincal Qs:
line_buffer[i]
right next to where we're writing it to outstream
?cols_buffer[i]
(used for mark duplicates) for stats and skip parsing line_buffer[i]
?stats
-API, and instead of out_stat.add_pair(algn2, algn1, pair_type)
, say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type)
, where algn={'chrom':c,'pos':p,'strand':s}
?stats
module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78parse
-code, i.e. avoid unpacking align
dictionaries there.Add tests from dedup, merge and stats
Got a failure using system samtools 1.2. Are we using any deprecated samtools invocations?
-- use DCIC pairs header
-- change the order of the columns
-- lump all sams into a single column
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.