Coder Social home page Coder Social logo

open2c / pairtools Goto Github PK

View Code? Open in Web Editor NEW
96.0 11.0 32.0 3.33 MB

Extract 3D contacts (.pairs) from sequencing alignments

License: MIT License

Python 96.75% Makefile 0.18% Cython 3.07%
3d-genome bioinformatics file-formatter hi-c ngs python pairs-file

pairtools's People

Contributors

agalitsyna avatar gfudenberg avatar golobor avatar hbbrandao avatar itsameerkat avatar nvictus avatar phlya avatar sergpolly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pairtools's Issues

pairsamtools merge fails when symbolic links are provided as an input

Here is an example:
pairsamtools merge pairsam_chunk* --nproc 2 -o MATa_R2.lane1.pairsam.gz
where
pairsam_chunk1 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.01.pairsam.gz pairsam_chunk2 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.00.pairsam.gz
fail with the message:

...
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_merge.py", line 59, in merge_py
      h, _ = _headerops.get_header(f)
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/_headerops.py", line 38, in get_header
    for line in instream:
File "/home/polzovatel/miniconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

While running

pairsamtools merge /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.0* --nproc 2 -o MATa_R2.lane1.pairsam.gz

yields non-empty MATa_R2.lane1.pairsam.gz

multimer compatability?

maybe not worth thinking about... but maybe it is? it looks like the pairsam should be flexible enough to extend to the case of things like c-walks?

pairsamtools merge single file no action?

Wouldn't it be a good approach not to do anything with an input file, in case pairsamtools merge is provided with only a single file ?

  • Right now it's up to distiller to check if it's a single-file input. Single files are treated differently - they're copied by distiller and are left untouched.

  • pairsamtools merge however does something to the pairsam header, even if it's just a single file and there is nothing to merge. Modified headers cause downstream issues in case of hierarchical merge implemented in distiller. Dealing with that in pairsamtools merge would make distiller code cleaner.

optical dedup stats

Enhancement:

readid contains info about spatial position of the physical read (short DNA molecule) on a sequencing flowcell.
This information can be used to find out how many of the observed duplicates are originating from PCR preceding sequencing step, and how many have been formed in the flowcell due to non-ideal conditions/read migration etc.

Proposition:

we can add an onlineDeduplicator-style buffer in the stats module that would keep track of incoming read-pairs and would start accumulating them as soon as it hits a DD-one, until the next one in the buffer in non-DD-one:

rid  c1   c2   p1   p2   s1   s2  pt  index
.    .    .    .    .    .    .   LL    1
.    .    .    .    .    .    .   DD    2
.    .    .    .    .    .    .   DD    3
.    .    .    .    .    .    .   DD    4
.    .    .    .    .    .    .   DD    5
.    .    .    .    .    .    .   LL    6

so with the input as above the content of the buffer would include pairs with index from 1-5 (including).

read_id (rid) column of that buffer can be further used to cluster pairs by their X,Y - position on the flowcell and to generate corresponding statistics.
We could use the read-clustering algorithm developed by Betul @betulakgol in the lab, and would have to optional because it is not really known, if readid of "any" sequencer would have such information and if it is in the same exact format.

Software:

from coding perspective, we would only need to make stats.add_pair(c1,p1,s1,c2,p2,s2,pt) to accept a read-id as well stats.add_pair(rid,c1,p1,s1,c2,p2,s2,pt) (either permanently or optionally - how? dunno ...) and implement that onlineDedup-style buffer inside the stats, so that everyhting will be nice and hidden from the dedup perspective.

stats are becoming pretty heavy at that point, and some Cythonization might be applied to it (split stats into _stats.pyx with the class and methods and pairsamtools_stats.py - a stand alone script capable of doing stats).

Seems like a doable and not very intrusive thing, which I already discussed with @hakanozadam,
@nvictus, @golobor - what do you think?

Versioning Pairsamtools

I think it would be nice to start versioning pairsamtools. I am seeing differences in functions and APIs between the versions of pairsamtools from May and December 2017. With versioning, we can impose a particular version on tools & pipelines using pairsamtools. Right now, in pairsamtools, I see

__version__ = '0.0.1-dev'

calculate Pc(s) in stats

we can calculate Pc(s) during stats computation. We'd need chromsizes (can take from pairs header, or supply an external file) and centromeres/gaps (can take from a curated database/online/external file).
via @Phlya

`parse_sort` takes forever when reads mapped to scaffold

This is just a preliminary thing - more of a question rather than issue:
In case we have a 30'000 pieces scaffold instead of a genome with ~20 well defined chromosomes, would you guys expect any "delays" with the parsing step?

In our experience it took more than a day on 100' million read chunks, and after that we killed it. Same step takes minutes for hg19 and ~10-30 million reads chunks.

We'll try to debug this, but in the meantime, maybe you guys could provide some extra insight into this?

adding stats to dedup

In order to address open2c/distiller-nf#80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

  • should we parse line_buffer[i] right next to where we're writing it to outstream?
  • should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
  • stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}?
    Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78
    It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.