Coder Social home page Coder Social logo

co-occurrence's Introduction

Co-occurrence analysis

Scripts to correlate the presence of contigs in samples. Useful to find all segments of segmented viruses.

The script takes an abundance table (contigs should be rows, samples columns) and by default filters out all contigs that are present in less than 10% of the samples. The output is a correlation matrix, which is then filtered to the preferred correlation coefficient (by default 0.3) and put into a pairwise dataframe for all contigs above the correlation threshold.

The abundance table is either transformed to a presence/absence table or, if you provide a file with the contig lengths (-l/--lengths), the read count is divided by the contig length. This would take also the abundance of ecah contig in account without bias towards the length of the contig (larger contigs have more reads, although they might not be that abundant).

The script also allows you to correlate specific contigs of interest with all other contigs in your study. It needs the names of your contigs of interest (-s/--segments). The output is a correlation matrix of all contigs above the correlation threshold with your contigs of interest.

Caveats:

  1. You need a decent amount of samples with your virus a) present and b) absent, otherwise there will be a) no correlations or b) too many correlations with non-related contigs (eg. host sequences).
  2. The correlations should always be checked, the output is not perfect and is more a tool to help you recover unknown segments. You could for example check for approximately equal coverage of your related contigs, check the open reading frames of all your contigs, etc.
  3. It is always a good idea to set some threshold on your abundance table to consider contigs present in a sample. This reduces the risk of making false conclusions because of index hopping. Also for this analysis, if you don't do any correction and you only consider presence/absence (i.e. one read is considered presence), the correlation analysis will not perform optimally and you might find false positive/negative correlations. You could set a threshold of e.g. 50% horizontal coverage to consider a contig present (if the horizontal coverage is below 50%, you set the read count to 0) or a more relaxed approach is to set a fixed threshold of e.g. 50 reads and everything below to 0.

Examples of input:

  1. Abundance table (-i/--input)
         Sample1  Sample2  Sample3  Sample4
Contig1  1839     0        868      0
Contig2  0        729      0        0
Contig3  1303     0        69       0
Contig4  0        0        0        90
  1. Segments file (-s/--segments)
Contig1 
Contig3
etc.
  1. Lengths file (-l/--lengths)
Contig1 7493
Contig2 2923
Contig3 3092
Contig4 1490

Dependencies:

  1. pandas
  2. numpy

co-occurrence's People

Contributors

landerdc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.