Coder Social home page Coder Social logo

lincs's Introduction

Transcriptional signatures of perturbation from LINCS L1000

DOI

Python analysis of the LINCS L1000 data.

The repository consists of python notebooks which are executed in the following order:

  1. api.ipynb retreives metadata from the L1000 API. Retrieved data is converted into a dataframe and saved as a tsv. Files are created for perturbations, signatures, cells, and probes.
  2. database.ipynb creates a SQLite database containing the metadata retrieved from the API. Data cleaning occurs here. The database resides at data/l1000.db but is ignored due to file size. However, the populated database is available on figshare.
  3. unichem.ipynb maps compounds to external databases and adds the mapping to the database. See this comment for more information.
  4. chemical-similarity.ipynb computes chemical similarities between compounds and adds these similarities to the database.
  5. consensi.ipynb computes consensus signatures for each perturbagen. The following consensus files are created:
  1. significance.ipynb converts consensus z-scores into significant up/down-regulation values. The following files are created:

See this comment for more information on steps 5 & 6.

Note: This is not an official LINCS L1000 repository. Users are warned that our modifications may have introduced errors or removed signal that was present the original data.

Inputs

This repository depends on modzs.gctx — a legacy probe × signature matrix of differential expression z-scores. Due to large file size (42.5 GB) this file is not uploaded to GitHub. To recreate this analysis rather than just use the results, users should retrieve modzs.gctx from figshare and place it in the download directory.

Citation

See the Transcriptional signatures of perturbation from LINCS L1000 section of the Rephetio manuscript for the final description of this work. Citations related to this repository are below:

  1. Systematic integration of biomedical knowledge prioritizes drugs for repurposing
    Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
    eLife (2017-09-22) https://doi.org/cdfk
    DOI: 10.7554/elife.26726 · PMID: 28936969 · PMCID: PMC5640425

  2. Consensus signatures for LINCS L1000 perturbations
    Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini
    Figshare (2016-03-08) https://doi.org/f3mqvs
    DOI: 10.6084/m9.figshare.3085426.v1

  3. dhimmel/lincs v2.0: Refined Consensus Signatures From Lincs L1000
    Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini
    Zenodo (2016-03-08) https://doi.org/f3mqvr
    DOI: 10.5281/zenodo.47223

  4. Computing consensus transcriptional profiles for LINCS L1000 perturbations
    Daniel Himmelstein, Caty Chung
    ThinkLab (2015-03-26) https://doi.org/f3mqwc
    DOI: 10.15363/thinklab.d43

Environment

Create the conda environment for this repository using:

conda env create --file environment.yml

License

All original content in this repository is released under CC0 1.0. LINCS data and derivatives are released under CC BY 4.0 — please refer to the LINCS data policy and attribute this repository and LINCS L1000.

lincs's People

Contributors

dhimmel avatar orthographic-pedant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lincs's Issues

README for the repo

Hey @dhimmel,

Thank you for such amazing work putting together the scripts to process and analyze the Lincs dataset. If it is not too much, could you add a README to the repo to guide us through the process?

Thank You.

Regards,
Yojana Gadiya

Effect of over- and underexpression on itselves

Hi Daniel,

thank you very much for sharing this work. As a computational biologist, this data seems very interesting for lookup of hypothesis won in another dataset in a wet lab data, great!

I had a look at the datasets you kindly provided in https://github.com/dhimmel/lincs/tree/gh-pages/data/consensi and checked the effect of overexpression/underexpression of a gene as perturbagen on itself:

About a third of the genes showed nominal significant (z score <= -1.96) underexpression when it was itself the repressing perturbagen. When looking on overexpression, about 10 percent of genes showed overexpression when they were the overexpressed perturbagen itself.

My first question is: While this is truly a clear enrichment in the right direction, is this rather low efficiency of a gene as perturbagen on itself expected?

My second question is: Do you suggest to filter for genes that have an effect as perturbagen on itself for quality control?

To illustrate this issue, here is a histogram of z-scores showing effect as perturbagen on itselves vs. effect on other genes:
s309_1_distribution_zscores_over_under_itselves_effect

Thanks and best, Holger

Level 4 replicates don´t match with level 5 signature

I am trying to plot some genes using data level 4 for my compound (BRD-K55591206) on HepG2 cells.

There are two signatures with HepG2 cells at level 5:
LJP008_HEPG2_24H:J01
POL001_HEPG2_24H:J09
To make sure I was using the same data from these level 5 signatures I checked the replicates at level 4 of each of these signatures above. The average of the two LJP008 experiments (distil_ids: LJP008_HEPG2_24H_X2_B20:J01|LJP008_HEPG2_24H_X3_B20:J01) matches the signature of each gene at level 5. Perfect.

However, the level 4 data for signature POL001_HEPG2 (distil_ids: POL001_HEPG2_24H_X1.L2_B23:J07|POL001_HEPG2_24H_X2.L2_B23:J07|POL001_HEPG2_24H_X3.L2_B23:J07) does not match level 5.

If we use the NAT2 gene as an example, we have the following level 5 value: 0.004413

On the other hand, the values for the level 4 replicates are:
POL001_HEPG2_24H_X1.L2_B23:J07 = -0.386299998
POL001_HEPG2_24H_X2.L2_B23:J07 = 0.110600002
POL001_HEPG2_24H_X3.L2_B23:J07 =0.38409999
The avg 0.036133 does not match level 5 0.004413

The compound is BRD-K55591206, 10 µM, 24 h.

Why don’t they match?

I am using cmapR to retrieve the data from these files:
https://clue.io/releases/data-dashboard
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level5/level5_beta_trt_cp_n720216x12328.gctx
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level4/level4_beta_all_n3026460x12328.gctx
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/siginfo_beta.txt
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/instinfo_beta.txt

Which set of GSEA data is equivalent to modzs.gctx

Hello Daniel,

I am in the process of creating auto-update scripts for all the nodes with hetio, and in order to do that, I will need a copy of the most up-to-date modzs.gctx file. I know that GSEA has some LINCS datasets in there, and I was curious which files best correspond to the modzs file that you used in this repository. Any input or feedback would be greatly appreciated. Thank you!

Best,
Krish

Download modzs.gctx

I can't figure out where to get modzs.gctx, which is needed to construct the signature dataframe sig_expr_df in consensi.ipynb. From here, you say:

The z-score signature vectors are retrieved from the /xchip/cogs/data/build/a2y13q1/modzs.gctx file on the C3 cloud.

But this was 2 years ago and the link doesn't work anymore. Also, I'm not exactly sure what this file is exactly or how it was generated.

I appreciate your help in advance!

Code to generate l1000.db or the downloadable l1000.db

As shown in database.ipynb, there is a large-size l1000.db file, containing the gene expression profiles and meta data of lincs. Here could you publish the code to produce the l1000.db and/or the l1000.db itself? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.