dhimmel / lincs Goto Github PK

View Code? Open in Web Editor NEW

26.0 6.0 5.0 261.54 MB

Library of Integrated Cellular Signatures L1000

Home Page: https://think-lab.github.io/d/43/

Python 17.25% Jupyter Notebook 82.75%

lincs l1000 gene-expression rephetio

lincs's Introduction

Transcriptional signatures of perturbation from LINCS L1000

Python analysis of the LINCS L1000 data.

The repository consists of python notebooks which are executed in the following order:

api.ipynb retreives metadata from the L1000 API. Retrieved data is converted into a dataframe and saved as a tsv. Files are created for perturbations, signatures, cells, and probes.
database.ipynb creates a SQLite database containing the metadata retrieved from the API. Data cleaning occurs here. The database resides at data/l1000.db but is ignored due to file size. However, the populated database is available on figshare.
unichem.ipynb maps compounds to external databases and adds the mapping to the database. See this comment for more information.
chemical-similarity.ipynb computes chemical similarities between compounds and adds these similarities to the database.
consensi.ipynb computes consensus signatures for each perturbagen. The following consensus files are created:

consensi-drugbank.tsv.bz2 with consensus signatures for each mapped drugbank compound
consensi-knockdown.tsv.bz2 with consensus signatures for each gene knockdown
consensi-overexpression.tsv.bz2 with consensus signatures for each gene over-expression
consensi-pert_id.tsv.bz2 with consensus signatures for each L1000 pert_id. This file is too large for GitHub (500 MB), but is available on figshare.

significance.ipynb converts consensus z-scores into significant up/down-regulation values. The following files are created:

DrugBank dysregulated genes (dysreg-drugbank.tsv) and counts (dysreg-drugbank-summary.tsv)
Knockdown dysregulated genes (dysreg-knockdown.tsv) and counts (dysreg-knockdown-summary.tsv)
Overexpression dysregulated genes (dysreg-overexpression.tsv) and counts (dysreg-overexpression-summary.tsv)
All perturbagens dysregulated genes (dysreg-pert_id.tsv.gz) and counts (dysreg-pert_id-summary.tsv)

See this comment for more information on steps 5 & 6.

Note: This is not an official LINCS L1000 repository. Users are warned that our modifications may have introduced errors or removed signal that was present the original data.

Inputs

This repository depends on modzs.gctx — a legacy probe × signature matrix of differential expression z-scores. Due to large file size (42.5 GB) this file is not uploaded to GitHub. To recreate this analysis rather than just use the results, users should retrieve modzs.gctx from figshare and place it in the download directory.

Citation

See the Transcriptional signatures of perturbation from LINCS L1000 section of the Rephetio manuscript for the final description of this work. Citations related to this repository are below:

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
eLife (2017-09-22) https://doi.org/cdfk
DOI: 10.7554/elife.26726 · PMID: 28936969 · PMCID: PMC5640425
Consensus signatures for LINCS L1000 perturbations
Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini
Figshare (2016-03-08) https://doi.org/f3mqvs
DOI: 10.6084/m9.figshare.3085426.v1
dhimmel/lincs v2.0: Refined Consensus Signatures From Lincs L1000
Daniel Himmelstein, Leo Brueggeman, Sergio Baranzini
Zenodo (2016-03-08) https://doi.org/f3mqvr
DOI: 10.5281/zenodo.47223
Computing consensus transcriptional profiles for LINCS L1000 perturbations
Daniel Himmelstein, Caty Chung
ThinkLab (2015-03-26) https://doi.org/f3mqwc
DOI: 10.15363/thinklab.d43

Environment

Create the conda environment for this repository using:

conda env create --file environment.yml

License

All original content in this repository is released under CC0 1.0. LINCS data and derivatives are released under CC BY 4.0 — please refer to the LINCS data policy and attribute this repository and LINCS L1000.

lincs's People

Contributors

Stargazers

Watchers

Forkers

minghao2016 krishbharat96 sailfish009 mengchengyao saisaitian

lincs's Issues

Where can i find the ligand consensus signature ? and the file that lits the number of gold signatures used of the analysis ?

README for the repo

Hey @dhimmel,

Thank you for such amazing work putting together the scripts to process and analyze the Lincs dataset. If it is not too much, could you add a README to the repo to guide us through the process?

Thank You.

Regards,
Yojana Gadiya

Effect of over- and underexpression on itselves

Hi Daniel,

thank you very much for sharing this work. As a computational biologist, this data seems very interesting for lookup of hypothesis won in another dataset in a wet lab data, great!

I had a look at the datasets you kindly provided in https://github.com/dhimmel/lincs/tree/gh-pages/data/consensi and checked the effect of overexpression/underexpression of a gene as perturbagen on itself:

About a third of the genes showed nominal significant (z score <= -1.96) underexpression when it was itself the repressing perturbagen. When looking on overexpression, about 10 percent of genes showed overexpression when they were the overexpressed perturbagen itself.

My first question is: While this is truly a clear enrichment in the right direction, is this rather low efficiency of a gene as perturbagen on itself expected?

My second question is: Do you suggest to filter for genes that have an effect as perturbagen on itself for quality control?

To illustrate this issue, here is a histogram of z-scores showing effect as perturbagen on itselves vs. effect on other genes:

Thanks and best, Holger

Level 4 replicates don´t match with level 5 signature

I am trying to plot some genes using data level 4 for my compound (BRD-K55591206) on HepG2 cells.

There are two signatures with HepG2 cells at level 5:
LJP008_HEPG2_24H:J01
POL001_HEPG2_24H:J09
To make sure I was using the same data from these level 5 signatures I checked the replicates at level 4 of each of these signatures above. The average of the two LJP008 experiments (distil_ids: LJP008_HEPG2_24H_X2_B20:J01|LJP008_HEPG2_24H_X3_B20:J01) matches the signature of each gene at level 5. Perfect.

However, the level 4 data for signature POL001_HEPG2 (distil_ids: POL001_HEPG2_24H_X1.L2_B23:J07|POL001_HEPG2_24H_X2.L2_B23:J07|POL001_HEPG2_24H_X3.L2_B23:J07) does not match level 5.

If we use the NAT2 gene as an example, we have the following level 5 value: 0.004413

On the other hand, the values for the level 4 replicates are:
POL001_HEPG2_24H_X1.L2_B23:J07 = -0.386299998
POL001_HEPG2_24H_X2.L2_B23:J07 = 0.110600002
POL001_HEPG2_24H_X3.L2_B23:J07 =0.38409999
The avg 0.036133 does not match level 5 0.004413

The compound is BRD-K55591206, 10 µM, 24 h.

Why don’t they match?

I am using cmapR to retrieve the data from these files:
https://clue.io/releases/data-dashboard
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level5/level5_beta_trt_cp_n720216x12328.gctx
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/level4/level4_beta_all_n3026460x12328.gctx
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/siginfo_beta.txt
https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/instinfo_beta.txt

Which set of GSEA data is equivalent to modzs.gctx

Hello Daniel,

I am in the process of creating auto-update scripts for all the nodes with hetio, and in order to do that, I will need a copy of the most up-to-date modzs.gctx file. I know that GSEA has some LINCS datasets in there, and I was curious which files best correspond to the modzs file that you used in this repository. Any input or feedback would be greatly appreciated. Thank you!

Best,
Krish

Download modzs.gctx

I can't figure out where to get modzs.gctx, which is needed to construct the signature dataframe sig_expr_df in consensi.ipynb. From here, you say:

The z-score signature vectors are retrieved from the /xchip/cogs/data/build/a2y13q1/modzs.gctx file on the C3 cloud.

But this was 2 years ago and the link doesn't work anymore. Also, I'm not exactly sure what this file is exactly or how it was generated.

I appreciate your help in advance!

Code to generate l1000.db or the downloadable l1000.db

As shown in database.ipynb, there is a large-size l1000.db file, containing the gene expression profiles and meta data of lincs. Here could you publish the code to produce the l1000.db and/or the l1000.db itself? Thank you.