Coder Social home page Coder Social logo

broadinstitute / lincs-cell-painting Goto Github PK

View Code? Open in Web Editor NEW
23.0 7.0 13.0 110.86 MB

Processed Cell Painting Data for the LINCS Drug Repurposing Project

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 89.93% Python 6.99% HTML 3.06% Shell 0.03%
drug-repurposing cell-painting lincs cell-morphology data-repository

lincs-cell-painting's Introduction

LINCS Cell Painting profile data repository

DOI

The Library of Integrated Network-Based Cellular Signatures (LINCS) Project aims to create publicly available resources to characterize how cells respond to perturbation. This repository stores Cell Painting readouts and associated data-processing pipelines for the LINCS Cell Painting dataset.

In this project, the Connectivity Map team perturbed A549 cells with 1,571 compounds across 6 doses in 5 technical replicates. The data represent a subset of the Broad Drug Repurposing Hub collection of compounds.

We refer to this dataset as LINCS Pilot 1. We also include data for the second batch of LINCS Cell Painting data, which we refer to as LKCP.

For a specific list of compounds tested, see metadata. You can interactively explore information about the compounds in the CLUE Repurposing app.

The Morphology Connectivity Hub is the primary source of this dataset.

Image-based profiling

We apply a unified, image-based profiling pipeline to all 136 384-well plates from LINCS Pilot 1, and all 135 384-well plates from LKCP. We use pycytominer as the primary tool for image-based profiling.

We process and store level 3 to level 5 profiles in the profiles/ directory. Furthermore, spherized and conensus profiles can be found in their relevant folders.

See profiles/README.md for more details and for instructions on how to reproduce the pipeline. For further details about image-based profiling in general, please refer to Caicedo et al. 2017.

Computational environment

We use conda to manage the computational environment.

To install conda see instructions.

We recommend installing conda by downloading and executing the .sh file and accepting defaults.

After installing conda, execute the following to install and navigate to the environment:

# First, install the `lincs` conda environment
conda env create --force --file environment.yml

# If you had already installed this environment and now want to update it
conda env update --file environment.yml --prune

# Then, activate the environment and you're all set!
conda activate lincs

Also note that when contributing to the repository, make sure to add any new package in the environment.yml file.

License

We use a dual license in this repository. We license the source code as BSD 3-Clause, and license the data, results, and figures as CC0 1.0.

Citation

If you use these data or software, please cite our Zenodo archive:

Natoli, Ted, Way, Gregory, Lu, Xiaodong, Logan, David, Alimova, Maria, Hartland, Kate, Golub, Todd, Carpenter, Anne, Singh, Shantanu, Subramanian, Aravind. (2021). broadinstitute/lincs-cell-painting: Full release of LINCS Cell Painting dataset (Version v1) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.5008187

lincs-cell-painting's People

Contributors

gwaybio avatar michaelbornholdt avatar niranjchandrasekaran avatar shntnu avatar tnat1031 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

lincs-cell-painting's Issues

Consensus data still performs really for my metrics

This is a reminder to tackle soon.
I don't know what I am doing wrong but when I download some non spherized consensus data and run enrichment over it. The results are awful. From past experience I know that this is due to super high correlate values from the normalization.

I thought we had fixed this... maybe not.

Add whitening normalization to this repo

The profiles deposited in #34 do not include whitening normalization. Previously, (see #4 (comment)) I elected to leave the whitened data to a future data upload because of this caveat:

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

@shntnu also notes in #4 (comment)

Going forward, we will very likely produce at least two different Level 4a profiles

  • whole-well z-scored
  • DMSO z-scored

because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.
We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

Create visualizations of the similarities among profiles

It is very useful for researchers to be able to browse a heat map with a dendrogram attached (or some other representation - it's hard!) and look at relationships among the samples (drugs in this case).

It would therefore be great to create such a visualization in sharable format, for Cell Painting profiles and also for L1000 profiles, to compare them qualitatively.

InChIKey14s can contain duplicate MOA/Target Info

In #12 we used InChIKey14 to map broad_ids and in #11 we discussed why this is important.

While processing some data, I noticed that InChiKey14s do not map uniquely to MOA and Targets. I guess this is not surprising given that drugs are often used for different indications in various clinical phases, but it is worth documenting here! It is dangerous to use InChIKeys14s to map directly to MOA/Targets.

For example, InChIKey14 KTEIFNKAUNYNJU maps to two MOA/Targets. However, it looks like the full InChIKey does map uniquely. I didn't comprehensively explore this.

image

@niranjchandrasekaran - maybe I missed this, but was there a reason to use InChiKey14 instead of the full InChiKey?

Spherized data UMAP figures

In #63 I add spherized profiles for batch 1 and batch 2 LINCS data. Here is a birds-eye view of what the profiles look like:

Summary

It is hard to determine from this view exactly how much the profile quality has improved. The DMSO profiles are still distributed widely in the UMAP space, but many compounds form distinct islands. It also doesn't look too much different (at least by a cursory glance) to be too much different from a non-spherized (level 4 profiles) LINCS dataset (see here).

Batch 2 profiles are potentially more interesting. We see distinct islands separated by cell type! This is expected, but also quite exciting. I do not have UMAP coordinates with level 4 profiles.

Batch 1

lincs_whole_plate_spherized_batch1

Batch 2

lincs_whole_plate_spherized_batch2

lincs_whole_plate_spherized_batch2_cell_line

Broad ID to MOA Discrepancy

Hello @gwaygenomics
I noticed that there is a discrepancy on some mappings from Broad ID to MOA.
For instance in repurposing_info_long.tsv, the Broad ID: BRD-K66035042-001-10-1 maps to the MOA: mucolytic agent.
While in repurposing_info_external_moa_map_resolved.tsv, the same Broad ID maps to the MOA: diuretic. Which .tsv file is correct? Thank you!

Two discrepancies in MOA `pert_iname` between samples and drugs files

I am working on step 4 of #5 (comment) and came across two discrepancies between the samples and drugs files. They are likely very minor, and can easily be resolved, but I am noting them here for completeness.

When comparing pert_iname between the two files (drugs and samples), every single pert_iname entry in the drugs file is found in the samples file. However, two pert_iname entries are found in the samples file and not in the drugs file.

The two pert_iname entries are:

  1. YM-298198-desmethyl
  2. golgicide-A

Reconciliation of YM-298198-desmethyl

The compound YM-298198-desmethyl is missing from the drugs file, but the entry YM-298198.
YM-298198-desmethyl is a derivative of YM-298198, and therefore has a different structure.

Samples File

image

Drugs File

image

Conclusion

I think it is safe to duplicate the YM-298198 drug column and make one pert_iname entry YM-298198-desmethyl.

Reconciliation of golgicide-A

This appears to be, simply, an issue of capitalization. See below:

Samples File

Note the exact same smiles string, but different broad_ids (b/c of different purities and vendors).

image

Drugs File

image

Conclusion

I will rename the first entry in the samples file for BRD-A57886255-001-02-9 to have a lower-case golgicide-A.

Summary

Two pert_iname entries had conflicts. The proposed solutions will remedy. Note that I will include the jupyter notebook that performs this adjustment in the repo.

Evaluate profile normalization strategies

Currently, all profiles are normalized based on mad_robustize. We can use this repository to systematically evaluate if one strategy is better.

Rationale

As noted in #4 (comment)

The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.

Yes. Rationale: mostly empirical – robustize resulted in higher (compared to standardize) replicate correlations of Level 4 across a few experiments we tested this in.

Channel order

Hello! Where can I find which markers/stains each of the 5 image channels correspond to?

Dropping outlier features

MB said:

I have found a “error” in the Lincs dataset and I was wondering if you guys knew of this and if there needs to be some fixing of the pycyto pipeline? I am analyzing the Level 5 consensus data from here. When running the cyto eval functions on this data, I noticed some very high correlations. They come from this one feature (Nuclei_AreaShape_MedianRadius) that is 10^13 times larger than the others. The image shows a scatter plot of two samples which have a 1.000 similarity but are different compounds.

image

This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142

  1. Add drop_outliers to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.py
  2. Reprocess

Required Steps for Depositing Profiles

I am working towards processing all Drug Repurposing data and adding the results in this repository. The cell health project (https://github.com/broadinstitute/cell-health) now requires that the data are uniformly processed, documented, and made available here.

I will outline below the necessary steps required to get the data and processing pipelines uploaded.

  1. Make sure there are only small floating point differences between cytominer-derived profiles and pycytominer-derived profiles.
    • We are discussing this in #3
    • I noted a potential discrepancy in cytominer-based documentation that needs addressing
  2. Implement broad sample specific annotations
  3. Rerun the "all" profiles pipeline described in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (currently a private repo)
    • This needs to be rerun with the updated robustize_mad normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.
  4. Rerun 4.apply module in cell-health
    • Only after steps 1-3 are complete, can I rerun the 4.apply module
    • I will explore whether or not to make the lincs-cell-painting profile repository a submodule of the cell-health project

Using InChIKey as the common field for mapping

I had previously settled on using InChIKey14 as the common field for mapping across different repurposing hub versions (#13) partly due to the success in manually mapping three compounds across all the versions (#11 (comment)). Also, since there are only 45/1514 compounds (#11 (comment)) from the repurposing profiles dataset that do not map to any broad_ids in the most recent repurposing hub version (20200324), this approach may be the most effective.

But given #17, it may be worth repeating this pipeline with InChIKey as the common field for merging as InChIKey does uniquely identify stereoisomers. My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recent broad_ids but I believe it will be useful to know the actual number.

@gwaygenomics I can begin by creating a new PR by modifying the mapping code (2.map-broad_id.ipynb) and perhaps you could re-run the rest of the pipeline to generate a table similar to #11 (comment)?

Add --no-name gzip flag to compression file output

We are get annoying file diff triggers when reprocessing the pipeline, even if nothing changes in the file. This is important to fix so that we are able to isolate actual changes that result from reprocessing output data.

As @shntnu notes in #48 the reason why the gzip files are triggering positive diffs, is because of an added timestamp.

The way to remove the timestamp from the file is to pass a --no-name (-n) flag to the gzip command. See http://linuxcommand.org/lc3_man_pages/gzip1.html

Fortunately, it looks like pandas-dev/pandas#33398 has added the ability to include args to pandas gzip compression. This improvement will be included in pandas version 1.1, which is scheduled for an Aug 1 release.

Three Options

  • pandas v1.1 option (assuming that it solves this problem!)
  • base python option (outlined #48 (comment))
  • bash option (outlined #48 (comment)).

For the pandas or python option, the solution should ideally live in pycytominer. I've created a stub for this at cytomining/pycytominer#83

Comparing Cytominer and Pycytominer Profiles

In this issue, I will discuss results of step 3 outlined in #22 (comment)

Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results

Comparing Pycytominer and Cytominer Processing

We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.

We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.

We output several metrics comparing the two approaches

Metrics

In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.

  1. Mean, median, and sum of element-wise differencs
  2. Per feature mean, median, and sum of element-wise differences
  3. Feature selection procedure differences per feature (level 4b only)

In addition, we confirm alignment of the following metadata columns:

  • Well
  • Broad Sample Name
  • Plate

Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.

Data Levels

Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.

Data Level Comparison
Images Level 1 NA
SQLite File (single cell profiles ) Level 2 NA
Aggregated Profiles with Well Information (metadata) Level 3 Yes
Normalized Aggregated Profiles with Metadata Level 4a Yes
Normalized and Feature Selected Aggregated Profiles with Metadata Level 4b Yes
Perturbation Profiles created Summarizing Replicates Level 5 No

Finalizing authors

We need to finalize authors for the version 1 release of this repository.

The authors should be those involved in the LINCS Cell Painting data creation. This means individuals involved in:

  • funding acquisition
  • experiment planning
  • assay optimization
  • data collection
  • data processing
  • data curation

I know that this is a complex task, but its complexity matches its importance. Once we define authors we can ensure proper attribution to all papers that use this data. @shntnu - please help me with this :)

Once we define authors, I will also:

Create consensus spherized profiles

Given that we create a single CSV file for spherized in this notebook, it will easiest to compute consensus in the same notebook.

The output should be stored at lincs-cell-painting/spherized_profiles/consensus and be named

  • 2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_median.csv.gz
  • 2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_median.csv.gz
  • 2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_modz.csv.gz
  • 2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_modz.csv.gz

i.e. median and modz consensus for each of the two Batch 1 files in this directory.

And same for Batch 2 (2017_12_05_Batch2)

Improve spherize documentation

The spherize notebook documentation should be improved.

@shntnu notes in #60

The notebook says

Here, we load in all normalized profiles (level 4a) data across all plates and apply a spherize transform using the DMSO profiles as the background distribution.

but it should say

Here, we load in all normalized profiles (level 4a) data across all plates, apply the standard set of feature selection operations, and then apply a spherize transform using the DMSO profiles as the background distribution.

This is a very easy fix, and a good beginner issue!

Java dependency in environment.yml

I keep getting this popup when running

conda env create --force --file environment.yml

image

No clue what this is about

I'm on mac OS 10.15.7

Make single cell .SQLite files publicly available

We'd ideally like to make all single cell SQLite files publicly available. As @shntnu noted to me in a separate email, the lab has a process in place to accomplish this, which is great!

To summarize the plan that @shntnu outlined:

Step 1: Make SQLite files available via RODA

However, this step has two blocking tasks:

Step 2: Unarchive SQLite files

The big lift here is unarchiving the data

Unarchiving notes:

Step 3: Copy SQLite files

All we need to do here is copy the unarchived SQLite to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/${batch}.

Make decisions up to consensus clearer

Talking to Mattias I noticed that there are some holes in terms of explaining what happens to get to the consensus data.

  1. It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data
  2. As Shantanu has already pointed out, the consensus data should be all in one place.
  3. Can we have the spherized_profiles in the profiler folder, makes more sense I think
  4. Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is
  5. Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.

Sorry for all these suggestions at once. I don't what is on your plate (haha) right now @gwaygenomics so I'm unsure how to move forward with these suggestions.

CC: @FloHu, @shntnu @niranjchandrasekaran

Add consensus perturbation signatures

My current plan is as follows:

  • 1. Process each plate independently (✅ in #34)
  • 2. Generate an across-plate consensus signature on broad_sample and dose.
  • 3. The consensus signature will be based on median and MODZ
  • 4. Output one single file for the full consensus signature
  • 5. Output a separate file for a feature selected consensus signature (derived after calculating consensus)

Unifying documentation for this step from #4 (comment) and #34 (comment)

Plate SQ00015049 is not processing

Something is wrong with plate SQ00015049. We successfully processed all other plates except this one. Below is the error:

Now processing... Plate: SQ00015049
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    cursor, statement, parameters, context
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlite3.DatabaseError: database disk image is malformed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "profile.py", line 56, in <module>
    ap = AggregateProfiles(sql_file=sql_file, strata=strata, operation=aggregate_method)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pycytominer/aggregate.py", line 86, in __init__
    self.load_image()
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pycytominer/aggregate.py", line 118, in load_image
    self.image_df = pd.read_sql(sql=image_query, con=self.conn)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 438, in read_sql
    chunksize=chunksize,
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 1218, in read_query
    result = self.execute(*args)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 1087, in execute
    return self.connectable.execute(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 976, in execute
    return self._execute_text(object_, multiparams, params)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1151, in _execute_text
    parameters,
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1288, in _execute_context
    e, statement, parameters, cursor, context
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1482, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    cursor, statement, parameters, context
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: select TableNumber, ImageNumber, Image_Metadata_Plate, Image_Metadata_Well from image]
(Background on this error at: http://sqlalche.me/e/4xp6)

Add Download + Usage Instructions

In the past, I've noticed that some users of the data struggle to access data in git lfs. We need to add downloading (and perhaps submodule setup) instructions to the README of this repo.

Update pycytominer version

we have made substantial progress in pycytominer since version 0.1 release. We need to update the environment.yml file and update the profiling pipeline to account for this change.

Get per-plate evaluation metrics

use the cytominer-eval library. An example https://github.com/jump-cellpainting/develop-computational-pipeline/issues/4#issuecomment-693006903 is pasted below:

After installing with:

pip install git+https://github.com/cytomining/cytominer-eval@56bd9e545d4ce5dea8c2d3897024a4eb241d06db

This now works:

import pandas as pd
from cytominer_eval import evaluate
from pycytominer.cyto_utils import infer_cp_features

file = "https://github.com/broadinstitute/lincs-cell-painting/raw/master/profiles/2016_04_01_a549_48hr_batch1/SQ00014813/SQ00014813_normalized_feature_select_dmso.csv.gz"
df = pd.read_csv(file)

features = infer_cp_features(df)
meta_features = infer_cp_features(df, metadata=True)

replicate_groups = ["Metadata_broad_sample", "Metadata_mg_per_ml"]

evaluate(
    profiles=df,
    features=features,
    meta_features=meta_features,
    replicate_groups=replicate_groups,
    operation="percent_strong",
    percent_strong_quantile=0.95
)

# Output: 0.32598039215686275

operation="grit" and operation="precision_recall" are also implemented.

(see https://github.com/cytomining/cytominer-eval/blob/master/cytominer_eval/evaluate.py for details)

Update profile workflow figure

In #73 @michaelbornholdt fixed the workflow diagram for processing the LINCS cell painting profiles.

We have since realized that this workflow diagram is incorrect.

@michaelbornholdt - are you able to adjust the diagram and file a pull request to correct the figure? We will need to do this before we submit the manuscript (which will be soon!)

Plate causing numerical issues

In the cell health project, I noticed some strange behavior with a specific plate.

The plate is SQ00015221 coming from plate map C-7161-01-LM6-011. The offending features seem to be based on Correlation_RWC.

We include the plate in this repo, but I am adding a note here that we should revisit. I have a sneaking suspicion that the issue stems from the missing values and zero issue noted in cytomining/pycytominer#79 and described in cytomining/cytominergallery#62

Second batch of lincs data

@shntnu - I remember you mentioning that we have another batch of Cell Painting data for this project. Can you point me to where this data lives?

We should work towards getting this data on here and processed. @sMyn42 is looking for a good dataset to test different batch effect correction tools (e.g. Harmony) to extend his Summer Research project. Having the second batch on here will help!

Old/Updated Broad IDs

I have encountered perhaps a significant hurdle in adding Cell Painting Repurposing Hub profiles to this repo.

There are broad ids (pert_id) in the profile data that are absent from the updated moa information.

For example, in one plate (SQ00014814) the following pert_ids are present (with annotations) in the profile data, but are absent in the repurposing moa files in this repo:

['BRD-A69275535',
'BRD-A69636825',
'BRD-A69815203',
'BRD-A72309220',
'BRD-A72390365',
'BRD-A74980173',
'BRD-A82156122',
'BRD-K50691590',
'BRD-K68164687',
'BRD-K71480163',
'BRD-K81258678',
'BRD-K81957469']

Given that these pert_ids have annotations in cytominer-derived profiles, this indicates that the pert_ids have changed somewhere.

Before I pursue this issue, I was wondering if there are any known solutions or datasets that map old to updated pert_ids. cc @shntnu @niranjchandrasekaran

Perhaps also @jrsacher has insight here. Josh, I scanned the CLUE and DepMap resources and was not able to find a map. I also checked the column deprecated_broad_id and I was able to recover 3 of the profiles (['BRD-K50691590', 'BRD-K50691590', 'BRD-K81258678']).

Any insights or pointers here would be greatly appreciated!

What to do with pert_ids with conflicting information? [RESOLVED]

I am following up #7 with an additional notebook to create a simple, basic mapping file with only a handful of columns. This includes creating a pert_id column, which is a 13 character subset of the full 22 character broad_id column. The additional 9 characters contain batch info about the compound. More details about this procedure is here: #5 (comment)

In generating this data, I noticed that 16 perturbations (by pert_id) contain conflicting information (by pert_iname, moa, or target). I paste all of the conflicting info below:

pert_id pert_iname moa target
0 BRD-A03204438 allopregnanolone GABA receptor positive allosteric modulator GABRA1
1 BRD-A03204438 pregnanolone GABA receptor positive allosteric modulator nan
2 BRD-K05674516 sofosbuvir RNA polymerase inhibitor nan
3 BRD-K05674516 PSI-7976 HCV inhibitor nan
4 BRD-K17498618 betaxolol adrenergic receptor antagonist ADRB1
5 BRD-K17498618 cisatracurium acetylcholine receptor antagonist CHRNA2
6 BRD-K20672254 pyrantel-tartrate acetylcholine receptor agonist CHRNA1
7 BRD-K20672254 pyrantel-pamoate neuromuscular blocker nan
8 BRD-K25650355 physostigmine-salicylate acetylcholinesterase inhibitor nan
9 BRD-K25650355 physostigmine cholinesterase inhibitor ACHE
10 BRD-K29713308 mebhydrolin antihistamine nan
11 BRD-K29713308 mebhydroline-1,5-naphtalenedisulfonate nan nan
12 BRD-K35952844 calcium-gluceptate nan nan
13 BRD-K35952844 sodium-glucoheptonate nan nan
14 BRD-K41260949 valproic-acid HDAC inhibitor ABAT
15 BRD-K41260949 divalproex-sodium benzodiazepine receptor agonist ALDH5A1
16 BRD-K66035042 mannitol-D diuretic nan
17 BRD-K66035042 sorbitol mucolytic agent nan
18 BRD-K71013094 neomycin-sulfate bacterial 30S ribosomal subunit inhibitor nan
19 BRD-K71013094 neomycin bacterial 30S ribosomal subunit inhibitor CXCR4
20 BRD-K79450420 INCB-024360 indoleamine 2,3-dioxygenase inhibitor IDO1
21 BRD-K79450420 epacadostat indoleamine 2,3-dioxygenase inhibitor IDO1
22 BRD-K87202646 isoniazid FABI inhibitor CYP1A2
23 BRD-K87202646 pasiniazid cyclooxygenase inhibitor nan
24 BRD-K93632104 salicylic-acid cyclooxygenase inhibitor AKR1C1
25 BRD-K93632104 sodium-salicylate prostanoid receptor antagonist ASIC3
26 BRD-K97799481 theophylline adenosine receptor antagonist nan
27 BRD-K97799481 aminophylline adenosine receptor antagonist ADORA1
28 BRD-K97799481 oxtriphylline adenosine receptor antagonist ADORA1
29 BRD-M55114534 pyrvinium androgen receptor antagonist nan
30 BRD-M55114534 pyrvinium-pamoate androgen receptor antagonist AR

Note about citing clue.io

We concluded that it is ok to have Level 3-5 data on GitHub, although we will cite clue.io as the primary source and references for this data, similar to this README by @gwaygenomics

The following data will eventually be made available on clue.io/morphology

Level 1-5 + connectivity file of

  • REP (1571 compounds, A549, 48h, 6 doses)
  • LKCP (~360 compounds, MCF7/A549/U2OS, 6h/24h/48h, 3 doses)
  • DBG (a subset of LKCP)

How do updated moa/target annotations influence moa/target recall?

A potentially fun analysis would be to evaluate how the different moa/target annotations updated over time (in different CLUE drugs/samples versions) influence moa/target recall.

Essentially, we would setup an eval framework (I imagine there is a traditional moa/target recall eval) where we use the same input profiles and alter the moa/target information as they have been updated over time.

If we see improvement over time this tells us that annotations are improving and, potentially, that there is even more room to improve categorization.

Add License

@shntnu - we should add an open source license to this repo before adding profiles. Have we thought about which license we should apply here?

Delete column names after spherizing

@FloHu noticed that the original names of the features are carried through after spherizing.
We might want to not do that since it can confuse people actually using the consensus data later on then. I think they should just be named feature_1 .. x

Compute per channel percent strong

While working on CPJUMP-Stain2, @shntnu and I observed that the proportion of compounds with a strong signal (percent strong metric) was similar if the analysis was performed with individual channels or across all channels. We wanted to find out if this behavior was seen in other datasets as well.

I chose one of the platemaps (H-BIOA-002-1) from BBBC022 and computed the channel-wise correlation values. Based on the results below, it looks like BBBC022 also behaves similarly.

BBBC022 channel-wise percent strong comparison

Click to expand!

3 BBBC022_H-BIOA-002-1_channels

BBBBC022 all channels percent strong

Click to expand!

3 BBBC022_H-BIOA-002-1_all

Performing this experiment with a larger dataset, such as LINCS, may help answer whether the above plots are technical artifacts or if this behavior is consistent across datasets.

Add second batch

@shntnu - I remember you mentioning that we have another batch of Cell Painting data for this project. Can you point me to where this data lives?

We should work towards getting this data on here and processed. @sMyn42 is looking for a good dataset to test different batch effect correction tools (e.g. Harmony) to extend his Summer Research project. Having the second batch on here will help!

Adding profiles to dvc

I am working on this now.

Asks

@shntnu

  • Can you provide me with access (and a pointer) to which AWS bucket to use for permanent dvc storage and access?
  • I also remember you wanting me to document adding DVC to this repo somewhere else, but I cannot find the link to where you want me to document steps. Can you also provide me this pointer again? Thanks! (see cross references below)

Cross references

A couple cross-references to track history of DVC discussions:

  • Discussing DVC after frozen data in #62
  • Some discussion of migration from git lfs to DVC after adding frozen data version 1 in #63
  • Discussing adding DVC to the standard profiling recipe cytomining/profiling-template#13

Download profiles via dvc command line

Hi @gwaygenomics and team,

We've worked on some cell profiling tools and would be interested to try them on this dataset. Unfortunately I am having trouble to download the profile data. It would be great if you would have pointers to help with that?

At the moment I tried the dvc get command line tools (I am new to it but quite excited by the concept 😊) but I probably do something wrong (I tried this on windows10 from a powershell terminal)

Capture

Thanks for your help,

Kind regards,
Benoit

Adding Level 3-5 Cell Painting Data Questions

I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.

  1. Confirm what the levels actually are! 😆
  2. I assume I should add cytominer profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed.
    • Located here: /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1
    • Were they processed using the standard profiling workflow?

Upload Image Files to IDR

We will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps.

From IDR:

  1. study file describing the overall study and the screens that were performed e.g. cell health
  2. library file(s) describing the plate layout of each screen e.g. cell health
  3. processed data file(s) containing summary results and/or a ‘hit' list for each screen

All files should be in tab-delimited text format.
Templates are provided but can be modified to suit your experiment.
Add or remove columns from the templates as necessary.

@gwaygenomics Did you have a processed data file for cell health?

Perturbation Metadata File - Perturbation ID and MOA

We have additional information for each compound assayed in the Drug Repurposing Hub Cell Painting Dataset.

There are at least four files on AWS, that could all work as a reference to describe compound metadata.

File Name Columns
pert_info.txt pert_id, pert_iname, pert_type, moa
pert_iname_moa.txt pert_iname, moa, source, url, support, num_sources
pert_id_to_iname.txt pert_id, pert_iname, pert_type
pert_iname_moa_aggregated.txt pert_iname, moa, pert_iname_modified

Below I summarize each of the files

pert_info.txt

image

pert_iname_moa.txt

image

pert_id_to_iname.txt

image

pert_iname_moa_aggregated.txt

image

Confirmed pert_info.txt subset

image

Should we reprocess all profiles before frozen data release?

I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:

  • release pycytominer version 0.1. It will be great to include a stable pycytominer version in the conda environment. We've upgraded pycytominer so much since the original reprocessing, and rerunning profiles will ease headaches (see below). (Decided not to pursue)
  • update MOA map for batch 2 data (see #61 (comment))

What headaches will an updated pycytominer resolve?

  • the updated pycytominer fixes no-name gzip flab (#50)
  • updated naming convention "blacklist" -> "blocklist"
  • potential to change epsilon in spherize()

Rerunning the pipeline will also enable us to migrate from git lfs to dvc.

Time estimate

  1. Runtime will take non-negligible time, probably ~1 week, but it will increase confidence and organization of the data.
  2. Migrating from git lfs to dvc will take 4 hours
  3. Releasing pycytominer version 0.1 will take longer. I think we are close to an official version 0.1 release https://github.com/cytomining/pycytominer/milestone/1

Add cell count files

In #34, I added preliminary cell count files. However, they included an extra column and were tab separated (see #34 (comment))

in 2141da9 I removed the cell count files in order to process them more consistently. The files are currently being generated and this issue will be closed once they are added back.

Updated Strategy for Adding Profiles

At the profiling checkin today, we discussed our strategy to adding (and evaluating) profiles in this project.

This issue supersedes #4

  1. I will complete #21 and @niranjchandrasekaran will review
  2. We will add median profiles to this repository as a first step
  3. I will confirm floating point differences between pycytominer and cytominer processed profiles (note the limitations discussed in #3 (comment))
  4. We will add mean profiles to this repo next
  5. We will perform an evaluation of sorts to compare mean vs. median profiles

Processing data using DeepProfiler

Juan asked this:

I need to process the LINCS dataset to proceed with the plan we discussed for LUAD. I'm going to need access to the images, which is a lot of data! Here is my plan to make it efficient, and I would like to get your feedback and recommendations:

  • Get all the plates back from Glacier for a few days. If I remember correctly, the entire set is 21TB or so.
  • Use EC2 instances to compress the images using DeepProfiler. I did this in the past and we can get down to 800GB or so.
  • Save the compressed dataset in S3 to work with it during the next couple of months, then send it to Glacier.  In fact, with that size we can probably keep it in the DGX and the GPU-cluster too.

Does this make sense? Do you have any recommendations for me before moving forward?

By the way, the compression should take roughly 4 hours per plate (pessimistic estimate), and can be run in parallel, with multiple plates per machine (one per CPU). So using 30 cheap instances (with 4 cores each) in spot mode should do the trick in one day, including the operator's time :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.