Coder Social home page Coder Social logo

wayscience / benchmarking_nf1_data Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 18.0 2.58 GB

Benchmarking data processing strategies for Cell Painting data of NF1 Schwann cells. See analysis repository (https://github.com/WayScience/NF1_SchwannCell_data_analysis) for information on how the data was interpreted.

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 98.40% Python 1.24% Shell 0.01% R 0.35%
image-analysis machine-learning cell-painting neurofibromatosis-type-1 schwann-cells

benchmarking_nf1_data's Introduction

NF1 Schwann Cell Data Project

Data

The data used in this project is a modified Cell Painting assay on Schwann cells from patients with Neurofibromatosis type 1 (NF1). In this modified Cell Painting, there are three channels:

  • DAPI (Nuclei)
  • GFP (Endoplasmic Reticulum)
  • RFP (Actin)

Modified_Cell_Painting.png

There are two genotypes of the NF1 gene in these cells:

  • Wild type (WT +/+): In column 6 from the plate (e.g C6, D6, etc.)
  • Null (Null -/-): In column 7 from the plate (e.g C7, D7, etc.)

It is important to study Schwann cells from NF1 patients because NF1 causes patients to develop neurofibromas, which are red bumps on the skin (tumors) that appear due to the loss of Ras-GAP neurofibromin. This loss occurs when the NF1 gene is mutated (NF1 +/-).

Goal

The goal of this project is to predict NF1 genotype from Schwann cell morphology. We apply cell image analysis to Cell Painting images and use representation learning to extract morphology features. We will apply machine learning to the morphology features to discover a biomarker of NF1 genotype. Once we discover a biomarker from these cells, we hope that our method can be used for drug discovery to treat this rare disease.

Repository Structure

Module Purpose Description
0_download_data Download NF1 pilot data Download images from each of NF1 dataset (e.g. pilot and second plate) for analysis
1_preprocessing_data Perform Illumination Correction (IC) Use BaSiCPy to perform IC on images per channel
2_segmenting_data Segment Objects Perform segmentation using Cellpose and outputing center (x,y) coordinates for each object
3_extracting_features Extract features Use center (x,y) coordinates in DeepProfiler to extract features from all channels
4_processing_features Normalize CellProfiler features Use Pycytominer functions to merge and normalize features acquired from CellProfiler
CellProfiler_pipelines Perform a full pipeline on NF1 data using CellProfiler (from IC to feature extraction) We run two CellProfiler pipelines (1. illumination correction and 2. segmenation and feature extraction)
TBD TBD TBD

benchmarking_nf1_data's People

Contributors

gwaybio avatar jenna-tomkinson avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

benchmarking_nf1_data's Issues

Current CP illum pipeline is not automated + saves images as different bit depth

For both CP and DP, the corrected images are saved as different bit depth than the original image during the IC step. I am not 100% sure why in the PyBaSiC method we rescaled and converted to 8-bit (which is reflected in CellProfiler) but it might be for good reason.

Based on research, bit-depth is important if you convert and go down bit-depth like we do, we do it in the correct way (see https://bioimagebook.github.io/chapters/1-concepts/3-bit_depths/python.html) by rescaling then converting, but do we need to actually do this or not?

As well, the CP illumination correction pipeline is not automated like the CellProfiler analysis.

ComplexHeatmap error for Pybasic Cellpose pipeline

I am getting an error when running the PyBasic Cellpose pipeline output (norm and feature selected) through the complex heatmap notebook.

I am getting this error, which I am not sure what this means:

Error in hclust(get_dist(submat, distance), method = method): NA/NaN/Inf in foreign function call (arg 10)

All other pipelines work fine and are getting outputs as expected.

MeasureImageIntensity not measuring whole image intensity

All plate 1 and plate 2 pipelines along with extracting single cell features will need to be rerun.

The NF1_analysis.cppipe file contains a mistake where the MeasureImageIntensity is measuring the intensity within objects, which I need in the beginning due to misinterpretation of what needed to be measured.

This is an easy fix and will need to be done to make sure these whole image features are included.

Apply pycytominer-transform to real world CP-derived SQLite

@d33bs - consider applying pycytominer-transform to the existing SQLite data here to test your approach with real world data.

We would be particularly interested in comparing the results from pycytominer.cyto_utils.cells.SingleCells to the pycytominer-transform derived parquet file.

More details about the SingleCells processing is here: https://github.com/WayScience/NF1_SchwannCell_data/blob/5536d86330aba0f66c74e2bd44c6e82ed1c985f9/4_processing_features/extract_single_cell_features.ipynb

SQLite File Creation Issue

A barrier that I had across while creating a CellProfiler pipeline was being able to make a SQLite file from the extracted features when attempting to use the collate() function from pycytominer. The goal that I had when using this function was to create one SQLite file from multiple .csv files created from CellProfiler.

Through prototyping and testing, I found that I ended up at a roadblock.

This roadblock was that the directory structure was not the same as described in the Image-based Profiling Handbook appendix. There was meant to be an Experiment.csv file in every "well-site" folder. Once I figured out that through the CellProfiler documentation that the Experiment.csv was only saving once.

To find a solution to this issue, I created an issue in the pycytominer repository (see cytomining/pycytominer#229). Thanks to Beth Cimini, she was able to give advice on how reach the goal that I had.

She explained that the collate function works best when you have a large dataset that you can not run locally. In contrast, she recommended using the ExportToDatabase module within CellProfiler to export all of the extracted features in a SQLite file if your dataset is smaller and/or can be run on your local machine.

Since the NF1 pilot data is very small, I decided to pivot from the collate() function to using the ExportToDatabase module. The whole analysis pipeline in CellProfiler took approximately 15 minutes on my MacBook Air.

Improved CellProfiler segmentation parameters -> Need to rerun Plate 1

I noticed when performing the test mode for Plate 2 using the NF1_analysis.cpproj file used for Plate 1 that the segmentation of the actin was not very good.

I updated the parameters to:

  • Select method for identifying secondary objects => Propagation (previously Watershed - Image)
  • Threshold correction factor => 1.0 (previously 1.5)

This change then caused the segmentation to perform better, specifically it was not undersegmenting the cells and was getting the full cell. The only caveat is that the improve parameters caused for any holes in the actin to be filled and apart of the cell. This is better than losing data

image

And this image shows the improvement of the segmentation (along with the inclusion of the hole as a part of the cell) in the same cell using the updated parameters:

image

These better parameters will need to be used for Plate 1 to stay consistent. Plate 1 will need to be rerun at some point in the new future.

Missing index.csv file in DP Projects

I am trying to look at the distribution of single cells per well in the DP feature extraction analysis (4_processing_features/extract_sc_features_dp.ipynb) but I cannot run the notebook.

I am receiving an error in cell 4:

deep_data_nuc = DeepProfiler_processing.DeepProfilerData(
    index_file_nuc, profile_dir_nuc, filename_delimiter="/", file_extension=".npz"
)

FileNotFoundError: [Errno 2] No such file or directory: '../3_extracting_features/NF1_nuc_project-DP/inputs/metadata/index.csv'

Second Plate Issues

There have been many issues that have come from the second plate with the NF1 data.

But the main issue at the current moment, when we have received all images that are .tif files, they came as RGB (3 channel) images and contained a scale.

When we received the updated second plate dataset, the images were still RGB but did not contain the scale.

What I noticed is that when splitting the images in CellProfiler to take the one channel connected to the colored channel in a composite Cell Painting image, there were discrepancies in how CellProfiler displayed the split channels.

Here is what it looks like with the images with scale:

image

Here is what is looks like with the updated images without the scale:

image

As seen in both images, the OrigGreen channel in the scaled image is less intense compared to in the non-scaled image, where OrigGreen and GrayBlue look like the exact same image.

I am unsure of if this makes a difference in terms of image quality. I am only taking one channel from the split RGB images so what I expect could happen is loss of some data but maybe not. These are my observations of the current discrepancies between NF1 datasets.

Linear model and power analysis for both DP feature sets

Currently, we use only one of the DP feature sets for linear modeling and power analysis. We should use both.

It would also be good to clarify what is different between the feature spaces. Are there differences per single cell feature? This might help us to zero in on efficientnet features that are significantly capturing nuclei or cytoplasm!

Rename github repo

Should clarify (make it obvious!) that this is benchmarking.

Also link to the other standard nf1_cell_painting repo

Linking single cells between DeepProfiler projects

Due to the structure of Cellpose and DeepProfiler, there was a need to separate out the objects into two DeepProfiler projects. The two main reasons include:

  1. Cellpose only being able to extract center x and y coordinates for one object at a time
  2. DeepProfiler needing different box size parameters per object

Currently, only univariate tests can be performed. To be able to perform multivariate tests, the processed features will need to be merged into one single DeepProfiler data frame as to include each single cell and all of the features from the nuclei and cytoplasm coordinates.

There is a need for a linking function that can combine features per single cell (row) based on the center x and y coordinates. Essentially, if the coordinates are within a certain amount of pixels, then the two single cells from each project can be merged.

Open data access

Hi @jenna-tomkinson - let's make the data public.

I've reviewed the agreement and there is nothing binding on the timing of data release or the nature of our analyses (as long as we perform our statement of work!). Releasing the data now makes it a bunch easier for us to analyze it in multiple ways.

Can you let me know how big the data are? This will help us to decide where to put it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.