Coder Social home page Coder Social logo

li-li-github / cytoself Goto Github PK

View Code? Open in Web Editor NEW

This project forked from royerlab/cytoself

0.0 1.0 0.0 51.42 MB

Self-supervised models for encoding protein localization patterns from microscopy images

License: BSD 3-Clause "New" or "Revised" License

Python 96.41% Jupyter Notebook 3.59%

cytoself's Introduction

cytoself

Python 3.9 DOI License Code style: black codecov Tests

cytoself in pytorch implementation. The original cytoself implemented in tensorflow is archived in the branch cytoself-tensorflow.

Note: Branch names have been changed. cytoself-pytorch -> main, the previous main -> cytoself-tensorflow.

Rotating_3DUMAP

cytoself is a self-supervised platform for learning features of protein subcellular localization from microscopy images [1]. The representations derived from cytoself encapsulate highly specific features that can derive functional insights for proteins on the sole basis of their localization.

Applying cytoself to images of endogenously labeled proteins from the recently released OpenCell database creates a highly resolved protein localization atlas [2].

[1] Kobayashi, Hirofumi, et al. "Self-Supervised Deep-Learning Encodes High-Resolution Features of Protein Subcellular Localization." Nature Methods (2022). https://www.nature.com/articles/s41592-022-01541-z
[2] Cho, Nathan H., et al. "OpenCell: Endogenous tagging for the cartography of human cellular organization." Science 375.6585 (2022): eabi6983. https://www.science.org/doi/10.1126/science.abi6983

How cytoself works

cytoself uses images (cell images where only single type of protein is fluorescently labeled) and its identity information (protein ID) as a label to learn the localization patterns of proteins.

Workflow_diagram

Installation

Recommended: create a new environment and install cytoself on the environment from pypi

(Optional) To run cytoself on GPUs, it is recommended to install pytorch GPU version before installing cytoself following the official instruction. The way to install pytorch GPU may vary upon your OS and CUDA version.

conda create -y -n cytoself python=3.9
conda activate cytoself
# (Optional: Install pytorch GPU following the official instruction)
pip install -e .

(For the developers) Install from this repository

Install development dependencies

pip install -r requirements/development.txt
pre-commit install

How to use cytoself on the sample data

Download one set of the image and label data from Data Availability. Open In Colab is available.

1. Prepare Data

from cytoself.datamanager.opencell import DataManagerOpenCell

data_ch = ['pro', 'nuc']
datapath = 'sample_data'  # path to download sample data
DataManagerOpenCell.download_sample_data(datapath)  # donwload data
datamanager = DataManagerOpenCell(datapath, data_ch, fov_col=None)
datamanager.const_dataloader(batch_size=32, label_name_position=1)

A folder, sample_data, will be created and sample data will be downloaded to this folder. The sample_data folder will be created in the "current working directory," which is where you are running the code. Use os.getcwd() to check where the current working directory is.

9 sets of data with 4 files for each protein (in total 36 files) will be downloaded. The file name is in the form of <protein_name>_<channel or label>.npy.

  • *_label.npy file: Contains label information in 3 columns, i.e. Ensembl ID, protein name and localization.
  • *_pro.npy file: Image data of protein channel. Size 100x100. Images were cropped with nucleus being centered (see details in paper).
  • *_nuc.npy file: Image data of nucleus channel. Size 100x100. Images were cropped with nucleus being centered (see details in paper).
  • *_nucdist.npy file: Data of nucleus distance map. Size 100x100. Images were cropped with nucleus being centered (see details in paper).

2. Create and train a cytoself model

from cytoself.trainer.cytoselflite_trainer import CytoselfFullTrainer

model_args = {
    'input_shape': (2, 100, 100),
    'emb_shapes': ((25, 25), (4, 4)),
    'output_shape': (2, 100, 100),
    'fc_output_idx': [2],
    'vq_args': {'num_embeddings': 512, 'embedding_dim': 64},
    'num_class': len(datamanager.unique_labels),
    'fc_input_type': 'vqvec',
}
train_args = {
    'lr': 1e-3,
    'max_epoch': 1,
    'reducelr_patience': 3,
    'reducelr_increment': 0.1,
    'earlystop_patience': 6,
}
trainer = CytoselfFullTrainer(train_args, homepath='demo_output', model_args=model_args)
trainer.fit(datamanager, tensorboard_path='tb_logs')

3. Plot UMAP

from cytoself.analysis.analysis_opencell import AnalysisOpenCell

analysis = AnalysisOpenCell(datamanager, trainer)
umap_data = analysis.plot_umap_of_embedding_vector(
    data_loader=datamanager.test_loader,
    group_col=2,
    output_layer=f'{model_args["fc_input_type"]}2',
    title=f'UMAP {model_args["fc_input_type"]}2',
    xlabel='UMAP1',
    ylabel='UMAP2',
    s=0.3,
    alpha=0.5,
    show_legend=True,
)

The output UMAP plot will be saved at demo_output/analysis/umap_figures/UMAP_vqvec2.png by default.

Result_UMAP

4. Plot feature spectrum

# Compute bi-clustering heatmap
analysis.plot_clustermap(num_workers=4)

# Prepare image data
img = next(iter(datamanager.test_loader))['image'].detach().cpu().numpy()[:1]

# Compute index histogram
vqindhist1 = trainer.infer_embeddings(img, 'vqindhist1')

# Reorder the index histogram according to the bi-clustering heatmap
ft_spectrum = analysis.compute_feature_spectrum(vqindhist1)

# Generate a plot
import numpy as np
import matplotlib.pyplot as plt

x_max = ft_spectrum.shape[1] + 1
x_ticks = np.arange(0, x_max, 50)
fig, ax = plt.subplots(figsize=(10, 3))
ax.stairs(ft_spectrum[0], np.arange(x_max), fill=True)
ax.spines[['right', 'top']].set_visible(False)
ax.set_xlabel('Feature index')
ax.set_ylabel('Counts')
ax.set_xlim([0, x_max])
ax.set_xticks(x_ticks, analysis.feature_spectrum_indices[x_ticks])
fig.tight_layout()
fig.show()

Tested Environments

Rocky Linux 8.6, NVIDIA A100, CUDA 11.7 (GPU)
Ubuntu 20.04.3 LTS, NVIDIA 3090, CUDA 11.4 (GPU)
Ubuntu 22.04.3 LTS, NVIDIA 4090, CUDA 12.2 (GPU)

Known Issues

There seems to be compatibility issues of python multiprocessing on Windows, causing a DataLoader unable to load data (issue, issue). Please try the temporal workaround.

Data Availability

The full data used in this work can be found here. The image data have the shape of [batch, 100, 100, 4], in which the last channel dimension corresponds to [target protein, nucleus, nuclear distance, nuclear segmentation].

Due to the large size, the whole data is split to 10 files. The files are intended to be concatenated together to form one large numpy file or one large csv.

Image_data00.npy
Image_data01.npy
Image_data02.npy
Image_data03.npy
Image_data04.npy
Image_data05.npy
Image_data06.npy
Image_data07.npy
Image_data08.npy
Image_data09.npy
Label_data00.csv
Label_data01.csv
Label_data02.csv
Label_data03.csv
Label_data04.csv
Label_data05.csv
Label_data06.csv
Label_data07.csv
Label_data08.csv
Label_data09.csv

cytoself's People

Contributors

ahmetcansolak avatar dependabot[bot] avatar li-li-github avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.