Coder Social home page Coder Social logo

bfurtwa / sceptre Goto Github PK

View Code? Open in Web Editor NEW
11.0 3.0 3.0 16.53 MB

SCeptre is a python package that extends the functionalities of Scanpy to analyze multiplexed single-cell proteomics data.

License: MIT License

Jupyter Notebook 99.78% Python 0.22%

sceptre's Introduction

SCeptre

(Single Cell proteomics readout of expression)

SCeptre is a python package that extends the functionalities of Scanpy to analyze multiplexed single-cell proteomics data.

Installation from Github

Tested on Ubuntu 20.04.1 LTS.
It's recommended to work in a dedicated conda environment. E.g:

conda create -n sceptre python=3.7
conda activate sceptre

Clone the repository and cd into its root directory. Then:

pip install .

Usage

Replication of Schoof et al.

Usage is exemplified in the notebooks for the analysis from the mansucript "Quantitative Single-Cell Proteomics as a Tool to Characterize Cellular Hierarchies".

The analysis can be replicated using the provided conda environment:

conda env create -f Schoof_et_al/code/environment.yml
conda activate sceptre
pip install Schoof_et_al/code/sceptre-0.1-py3-none-any.whl

The required data can be downloaded from http://proteomecentral.proteomexchange.org using the dataset identifier PXD020586

Find the notebooks in the subdirectory Schoof_et_al/code, place the required data in Schoof_et_al/data, and create the folder Schoof_et_al/results/tmp/.

The following notebooks process the different datasets:

Notebook Description
300ms.ipynb Sceptre analysis of the 'medium' dataset.
500ms.ipynb SCeptre analysis of the 'high' dataset.
bulk.ipynb SCeptre analysis of the 'bulk' dataset.
integrated.ipynb SCeptre analysis of the 'integrated' dataset.

Functions and example worklow

Each function has its own docstring explaining the function in depth. A typical workflow makes usage of the following steps:

Create meta data

To create the meta data for each cell, as done in Schoof et al., from a collection of tables describing the experimental design and layouts of the 384-well plates, the following function is used. For details on the required tables, have a look at Schoof_et_al/data/500ms.

import sceptre as spt
spt.create_meta_data(input_dir="../data/500ms/", output_dir=res_dir)

Alternatively, the meta data table can be created by the user. It requires the columns File ID and Channel to map the meta data to each cell.

Load dataset

To load the dataset into python, to following function is used. To this end, only output tables from Proteome Discoverer are supported.

dataset = spt.load_dataset(proteins = "../data/500ms/500ms_Proteins.txt",
                           psms = "../data/500ms/500ms_PSMs.txt",
                           msms = "../data/500ms/500ms_MSMSSpectrumInfo.txt",
                           files = "../data/500ms/500ms_InputFiles.txt",
                           meta = res_dir + "meta.txt")

LC-MS QC

The dataset object can be used to quality control each LC-MS run with the follwing functions.

spt.plot_psms_msms(dataset)
spt.plot_avg_sn(dataset)
spt.plot_set_overview(dataset)

s_c_channels = ['127N', '128N', '128C', '129N', '129C', '130N', '130C',
                '131N','131C', '132N', '132C', '133N', '133C', '134N']
spt.print_ms_stats(dataset, s_c_channels=s_c_channels)

spt.plot_interference(dataset)

Load dataset into Scanpy

Subsequently the dataset object is used to create a scanpy adata object.

adata = spt.dataset_to_scanpy(dataset)

Filtering and normalization

Non-single cell channels have to be removed.

adata = adata[adata.obs['Channel'] != '126'].copy()
adata = adata[adata.obs['Channel'] != '127C'].copy()

Then the dataset can be normalized.

spt.normalize(adata)

Cell QC

The follwing functions are used to filter out outlier cells.

spt.calculate_cell_filter(adata)
spt.apply_cell_filter(adata)

Batch QC

To detect potential systematic bias introduced by the sample preparation or measurement the following functions can be used.

spt.plot_batch_qc(adata)
spt.plot_plate_qc(adata)

Scanpy analysis

The adata object can now be used to perform a scanpy analysis.

sceptre's People

Contributors

bfurtwa avatar henriettaholze avatar sabrinarichter avatar

Stargazers

 avatar  avatar Joe Varberg avatar Natalie Clark avatar Tudor-Stefan Cotet avatar Yin Huang avatar  avatar Braza Faouzi avatar  avatar  avatar  avatar

Watchers

Grossmann Jonas avatar  avatar Pedro Aragon avatar

sceptre's Issues

Normalization with Pandas 2.0

Hey:) Awesome work!

I had a look at the normalization function in the scope of re-analyzing the data. The code relies on default behaviour which already depreciated in version (2.0).

Some data

import pandas as pd

print(f"{pd.__version__ = }")

# create a test dataset of 3 files and 3 plexes for 5 protein groups
# with different levels of expressions
data = {
    'P1' : {('F1', 'plex1'): 1.0,
            ('F1', 'plex2'): 1.4,
            ('F1', 'plex3'): 1.6,
            ('F2', 'plex1'): 1.2,
            ('F2', 'plex2'): 1.6,
            ('F2', 'plex3'): 1.8,
            ('F3', 'plex1'): 1.1,
            ('F3', 'plex2'): 1.5,
            ('F3', 'plex3'): 1.7,},
    'P2' : {('F1', 'plex1'): 2.0,
            ('F1', 'plex2'): 2.4,
            ('F1', 'plex3'): 2.6,
            ('F2', 'plex1'): 2.2,
            ('F2', 'plex2'): 2.6,
            ('F2', 'plex3'): 2.8,
            ('F3', 'plex1'): 2.1,
            ('F3', 'plex2'): 2.5,
            ('F3', 'plex3'): 2.7,},
    'P3' : {('F1', 'plex1'): 3.0,
            ('F1', 'plex2'): 3.4,
            ('F1', 'plex3'): 3.6,
            ('F2', 'plex1'): 3.2,
            ('F2', 'plex2'): 3.6,
            ('F2', 'plex3'): 3.8,
            ('F3', 'plex1'): 3.1,
            ('F3', 'plex2'): 3.5,
            ('F3', 'plex3'): 3.7,}           
}
data = pd.DataFrame(data)
data.index = data.index.rename(['File ID', 'Channel'])
data
pd.__version__ = '1.5.3'
P1 P2 P3
File ID Channel
F1 plex1 1.0 2.0 3.0
plex2 1.4 2.4 3.4
plex3 1.6 2.6 3.6
F2 plex1 1.2 2.2 3.2
plex2 1.6 2.6 3.6
plex3 1.8 2.8 3.8
F3 plex1 1.1 2.1 3.1
plex2 1.5 2.5 3.5
plex3 1.7 2.7 3.7

apply to groupby (which returns a DataFrame)

I was a bit confused how you use groupby (see below) and investigated a bit the behaviour. Interestingly, the implict (or flexible) apply
using instance dispatching methods and the explict apply don't behave exactly the same.

Compare the implicit usage

data.groupby(axis=0, level=0).describe()
P1 P2 P3
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
File ID
F1 3.0 1.333333 0.305505 1.0 1.2 1.4 1.5 1.6 3.0 2.333333 ... 2.5 2.6 3.0 3.333333 0.305505 3.0 3.2 3.4 3.5 3.6
F2 3.0 1.533333 0.305505 1.2 1.4 1.6 1.7 1.8 3.0 2.533333 ... 2.7 2.8 3.0 3.533333 0.305505 3.2 3.4 3.6 3.7 3.8
F3 3.0 1.433333 0.305505 1.1 1.3 1.5 1.6 1.7 3.0 2.433333 ... 2.6 2.7 3.0 3.433333 0.305505 3.1 3.3 3.5 3.6 3.7

3 rows ร— 24 columns

to the explicit usage

data.groupby(axis=0, level=0).apply(lambda df: df.describe())
P1 P2 P3
File ID
F1 count 3.000000 3.000000 3.000000
mean 1.333333 2.333333 3.333333
std 0.305505 0.305505 0.305505
min 1.000000 2.000000 3.000000
25% 1.200000 2.200000 3.200000
50% 1.400000 2.400000 3.400000
75% 1.500000 2.500000 3.500000
max 1.600000 2.600000 3.600000
F2 count 3.000000 3.000000 3.000000
mean 1.533333 2.533333 3.533333
std 0.305505 0.305505 0.305505
min 1.200000 2.200000 3.200000
25% 1.400000 2.400000 3.400000
50% 1.600000 2.600000 3.600000
75% 1.700000 2.700000 3.700000
max 1.800000 2.800000 3.800000
F3 count 3.000000 3.000000 3.000000
mean 1.433333 2.433333 3.433333
std 0.305505 0.305505 0.305505
min 1.100000 2.100000 3.100000
25% 1.300000 2.300000 3.300000
50% 1.500000 2.500000 3.500000
75% 1.600000 2.600000 3.600000
max 1.700000 2.700000 3.700000

In pandas version 1.5 they highlight an API change to come

image

which leads to errors in the normalization function (below) by adding back the group key when using pandas=2. The behaviour on the
example data is:

data.groupby(axis=0, level=0, group_keys=False).apply(lambda df: df.describe())
P1 P2 P3
count 3.000000 3.000000 3.000000
mean 1.333333 2.333333 3.333333
std 0.305505 0.305505 0.305505
min 1.000000 2.000000 3.000000
25% 1.200000 2.200000 3.200000
50% 1.400000 2.400000 3.400000
75% 1.500000 2.500000 3.500000
max 1.600000 2.600000 3.600000
count 3.000000 3.000000 3.000000
mean 1.533333 2.533333 3.533333
std 0.305505 0.305505 0.305505
min 1.200000 2.200000 3.200000
25% 1.400000 2.400000 3.400000
50% 1.600000 2.600000 3.600000
75% 1.700000 2.700000 3.700000
max 1.800000 2.800000 3.800000
count 3.000000 3.000000 3.000000
mean 1.433333 2.433333 3.433333
std 0.305505 0.305505 0.305505
min 1.100000 2.100000 3.100000
25% 1.300000 2.300000 3.300000
50% 1.500000 2.500000 3.500000
75% 1.600000 2.600000 3.600000
max 1.700000 2.700000 3.700000

For methods returning scalars the behaviour of both implict and explicit applying is the same in pandas=1.5:

data.groupby(axis=0, level=0).median()
P1 P2 P3
File ID
F1 1.4 2.4 3.4
F2 1.6 2.6 3.6
F3 1.5 2.5 3.5
data.groupby(axis=0, level=0).apply(lambda df: df.median())
P1 P2 P3
File ID
F1 1.4 2.4 3.4
F2 1.6 2.6 3.6
F3 1.5 2.5 3.5

Current implementation

Raises now warnings and has mixed APIs for grouping:

SCeptre/sceptre/sceptre.py

Lines 568 to 603 in d1d4015

quant = pd.DataFrame(
adata.X.T.copy(),
columns=adata.obs[["File ID", "Channel"]]
.set_index(["File ID", "Channel"])
.index,
).replace(0, np.nan)
if method == "file_channel":
for i in range(100): # iterate to converge to normalized channel and file
quant_0 = quant.copy()
# file bias normalization
# calculate median for each protein in each sample
med = quant.T.reset_index().groupby("File ID").median().T
# calculate the factors needed for a median shift
med_tot = med.median(axis=1)
factors = med.divide(med_tot, axis=0)
quant = quant.groupby(axis=1, level=0).apply(
lambda x: x.divide(factors.loc[:, x.name], axis=0)
)
# channel bias normalization
# calculate median for each protein in each channel
med = quant.T.reset_index().groupby("Channel").median().T
# calculate the factors needed for a median shift
med_tot = med.median(axis=1)
factors = med.divide(med_tot, axis=0)
quant = quant.groupby(axis=1, level=1).apply(
lambda x: x.divide(factors.loc[:, x.name], axis=0)
)
# stop iterating when the change in quant to the previous iteration is below iter_thresh
if (abs(quant - quant_0).max().max()) <= iter_thresh:
break
print("performed {} iterations".format(i + 1))

iter_thresh = 0.01
iter_max = 100

# for documentation: How data is accessed from anndata object:
# pd.DataFrame(
#         adata.X.T.copy(),
#         columns=adata.obs[["File ID", "Channel"]]
#         .set_index(["File ID", "Channel"])
#         .index,
#     ).replace(0, np.nan)

I replaced the anndata object directly with transformed DataFrame:

  • the reset_index operation leads to one column being not numeric -> throws warning about changed behaviour to be expected
  • groupby uses apply directly on instances (DataFrame) where in pandas=2.0 the keys are added repeatedly in the columns (not shown here)
quant = data.copy().T

for i in range(iter_max):  # iterate to converge to normalized channel and file
    quant_0 = quant.copy()
    # file bias normalization
    # calculate median for each protein in each sample
    med = quant.T.reset_index().groupby("File ID").median().T
    # calculate the factors needed for a median shift
    med_tot = med.median(axis=1)
    factors = med.divide(med_tot, axis=0)

    quant = quant.groupby(axis=1, level=0).apply(
        lambda x: x.divide(factors.loc[:, x.name], axis=0)
    )

    # channel bias normalization
    # calculate median for each protein in each channel
    med = quant.T.reset_index().groupby("Channel").median().T
    # calculate the factors needed for a median shift
    med_tot = med.median(axis=1)
    factors = med.divide(med_tot, axis=0)
    quant = quant.groupby(axis=1, level=1).apply(
        lambda x: x.divide(factors.loc[:, x.name], axis=0)
    )

    # stop iterating when the change in quant to the previous iteration is below iter_thresh
    if (abs(quant - quant_0).max().max()) <= iter_thresh:
        print("Max deviation: {:.2f}".format(abs(quant - quant_0).max().max()))
        break
print("performed {} iterations".format(i + 1))
quant.T
Max deviation: 0.00
performed 2 iterations


<ipython-input-10-7d4d30a6d175>:7: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("File ID").median().T
<ipython-input-10-7d4d30a6d175>:12: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=0).apply(
<ipython-input-10-7d4d30a6d175>:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("Channel").median().T
<ipython-input-10-7d4d30a6d175>:22: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=1).apply(
<ipython-input-10-7d4d30a6d175>:7: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("File ID").median().T
<ipython-input-10-7d4d30a6d175>:12: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=0).apply(
<ipython-input-10-7d4d30a6d175>:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("Channel").median().T
<ipython-input-10-7d4d30a6d175>:22: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=1).apply(
P1 P2 P3
File ID Channel
F1 plex1 1.461039 2.480159 3.486717
plex2 1.500000 2.500000 3.500000
plex3 1.512605 2.507716 3.505564
F2 plex1 1.534091 2.518315 3.512545
plex2 1.500000 2.500000 3.500000
plex3 1.488971 2.492877 3.494745
F3 plex1 1.500000 2.500000 3.500000
plex2 1.500000 2.500000 3.500000
plex3 1.500000 2.500000 3.500000

Updated version

  • only has a groupby operation
  • for dividing by factors the classic index behaviour is used
  • repetition of code blocks becomes easier to see
  • hopefully easier to understand and more stable

It seems to converge faster and with less floating point errors.

quant = data.copy()
for i in range(iter_max):  # iterate to converge to normalized channel and file
    quant_0 = quant.copy()

    # file bias normalization
    # calculate median for each protein in each sample
    med = quant.groupby(axis=0, level=0).median()
    # calculate the factors needed for a median shift of features (e.g. proteins)
    med_tot = med.median(axis=0)
    factors = med.divide(med_tot, axis=1)
    # apply the factors to the data
    quant = quant.divide(factors)

    # channel bias normalization
    # calculate median for each protein in each channel
    med = quant.groupby(axis=0, level=1).median()
    # calculate the factors needed for a median shift of features (e.g. proteins)
    med_tot = med.median(axis=0)
    factors = med.divide(med_tot, axis=1)
    # apply the factors to the data
    quant = quant.divide(factors)

    # stop iterating when the change in quant to the previous iteration is below iter_thresh
    max_dev = abs(quant - quant_0).max().max()
    if (max_dev) <= iter_thresh:
        print(f"Max deviation: {max_dev:.2f}")
        break
    print("performed {} iterations".format(i + 1))
quant
performed 1 iterations
Max deviation: 0.00
P1 P2 P3
File ID Channel
F1 plex1 1.461039 2.480159 3.486717
plex2 1.500000 2.500000 3.500000
plex3 1.512605 2.507716 3.505564
F2 plex1 1.534091 2.518315 3.512545
plex2 1.500000 2.500000 3.500000
plex3 1.488971 2.492877 3.494745
F3 plex1 1.500000 2.500000 3.500000
plex2 1.500000 2.500000 3.500000
plex3 1.500000 2.500000 3.500000

I hope I did not miss any implementation detail. If not I am happy to create a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.