bfurtwa / sceptre Goto Github PK

SCeptre is a python package that extends the functionalities of Scanpy to analyze multiplexed single-cell proteomics data.

License: MIT License

Jupyter Notebook 99.78% Python 0.22%

sceptre's Introduction

SCeptre

(Single Cell proteomics readout of expression)

SCeptre is a python package that extends the functionalities of Scanpy to analyze multiplexed single-cell proteomics data.

Installation from Github

Tested on Ubuntu 20.04.1 LTS.
It's recommended to work in a dedicated conda environment. E.g:

conda create -n sceptre python=3.7
conda activate sceptre

Clone the repository and cd into its root directory. Then:

pip install .

Usage

Replication of Schoof et al.

Usage is exemplified in the notebooks for the analysis from the mansucript "Quantitative Single-Cell Proteomics as a Tool to Characterize Cellular Hierarchies".

The analysis can be replicated using the provided conda environment:

conda env create -f Schoof_et_al/code/environment.yml
conda activate sceptre
pip install Schoof_et_al/code/sceptre-0.1-py3-none-any.whl

The required data can be downloaded from http://proteomecentral.proteomexchange.org using the dataset identifier PXD020586

Find the notebooks in the subdirectory Schoof_et_al/code, place the required data in Schoof_et_al/data, and create the folder Schoof_et_al/results/tmp/.

The following notebooks process the different datasets:

Notebook	Description
300ms.ipynb	Sceptre analysis of the 'medium' dataset.
500ms.ipynb	SCeptre analysis of the 'high' dataset.
bulk.ipynb	SCeptre analysis of the 'bulk' dataset.
integrated.ipynb	SCeptre analysis of the 'integrated' dataset.

Functions and example worklow

Each function has its own docstring explaining the function in depth. A typical workflow makes usage of the following steps:

Create meta data

To create the meta data for each cell, as done in Schoof et al., from a collection of tables describing the experimental design and layouts of the 384-well plates, the following function is used. For details on the required tables, have a look at Schoof_et_al/data/500ms.

import sceptre as spt
spt.create_meta_data(input_dir="../data/500ms/", output_dir=res_dir)

Alternatively, the meta data table can be created by the user. It requires the columns File ID and Channel to map the meta data to each cell.

Load dataset

To load the dataset into python, to following function is used. To this end, only output tables from Proteome Discoverer are supported.

dataset = spt.load_dataset(proteins = "../data/500ms/500ms_Proteins.txt",
                           psms = "../data/500ms/500ms_PSMs.txt",
                           msms = "../data/500ms/500ms_MSMSSpectrumInfo.txt",
                           files = "../data/500ms/500ms_InputFiles.txt",
                           meta = res_dir + "meta.txt")

LC-MS QC

The dataset object can be used to quality control each LC-MS run with the follwing functions.

spt.plot_psms_msms(dataset)
spt.plot_avg_sn(dataset)
spt.plot_set_overview(dataset)

s_c_channels = ['127N', '128N', '128C', '129N', '129C', '130N', '130C',
                '131N','131C', '132N', '132C', '133N', '133C', '134N']
spt.print_ms_stats(dataset, s_c_channels=s_c_channels)

spt.plot_interference(dataset)

Load dataset into Scanpy

Subsequently the dataset object is used to create a scanpy adata object.

adata = spt.dataset_to_scanpy(dataset)

Filtering and normalization

Non-single cell channels have to be removed.

adata = adata[adata.obs['Channel'] != '126'].copy()
adata = adata[adata.obs['Channel'] != '127C'].copy()

Then the dataset can be normalized.

spt.normalize(adata)

Cell QC

The follwing functions are used to filter out outlier cells.

spt.calculate_cell_filter(adata)
spt.apply_cell_filter(adata)

Batch QC

To detect potential systematic bias introduced by the sample preparation or measurement the following functions can be used.

spt.plot_batch_qc(adata)
spt.plot_plate_qc(adata)

Scanpy analysis

The adata object can now be used to perform a scanpy analysis.

sceptre's People

Contributors

Stargazers

Watchers

Forkers

sabrinarichter schoofe henriettaholze

sceptre's Issues

Normalization with Pandas 2.0

Hey:) Awesome work!

I had a look at the normalization function in the scope of re-analyzing the data. The code relies on default behaviour which already depreciated in version (2.0).

Some data

import pandas as pd

print(f"{pd.__version__ = }")

# create a test dataset of 3 files and 3 plexes for 5 protein groups
# with different levels of expressions
data = {
    'P1' : {('F1', 'plex1'): 1.0,
            ('F1', 'plex2'): 1.4,
            ('F1', 'plex3'): 1.6,
            ('F2', 'plex1'): 1.2,
            ('F2', 'plex2'): 1.6,
            ('F2', 'plex3'): 1.8,
            ('F3', 'plex1'): 1.1,
            ('F3', 'plex2'): 1.5,
            ('F3', 'plex3'): 1.7,},
    'P2' : {('F1', 'plex1'): 2.0,
            ('F1', 'plex2'): 2.4,
            ('F1', 'plex3'): 2.6,
            ('F2', 'plex1'): 2.2,
            ('F2', 'plex2'): 2.6,
            ('F2', 'plex3'): 2.8,
            ('F3', 'plex1'): 2.1,
            ('F3', 'plex2'): 2.5,
            ('F3', 'plex3'): 2.7,},
    'P3' : {('F1', 'plex1'): 3.0,
            ('F1', 'plex2'): 3.4,
            ('F1', 'plex3'): 3.6,
            ('F2', 'plex1'): 3.2,
            ('F2', 'plex2'): 3.6,
            ('F2', 'plex3'): 3.8,
            ('F3', 'plex1'): 3.1,
            ('F3', 'plex2'): 3.5,
            ('F3', 'plex3'): 3.7,}           
}
data = pd.DataFrame(data)
data.index = data.index.rename(['File ID', 'Channel'])
data

pd.__version__ = '1.5.3'

		P1	P2	P3
File ID	Channel
F1	plex1	1.0	2.0	3.0
	plex2	1.4	2.4	3.4
	plex3	1.6	2.6	3.6
F2	plex1	1.2	2.2	3.2
	plex2	1.6	2.6	3.6
	plex3	1.8	2.8	3.8
F3	plex1	1.1	2.1	3.1
	plex2	1.5	2.5	3.5
	plex3	1.7	2.7	3.7

apply to groupby (which returns a `DataFrame`)

I was a bit confused how you use groupby (see below) and investigated a bit the behaviour. Interestingly, the implict (or flexible) apply
using instance dispatching methods and the explict apply don't behave exactly the same.

Compare the implicit usage

data.groupby(axis=0, level=0).describe()

	P1								P2					P3
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
File ID
F1	3.0	1.333333	0.305505	1.0	1.2	1.4	1.5	1.6	3.0	2.333333	...	2.5	2.6	3.0	3.333333	0.305505	3.0	3.2	3.4	3.5	3.6
F2	3.0	1.533333	0.305505	1.2	1.4	1.6	1.7	1.8	3.0	2.533333	...	2.7	2.8	3.0	3.533333	0.305505	3.2	3.4	3.6	3.7	3.8
F3	3.0	1.433333	0.305505	1.1	1.3	1.5	1.6	1.7	3.0	2.433333	...	2.6	2.7	3.0	3.433333	0.305505	3.1	3.3	3.5	3.6	3.7

3 rows × 24 columns

to the explicit usage

data.groupby(axis=0, level=0).apply(lambda df: df.describe())

		P1	P2	P3
File ID
F1	count	3.000000	3.000000	3.000000
	mean	1.333333	2.333333	3.333333
	std	0.305505	0.305505	0.305505
	min	1.000000	2.000000	3.000000
	25%	1.200000	2.200000	3.200000
	50%	1.400000	2.400000	3.400000
	75%	1.500000	2.500000	3.500000
	max	1.600000	2.600000	3.600000
F2	count	3.000000	3.000000	3.000000
	mean	1.533333	2.533333	3.533333
	std	0.305505	0.305505	0.305505
	min	1.200000	2.200000	3.200000
	25%	1.400000	2.400000	3.400000
	50%	1.600000	2.600000	3.600000
	75%	1.700000	2.700000	3.700000
	max	1.800000	2.800000	3.800000
F3	count	3.000000	3.000000	3.000000
	mean	1.433333	2.433333	3.433333
	std	0.305505	0.305505	0.305505
	min	1.100000	2.100000	3.100000
	25%	1.300000	2.300000	3.300000
	50%	1.500000	2.500000	3.500000
	75%	1.600000	2.600000	3.600000
	max	1.700000	2.700000	3.700000

In pandas version 1.5 they highlight an API change to come

which leads to errors in the normalization function (below) by adding back the group key when using pandas=2. The behaviour on the
example data is:

data.groupby(axis=0, level=0, group_keys=False).apply(lambda df: df.describe())

	P1	P2	P3
count	3.000000	3.000000	3.000000
mean	1.333333	2.333333	3.333333
std	0.305505	0.305505	0.305505
min	1.000000	2.000000	3.000000
25%	1.200000	2.200000	3.200000
50%	1.400000	2.400000	3.400000
75%	1.500000	2.500000	3.500000
max	1.600000	2.600000	3.600000
count	3.000000	3.000000	3.000000
mean	1.533333	2.533333	3.533333
std	0.305505	0.305505	0.305505
min	1.200000	2.200000	3.200000
25%	1.400000	2.400000	3.400000
50%	1.600000	2.600000	3.600000
75%	1.700000	2.700000	3.700000
max	1.800000	2.800000	3.800000
count	3.000000	3.000000	3.000000
mean	1.433333	2.433333	3.433333
std	0.305505	0.305505	0.305505
min	1.100000	2.100000	3.100000
25%	1.300000	2.300000	3.300000
50%	1.500000	2.500000	3.500000
75%	1.600000	2.600000	3.600000
max	1.700000	2.700000	3.700000

For methods returning scalars the behaviour of both implict and explicit applying is the same in pandas=1.5:

data.groupby(axis=0, level=0).median()

	P1	P2	P3
File ID
F1	1.4	2.4	3.4
F2	1.6	2.6	3.6
F3	1.5	2.5	3.5

data.groupby(axis=0, level=0).apply(lambda df: df.median())

	P1	P2	P3
File ID
F1	1.4	2.4	3.4
F2	1.6	2.6	3.6
F3	1.5	2.5	3.5

Current implementation

Raises now warnings and has mixed APIs for grouping:

SCeptre/sceptre/sceptre.py

Lines 568 to 603 in d1d4015

    
           quant = pd.DataFrame( 
        
               adata.X.T.copy(), 
        
               columns=adata.obs[["File ID", "Channel"]] 
        
               .set_index(["File ID", "Channel"]) 
        
               .index, 
        
           ).replace(0, np.nan) 
        
           if method == "file_channel": 
        
               for i in range(100):  # iterate to converge to normalized channel and file 
        
                   quant_0 = quant.copy() 
        
                   # file bias normalization 
        
                   # calculate median for each protein in each sample 
        
                   med = quant.T.reset_index().groupby("File ID").median().T 
        
                   # calculate the factors needed for a median shift 
        
                   med_tot = med.median(axis=1) 
        
                   factors = med.divide(med_tot, axis=0) 
        
                   quant = quant.groupby(axis=1, level=0).apply( 
        
                       lambda x: x.divide(factors.loc[:, x.name], axis=0) 
        
                   ) 
        
                   # channel bias normalization 
        
                   # calculate median for each protein in each channel 
        
                   med = quant.T.reset_index().groupby("Channel").median().T 
        
                   # calculate the factors needed for a median shift 
        
                   med_tot = med.median(axis=1) 
        
                   factors = med.divide(med_tot, axis=0) 
        
                   quant = quant.groupby(axis=1, level=1).apply( 
        
                       lambda x: x.divide(factors.loc[:, x.name], axis=0) 
        
                   ) 
        
                   # stop iterating when the change in quant to the previous iteration is below iter_thresh 
        
                   if (abs(quant - quant_0).max().max()) <= iter_thresh: 
        
                       break 
        
               print("performed {} iterations".format(i + 1))

iter_thresh = 0.01
iter_max = 100

# for documentation: How data is accessed from anndata object:
# pd.DataFrame(
#         adata.X.T.copy(),
#         columns=adata.obs[["File ID", "Channel"]]
#         .set_index(["File ID", "Channel"])
#         .index,
#     ).replace(0, np.nan)

I replaced the anndata object directly with transformed DataFrame:

the reset_index operation leads to one column being not numeric -> throws warning about changed behaviour to be expected
groupby uses apply directly on instances (DataFrame) where in pandas=2.0 the keys are added repeatedly in the columns (not shown here)

quant = data.copy().T

for i in range(iter_max):  # iterate to converge to normalized channel and file
    quant_0 = quant.copy()
    # file bias normalization
    # calculate median for each protein in each sample
    med = quant.T.reset_index().groupby("File ID").median().T
    # calculate the factors needed for a median shift
    med_tot = med.median(axis=1)
    factors = med.divide(med_tot, axis=0)

    quant = quant.groupby(axis=1, level=0).apply(
        lambda x: x.divide(factors.loc[:, x.name], axis=0)
    )

    # channel bias normalization
    # calculate median for each protein in each channel
    med = quant.T.reset_index().groupby("Channel").median().T
    # calculate the factors needed for a median shift
    med_tot = med.median(axis=1)
    factors = med.divide(med_tot, axis=0)
    quant = quant.groupby(axis=1, level=1).apply(
        lambda x: x.divide(factors.loc[:, x.name], axis=0)
    )

    # stop iterating when the change in quant to the previous iteration is below iter_thresh
    if (abs(quant - quant_0).max().max()) <= iter_thresh:
        print("Max deviation: {:.2f}".format(abs(quant - quant_0).max().max()))
        break
print("performed {} iterations".format(i + 1))
quant.T

Max deviation: 0.00
performed 2 iterations


<ipython-input-10-7d4d30a6d175>:7: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("File ID").median().T
<ipython-input-10-7d4d30a6d175>:12: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=0).apply(
<ipython-input-10-7d4d30a6d175>:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("Channel").median().T
<ipython-input-10-7d4d30a6d175>:22: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=1).apply(
<ipython-input-10-7d4d30a6d175>:7: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("File ID").median().T
<ipython-input-10-7d4d30a6d175>:12: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=0).apply(
<ipython-input-10-7d4d30a6d175>:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.median is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  med = quant.T.reset_index().groupby("Channel").median().T
<ipython-input-10-7d4d30a6d175>:22: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  quant = quant.groupby(axis=1, level=1).apply(

		P1	P2	P3
File ID	Channel
F1	plex1	1.461039	2.480159	3.486717
	plex2	1.500000	2.500000	3.500000
	plex3	1.512605	2.507716	3.505564
F2	plex1	1.534091	2.518315	3.512545
	plex2	1.500000	2.500000	3.500000
	plex3	1.488971	2.492877	3.494745
F3	plex1	1.500000	2.500000	3.500000
	plex2	1.500000	2.500000	3.500000
	plex3	1.500000	2.500000	3.500000

Updated version

only has a groupby operation
for dividing by factors the classic index behaviour is used
repetition of code blocks becomes easier to see
hopefully easier to understand and more stable

It seems to converge faster and with less floating point errors.

quant = data.copy()
for i in range(iter_max):  # iterate to converge to normalized channel and file
    quant_0 = quant.copy()

    # file bias normalization
    # calculate median for each protein in each sample
    med = quant.groupby(axis=0, level=0).median()
    # calculate the factors needed for a median shift of features (e.g. proteins)
    med_tot = med.median(axis=0)
    factors = med.divide(med_tot, axis=1)
    # apply the factors to the data
    quant = quant.divide(factors)

    # channel bias normalization
    # calculate median for each protein in each channel
    med = quant.groupby(axis=0, level=1).median()
    # calculate the factors needed for a median shift of features (e.g. proteins)
    med_tot = med.median(axis=0)
    factors = med.divide(med_tot, axis=1)
    # apply the factors to the data
    quant = quant.divide(factors)

    # stop iterating when the change in quant to the previous iteration is below iter_thresh
    max_dev = abs(quant - quant_0).max().max()
    if (max_dev) <= iter_thresh:
        print(f"Max deviation: {max_dev:.2f}")
        break
    print("performed {} iterations".format(i + 1))
quant

performed 1 iterations
Max deviation: 0.00

		P1	P2	P3
File ID	Channel
F1	plex1	1.461039	2.480159	3.486717
	plex2	1.500000	2.500000	3.500000
	plex3	1.512605	2.507716	3.505564
F2	plex1	1.534091	2.518315	3.512545
	plex2	1.500000	2.500000	3.500000
	plex3	1.488971	2.492877	3.494745
F3	plex1	1.500000	2.500000	3.500000
	plex2	1.500000	2.500000	3.500000
	plex3	1.500000	2.500000	3.500000

I hope I did not miss any implementation detail. If not I am happy to create a PR.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	quant = pd.DataFrame(
	adata.X.T.copy(),
	columns=adata.obs[["File ID", "Channel"]]
	.set_index(["File ID", "Channel"])
	.index,
	).replace(0, np.nan)

	if method == "file_channel":
	for i in range(100): # iterate to converge to normalized channel and file
	quant_0 = quant.copy()
	# file bias normalization
	# calculate median for each protein in each sample
	med = quant.T.reset_index().groupby("File ID").median().T
	# calculate the factors needed for a median shift
	med_tot = med.median(axis=1)
	factors = med.divide(med_tot, axis=0)

	quant = quant.groupby(axis=1, level=0).apply(
	lambda x: x.divide(factors.loc[:, x.name], axis=0)
	)

	# channel bias normalization
	# calculate median for each protein in each channel
	med = quant.T.reset_index().groupby("Channel").median().T
	# calculate the factors needed for a median shift
	med_tot = med.median(axis=1)
	factors = med.divide(med_tot, axis=0)

	quant = quant.groupby(axis=1, level=1).apply(
	lambda x: x.divide(factors.loc[:, x.name], axis=0)
	)

	# stop iterating when the change in quant to the previous iteration is below iter_thresh
	if (abs(quant - quant_0).max().max()) <= iter_thresh:
	break
	print("performed {} iterations".format(i + 1))