Refactor coda stuff into coda subfolder

Title

Minor issues in milopy tools

Noting down issues to fix and potential improvements to milopy, encountered while writing tutorial (theislab/pertpy-tutorials#4) - issues entirely inherited from my code ofc :)

milo.add_covariate_to_nhoods_var

change name to add_covariate_to_nhoods_obs and update docs
Check types of added columns (e.g. can go from category to object and then nhood_counts_by_cond complains

pt.pl.milo.nhood_counts_by_cond

add informative error for when covariate is missing (directing to use add_covariate_to_nhoods_var)
convert strings to categoricals internally (or add informative errors for dtype object)

milo.da_nhoods:

at the moment the use of subset_samples and subset_nhoods is buggy, in some edge cases this messes up the SpatialFDR calculation. While the bug is easy enough to catch with diagnostic plots (randomized SpatialFDR) and to fix (emdann/milopy#30), I'm wondering whether these parameters should be dropped completely in the new implementation in pertpy. The use case of subset_samples is essentially covered by specifying contrasts (+ adding a new column in adata.obs if needed to specify groups). Using subset_nhoods is somewhat of a pathological hack (I don't think I've ever used it and I can't even remember the original use case for adding this parameter). In practice if one wants to test on a specific subset of cell phenotypes the best solution is to subset cells and restart from KNN graph building.
Add informative errors for missing columns in model matrix (i.e. when the reference level is specified in contrasts)

Happy to open a PR to implement these changes.

ATAC-Seq support

Would be great to have support for Perturb-ATAC in pertpy.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116249 has the ATAC data. Not sure where we get the information on the perturbations from?

The first row of this data are the cells (remove the suffixes) and the chromosomes are the var. Rest are counts.

Will come up with a plan to work with this.

Add scperturb datasets

Write dataloaders for all of

https://zenodo.org/record/7278143

that we are not covering yet.

Source: http://projects.sanderlab.org/scperturb/

Add one dataloader as a PR. Ensure that it has proper docstrings. We have to encourage people to cite the original scperturb paper. The code for the datasets is here -> https://github.com/theislab/pertpy/blob/development/pertpy/data/_datasets.py
Add all other dataloaders

AttributeError: module 'pertpy.tools' has no attribute 'Milopy'

Report

I installed the latest version of pertpy from GitHub pull but getting an error at the step below:

> milo = pt.tl.Milopy()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[47], line 1
----> 1 milo = pt.tl.Milopy()

AttributeError: module 'pertpy.tools' has no attribute 'Milopy'

However this works-
> milo = pt.tl.Milo()

The tutorial needs to be updated.

Version information

No response

Think about a cleaner way to structure augurpy

Currently folders in tl and pl. Not totally happy about it and the resulting docs.

I shall think about it. Likely has to be a class.

guide RNA assignment

We received the request to add support for guide RNA assignment.

Read up on literature -> post links here
Get datasets that we can test our implementations on and understand them
Suggest a couple of easy solutions here and discuss them
Consider implementing more complex models

Error with multiple covariates using pertypy.tl.Sccoda()

Report

I am trying to using code from the "Compositional Analysis" tutorial from Single Cell Best Practices to analyze compositional differences in cell types along two dimensions (two covariates). When I run sccoda_model.load() using my two covariates ('age' and 'region', see below):

sccoda_model = pt.tl.Sccoda() sccoda_data = sccoda_model.load( adata, type="cell_level", generate_sample_level=True, cell_type_identifier="cell_type", sample_identifier="replicate", covariate_obs=["age","region"], )

I get the following error:

"Covariate region has non-unique values! Skipping...". However, the categories in "region" are unique. The same results when I run the above with "region" as the sole covariate, but not when I run with "age" as the sole covariate. Both are categorical types in pandas series. Any ideas on the source of this issue would be greatly appreciated, thanks!

Version information

No response

Link to documentation is outdated

The link to the usage in readme doesn't work. If this issue is not closed in one of the PRs, I suggest changing it from
https://pertpy.readthedocs.io/en/latest/usage.html
to
https://pertpy.readthedocs.io/en/latest/usage/usage.html

Remove darglint

Just like I did for ehrapy

Determine baseline with mixscape

To be able to validate our approach and to compare against mixscape we need to run the complete mixscape vignette on our data (and ideally their data if available).

Create a Conda environment/Docker container for the complete mixscape vignette
Run the complete mixscape vignette on their data if available
Run the complete mixscape vignette on our own data

Benchmarking speed of all tools

We should benchmark the speed of all tools to highlight that our versions are faster and scale better.

Let's use our cluster and keep the runtime environment consistent -> always the same pertpy version (please start with the current main branch because I made lots of updates...) and the same number of CPUs/RAM on our cluster.

The tutorials can be found here: https://pertpy.readthedocs.io/en/latest/tutorials/index.html . The respective tutorials of the tools define the pipeline that we want to benchmark.

We always use the datasets that are defined in the pertpy dataloaders. They should also always exist for the R versions and the other versions. Always collect the runtime.

Create a barplot of the runtimes
This should be super reproducible -> add all scripts and notebooks to https://github.com/theislab/pertpy-reproducibility/ as pull requests

Implement scikit-learn-intelex for random forest and logistic regression

The intelex patch could make the random forest and logistic regression computations several times faster.

For reference: https://github.com/intel/scikit-learn-intelex

Example implementation:
https://github.com/theislab/ehrapy/blob/development/ehrapy/preprocessing/_data_imputation.py#L217

Enable development builds for documentation

python:
    version: 3.8
    install:
        - method: pip
          path: .
        - requirements: docs/requirements.txt

in .readthedocs.yml

mixscape barplot broken

Following the mixscape tutorial

pt.pl.ms.barplot(mdata['rna'])

results in

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/bp_perturbation_modeling/lib/python3.8/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File ~/miniconda3/envs/bp_perturbation_modeling/lib/python3.8/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniconda3/envs/bp_perturbation_modeling/lib/python3.8/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'NT'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/home/zeth/PycharmProjects/extended-single-cell-best-practices/jupyter-book/conditions/perturbation_modeling.ipynb Cell 139 in <cell line: 1>()
----> [1](vscode-notebook-cell:/home/zeth/PycharmProjects/extended-single-cell-best-practices/jupyter-book/conditions/perturbation_modeling.ipynb#ch0000137?line=0) pt.pl.ms.barplot(mdata['rna'])

File ~/miniconda3/envs/bp_perturbation_modeling/lib/python3.8/site-packages/pertpy/plot/_mixscape.py:71, in MixscapePlot.barplot(adata, control, mixscape_class_global, axis_text_x_size, axis_text_y_size, axis_title_size, strip_text_size, panel_spacing_x, panel_spacing_y, legend_title_size, legend_text_size, show, save)
     69 if mixscape_class_global not in adata.obs:
     70     raise ValueError("Please run `pt.tl.mixscape` first.")
---> 71 count = pd.crosstab(index=adata.obs[mixscape_class_global], columns=adata.obs[control])
     72 all_cells_percentage = pd.melt(count / count.sum(), ignore_index=False).reset_index()
     73 KO_cells_percentage = all_cells_percentage[all_cells_percentage[mixscape_class_global] == "KO"]

File ~/miniconda3/envs/bp_perturbation_modeling/lib/python3.8/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
   3503 if self.columns.nlevels > 1:
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):
   3507     indexer = [indexer]

File ~/miniconda3/envs/bp_perturbation_modeling/lib/python3.8/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: 'NT'

differential Response detection

Algorithms to identify the degree of responders/not responder

Approach 1 :

reimplement augor in python (both for binary (ctrl vs treatment and multiple perturbations, or regression see here https://www.nature.com/articles/s41596-021-00561-x): https://github.com/neurorestore/Augur
Augor works on three problems/settings as explained here:
RNA velocity as input instead of gene expression (rn scvelo and then augor, see box 1 in the link)
Binary, multiclass and continuous labels (see box 2)
Differential prioritization: run auger multiple times and then report differential auc difference

compare to: mixscape perturbed vs not-perturbed states

Approach 2

simple t-test thresholding as explored in augor paper (one vs all t-test and rank according to of n_degs passing certain threshold (LFC and p-values) and then rank

Thinks about how to visualize:

How to visualize :

a plot to visualizing differential responses by ranking them similar to fig1 here or see fig2 here

Add Distances

Distances to implement

E-distance
MMD
Wasserstein distance
Pseudobulk distance (distances b/w centroids)

Tests to implement

E-test [[ general distance-based permutation test ]]

Implementation plan

Rename Augurpy to Augur

Also in tutorials + book

`calculate_multifactor_PMD()` adds `int` keys to `.obs` leading to error on `adata.write_h5ad()`

Report

pt.tl.Dialogue().calculate_multifactor_PMD() adds MCPs to .obs with int keys instead of str leading to error on adata.write_h5ad():

Traceback (most recent call last):
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/utils.py", line 214, in func_wrapper
    return func(elem, key, val, *args, **kwargs)
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/specs/registry.py", line 175, in write_elem
    _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/specs/registry.py", line 24, in wrapper
    result = func(g, k, *args, **kwargs)
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/specs/methods.py", line 499, in write_dataframe
    col_names = [check_key(c) for c in df.columns]
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/specs/methods.py", line 499, in <listcomp>
    col_names = [check_key(c) for c in df.columns]
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/utils.py", line 109, in check_key
    raise TypeError(f"{key} of type {typ} is an invalid key. Should be str.")
TypeError: 0 of type <class 'int'> is an invalid key. Should be str.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/eliaskahl/workspace/github.com/theislab-gobi-sc-dialogue/dialogue-experiments-elias/mre-mcp-str-keys.py", line 19, in <module>
    adata.write_h5ad("data/dialogue-example_pmd.h5ad")
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1918, in write_h5ad
    _write_h5ad(
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/h5ad.py", line 98, in write_h5ad
    write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
  File "/Users/eliaskahl/miniconda3/envs/pertpy-apple-silicon/lib/python3.10/site-packages/anndata/_io/utils.py", line 220, in func_wrapper
    raise type(e)(
TypeError: 0 of type <class 'int'> is an invalid key. Should be str.

Minimal reproducible example:

import pertpy as pt
import scanpy as sc

import session_info
session_info.show(html=False, dependencies=True)

adata = pt.dt.dialogue_example()
sc.pp.pca(adata)

dl = pt.tl.Dialogue()
adata, mcps, ws, ct_subs = dl.calculate_multifactor_PMD(
    adata,
    sample_id="clinical.status",
    celltype_key="cell.subtypes",
    n_counts_key="nCount_RNA",
    normalize=True,
)

adata.write_h5ad("data/dialogue-example_pmd.h5ad")

Version information

pertpy 0.4.0
scanpy 1.9.2
session_info 1.0.0

PIL 9.4.0
PyQt5 NA
adjustText NA
anndata 0.8.0
arrow 1.2.3
arviz 0.14.0
asttokens NA
attr 22.1.0
backcall 0.2.0
certifi 2022.12.07
cffi 1.15.1
cftime 1.6.2
charset_normalizer 3.0.1
comm 0.1.2
custom_inherit 2.4.1
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
decorator 5.1.1
decoupler 1.3.3
defusedxml 0.7.1
entrypoints 0.4
ete3 3.1.2
executing 0.8.3
fontTools 4.38.0
google NA
h5py 3.8.0
idna 3.4
igraph 0.10.4
ipykernel 6.19.2
ipython_genutils 0.2.0
ipywidgets 7.6.5
jax 0.4.4
jaxlib 0.4.4
jedi 0.18.1
jinja2 3.1.2
joblib 1.2.0
kiwisolver 1.4.4
leidenalg 0.9.1
llvmlite 0.39.1
markupsafe 2.1.1
matplotlib 3.7.0
mizani 0.8.1
mpl_toolkits NA
mudata 0.2.1
multipledispatch 0.6.0
muon 0.1.3
natsort 8.2.0
netCDF4 1.6.2
numba 0.56.4
numpy 1.23.5
numpyro 0.10.1
opt_einsum v3.3.0
packaging 22.0
palettable 3.3.0
pandas 1.5.3
parso 0.8.3
patsy 0.5.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 2.5.2
plotnine 0.10.1
ply 3.11
png 0.20220715.0
prompt_toolkit 3.0.36
psutil 5.9.0
ptyprocess 0.7.0
pure_eval 0.2.2
pycparser 2.21
pygments 2.11.2
pynndescent 0.5.8
pyomo 6.5.0
pyparsing 3.0.9
pypi_latest 0.1.2
pytz 2022.7
questionary 1.10.0
reportlab 3.6.12
requests 2.28.2
rich NA
rpy2 3.5.8
scipy 1.10.1
seaborn 0.11.2
setuptools 65.6.3
sip NA
six 1.16.0
sklearn 1.2.1
skmisc 0.1.4
sparsecca 0.3.0
stack_data 0.2.0
statannotations 0.5.0
statsmodels 0.13.5
switchlang 0.1.0
texttable 1.6.7
threadpoolctl 3.1.0
tornado 6.2
toyplot 1.0.3
toytree 2.0.1
tqdm 4.64.1
traitlets 5.7.1
typing_extensions NA
tzlocal NA
umap 0.5.3
urllib3 1.26.14
wcwidth 0.2.5
xarray 2023.2.0
xarray_einstats 0.5.1
zmq 23.2.0
zoneinfo NA

IPython 8.10.0
jupyter_client 7.4.9
jupyter_core 5.2.0
notebook 6.5.2

Python 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:26:08) [Clang 14.0.6 ]
macOS-13.1-arm64-arm-64bit

Session information updated at 2023-03-01 13:34

clearer dataset requirements for mixscape

Description of feature

Hello! As a python/scanpy user, I am happy to see that you are working on a scverse impelementation of mixscape. I have attempted to follow along with this notebook : https://github.com/theislab/pertpy-tutorials/blob/main/mixscape.ipynb using a third-party dataset (this. I have been using pertpy 0.3

I have had difficulty, however, because the names of some columns/features are occasionally required to have a certain value. For example, pt.tl.Mixscape.pert_sign apparently requires that the supplied AnnData object has a column in its .obs table of the format <gene_target>g<#> that has the same name as the nontargeting control. As far as I can tell, this is not documented.

Similarly, in pt.tl.Mixscape.lda value of the labels keyword argument must be 'gene_target' for proper function, since that is hard-coded elsewhere in the function.

These are just the functions that I have been able to make work. pt.pl.ms.barplot fails to work, and I cannot decipher the traceback to understand how to reconfigure my AnnData.obs table properly. I also cannot find how to make pt.pl.ms.lda work properly, since some array is getting reshaped during function runtime in ways that I cannot understand.

If this pipeline is working as designed, I would greatly appreciate a tool or guide to make sure that my AnnData.obs table has all required columns formatted properly.

high-level analysis

The goal of this module should be around answering questions about high-level biological changes and enrichment analysis. For example which terms will appear after a certain perturbation?

@ilibarra what would be ideas to implement here?

Integrate Jax scGen

https://github.com/theislab/jaxpert/tree/main/jaxscgen/jaxscgen

Will need to figure out how to make the dependencies optional. Jax is not really installable on Windows.

Implement MCP methods

A reimplementation of https://github.com/livnatje/DIALOGUE could be awesome.

Then we could also talk to @davidsebfischer et al. to see how far their MCP method is.

Datasets

We have multiple datasets that we need to prepare:

multimodal single-cell CRISPR screens:

https://nature.com/articles/s41588-021-00779-1
https://www.nature.com/articles/s41588-021-00778-2 (mixscape)
Christians Bocks dataset (we already have it)
Hanifta large scale cite-se datasets with covid 19 (Chris has the data, large-scale CITE-seq, maybe goo to show scale)

Uni Modal KO datasets #17

Norman et al. https://science.sciencemag.org/content/365/6455/786.abstract (we have)

Normal scRNA-seq:

continues covariates to test auger ( https://science.sciencemag.org/content/367/6473/45, we have the data)
cross-species dataset with continuous-time but only perturbation (https://www.nature.com/articles/s41586-018-0657-2, we have the data)

RNA velocity perturbation analysis

dataset which was analyzed augor protocol paper

Integrate Milopy

We want to integrate https://github.com/emdann/milopy into pertpy with an updated version

Tasks (roughly in order) are:

Mixscape data type error

Report

Hello!

When I'm trying to reproduce the tutorial with a different dataset, I receive the following error in the function Mixscape.mixscape.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[70], line 1
----> 1 mixscape_identifier.mixscape(adata = adata, control = 'control', labels='perturbation', layer='X_pert')

File ~/miniconda3/envs/pertpy/lib/python3.9/site-packages/pertpy/tools/_mixscape.py:244, in Mixscape.mixscape(self, adata, labels, control, new_class_name, min_de_genes, layer, logfc_threshold, iter_num, split_by, pval_cutoff, perturbation_type, copy)
    240 vec = np.mean(X[guide_cells][:, de_genes_indices], axis=0) - np.mean(
    241     X[nt_cells][:, de_genes_indices], axis=0
    242 )
    243 # project cells onto the perturbation vector
--> 244 pvec = np.sum(np.multiply(dat.toarray(), vec), axis=1) / np.sum(np.multiply(vec, vec))
    245 pvec = pd.Series(np.asarray(pvec).flatten(), index=list(all_cells.index[all_cells]))
    246 if n_iter == 0:

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Nevertheless, it doesn't happen when running with the dataset of the tutorial. I think that the problem is that Mixscape.pert_sign does not create .layers['X_pert'] as a sparse matrix, but as a normal numpy array.

Thanks!

Version information

No response

Add ehrapy documentation structure

Likely pertpy-tutorials as a submodule with the new docstyle that we're using at ehrapy.

Redesign tests

As discussed, we should not use test data is input for the tests, but rather design artificial AnnData objects with clear distribution shifts that we can test for.

Redesign and reimplement Augurpy tests
Redesign and reimplement mixscape tests

Add statsmodels support for milopy

Our current milo implementation uses edgeR from R to make a few calculations here: https://github.com/theislab/pertpy/blob/development/pertpy/tools/_milo.py#L310 . However, we want pertpy to be a pure Python experience. Hence, we want to offer an alternative that does NOT use edgeR but instead uses statsmodels to do a very similar calculations.

Figure out what exactly edgeR does here.
Implement these steps with statsmodels.
Compare the results with the original edgeR implementation. If we're not crazily off submit a PR and add it in another if-case

Improve dataloader descriptions

https://pertpy.readthedocs.io/en/latest/usage/usage.html#data should not just be scperturb

Install fails on M1 macs

Describe the bug

Unable to install using pip on M1 mac (Tessa and @stefanpeidli)
To Reproduce

mamba create -n pertpy_env python=3.9 conda activate pertpy_env pip install pertpy

(pertpy_env) stefanpeidli@Stefans-MBP ~ % pip install pertpy Collecting pertpy Using cached pertpy-0.3.0-py3-none-any.whl (82 kB) Collecting muon>=0.1.2 Using cached muon-0.1.2-py3-none-any.whl (287 kB) Collecting pypi-latest>=0.1.1 Using cached pypi_latest-0.1.2-py3-none-any.whl (10 kB) Collecting scipy<2.0.0,>=1.9.3 Downloading scipy-1.10.0-cp39-cp39-macosx_12_0_arm64.whl (28.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━28.9/28.9 MB 12.3 MB/s eta 0:00:00
Collecting PyYAML>=5.4.1
Downloading PyYAML-6.0-cp39-cp39-macosx_11_0_arm64.whl (173 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
174.0/174.0 kB 8.3 MB/s eta 0:00:00
Collecting arviz<0.15.0,>=0.14.0
Using cached arviz-0.14.0-py3-none-any.whl (1.7 MB)
Collecting ipywidgets>=7.6.5
Downloading ipywidgets-8.0.4-py3-none-any.whl (137 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.8/137.8 kB 8.5 MB/s eta 0:00:00
Collecting scanpy>=1.8.1
Downloading scanpy-1.9.1-py3-none-any.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 12.1 MB/s eta 0:00:00 Collecting pyqt5<6.0.0,>=5.15.7 Using cached PyQt5-5.15.9.tar.gz (3.2 MB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... error error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [29 lines of output]
Traceback (most recent call last):
File "/Users/stefanpeidli/miniconda3/envs/pertpy_env/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in prepare_metadata_for_build_wheel
hook = backend.prepare_metadata_for_build_wheel
AttributeError: module 'sipbuild.api' has no attribute 'prepare_metadata_for_build_wheel'

  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/Users/stefanpeidli/miniconda3/envs/pertpy_env/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 351, in <module>
      main()
    File "/Users/stefanpeidli/miniconda3/envs/pertpy_env/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 333, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/Users/stefanpeidli/miniconda3/envs/pertpy_env/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 148, in prepare_metadata_for_build_wheel
      whl_basename = backend.build_wheel(metadata_directory, config_settings)
    File "/private/var/folders/fr/z238tpm931s466144gcdpdj40000gn/T/pip-build-env-hvo51gpt/overlay/lib/python3.9/site-packages/sipbuild/api.py", line 46, in build_wheel
      project = AbstractProject.bootstrap('wheel',
    File "/private/var/folders/fr/z238tpm931s466144gcdpdj40000gn/T/pip-build-env-hvo51gpt/overlay/lib/python3.9/site-packages/sipbuild/abstract_project.py", line 87, in bootstrap
      project.setup(pyproject, tool, tool_description)
    File "/private/var/folders/fr/z238tpm931s466144gcdpdj40000gn/T/pip-build-env-hvo51gpt/overlay/lib/python3.9/site-packages/sipbuild/project.py", line 585, in setup
      self.apply_user_defaults(tool)
    File "project.py", line 68, in apply_user_defaults
      super().apply_user_defaults(tool)
    File "/private/var/folders/fr/z238tpm931s466144gcdpdj40000gn/T/pip-build-env-hvo51gpt/overlay/lib/python3.9/site-packages/pyqtbuild/project.py", line 70, in apply_user_defaults
      super().apply_user_defaults(tool)
    File "/private/var/folders/fr/z238tpm931s466144gcdpdj40000gn/T/pip-build-env-hvo51gpt/overlay/lib/python3.9/site-packages/sipbuild/project.py", line 236, in apply_user_defaults
      self.builder.apply_user_defaults(tool)
    File "/private/var/folders/fr/z238tpm931s466144gcdpdj40000gn/T/pip-build-env-hvo51gpt/overlay/lib/python3.9/site-packages/pyqtbuild/builder.py", line 69, in apply_user_defaults
      raise PyProjectOptionException('qmake',
  sipbuild.pyproject.PyProjectOptionException
  [end of output]

` note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.`

Expected behavior

Installs and loads

System [please complete the following information]:

OS: Mac OS Ventura 13.01 [Tessa had the same problem in 13.02]
Language Version: Python 3.9 [Tessa had same in python 3.10]
Virtual environment: miniconda (stefan), miniforge (tessa)

Additional context

brew install of Qt5 seems like it might be necessary, but we're still having trouble with it freezing on the license step for pyqt5 install even after that

Perturbation metadata annotation

We should provide tooling for a "perturbation dictionary" to allow more metadata to be easily fetched and added to perturbations. To facilitate this we query a couple of databases to annotate things. The actual API and design needs to be discussed.

cell line metadata (mutation status, tissue type, etc.) - exists as an excel sheet from some public resource
perturbation metadata (MoA) fetched from clue.io + drugbank (had to apply for 1-yr data access here, kinda jank, need to discuss)
perturbation metadata (chemical) fetched from pubchempy
cell line x perturbation interactions from GDSC

Things that we have on the table but never wrote code for:

synergy scores from AZ-DREAM

Add dataloaders for CPA datasets

https://github.com/theislab/cpa/tree/main/docs/tutorials

Useful for the book

Not compatible with latest Python 3.10+

Report

Is it possible to install with version compatible with latest python? Otherwise it's hard to work with updated pacakges like scvi-tools that require python >3.10.

> pip install pertpy
ERROR: Ignored the following versions that require a different python version: 0.1.0 Requires-Python >=3.8.0,<3.10; 0.2.0 Requires-Python >=3.8.0,<3.10; 0.3.0 Requires-Python >=3.8.0,<3.10
ERROR: Could not find a version that satisfies the requirement pertpy (from versions: none)
ERROR: No matching distribution found for pertpy

Version information

No response

Import problem with poetry install

Report

Bug is occurring on server and on my laptop (M1). Both use conda. Python version 3.10.6.

I create a new environment following the directions in the contributor guide. poetry install doesn't install pyomo, so I get an error when running poetry run pertpy from that, then conda install pyomo, run poetry install again, which seems like it works, and then when I try to actually import and use the package I get:

``

Installed version 0.4.0 of pertpy is newer than the latest release 0.3.0! You are running a nightly version and features may break! ryp2 is not installed. Install with pip install rpy2 to run tools with R support. To use sccoda or tasccoda please install ete3 with pip install ete3 Traceback (most recent call last): File "<string>", line 1, in <module> File "/n/groups/marks/software/anaconda_o2/envs/pertpy-dev/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'pertpy.__main__'

I tried installing from github via pip, which will load into python but won't let me run nox without erroring (" The Poetry configuration is invalid" error when running tests or attempting make poetry install again

Version information

No response

Mixed effect analysis and drug dose response

It would be great to have support in pertpy for complex mixed effect models and drug dose response prediction. Some things are in CPA, but I'd like to see CPA being published first before we consider integration a fast JAX based implementation.

Other options:

Linear model that was in CPA at some point?
Other simple linear models?

-> @moinfar please talk to @MO-L about this. He had an idea here once. As soon as we know exactly what this is supposed to look like we'll discuss it in this issue and plan the implementation.

Cell-type composition analysis

compostion analysis

We need to have a function like perp.composition(adata, method, cell_type:...). I have already talked to Beni, it seems scCODA already has good wrappers for statistical methods (some in R but they have wrapper for them) and has already 12 test statistics implemented

We need to include them in here and for some of them reimplement (fastest ones).

For now, we can use the ones that are already there to reimplement stuff we have to talk to Johannes who maintains them.

Augurpy: AttributeError: 'numpy.ndarray' object has no attribute 'power'

<Processing data... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

RemoteTraceback                          Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
    r = call_item()
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call
    return self.fn(self.args, **self.kwargs)
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/parallel_backends.py", line 595, in _call
    return self.func(args, kwargs)
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call
    return [func(*args, kwargs)
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(args, *kwargs)
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/pertpy/tools/_augurpy.py", line 318, in cross_validate_subsample
    subsample = self.draw_subsample(
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/pertpy/tools/_augurpy.py", line 281, in draw_subsample
    return self.sample(
  File "/home/niklas/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/pertpy/tools/_augurpy.py", line 234, in sample
    subsample.var["var"] = np.ravel(subsample.X.power(2).mean(axis=0) - np.power(subsample.X.mean(axis=0), 2))
AttributeError: 'numpy.ndarray' object has no attribute 'power'
"""

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 ag_rfc.predict(loaded_data1,
      2                n_subsamples=20,  # number of random subsamples 
      3                subsample_size=10, # number of cells to subsample 
      4                folds=3,
      5                min_cells=200)

File ~/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/pertpy/tools/_augurpy.py:703, in Augurpy.predict(self, adata, n_subsamples, subsample_size, folds, min_cells, feature_perc, var_quantile, span, filter_negative_residuals, n_threads, augur_mode, select_variance_features, key_added, random_state, zero_division)
    698     print(
    699         f"[bold red]Skipping {cell_type} cell type - the number of samples for at least one class type is less than "
    700         f"subsample size {subsample_size}."
    701     )
    702 else:
--> 703     results[cell_type] = Parallel(n_jobs=n_threads)(
    704         delayed(self.cross_validate_subsample)(
    705             adata=cell_type_subsample,
    706             augur_mode=augur_mode,
    707             subsample_size=subsample_size,
    708             folds=folds,
    709             feature_perc=feature_perc,
    710             subsample_idx=i,
    711             random_state=random_state,
    712             zero_division=zero_division,
    713         )
    714         for i in range(n_subsamples)
    715     )
    716     # summarize scores for cell type
    717     results["summary_metrics"][cell_type] = self.average_metrics(results[cell_type])

File ~/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/parallel.py:1056, in Parallel.call(self, iterable)
   1053     self._iterating = False
   1055 with self._backend.retrieval_context():
-> 1056     self.retrieve()
   1057 # Make sure that we get a last message telling us we are done
   1058 elapsed_time = time.time() - self._start_time

File ~/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/parallel.py:935, in Parallel.retrieve(self)
    933 try:
    934     if getattr(self._backend, 'supports_timeout', False):
--> 935         self._output.extend(job.get(timeout=self.timeout))
    936     else:
    937         self._output.extend(job.get())

File ~/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/site-packages/joblib/_parallel_backends.py:542, in LokyBackend.wrap_future_result(future, timeout)
    539 """Wrapper for Future.result to implement the same behaviour as
    540 AsyncResults.get from multiprocessing."""
    541 try:
--> 542     return future.result(timeout=timeout)
    543 except CfTimeoutError as e:
    544     raise TimeoutError from e

File ~/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/concurrent/futures/_base.py:444, in Future.result(self, timeout)
    442     raise CancelledError()
    443 elif self._state == FINISHED:
--> 444     return self.__get_result()
    445 else:
    446     raise TimeoutError()

File ~/miniconda3/envs/niche_fibrosis_env_pertpy/lib/python3.8/concurrent/futures/_base.py:389, in Future.__get_result(self)
    387 if self._exception:
    388     try:
--> 389         raise self._exception
    390     finally:
    391         # Break a reference cycle with the exception in self._exception
    392         self = None

AttributeError: 'numpy.ndarray' object has no attribute 'power'
>

Details to reproduce on MM.

CC @niklaslang

TypeError: arguments did not match any overloaded call and QImage(variant: Any): too many arguments

Report

I'm running the composition analysis per the tutorial and encountering an error at the tree generation step as below:

> tasccoda_model = pt.tl.Tasccoda()
> tasccoda_data = tasccoda_model.load(
    adata,
    type="cell_level",
    cell_type_identifier="nsbm_level_1",
    sample_identifier=batch_key,
    covariate_obs=["genotype"],
    levels_orig=["nsbm_level_4", "nsbm_level_3", "nsbm_level_2", "nsbm_level_1"],
    add_level_name=True,
)
> tasccoda_data

MuData object with n_obs × n_vars = 23039 × 18297
  var:	'n_cells'
  2 modalities
    rna:	23034 x 18227
      obs:	'genotype', 'replicate', 'batch', 'scDblFinder_class', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'size_factors', 'leiden', 'leiden_res0_25', 'leiden_res0_5', 'leiden_res1', 'S_score', 'G2M_score', 'phase', 'manual_celltype_annotation', 'gt_rep', 'scCODA_sample_id', 'nsbm_level_0', 'nsbm_level_1', 'nsbm_level_2', 'nsbm_level_3', 'nsbm_level_4', 'nsbm_level_5', 'nsbm_level_6'
      var:	'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_deviant', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
      uns:	'genotype_colors', 'gt_rep_colors', 'hvg', 'leiden', 'leiden_res0_25_colors', 'leiden_res0_5_colors', 'leiden_res1_colors', 'manual_celltype_annotation_colors', 'neighbors', 'pca', 'phase_colors', 'replicate_colors', 'tsne', 'umap', 'schist', 'nsbm_level_1_colors', 'nsbm_level_2_colors'
      obsm:	'X_pca', 'X_tsne', 'X_umap', 'CM_nsbm_level_0', 'CM_nsbm_level_1', 'CM_nsbm_level_2', 'CM_nsbm_level_3', 'CM_nsbm_level_4', 'CM_nsbm_level_5', 'CM_nsbm_level_6'
      varm:	'PCs'
      layers:	'PFlog1pPF_normalization', 'counts', 'log1pPF_normalization', 'raw_counts', 'scTransform_normalization', 'scran_normalization', 'soupX_counts'
      obsp:	'connectivities', 'distances'
    coda:	5 x 70
      obs:	'genotype', 'gt_rep'
      var:	'n_cells'
      uns:	'tree'

> pt.pl.coda.draw_tree(tasccoda_data)

File ~/opt/anaconda3/envs/scanpy/lib/python3.10/site-packages/ete3/coretype/tree.py:1392, in TreeNode.render(self, file_name, layout, w, h, tree_style, units, dpi)
   1388     return drawer.get_img(self, w=w, h=h,
   1389                           layout=layout, tree_style=tree_style,
   1390                           units=units, dpi=dpi, return_format=file_name)
   1391 else:
-> 1392     return drawer.render_tree(self, file_name, w=w, h=h,
   1393                             layout=layout, tree_style=tree_style,
   1394                               units=units, dpi=dpi)

File ~/opt/anaconda3/envs/scanpy/lib/python3.10/site-packages/ete3/treeview/drawer.py:111, in render_tree(t, imgName, w, h, layout, tree_style, header, units, dpi)
    109 scene.addItem(scene.master_item)
    110 if imgName.startswith("%%inline"):
--> 111     imgmap = save(scene, imgName, w=w, h=h, units=units, dpi=dpi)
    112 else:
    113     x_scale, y_scale = save(scene, imgName, w=w, h=h, units=units, dpi=dpi)

File ~/opt/anaconda3/envs/scanpy/lib/python3.10/site-packages/ete3/treeview/main.py:752, in save(scene, imgName, w, h, dpi, take_region, units)
    750 else:
    751     targetRect = QRectF(0, 0, w, h)
--> 752     ii= QImage(w, h, QImage.Format_ARGB32)
    753     ii.fill(QColor(Qt.white).rgb())
    754     ii.setDotsPerMeterX(dpi / 0.0254) # Convert inches to meters

TypeError: arguments did not match any overloaded call:
  QImage(): too many arguments
  QImage(size: QSize, format: QImage.Format): argument 1 has unexpected type 'float'
  QImage(width: int, height: int, format: QImage.Format): argument 1 has unexpected type 'float'
  QImage(data: bytes, width: int, height: int, format: QImage.Format): argument 1 has unexpected type 'float'
  QImage(data: PyQt5.sip.voidptr, width: int, height: int, format: QImage.Format): argument 1 has unexpected type 'float'
  QImage(data: bytes, width: int, height: int, bytesPerLine: int, format: QImage.Format): argument 1 has unexpected type 'float'
  QImage(data: PyQt5.sip.voidptr, width: int, height: int, bytesPerLine: int, format: QImage.Format): argument 1 has unexpected type 'float'
  QImage(xpm: List[str]): argument 1 has unexpected type 'float'
  QImage(fileName: str, format: typing.Optional[str] = None): argument 1 has unexpected type 'float'
  QImage(a0: QImage): argument 1 has unexpected type 'float'
  QImage(variant: Any): too many arguments

Can you please guide me how to fix this?

Version information

altair 4.2.2
anndata 0.8.0
matplotlib 3.7.0
mudata 0.2.1
numpy 1.23.5
pandas 1.5.3
pertpy 0.4.0
scanpy 1.9.2
schist v0.7.15
seaborn 0.12.2
session_info 1.0.0
tensorflow 2.11.0

PIL 9.4.0
PyQt5 NA
aa8f2297d25b4dc6fd3d98411eb3ba53823c4f42 NA
absl NA
adjustText NA
anyio NA
appnope 0.1.3
arrow 1.2.3
arviz 0.14.0
asttokens NA
astunparse 1.6.3
attr 22.2.0
babel 2.12.1
backcall 0.2.0
beta_ufunc NA
binom_ufunc NA
cairo 1.23.0
certifi 2022.12.07
cffi 1.15.1
cftime 1.6.2
charset_normalizer 3.0.1
colorama 0.4.6
comm 0.1.2
custom_inherit 2.4.1
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 5.1.1
decoupler 1.3.4
defusedxml 0.7.1
dot_parser NA
entrypoints 0.4
ete3 3.1.2
etils 1.0.0
executing 1.2.0
fastjsonschema NA
flatbuffers 23.1.21
fontTools 4.38.0
fqdn NA
fsspec 2023.1.0
gast NA
google NA
graph_tool 2.46 (commit , )
h5py 3.7.0
hypergeom_ufunc NA
idna 3.4
igraph 0.10.4
ipykernel 6.21.2
ipython_genutils 0.2.0
ipywidgets 8.0.4
isoduration NA
jax 0.4.4
jaxlib 0.4.4
jedi 0.18.2
jinja2 3.1.2
joblib 1.2.0
json5 NA
jsonpointer 2.3
jsonschema 4.17.3
jupyter_server 1.23.6
jupyterlab_server 2.19.0
keras 2.11.0
kiwisolver 1.4.4
leidenalg 0.9.1
llvmlite 0.39.1
louvain 0.8.0
markupsafe 2.1.2
matplotlib_inline 0.1.6
mizani 0.8.1
mpl_toolkits NA
multipledispatch 0.6.0
muon 0.1.3
natsort 8.3.0
nbformat 5.7.3
nbinom_ufunc NA
ncf_ufunc NA
netCDF4 1.6.2
numba 0.56.4
numpyro 0.10.1
opt_einsum v3.3.0
packaging 23.0
palettable 3.3.0
parso 0.8.3
patsy 0.5.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.0.0
plotnine 0.10.1
ply 3.11
png 0.20220715.0
prometheus_client NA
prompt_toolkit 3.0.38
psutil 5.9.4
ptyprocess 0.7.0
pure_eval 0.2.2
pvectorc NA
pycparser 2.21
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pydot 1.4.2
pygments 2.14.0
pynndescent 0.5.8
pyomo 6.5.0
pyparsing 3.0.9
pypi_latest 0.1.2
pyrsistent NA
pytz 2022.7.1
pytz_deprecation_shim NA
questionary 1.10.0
reportlab 3.6.12
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rich NA
rpy2 3.5.9
scipy 1.9.3
send2trash NA
setuptools 65.6.3
sip NA
six 1.16.0
sklearn 1.2.1
skmisc 0.1.4
sniffio 1.3.0
sparsecca 0.3.0
stack_data 0.6.2
statannotations 0.5.0
statsmodels 0.13.5
switchlang 0.1.0
tensorboard 2.11.2
termcolor NA
terminado 0.17.1
texttable 1.6.7
threadpoolctl 3.1.0
toolz 0.12.0
tornado 6.2
toyplot 1.0.3
toytree 2.0.1
tqdm 4.64.1
traitlets 5.9.0
typing_extensions NA
tzlocal NA
umap 0.5.3
uri_template NA
urllib3 1.26.14
wcwidth 0.2.6
webcolors 1.11.1
websocket 1.5.1
wrapt 1.15.0
xarray 2023.2.0
xarray_einstats 0.5.1
yaml 6.0
zmq 25.0.0
zoneinfo NA
zstandard 0.19.0

IPython 8.11.0
jupyter_client 7.4.1
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.2

Python 3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ]
macOS-10.16-x86_64-i386-64bit

Session information updated at 2023-03-01 21:14

Add support for cell line bulk gene expression

https://depmap.org/portal/download/all/?releasename=DepMap+Public+22Q2&filename=CCLE_expression.csv

Instead of one-hot encoding cell lines we could throw in the whole bulk expression as input into classification models. It might be required to PCA those though.

Maybe also offer PCAed versions of those in the first place.

ete3 warning masks other ImportErrors

Describe the bug

To Reproduce

Steps to reproduce the behavior:

Install pyqt5, ete3 and pertpy
Assert import ete3 works fine
import pertpy as pt -> shows warning for ete3

Actual behaviour
Warning appears:

To use sccoda or tasccoda please install ete3 with `pip install ete3`

Expected behavior
No warning when ete3 available.

System [please complete the following information]:

OS: macOS 13.1, Apple M1 Max
Language Version: Python 3.9

Virtual environment: micromamba with:

# environment.yaml
name: pertpy-apple-silicon
channels:
  - conda-forge
dependencies:
  - python=3.9
  - rpy2

brew install qt5
# make sure to add all the environment variables to your .zshrc
# as described at the end of the output of the brew install command
# then reload your shell:
zsh
create environment with mamba (conda should work as well, but I haven't tested it)
mamba create -f environment.yaml -n pertpy-apple-silicon
mamba activate pertpy-apple-silicon
pip install pyqt5 --config-settings --confirm-license= --verbose
pip install ete3
pip install git+https://github.com/theislab/pertpy.git

Additional context

Bug in pertpy/tools/init.py#L10
There are faulty import paths getting masked by the catching of the ImportError for the ete3 warning.
from pertpy.tools.coda._sccoda import Sccoda works fine.

Follow up on #201

Further Augurpy improvements

Data not normalized. Normalizing now using scanpy log1p normalize. not the best message
Add comparison result table to the documentation
Feature importances could also be saved in .uns

Add tascCODA 2.0

Add new implementation of tascCODA together with an associated new tutorial.

perturbation space

The idea is the central approach to first visual analysis, similar to mixescape where they perform 20 nn analysis and subtract the mean of top n non-perturbed analysis and then visualize cells.

I would suggest a more sophisticated method compared to PCA representation here :

Approach 1: fast embedding with:

Kernel-PCA (non-linear)
ICA (non-linear)
PCA (linear, mixscape)
and repeat same analysis as mixscape and use this for dewnstream analysis. here we can benchmark to mixscape.

in addition to that we can ->
We can come up with a asbtract vector for each of those embeddings for each pertrubation conditioned on cell-types not cell type meaning: for each pertrubation in the latent space we average representation for each pertrbation and plot a latent pertrbation space.

Approach 2:

Option 1 : use a super light version of scGen which takes less than 5 mins to run. @Koncopd and then benchmark against approach 2, I think we can have it there any way

Option 2: very light CVAE with embeddings as conditions

option 3: very light CPA

all options in approach 2 would allow to have latent space representation per cell, embeddings for cell-types, pertrubations.

Look into Raghavan et al

Description of feature

pertpy/scanpy related note (not sure what to do with it): in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8822455/ under scBasal and scClassical programs, they use a score (like from sc.tl.score_genes ) and then infer the markers most representative of that score using plain correlations and FDR. Is there something like this already and if not, can we trivially add it to something? I think it’s super useful and also a much better way of identifying gene programs.

I shall look into this

Reversed labele

Thank you for opening the script!
I used the following jupyter-notebook as a reference to filter my data at hand and enter the following command:
https://github.com/theislab/pertpy-tutorials/blob/main/mixscape.ipynb
mixscape_identifier = pt.tl.Mixscape()
mixscape_identifier.pert_sign(adata, 'perturbation', 'NT')
mixscape_identifier.mixscape(adata = adata, control = 'NT', labels='grna')
I got the mislabeling result that KO labels were reversed in some gRNAs (see RPL11, LDB1).

Since RACK1 is labeled correctly, I do not believe it is an alphabetical issue.
I am wondering if this result is due to the labeling of KO on the higher cell number of gRNA-transfected cells.
Could you please tell us why this is happening?

Preprocessing module

The ECCITE-seq data consists of transcriptomics and surface protein data. Both types of data require quality control and preprocessing to ensure that the data is ready for downstream tasks.

- QC and normalization of transcriptomics data:
We don't need to do much just similar prep from Scanpy. Basically, we just need to ensure that our example notebook has a quick run down of scRNA-seq QC and normalization.

- QC and normalization of protein data:

QC is still open, but @Zethson is looking into it. Maybe adapt from https://scanpy-tutorials.readthedocs.io/en/latest/cite-seq/pbmc5k.html
CLR (centered log-ratio transformation) similar to mixescape
LogNormalize also in mixscape
Relative counts also in miix scape (https://satijalab.org/seurat/reference/normalizedata)
Advanced method: Denoised and Scaled by Background (DSB) the method is in R but works with seurat object (https://www.biorxiv.org/content/10.1101/2020.02.24.963603v1.full.pdf) and code here (https://github.com/niaid/dsb)

Better support for DE in scverse

Continuation of a discussion in theislab/single-cell-best-practices#141

Goal

Implementation of useful helper methods for common DE tools. We don't plan to reimplement several DE methods. Other people are trying this already.

Tools

We could implement wrapper functions that make it streamlined to run DE tools within the scverse ecosystem. The output of the tools should be in the right slots of the AnnData object.

Most important

wrap basic methods from scanpy (wilcoxon, t-test, ...)
wrap pydesq2, see also owkin/PyDESeq2#15

Other options

Disclaimer: while the rpy2 code snippets should work in principle, I haven't used them extensively and am not sure if they follow the best practices of the respective tool.

edgeR (through rpy2) [code snippet]
MAST (through rpy2) [code snippet] -> should additionally support LME models.
Limma (through rpy2)
statsmodels linear model wrapper [code snippet]

Input specification

I envisage something like

def de_analysis(
    adata: AnnData,
    groupby: str,
    method: Literal["t-test", "wilcoxon", "pydeseq2"],
    *,
    formula: Optional[str],
    contrast: Optional[Any],
    inplace: bool = True,
    key_addeed: Optional[str] = None,
) -> pd.DataFrame: 
    """
    Perform differential expression analysis
    
    Parameters
    ----------
    adata
        single-cell or pseudobulk AnnData object
    groupby
        column in adata.obs that contains the factor to test, e.g. `treatment`. 
        For simple statistical tests (t-test, wilcoxon), it is sufficient to specify groupby. Linear models require to specify
        a formula. In that case, the `groupby` column is used to compute the contrast. 
    method
        Which method to use to perform the DE test
    formula
        model specification for linear models. E.g. `~ treatment + sex + age`. 
        MUST contain the factor specified in `groupby`.         
    contrast
        TODO
        See e.g. https://www.statsmodels.org/devel/contrasts.html for more information.
    inplace
        if True, save the result in `adata.varm[key_added]`
    key_added
        Key under which the result is saved in `adata.varm` if inplace is True.
        If set to None this defaults to `de_{method}_{groupby}`. 
    """

Clearly, the challenge here is to find an interface that works with both simple methods like t-test and the more flexible linear models. It will also always be a tradeoff between simplicity and the possibility to use all features of the underlying methods (e.g. do we want to support arbitrary contrasts? Do we want to support multiple contrasts for a single model fit?).
Some of these cases could require splitting the function up into a fit and a compute_contrast step. In that case an OOP interface might be more appropriate.

Output specification

Each DE tool should output a data frame with at least the following columns

gene_id
log2 fold change
mean expression
unadjusted p-value
adjusted p-value

and optionally other columns (may depend on the method).

Plots

Add volcano plots from decoupler/wrap them (plot_volcano)

Note: an (optionally) interactive version based on plotly/altair could be really nice -- hover over the dots to view the gene name.
Add paired or unpaired scatter+boxplots are nice ways of visualizing individual genes (each dot is a pseudobulk-sample)

[code snipped based on seaborn]
Add bar charts. Scatterplot on top for paired observations. Each dot represents the fold change for one paired observation.
[code snipped based on altair]
heatmap of coefficients with FDR correction

[code snippet based on altair]

Support for genetics data

A highly requested feature was support for genetics data. Currently there are already toolkits for genetics data out there: namely https://github.com/pystatgen/sgkit and https://github.com/hail-is/hail

We don't want to reimplement any of the genetics data tools. However, we want to enable bridges to their ecosystem by providing functions that convert the data structures from one into another.

The tasks are roughly:

Explore hail and sgkit. How do they work? How do their data structures work?
Find some example single-cell genetics dataset. Can we reasonably work with it in AnnData + scverse tools? What would it look like?
Write AnnData -> sgkit
Write sgkit -> AnnData
Write AnnData -> hail
Write hail -> AnnData
Write a notebook that loads some genetics data (can also add a pertpy dataloader) into an AnnData object -> converts to sgkit/hail -> does something -> converts back to AnnData. Add this notebook to https://github.com/theislab/pertpy-tutorials/

theislab / pertpy Goto Github PK

pertpy's Introduction

pertpy

Documentation

Installation

pertpy's People

Contributors

Stargazers

Watchers

Forkers

pertpy's Issues

Report

Version information

Report

Version information

Algorithms to identify the degree of responders/not responder

Approach 1 :

Approach 2

Report

Version information

pertpy 0.4.0 scanpy 1.9.2 session_info 1.0.0

IPython 8.10.0 jupyter_client 7.4.9 jupyter_core 5.2.0 notebook 6.5.2

Python 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:26:08) [Clang 14.0.6 ] macOS-13.1-arm64-arm-64bit

Description of feature

multimodal single-cell CRISPR screens:

Uni Modal KO datasets #17

Normal scRNA-seq:

RNA velocity perturbation analysis

Report

Version information

Report

Version information

Report

Version information

compostion analysis

Report

Version information

altair 4.2.2 anndata 0.8.0 matplotlib 3.7.0 mudata 0.2.1 numpy 1.23.5 pandas 1.5.3 pertpy 0.4.0 scanpy 1.9.2 schist v0.7.15 seaborn 0.12.2 session_info 1.0.0 tensorflow 2.11.0

IPython 8.11.0 jupyter_client 7.4.1 jupyter_core 5.2.0 jupyterlab 3.6.1 notebook 6.5.2

Python 3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ] macOS-10.16-x86_64-i386-64bit

Approach 1: fast embedding with:

Approach 2:

Description of feature

Goal

Tools

Input specification

Output specification

Plots

Recommend Projects

Recommend Topics

Recommend Org

pertpy 0.4.0
scanpy 1.9.2
session_info 1.0.0

IPython 8.10.0
jupyter_client 7.4.9
jupyter_core 5.2.0
notebook 6.5.2

Python 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:26:08) [Clang 14.0.6 ]
macOS-13.1-arm64-arm-64bit

altair 4.2.2
anndata 0.8.0
matplotlib 3.7.0
mudata 0.2.1
numpy 1.23.5
pandas 1.5.3
pertpy 0.4.0
scanpy 1.9.2
schist v0.7.15
seaborn 0.12.2
session_info 1.0.0
tensorflow 2.11.0

IPython 8.11.0
jupyter_client 7.4.1
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.2

Python 3.10.9 (main, Jan 11 2023, 09:18:20) [Clang 14.0.6 ]
macOS-10.16-x86_64-i386-64bit