calico / scnym Goto Github PK

View Code? Open in Web Editor NEW

73.0 6.0 12.0 4.87 MB

Semi-supervised adversarial neural networks for classification of single cell transcriptomics data

Home Page: https://scnym.research.calicolabs.com

License: Apache License 2.0

Shell 0.21% Jupyter Notebook 41.35% Python 56.80% R 1.63%

single-cell rna-seq single-cell-genomics semi-supervised adversarial-training

scnym's Introduction

scNym - Semi-supervised adversarial neural networks for single cell classification

scNym is a neural network model for predicting cell types from single cell profiling data (e.g. scRNA-seq) and deriving cell type representations from these models. While cell type classification is the main use case, these models can map single cell profiles to arbitrary output classes (e.g. experimental conditions).

We've described scNym in detail in a recent paper in Genome Research.
Please cite our work if you find this tool helpful.
We also have a research website that describes scNym in brief -- https://scnym.research.calicolabs.com

Semi-supervised adversarial neural networks for single cell classification.
Jacob C. Kimmel and David R. Kelley.
Genome Research. 2021. doi: https://doi.org/10.1101/gr.268581.120

BibTeX

@article{kimmel_scnym_2021,
	title = {Semi-supervised adversarial neural networks for single-cell classification},
	issn = {1088-9051, 1549-5469},
	url = {https://genome.cshlp.org/content/early/2021/02/24/gr.268581.120},
	doi = {10.1101/gr.268581.120},
	pages = {gr.268581.120},
	journaltitle = {Genome Research},
	shortjournal = {Genome Res.},
	author = {Kimmel, Jacob C. and Kelley, David R.},
	urldate = {2021-02-26},
	date = {2021-02-24},
	langid = {english},
	pmid = {33627475}
}

If you have an questions, please feel free to email me.

Jacob C. Kimmel
[email protected]
Calico Life Sciences, LLC

Model

The scNym model is a neural network leveraging modern best practices in architecture design. Gene expression vectors are transformed by non-linear functions at each layer in the network. Each of these functions have parameters that are learned from data.

scNym uses the MixMatch semi-supervision framework (Berthelot et. al. 2019) and domain adversarial training to take advantange of both labeled training data and unlabeled target data to learn these parameters. Given a labeled dataset X and an unlabeled dataset U, scNym uses the model to guess "pseudolabels" for each unlabeled observation. All observations are then augmented using the "MixUp" weighted averaging method prior to computing losses.

We also introduce a domain adversarial network (Ganin et. al. 2016) that predicts the domain of origin (e.g. {target, train} or {method_A, method_B, method_C}) for each observation. We invert the adversary's gradients during backpropogation so the model learns to "compete" with the adversary by adapting across domains. Model parameters are then trained to minimize a supervised cross-entropy loss applied to the labeled examples, an unsupervised mean squared error loss applied to the unlabeled examples, and a classification loss across domains for the domain adversary.

Tutorials

The best way to become acquainted with scNym is to walk through one of our interactive tutorials. We've prepared tutorials using Google Colab so that all computation can be performed using free GPUs. You can even analyze data on your cell phone!

Semi-supervised cell type classification using cell atlas references

This tutorial demonstrates how to train a semi-supervised scNym model using a pre-prepared cell atlas as a training data set and a new data set as the target. You can upload your own data through Google Drive to classify cell types in a new experiment.

Transfering labels from a cell atlas

Transferring annotations across technologies in human PBMCs

This tutorial shows how to use scNym to transfer annotations across experiments using different sequencing technologies. We use human peripheral blood mononuclear cell profiles generated with different versions of the 10x Genomics chemistry for demonstration.

Cross-technology annotation transfer

Installation

First, clone the repository:

We recommend creating a virtual environment for use with scNym. This is easily accomplished using virtualenv or conda. We recommend using python=3.8 for scNym, as some of our dependencies don't currently support the newest Python versions.

$ python3 -m venv scnym_env # python3 is python3.8
$ source scnym_env/bin/activate

$ conda create -n scnym_env -c conda-forge python=3.8
$ conda activate scnym_env

Once the environment is set up, simply run:

$ cd scnym
$ pip install -e ./

After installation completes, you should be able to import scnym in Python and run scNym as a command line tool:

$ python -c "import scnym; print(scnym.__file__)"
$ scnym --help

Usage

Data Preprocessing

Data inputs for scNym should be log(CPM + 1) normalized counts, where CPM is Counts Per Million and log is the natural logarithm. This transformation is crucial if you would like to use any of our pre-trained model weights, provided in the tutorials above.

For the recommended Python API interface, data should be formatted as an AnnData object with normalized counts in the main .X observations attribute.

For the command line tool, data can be stored as a dense [Cells, Genes] CSV of normalized counts, an AnnData h5ad object, or a Loompy loom object.

Python API

We recommend users take advantange of our Python API for scNym, suitable for use in scripts and Jupyter notebooks. The API follows the scanpy functional style and has a single end-point for training and prediction.

To begin with the Python API, load your training and test data into anndata.AnnData objects using scanpy.

Training

Training an scNym model using the Python API is simple. We provide an example below.

from scnym.api import scnym_api

scnym_api(
    adata=adata,
    task='train',
    groupby='cell_ontology_class',
    out_path='./scnym_output',
    config='no_new_identity',
)

The groupby keyword specifies a column in adata.obs containing annotations to use for model training. This API supports semi-supervised adversarial training using a special token in the annotation column. Any cell with the annotation "Unlabeled" will be treated as part of the target dataset and used for semi-supervised and adversarial training.

We also provide two predefined configurations for model training.

no_new_identity -- This configuration assumes every cell in the target set belongs to one of the classes in the training set. This assumption improves performance, but can lead to erroneously high confidence scores if new cell types are present in the target data.
new_identity_discovery -- This configuration is useful for experiments where new cell type discoveries may occur. It uses pseudolabel thresholding to avoid the assumption above. If new cell types are present in the target data, they correctly receive low confidence scores.

Prediction

from scnym.api import scnym_api

scnym_api(
    adata=adata,
    task='predict',
    key_added='scNym',
    trained_model='./scnym_output',
    out_path='./scnym_output',
    config='no_new_identity',
)

The prediction task adds a key to adata.obs that contains the scNym annotation predictions, as well as the associated confidence scores. The key is defined by key_added and the confidence scores are stored as adata.obs[key_added + '_confidence'].

The prediction task also extracts the activations of the penultimate scNym layer as an embedding and stores the result in adata.obsm["X_scnym"].

Interpretation

scNym models can be interpreted using the expected gradients technique to estimate Shapley values (Erion et. al. 2020). Briefly, expected gradient estimation computes the gradient on the predicted output class score with respect to an input vector, where the input vector is a random interpolation between an observation x and some reference vector x'.
Intuitively, we are using gradients on the input vector to highlight important genes that influence class predictions. We can then rank the importance of various genes using the resulting expected gradient value.

Computing expected gradients in scNym is accomplished with the scnym_interpret API endpoint.

from scnym.api import scnym_interpret

expected_gradients = scnym_interpret(
    adata=adata,
    groupby="cell_type",
    target="target_cell_type",
    source="all", # use all data except target cells as a reference    
    trained_model=PATH_TO_TRAINED_MODEL,
    config=CONFIG_USED_FOR_TRAINING,
)

# `expected_gradients["saliency"]` is a pandas.Series ranking genes by their mean
# expected gradient across cells
# `expected_gradients["gradients"]` is a pd.DataFrame [Cells, Features] table of expected
# gradient estimates for each feature in each `target` cell

Training and predicting with Cell Atlas References

We also provide a set of preprocessed cell atlas references for human, mouse, and rat, as well as pretrained weights for each.

It's easy to use the scNym API to transfer labels from these atlases to your own data.

Semi-supervised training with cell atlas references

The best way to transfer labels is by training an scNym model using your data as the target dataset for semi-supervised learning. Below, we demonstrate how to train a model on a cell atlas with your data as the target.

We provide access to cell atlases for the mouse and rat through the scNym API, but we encourage users to thoughfully consider which training data are most appropriate for their experiments.

import anndata
from scnym.api import scnym_api, atlas2target

# load your data
adata = anndata.read_h5ad(path_to_your_data)

# first, we create a single object with both the cell
# atlas and your data
# `atlas2target` will take care of passing annotations
joint_adata = atlas2target(
    adata=adata,
    species='mouse',
    key_added='annotations',
)

# now train an scNym model as above
scnym_api(
    adata=joint_adata,
    task='train',
    groupby='annotations',
    out_path='./scnym_output',
    config='new_identity_discovery',
)

Multi-domain training

By default, scNym treats training cells as one domain, and target cells as another. scNym also offers the ability to integrate across multiple training and target domains through the domain adversary. This feature can be enabled by providing domain labels for each training cell in the AnnData object and passing the name of the relevant anntotation column to scNym.

# load multiple training and target datasets
# ...
# set unique labels for each domain
training_adata_00.obs['domain_label'] = 'train_0'
training_adata_01.obs['domain_label'] = 'train_1'

target_adata_00.obs['domain_label'] = 'target_0'
target_adata_01.obs['domain_label'] = 'target_1'

# set target annotations to "Unlabeled"
target_adata_00.obs['annotations'] = 'Unlabeled'
target_adata_01.obs['annotations'] = 'Unlabeled'

# concatenate 
adata = training_adata_00.concatenate(
    training_adata_01,
    target_adata_00,
    target_adata_01,
)

# provide the `domain_groupby` argument to `scnym_api`
scnym_api(
    adata=adata,
    task='train',
    groupby='annotations',
    domain_groupby='domain_label',
    out_path='./scnym_output',
    config='no_new_identity',
)

Advanced configuration options

We provide two configurations for scNym model training, as noted above. However, users may wish to experiment with different configuration options for new applications of scNym models.

To experiment with custom configuration options, users can simply copy one of the pre-defined configurations and modify as desired. All pre-defined configurations are stored as Python dictionaries in scnym.api.CONFIGS.

import scnym
config = scnym.api.CONFIGS["no_new_identity"]
# increase the number of training epochs
config["n_epochs"] = 500
# increase the weight of the domain adversary 0.1 -> 0.3
config["ssl_kwargs"]["dan_max_weight"] = 0.3

# descriptions of all parameters and their default values
"default": {
    "n_epochs": 100, # number of training epochs
    "patience": 40, # number of epochs to wait before early stopping
    "lr": 1.0, # learning rate
    "optimizer_name": "adadelta", # optimizer
    "weight_decay": 1e-4, # weight decay for the optimizer
    "batch_size": 256, # minibatch size
    "mixup_alpha": 0.3, # shape parameter for MixUp: lambda ~ Beta(alpha, alpha)
    "unsup_max_weight": 1.0, # maximum weight for the MixMatch loss
    "unsup_mean_teacher": False, # use a mean teacher for MixMatch pseudolabeling
    "ssl_method": "mixmatch", # semi-supervised learning method to use
    "ssl_kwargs": {
        "augment_pseudolabels": False, # perform augmentations before pseudolabeling
        "augment": "log1p_drop", # augmentation to use if `augment_pseudolabels`
        "unsup_criterion": "mse", # criterion fxn for MixMatch loss
        "n_augmentations": 1, # number of augmentations per observation
        "T": 0.5, # temperature scaling parameter
        "ramp_epochs": 100, # number of epochs to ramp up the MixMatch loss
        "burn_in_epochs": 0, # number of epochs to wait before ramping MixMatch
        "dan_criterion": True, # use a domain adversary
        "dan_ramp_epochs": 20, # ramp epochs for the adversarial loss
        "dan_max_weight": 0.1, # max weight for the adversarial loss
        "min_epochs": 20, # minimum epochs to train before saving a best model
    },
    "model_kwargs": {
        "n_hidden": 256, # number of hidden units per hidden layer
        "n_layers": 2, # number of hidden layers
        "init_dropout": 0.0, # dropout on the initial layer
        "residual": False, # use residual layers
    },
    # save logs for tensorboard. enables nice visualizations, but can slow down 
    # training if filesystem I/O is limiting.
    "tensorboard": False,
}

CLI

Models can be also trained using the included command line interface, scnym. The CLI allows for more detailed model configuration, but should only be used for experimentation.

The CLI accepts configuration files in YAML or JSON formats, with parameters carrying the same names as command line arguments.

To see a list of command line arguments/configuration parameters, run:

$ scnym -h

A sample configuration is included as default_config.txt.

Demo Script

A CLI demo shell script is provided that downloads data from the Tabula Muris and trains an scnym model.

To execute the script, run:

chmod +x demo_script.sh
source demo_script.sh

in the repository directory.

Processed Data

We provide processed data we used to evaluate scNym in the common AnnData format.

scnym's People

Contributors

Stargazers

Watchers

Forkers

vaalessi jdasam mumichae hasihays qindan2008 amakiri-augustine jlehrer1 hahaschool edwardzhu77 ppxbb erdalga aquamono

scnym's Issues

segmentation fault after training finished

I'm training in the train_tissue_independent mode. Training is successful. But after Final Eval Acc is reported, a segmentation fault is thrown:

Let me know if you would like more information.

scNym Will Not Install

The scNym package (and tutorials) are broken as it attempts to install leidenalg==0.8.0 from cache and fails. Please see the below error message. Please fix the scNym package requirements.

Collecting scnym
Using cached scnym-0.3.2-py2.py3-none-any.whl (68 kB)
Collecting anndata==0.7.4 (from scnym)
Using cached anndata-0.7.4-py3-none-any.whl (118 kB)
Collecting ConfigArgParse==1.1 (from scnym)
Using cached ConfigArgParse-1.1.tar.gz (41 kB)
Preparing metadata (setup.py) ... done
Collecting h5py==2.10.0 (from scnym)
Using cached h5py-2.10.0.tar.gz (301 kB)
Preparing metadata (setup.py) ... done
Collecting leidenalg==0.8.0 (from scnym)
Using cached leidenalg-0.8.0.tar.gz (4.1 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (pyproject.toml) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Annotating Seurat clusters

Hi,

I have a question about using scNym with my Seurat analysis. I have several samples corresponding to different treatments. I have followed the Seurat workflow including integrating the datasets and dimension reduction followed by clustering. I now have a few clusters from Seurat. I would like to annotate these clusters. Can scNym accomplish this?

I ask because I'm going through the README and it seems that scNym only takes the count matrix as an input, without any a priori clustering or dimension reduction, let alone integration, and attempts to annotate each individual cell? In other words, it seems an scNym workflow foregoes the standard workflow past preprocessing and does the clustering itself, via the automatic annotation.

If I've understood this correct, is it still possible nonetheless to incorporate scNym into my current framework? I understand I need to convert my Seurat object to a scanpy format in order to even attempt it.

Finally, I was also wondering what kind of performance I might be able to expect from scNym on cell lines were the cells are not expected to really differentiate? In particular, I am seeking cluster annotation tools because the manual annotation by comparing markers is proving too difficult.

Thanks.

RuntimeError: NaN loss encountered in training

Hi, I have been trying to use scNym for annotating a target dataset but training terminates with a Runtime error message "NaN loss encountered in training". I am using the colab notebook as a guide with my own datasets for test and train. I tried running the training a couple of times on a local GPU node and the training was interrupted at epochs 12 and 18 respectively, with the above message.

Any help is appreciated!

Command:

scnym_api(
    adata=adata,
    task='train',
    groupby='ref_annotations',
    out_path='./scnym_outputs',
    config='no_new_identity',
)

CUDA compute device found.
23428 unlabeled observations found.
Using unlabeled data as a target set for semi-supervised, adversarial training.

training examples:  (242783, 1651)
target   examples:  (23428, 1651)
X:  (242783, 1651)
y:  (242783,)
Not weighting classes and not balancing classes.
Found 2 unique domains.
Using MixMatch for semi-supervised learning
Scaling ICL over 100 epochs, 0 epochs for burn in.
Scaling ICL over 20 epochs, 0 epochs for burn in.
Using a Domain Adaptation Loss.
Training...
Epoch 12/99|---___________________________|

Traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/Users/nagendraKU/atlas_integ_preprocess/scnym_labeltransfer.ipynb Cell 14' in <cell line: 1>()
----> 1 scnym_api(
      2     adata=adata,
      3     task='train',
      4     groupby='ref_annotations',
      5     out_path='./scnym_outputs',
      6     config='no_new_identity',
      7 )

File /home/projects/xxx_condaenv/scnym/lib/python3.8/site-packages/scnym/api.py:339, in scnym_api(adata, task, groupby, domain_groupby, out_path, trained_model, config, key_added, copy)
    336         msg = f'{groupby} is not a variable in `adata.obs`'
    337         raise ValueError(msg)
--> 339     scnym_train(
    340         adata=adata,
    341         config=config,
    342     )
    343 else:
    344     # check that a pre-trained model was specified or 
    345     # provided for prediction
    346     if trained_model is None:

File /home/projects/xxx_condaenv/scnym/lib/python3.8/site-packages/scnym/api.py:514, in scnym_train(adata, config)
    511         msg = f'{pretrained} file not found.'
    512         raise FileNotFoundError(msg)
--> 514 acc, loss = main.fit_model(
    515     X=X,
    516     y=y,
    517     traintest_idx=traintest_idx,
    518     val_idx=val_idx,
    519     batch_size=config['batch_size'],
    520     n_epochs=config['n_epochs'],
    521     lr=config['lr'],
    522     optimizer_name=config['optimizer_name'],
    523     weight_decay=config['weight_decay'],
    524     ModelClass=model.CellTypeCLF,
    525     out_path=config['out_path'],
    526     mixup_alpha=config['mixup_alpha'],
    527     unlabeled_counts=unlabeled_counts,
    528     input_domain=input_domain,
    529     unlabeled_domain=unlabeled_domain,
    530     unsup_max_weight=config['unsup_max_weight'],
    531     unsup_mean_teacher=config['unsup_mean_teacher'],
    532     ssl_method=config['ssl_method'],
    533     ssl_kwargs=config['ssl_kwargs'],
    534     pretrained=pretrained,
    535     patience=config.get('patience', None),
    536     save_freq=config.get('save_freq', None),
    537     tensorboard=config.get('tensorboard', False),
    538     **config['model_kwargs'],
    539 )
    541 # add the final model results to `adata`
    542 results = {
    543     'model_path': osp.realpath(osp.join(config['out_path'], '00_best_model_weights.pkl')),
    544     'final_acc': acc,
   (...)
    552     'val_idx': val_idx,
    553 }

File /home/projects/xxx_condaenv/scnym/lib/python3.8/site-packages/scnym/main.py:519, in fit_model(X, y, traintest_idx, val_idx, batch_size, n_epochs, lr, optimizer_name, weight_decay, ModelClass, out_path, n_genes, mixup_alpha, unlabeled_counts, unsup_max_weight, unsup_mean_teacher, ssl_method, ssl_kwargs, weighted_classes, balanced_classes, input_domain, unlabeled_domain, pretrained, patience, save_freq, tensorboard, **kwargs)
    509     T = SemiSupervisedTrainer(
    510         unsup_dataloader=unsup_dataloader,
    511         unsup_criterion=USL,
   (...)
    515         **trainer_kwargs,
    516     )
    518 print('Training...')
--> 519 T.train()
    520 print('Training complete.')
    521 print()

File /home/projects/xxx_condaenv/scnym/lib/python3.8/site-packages/scnym/trainer.py:452, in Trainer.train(self)
    449 print(msg, end=end_char)
    451 # training epoch
--> 452 self.train_epoch()
    453 # evaluate model
    454 self.val_epoch()

File /home/projects/xxx_condaenv/scnym/lib/python3.8/site-packages/scnym/trainer.py:637, in SemiSupervisedTrainer.train_epoch(self)
    635     print('total loss: ', loss.item())
    636 if np.isnan(loss.data.cpu().numpy()):
--> 637     raise RuntimeError('NaN loss encountered in training')
    639 # compute gradients in a backward pass, update parameters
    640 loss.backward()

RuntimeError: NaN loss encountered in training

scnym not working; Reason: image not found

I followed the installation instructions and when I run

scnym --help

I get the following:

(scnym_env) user-comp:scnym user$ scnym --help
Traceback (most recent call last):
  File "/Users/user/anaconda3/envs/scnym_env/bin/scnym", line 11, in <module>
    load_entry_point('scnym==0.1', 'console_scripts', 'scnym')()
  File "/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/pkg_resources/__init__.py", line 490, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2854, in load_entry_point
    return ep.load()
  File "/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2445, in load
    return self.resolve()
  File "/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2451, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/Users/user/anaconda3/scnym/scnym/__init__.py", line 8, in <module>
    from . import dataprep, model, predict, trainer, utils
  File "/Users/user/anaconda3/scnym/scnym/dataprep.py", line 1, in <module>
    import torch
  File "/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/torch/__init__.py", line 79, in <module>
    from torch._C import *
ImportError: dlopen(/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/torch/_C.cpython-37m-darwin.so, 9): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/torch/lib/libshm.dylib
  Reason: image not found

I get a similar error when I open a jupyter notebook and try to import scnym:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-116a8c9e65d1> in <module>
----> 1 import scnym
      2 from scnym.predict import Predicter

~/anaconda3/envs/scnym_env/lib/python3.7/site-packages/scnym-0.1-py3.7.egg/scnym/__init__.py in <module>
      6 # e.g.
      7 # >> from scnym.model import CellTypeCLF
----> 8 from . import dataprep, model, predict, trainer, utils

~/anaconda3/envs/scnym_env/lib/python3.7/site-packages/scnym-0.1-py3.7.egg/scnym/dataprep.py in <module>
----> 1 import torch
      2 import numpy as np
      3 from torch.utils.data import Dataset
      4 from typing import Callable
      5 

~/anaconda3/envs/scnym_env/lib/python3.7/site-packages/torch/__init__.py in <module>
     77 del _dl_flags
     78 
---> 79 from torch._C import *
     80 
     81 __all__ += [name for name in dir(_C)

ImportError: dlopen(/Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/torch/_C.cpython-37m-darwin.so, 9): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/user/anaconda3/envs/scnym_env/lib/python3.7/site-packages/torch/lib/libshm.dylib
  Reason: image not found

Multiple categories

Is there functionality to incorporate multiple categories, in addition to domain? For instance, I'd also like the adversary to classify other annotations such as patient and sequencer.

Lack of support for sparse matrices makes scNym unusable for large datasets

scNym requires log(CPMs) rather than raw counts. This requirement makes scNym unusable for large datasets, since log(CPMs) require a dense matrix, and dense matrices usually can't fit in memory for datasets with more than a few hundred thousand cells.

domain_groupby argument missing

Thank you for developing scNym. I've been using it to infer cell populations between datasets. However, after going over your guidelines on how to use scNym I noticed that the argument domain_groupby seems to be missing from the scnym_api function. As such, I'm wondering which steps I need to take in order to be able to run scNym using multiple domains.
I've been using the scNym version 0.3.0 in python 3.8.5.

Problems with scnym installation using conda

I attended the scNym talk at ISMB and I'm interested in trying out scNym! However, I am having some issues setting up a local environment using conda to run scNym. I will detail the issues chronologically as well as my workarounds.

Issue 1: The README instructs to create an empty conda environment, with no specification for python distributions, and then running pip. Even if you activate into the environment, because there is no Python and hence no pip, it will either fail or worse, if you have Python installed natively on the machine, the pip call will default to that.

Solution: Instead of running conda create -n scnym_env, add in something like conda create -n scnym_env -c conda-forge python.

Issue 2: After doing the above and running pip install scnym, scNym (command-line) does not work.

$ scnym --help
Traceback (most recent call last):
  File "/home/timlai/anaconda3/envs/scnym_env/bin/scnym", line 8, in <module>
    sys.exit(main())
  File "/home/timlai/anaconda3/envs/scnym_env/lib/python3.8/site-packages/scnym/main.py", line 1316, in main
    import yaml
ModuleNotFoundError: No module named 'yaml'

Solution: https://stackoverflow.com/a/56992964/13906501 takes care of this. Perhaps update the requirements to include this?

Issue 3: Although numbered 3, this seems to be independent of 2. Specifically, attempting this immediately after 1 or after the fix mentioned in 2 results in the same error.

Testing the API:

$ echo "from scnym.api import scnym_api" > test.py
$ python test.py
Traceback (most recent call last):
  File "/home/timlai/scRNA/PAGA/scNYM/scnym.py", line 12, in <module>
    from scnym.api import scnym_api
ModuleNotFoundError: No module named 'scnym.api'; 'scnym' is not a package

I do not have a solution for this and would appreciate some advice.

Human atlas not available in notebook?

Hi there,

Could the scnym_atlas_transfer notebook in the README be updated to include the human atlas?

Thanks!

ValueError Traceback (most recent call last)
in ()
5 if ATLAS2USE not in CELL_ATLASES.keys():
6 msg = f'{ATLAS2USE} is not available in the cell atlas directory.'
----> 7 raise ValueError(msg)

ValueError: human is not available in the cell atlas directory.

Unable to run the model successfully.

when I run the example of scNym Cell Type Classification with Cell Atlas References. i got somthing wrong, i think there must be something wrong with the model?

No longer installs

Attempting to install yields the error

  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in leidenalg setup command: use_2to3 is invalid.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Downgrading setuptools via liftoff/pyminifier#132 doesn't seem to help.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm`

Hi,

Thank you for developing scNym, I have been using it a lot for label transfer tasks and it is great!.
So far, my workflow has worked flawlessly until I moved to a new workstation.

When I run the following:

scnym.api.scnym_api(
    adata = combined_object,
    task = 'train',
    groupby = 'cell_states',
    domain_groupby='domain_label',
    out_path = '/scnym_models/healthy/',
    config = 'new_identity_discovery',
)

It fails with the following error:

CUDA compute device found.
32767 unlabeled observations found.
Using unlabeled data as a target set for semi-supervised, adversarial training.

training examples:  (307282, 15412)
target   examples:  (32767, 15412)
X:  (307282, 15412)
y:  (307282,)
Using user provided domain labels.
Found 164 source domains and 6 target domains.
Not weighting classes and not balancing classes.
Found 170 unique domains.
Using MixMatch for semi-supervised learning
Scaling ICL over 100 epochs, 0 epochs for burn in.
Scaling ICL over 20 epochs, 0 epochs for burn in.
Using a Domain Adaptation Loss.
Training...
Epoch 0/99|______________________________|
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?f37e811e-484a-43aa-a78f-a31b60f7d9b4)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
221123_train_scNym_reference-Healthy_model.ipynb Cell 18 in <cell line: 1>()
----> [1](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) scnym.api.scnym_api(
      [2](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1)     adata = combined_object,
      [3](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2)     task = 'train',
      [4](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3)     groupby = 'cell_states',
      [5](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4)     domain_groupby='domain_label',
      [6](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5)     out_path = '/scnym_models/healthy_hlca/',
      [7](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6)     config = 'new_identity_discovery',
      [8](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=7) )

File ~/mambaforge/envs/scnym/lib/python3.8/site-packages/scnym/api.py:339, in scnym_api(adata, task, groupby, domain_groupby, out_path, trained_model, config, key_added, copy)
    336         msg = f'{groupby} is not a variable in `adata.obs`'
    337         raise ValueError(msg)
--> 339     scnym_train(
    340         adata=adata,
    341         config=config,
    342     )
    343 else:
    344     # check that a pre-trained model was specified or 
    345     # provided for prediction
    346     if trained_model is None:

File ~/mambaforge/envs/scnym/lib/python3.8/site-packages/scnym/api.py:514, in scnym_train(adata, config)
...
-> 1370     ret = torch.addmm(bias, input, weight.t())
   1371 else:
   1372     output = input.matmul(weight.t())

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Since this happened after I changed workstations, I assume it has to do with some compatibility issues with CUDA, but I can't really get my head around it.

Do you think you could help me with this?

Thank you!

Session info here:

The `sinfo` package has changed name and is now called `session_info` to become more discoverable and self-explanatory. The `sinfo` PyPI package will be kept around to avoid breaking old installs and you can downgrade to 0.3.2 if you want to use it without seeing this message. For the latest features and bug fixes, please install `session_info` instead. The usage and defaults also changed slightly, so please review the latest README at https://gitlab.com/joelostblom/session_info.
-----
anndata     0.8.0
scanpy      1.6.0
sinfo       0.3.4
-----
PIL                         9.3.0
absl                        NA
asttokens                   NA
backcall                    0.2.0
certifi                     2022.09.24
chardet                     3.0.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.5.1
decorator                   5.1.1
dunamai                     1.14.1
entrypoints                 0.4
executing                   0.8.3
get_version                 3.5.4
google                      NA
h5py                        3.7.0
idna                        2.10
igraph                      0.10.2
importlib_metadata          NA
ipykernel                   6.9.1
jedi                        0.18.1
joblib                      1.2.0
kiwisolver                  1.4.4
legacy_api_wrap             1.2
leidenalg                   0.8.0
llvmlite                    0.32.1
louvain                     0.7.0
matplotlib                  3.5.3
mpl_toolkits                NA
natsort                     8.2.0
numba                       0.49.1
numexpr                     2.8.4
numpy                       1.23.5
packaging                   21.3
pandas                      1.5.1
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.20
ptyprocess                  0.7.0
pure_eval                   0.2.2
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.11.2
pyparsing                   3.0.9
pytz                        2022.6
requests                    2.23.0
scipy                       1.4.1
scnym                       0.3.2
setuptools                  65.5.1
setuptools_scm              NA
six                         1.16.0
sklearn                     0.22.2.post1
stack_data                  0.2.0
tables                      3.6.1
tensorboard                 2.2.1
texttable                   1.6.5
torch                       1.4.0
torchvision                 0.5.0
tornado                     6.1
tqdm                        4.44.1
traitlets                   5.1.1
typing_extensions           NA
urllib3                     1.25.8
wcwidth                     0.2.5
yaml                        5.3.1
zipp                        NA
zmq                         23.2.0
-----
IPython             8.4.0
jupyter_client      7.2.2
jupyter_core        4.10.0
-----
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) [GCC 10.4.0]
Linux-6.0.8-200.fc36.x86_64-x86_64-with-glibc2.10
16 logical CPU cores, x86_64
-----
Session information updated at 2022-11-23 15:06

Add edge-case testing for "jackpot" cells

Some cell profiles contain a large majority of reads mapping to a single gene (e.g. the Rn45s locus for low quality cells). These cells are usually filtered out during quality control, but in the event they persist in a dataset passed to scnym, they can lead to unstable training dynamics and failures to converge.

We should add simple quality control checks in .api.scnym_api that search for these cells in user provided datasets and throw a warning if they appear.

install instructions dont work - scnym won't install with python 3.10

By default conda install python 3.10 but many required versions of packages for scnym are not available.

(scnym) nicholas@sci-pvm-nicholas:~$ pip install scnym==0.3.2
Collecting scnym==0.3.2
  Using cached scnym-0.3.2-py2.py3-none-any.whl (68 kB)
Collecting scikit-misc==0.1.3
  Using cached scikit-misc-0.1.3.tar.gz (887 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/nicholas/miniconda3/envs/scnym/bin/python3.1 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-mwr6kflj/scikit-misc_98238cf960f14555a7581927dfe035cc/setup.py'"'"'; __file__='"'"'/tmp/pip-install-mwr6kflj/scikit-misc_98238cf960f14555a7581927dfe035cc/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-f3_rfcls
       cwd: /tmp/pip-install-mwr6kflj/scikit-misc_98238cf960f14555a7581927dfe035cc/
  Complete output (7 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-mwr6kflj/scikit-misc_98238cf960f14555a7581927dfe035cc/setup.py", line 170, in <module>
      setup_package()
    File "/tmp/pip-install-mwr6kflj/scikit-misc_98238cf960f14555a7581927dfe035cc/setup.py", line 144, in setup_package
      from numpy.distutils.core import setup, numpy_cmdclass
  ModuleNotFoundError: No module named 'numpy'
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/b3/42/1cb9d9aa545e2a459dbc235e2d15e733876431397b1e20b28b80b5e3755e/scikit-misc-0.1.3.tar.gz#sha256=439bded1d0b549c06bd8d0f167d7b9ac6ed18fd18bbd15eec02b31820b0bb4dc (from https://pypi.org/simple/scikit-misc/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement scikit-misc==0.1.3 (from scnym) (from versions: 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4)
ERROR: No matching distribution found for scikit-misc==0.1.3
(scnym) nicholas@sci-pvm-nicholas:~$ python 
Python 3.10.0 | packaged by conda-forge | (default, Oct 12 2021, 21:24:52) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()

python 3.8 does work