Coder Social home page Coder Social logo

theislab / scarches Goto Github PK

View Code? Open in Web Editor NEW
310.0 14.0 49.0 802.67 MB

Reference mapping for single-cell genomics

Home Page: https://docs.scarches.org/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 0.90% Jupyter Notebook 99.10%
single-cell deep-learning rna-seq-analysis data-integration batch-correction multimodal-deep-learning multiomics single-cell-genomics scrna-seq human-cell-atlas

scarches's Introduction

PyPI PyPIDownloads Docs

Single-cell architecture surgery (scArches) is a package for reference-based analysis of single-cell data.

What is scArches?

scArches allows your single-cell query data to be analyzed by integrating it into a reference atlas. By mapping your data into an integrated reference you can transfer cell-type annotation from reference to query, identify disease states by mapping to healthy atlas, and advanced applications such as imputing missing data modalities or spatial locations.

Usage and installation

See here for documentation and tutorials.

Support and contribute

If you have a question or new architecture or a model that could be integrated into our pipeline, you can post an issue or reach us by email.

Reference

If scArches is helpful in your research, please consider citing the following paper: :

@article{lotfollahi2021mapping,
  title={Mapping single-cell data to reference atlases by transfer learning},
  author={Lotfollahi, Mohammad and Naghipourfar, Mohsen and Luecken, Malte D and Khajavi,
  Matin and B{\"u}ttner, Maren and Wagenstetter, Marco and Avsec, {\v{Z}}iga and Gayoso,
  Adam and Yosef, Nir and Interlandi, Marta and others},
  journal={Nature Biotechnology},
  pages={1--10},
  year={2021},
  publisher={Nature Publishing Group}}

scarches's People

Contributors

aidinbii avatar alextopalova avatar alitinet avatar arcaneemergence avatar cdedonno avatar chelseabright96 avatar cottoneyejoe95 avatar cuongqn avatar dineshpalli avatar dreast avatar elihei2 avatar evanbiederstedt avatar hrovatin avatar koncopd avatar lcmmichielsen avatar lisasikkema avatar m0hammadl avatar maarten-devries avatar matinkhajavi avatar mbuttner avatar mohsennaghipourfar avatar mohsennp avatar moinfar avatar naghipourfar avatar natbutter avatar zethson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scarches's Issues

Using GPU in parallel

Good day!

I have a question whether is possible to set the parameters to use several GPU in parallel to train the models that support this feature.

Btw, use_gpu parameter does not work in the last version of scarches (v0.3.5). Took me some time to find that the new flag is use_cuda (saw it in one of the issues raised before) ;)

Thanks in advance!

Confidence of predictions

Dear authors,

Congratulations on this amazing and promising software. It's of great value for the scientific community.

I'm testing on my data the unsupervised model with SCANVI. I was looking at the documentation, tutorials and issues but I'm unable to find if the models give a measure for the confidence of the predictions.

Thanks and best,

Juanlu.

Training without gpu does not use learning rate decrease (lr_reducer)

If I see correctly training without gpu does not use lr_reducer parameter/does not decrease learning rate when proceeding with training:

def _train_on_batch(self, adata,

def _train_on_batch(self, adata,

Also, some other packages include saving and plotting of loss over training so that user can quickly evaluate if training went well. (scVI saves loss and provides plotting function, CellBender remove background saves the loss plot image, which is even more convenient).

trVAE: MMD loss always zero

hi,

my reference and query data do not mix when i use scanvi, so i decided to go ahead and try trvae as it applies additional MMD loss. however, i noticed that MMD loss seems to always equal zero:

In [114]: trvae.train(
     ...:     n_epochs=trvae_epochs,
     ...:     alpha_epoch_anneal=200,
     ...:     early_stopping_kwargs=early_stopping_kwargs
     ...: )
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
 |███-----------------| 15.0%  - epoch_loss:   13730 - epoch_recon_loss:   13723 - epoch_kl_loss:      18 - epoch_mmd_loss:       0 - val_loss:   13714 - val_recon_loss:   13707 - val_kl_loss:      18 - val_mmd_loss:       0
ADJUSTED LR
 |████----------------| 20.0%  - epoch_loss:   13717 - epoch_recon_loss:   13708 - epoch_kl_loss:      17 - epoch_mmd_loss:       0 - val_loss:   13708 - val_recon_loss:   13699 - val_kl_loss:      17 - val_mmd_loss:       0
ADJUSTED LR
 |████----------------| 21.4%  - epoch_loss:   13693 - epoch_recon_loss:   13684 - epoch_kl_loss:      17 - epoch_mmd_loss:       0 - val_loss:   13713 - val_recon_loss:   13704 - val_kl_loss:      17 - val_mmd_loss:       0
Stopping early: no improvement of more than 0 nats in 20 epochs
If the early stopping criterion is too strong, please instantiate it with different parameters in the train method.
Saving best state of network...
Best State was in Epoch 85

[...]

In [119]: new_trvae.train(
     ...:     n_epochs=surgery_epochs,
     ...:     alpha_epoch_anneal=200,
     ...:     early_stopping_kwargs=early_stopping_kwargs,
     ...:     weight_decay=0
     ...: )
Trying to set attribute `.obs` of view, copying.
Trying to set attribute `.obs` of view, copying.
 |████████████████████| 100.0%  - epoch_loss:   14079 - epoch_recon_loss:   14064 - epoch_kl_loss:      15 - epoch_mmd_loss:       0 - val_loss:   14220 - val_recon_loss:   14205 - val_kl_loss:      15 - val_mmd_loss:       0
Saving best state of network...
Best State was in Epoch 499

i am using readthedocs code without any modifications, just with my own data.

any ideas what may be wrong? thanks in advance!

Multiple batches with trvae

I've noticed that when I run trvae to build my reference, I run into an error when I have more batches than latent dimensions (error below). When I increase the number of latent dimensions to be at least the number of batches, this goes away! I am curious if this behavior is an attribute of the underlying model or the implementation. It is important for me to integrate >200 batches and it seems impractical to use >200 latent dimensions to do so.

Thanks for your help and insight!

In : trvae_full.train(
...: n_epochs=trvae_epochs,
...: alpha_epoch_anneal=trvae_epochs,
...: early_stopping_kwargs=early_stopping_kwargs
...: )
Traceback (most recent call last):
File "", line 4, in
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/models/trvae/trvae_model.py", line 282, in train
self.trainer.train(n_epochs, lr, eps)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/trainers/trvae/trainer.py", line 167, in train
self.on_iteration(batch_data)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/trainers/trvae/trainer.py", line 249, in on_iteration
self.current_loss = loss = self.loss(**batch_data)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/trainers/trvae/unsupervised.py", line 62, in loss
recon_loss, kl_loss, mmd_loss = self.model(**total_batch)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/models/trvae/trvae.py", line 179, in forward
outputs = self.decoder(z1, batch)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/models/trvae/modules.py", line 188, in forward
dec_latent = self.FirstL(z_cat)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/PHShome/ik936/anaconda3/envs/gpu/lib/python3.7/site-packages/scarches/models/trvae/modules.py", line 25, in forward
expr, cond = torch.split(x, x.shape[1] - self.n_cond, dim=1)
ValueError: too many values to unpack (expected 2)

API suggestion - add load() convenience function

It would be super useful if there was a scarches.models load() class method, to parallel to save(), creating / restroring a model with one call. Currently, it takes 3-4 calls to perform the same.

zero filling missing genes in the query

hi @Cottoneyejoe95, could you please add that as an option if the query data has some missing genes then it should be zero filled? add and option call zero_fill=False in load_query function and explain if True it will fill in zeros for missing genes in query data

Effects of cell cycle, mitochondrial counts

Thank you for a great tool. I'm wondering how you recommend dealing with cell cycle / mt count effects on data. Do you recommend regressing out the effect prior to training ? And is any particular loss function better post regression? Thanks a lot.

loading scArches from local and train with new data

Hi team,
I've trained a network successfully following the tutorial and the network is saved in local. When I reload the network and train with a query data set with the train() function, my kernel restarts itself after a couple of warning messages from tensorflow: _The name tf.assign is deprecated. Please use tf.compat.v1.assign instead. Not sure if this is related to scArches, but have you seen before? Since there is no tutorial for loading network from local, I also like to confirm what I did is correct. Any advice is appreciated!

`
config_path = './models/scArches_v4_sse/IBD_epi_reference/scArches.json'

pre_trained_scArches = sca.models.scArches.from_config(config_path, construct=True, compile=True)

pre_trained_scArches.model_path = './models/scArches_v4_sse/IBD_epi_reference/'

pre_trained_scArches.task_name = 'IBD_epi_reference'

pre_trained_scArches.restore_model_weights(compile=True)

target_conditions = adata.obs[condition_key].unique().tolist()

new_network = sca.operate(pre_trained_scArches,
new_conditions=target_conditions,
new_task_name="IBD_epi_UC_CD")

new_network.model_path = './models/scArches_v4_sse/IBD_epi_UC_CD/'

new_network.train(adata,
condition_key=condition_key,
batch_size=128,
n_epochs=100) ### this is where the kernal restarts
`

example pancreas dataset not there

Hi! In your example notebook "pancreas_pipeline.ipynb" the file pancreas_normalized_hvg.h5ad is loaded from "./tests/data/pancreas_normalized_hvg.h5ad". However, the file is not there in the repo. Is it possible to upload it again?

scArches google colab dependencies

I am trying to install scArches in google colab but am receiving the following error messages:

ERROR: scvi-tools 0.8.1 has requirement importlib-metadata<2.0,>=1.0; python_version < "3.8", but you'll have importlib-metadata 3.4.0 which is incompatible.
ERROR: scarches 0.3.5 has requirement matplotlib>=3.3.1, but you'll have matplotlib 3.2.2 which is incompatible.
ERROR: scarches 0.3.5 has requirement scikit-learn>=0.23.2, but you'll have scikit-learn 0.22.2.post1 which is incompatible.
ERROR: scarches 0.3.5 has requirement scipy>=1.5.2, but you'll have scipy 1.4.1 which is incompatible.

How do I address these version issues?

"AttributeError: 'float' object has no attribute 'cpu'"

Dear Authors,

I am running scArches 0.3.0 with trvae. I run into a problem (full error message below) with my datasets when I try to either build a reference or map a query when either the source or target has only one batch variable value. Fortunately, when I add a second batch variable value to the source, trvae works fine, and to the target, scArches works fine. Unfortunately, I could not reproduce this error by downsampling batches in the notebooks you provided. I found the same main error message in a separate github but it was not clear if this applied to .

Thank you in advance for your help!


AttributeError Traceback (most recent call last)
in ()
10 n_epochs=trvae_epochs,
11 alpha_epoch_anneal=trvae_epochs,
---> 12 early_stopping_kwargs=early_stopping_kwargs
13 )

/PHShome/ik936/anaconda3/envs/scarches3/lib/python3.6/site-packages/scarches/models/trvae/trvae_model.py in train(self, n_epochs, lr, eps, **kwargs)
280 condition_key=self.condition_key_,
281 **kwargs)
--> 282 self.trainer.train(n_epochs, lr, eps)
283 self.is_trained_ = True
284

/PHShome/ik936/anaconda3/envs/scarches3/lib/python3.6/site-packages/scarches/trainers/trvae/trainer.py in train(self, n_epochs, lr, eps)
168
169 # Validation of Model, Monitoring, Early Stopping
--> 170 self.on_epoch_end()
171 if self.use_early_stopping:
172 if not self.check_early_stop():

/PHShome/ik936/anaconda3/envs/scarches3/lib/python3.6/site-packages/scarches/trainers/trvae/trainer.py in on_epoch_end(self)
266 if "loss" in key:
267 self.logs["epoch_" + key].append(
--> 268 sum(self.iter_logs[key][:]).cpu().detach().numpy() / len(self.iter_logs[key][:]))
269
270 # Validate Model

AttributeError: 'float' object has no attribute 'cpu'

Update to scvi-tools>=0.9.0

Hi all,

It would be great to update to use the new scvi-tools version. We have a tutorial here that should show everything necessary to change. From your end, it's just going to be these minor API changes in the train method. We also don't have accuracy tracked for scanvi, but do have the classification loss.

CUDA error: no kernel image is available for execution on the device

Hi!
I was trying to use the load_query_data function but got an error:

model = sca.models.SCANVI.load_query_data(
    target_adata,
    ref_path,
    freeze_dropout = True,
)

INFO     Using data from adata.X                                                             
INFO     Computing library size prior per batch                                              
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels']     
INFO     Successfully registered anndata object containing 128592 cells, 2000 vars, 16       
         batches, 29 labels, and 0 proteins. Also registered 0 extra categorical covariates  
         and 0 extra continuous covariates.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-d8e8bb4ae33d> in <module>
      2     target_adata,
      3     ref_path,
----> 4     freeze_dropout = True,
      5 )
      6 model._unlabeled_indices = np.arange(target_adata.n_obs)

~/miniconda3/envs/scarches_0_3_5/lib/python3.7/site-packages/scvi_tools-0.8.1-py3.7.egg/scvi/core/models/archesmixin.py in load_query_data(cls, adata, reference_model, inplace_subset_query_vars, use_cuda, unfrozen, freeze_dropout, freeze_expression, freeze_decoder_first_layer, freeze_batchnorm_encoder, freeze_batchnorm_decoder, freeze_classifier)
    116             else:
    117                 dim_diff = new_ten.size()[-1] - load_ten.size()[-1]
--> 118                 fixed_ten = torch.cat([load_ten, new_ten[..., -dim_diff:]], dim=-1)
    119                 load_state_dict[key] = fixed_ten
    120 

RuntimeError: CUDA error: no kernel image is available for execution on the device

Any idea what could be the issue here?
Thanks!

UnicodeEncodeError in progress bar

Occasionally I get the below error during training when I run the below code.
Version: scArches-0.1.2

network.train(adata,
              n_epochs=params['n_epochs'],
              batch_size=params['batch_size'], 
              condition_key='study_sample',
              save=True,
              retrain=True
             )

Traceback (most recent call last):
  File "code/diabetes_analysis/integration/scArches/1_run_ref_scArches_script.py", line 103, in <module>
    retrain=True
  File "/home/icb/karin.hrovatin/miniconda3/envs/py3.6-scarches/lib/python3.6/site-packages/scarches/models/cvae.py", line 749, in train
    verbose)
  File "/home/icb/karin.hrovatin/miniconda3/envs/py3.6-scarches/lib/python3.6/site-packages/scarches/models/scarches.py", line 496, in _train_on_batch
    print_progress(i, logs, n_epochs)
  File "/home/icb/karin.hrovatin/miniconda3/envs/py3.6-scarches/lib/python3.6/site-packages/scarches/models/_utils.py", line 125, in print_progress
    _print_progress_bar(epoch + 1, n_epochs, prefix='', suffix=message, decimals=1, length=20)
  File "/home/icb/karin.hrovatin/miniconda3/envs/py3.6-scarches/lib/python3.6/site-packages/scarches/models/_utils.py", line 132, in _print_progress_bar
    sys.stdout.write('\r%s |%s| %s%s %s' % (prefix, bar, percent, '%', suffix)),
UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 3: ordinal not in range(128)

scarches switches to ipython when imported

Hello,

I've found that when importing scarches within the python interpreter, it switches to ipython. Is this an expected behaviour? I've been able to reproduce this error in 3 different computers.

Screen Shot 2021-02-10 at 11 29 42 pm

Something similar occurs when loading scarches via reticulate in R, the python session restarts.

The session information is:

  • Python 3.7.7
  • scArches 0.3.2
Package                       Version
----------------------------- -------------------
anndata                       0.7.5
annoy                         1.17.0
appnope                       0.1.0
astroid                       2.4.2
attrs                         19.3.0
backcall                      0.2.0
backports.functools-lru-cache 1.6.1
bbknn                         1.3.12
bleach                        3.1.5
certifi                       2020.12.5
cffi                          1.14.1
chardet                       3.0.4
colorama                      0.4.4
commonmark                    0.9.1
conda                         4.9.2
conda-package-handling        1.7.0
cryptography                  2.9.2
cycler                        0.10.0
Cython                        0.29.21
decorator                     4.4.2
defusedxml                    0.6.0
Deprecated                    1.2.10
entrypoints                   0.3
et-xmlfile                    1.0.1
filelock                      3.0.12
future                        0.18.2
gdown                         3.12.2
get-version                   2.1
graphtools                    1.5.1
h5py                          2.10.0
hyperopt                      0.1.2
idna                          2.9
importlib-metadata            1.6.1
ipykernel                     5.3.4
ipython                       7.16.1
ipython-genutils              0.2.0
ipywidgets                    7.5.1
isort                         5.7.0
jdcal                         1.4.1
jedi                          0.17.2
Jinja2                        2.11.2
joblib                        0.15.1
json5                         0.9.4
jsonschema                    3.2.0
jupyter                       1.0.0
jupyter-client                6.1.6
jupyter-console               6.1.0
jupyter-core                  4.6.3
jupyterlab                    2.2.2
jupyterlab-server             1.2.0
kiwisolver                    1.2.0
lazy-object-proxy             1.4.3
legacy-api-wrap               1.2
leidenalg                     0.8.1
llvmlite                      0.33.0+1.g022ab0f
magic-impute                  2.0.3
MarkupSafe                    1.1.1
matplotlib                    3.3.3
mccabe                        0.6.1
mistune                       0.8.4
mkl-fft                       1.1.0
mkl-random                    1.1.1
mkl-service                   2.3.0
mock                          4.0.2
more-itertools                8.4.0
natsort                       7.0.1
nbconvert                     5.6.1
nbformat                      5.0.6
networkx                      2.4
notebook                      6.0.3
notedown                      1.5.1
numba                         0.50.1
numexpr                       2.7.1
numpy                         1.19.5
openpyxl                      3.0.5
packaging                     20.4
pandas                        1.1.5
pandoc-attributes             0.1.7
pandocfilters                 1.4.2
parso                         0.7.1
patsy                         0.5.1
pexpect                       4.8.0
phate                         1.0.4
pickleshare                   0.7.5
Pillow                        7.1.2
pip                           20.1.1
pluggy                        0.13.1
prometheus-client             0.8.0
prompt-toolkit                3.0.5
ptyprocess                    0.6.0
py                            1.8.2
pycairo                       1.19.1
pycosat                       0.6.3
pycparser                     2.20
Pygments                      2.6.1
PyGSP                         0.5.1
pylint                        2.6.0
pymongo                       3.11.2
pyOpenSSL                     19.1.0
pyparsing                     3.0.0a1
pyrsistent                    0.16.0
PySocks                       1.7.1
pytest                        5.4.3
python-dateutil               2.8.1
python-igraph                 0.8.2
pytz                          2020.1
pyzmq                         19.0.1
qtconsole                     4.7.5
QtPy                          1.9.0
requests                      2.23.0
rich                          9.6.1
rpy2                          3.3.4
ruamel-yaml                   0.15.87
s-gd2                         1.7
scanpy                        1.6.0
scArches                      0.3.2
scikit-learn                  0.23.2
scikit-misc                   0.1.3
scipy                         1.5.4
scprep                        1.0.5.post2
scvi-tools                    0.8.1
seaborn                       0.11.1
Send2Trash                    1.5.0
setuptools                    47.1.1.post20200604
setuptools-scm                4.1.2
sinfo                         0.3.1
six                           1.15.0
statsmodels                   0.12.0
stdlib-list                   0.8.0
tables                        3.6.1
tasklogger                    1.0.0
terminado                     0.8.3
testpath                      0.4.4
texttable                     1.6.2
threadpoolctl                 2.1.0
toml                          0.10.2
torch                         1.6.0
tornado                       6.0.4
tqdm                          4.49.0
traitlets                     4.3.3
typed-ast                     1.4.2
typing-extensions             3.7.4.3
tzlocal                       2.1
umap-learn                    0.4.6
urllib3                       1.25.8
wcwidth                       0.2.5
webencodings                  0.5.1
wheel                         0.34.2
widgetsnbextension            3.5.1
wrapt                         1.12.1
zipp                          3.1.0

Thanks!

how to improve resolution

Hi team,

Thanks much for the great tool. I was able to install it pretty smoothly and run through with a public data set without errors (so far) based on the nicely documented tutorials.
One issue I have though is the separation of the cells. The object was previously analyzed with a standard Scanpy pipeline and now used to train the model. When I checked my latent_adata, the cell types are not separated as nicely as what they looked like before analyzed by Scanpy and many previously-seen smaller branches in the UMAP are now merged into a big blob.
I've tried following the 'tips' and tried different z_dimension (10 or 20), architecture (from the default to [128,128,128,128]), alpha (down to even 0.000001), n_epochs (100, 150, 200), hvg (2k genes or 5k genes), but none seems to help. (loss_fn is always 'nb'.)
Any advice will be greatly appreciated! Thanks!

Best,
Xiao

Semi-supervised scArches

Thank you for creating this tool. I’ve been following your vignette for semi-supervised scArches with SCANVI, with a reference dataset (1 batch) and a query dataset (1 batch). However, when I train SCANVI on the reference, it achieves prediction accuracy of only 59%. Is there anything I can do to improve the performance?

documentation - minor typos/errors

Minor issues I noted while reading the current documentation:

  1. about.html: Where To Start has a formatting issue with the pancreas pipeline link.
  2. api/data.html:
    • Params for normalize_hvg() are misformatted.
    • "here" link in scarches.data.read() is a dead link
  3. api/utils.html: param formatting for scarches.util.label_encoder() is bad
  4. api/zenodo.html: upload_adaptor() and others -- missing capitalization on first sentence ("upload"), and other scattered accidental capitalization, eg, "Creates A deposition...."
  5. latest/model_sharing.html: "toturial" misspelled
  6. latest/pancrease_pipeline.html: "a new query stuy ...", study is misspelled
  7. the conditions param for sca.models.scArches is documented as an 'int', but it really a list per the source code.

Problems integrating cell and nuclei datasets

Good day,

I am excited to use this tool for building a reference for my study. However, I am having problems integrating single cell and nuclei datasets. When I used the standard integration method by Seurat, I get the expected output but scArches is not able to integrate the data coming from snRNAseq. For both pipelines, I used as batch key the different datasets. See attached picture.

What do you think the issue can be? thanks in advance!
Screenshot 2020-11-26 at 11 02 21

suggestions for docs improvements

Couple of suggested improvements in the docs, based upon a first read:

  • reference API docs for important API used in the examples (eg, sca.operate, scc.ann.*, etc)
  • a sketch of the various models, highlighting their behavior and when it is appropriate to try them
  • examples of appropriate use of normalize_hvg() and the semantics of size_factors (required by subsequent network training).

Install through conda

It would be great to be able to install with conda.

I tried to install the requirements through conda but tensorflow==1.15.2 is not available there.

Sizes of tensors must match except in dimension 1

hi,

i have successfully run scarches with one pair of reference and query sets. now i am trying to do it with another one and i get the following error message at the query data loading stage:

In [123]: model = sca.models.SCANVI.load_query_data(
     ...:     target_adata.copy(),
     ...:     vae,
     ...: )
INFO     Using data from adata.X                                                             
INFO     Computing library size prior per batch                                              
/opt/anaconda3/lib/python3.7/site-packages/scvi/data/_anndata.py:795: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
  logger_data_loc
INFO     Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels']     
INFO     Successfully registered anndata object containing 19755 cells, 6137 vars, 9 batches,
         17 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0     
         extra continuous covariates.                                                        
WARNING  Make sure the registered X field in anndata contains unnormalized count data.       
Traceback (most recent call last):

  File "<ipython-input-123-d90f5f2be02a>", line 3, in <module>
    vae,

  File "/opt/anaconda3/lib/python3.7/site-packages/scvi/core/models/archesmixin.py", line 118, in load_query_data
    fixed_ten = torch.cat([load_ten, new_ten[..., -dim_diff:]], dim=-1)

RuntimeError: Sizes of tensors must match except in dimension 1. Got 9 and 17 in dimension 0 (The offending index is 1)

any advice on what may be causing the error is greatly appreciated!

gpus with trVAE model

Hi team,

I am trying to run trVAE with GPUs. I made sure that scArches is able to successfully use GPUs by testing the use_cuda flag in the scVI model. However, I am not certain whether the trVAE model is making use of GPUs or CPUs. Do you have any suggestions for how to make sure that trVAE uses GPUs?

Thanks!

Ilya

save/restore creates network with incorrect model_path

If I create, train and save a network, and then use scarches.models.from_config() to reload it, the new network is created with an incorrect model_path. This then makes it impossible to use the various restore_ methods.

Example:

import scarches as sca
import scanpy as sc
import anndata as ad

condition_key = 'study'
adata = sca.datasets.pancreas()
# slice out a chunk to make this test run fast
adata = adata[0:1000, :]

condition_labels = adata.obs[condition_key].unique().tolist()
network = sca.models.scArches(task_name='test',
                             x_dimension=adata.shape[1],
                             z_dimension=10,
                             conditions=condition_labels,
                             model_path='./models/')

network.train(adata,
              n_epochs=150,
              condition_key=condition_key,
              batch_size=128)
network.save()

new_network = sca.models.scArches.from_config('./models/test/scArchesNB.json')
new_network.restore_model_config()  # Fails, returns False
print(new_network.model_path)  # prints:  ./models/test/test

If I manually clobber new_network.model_path = './models/test/' and then re-try restore_model_config(), it works.

vae.train crashes with no messages

Dear scArches team,

I'm very excited to give this latest version a whirl, especially with the .*vi and trvae integration. I'm running into a problem running the sample notebooks. Each time I get to the vae.train command, the kernel crashes and restarts. I've run this in a notebook and on a terminal. Both times, I get no warnings or error messages to help track the problem. The first line INFO Training for 500 epochs prints and then the kernel crashes. I've double checked that I installed the appropriate versions of packages from your requirements file. Do you have any suggestions for how to track down the issue?

Thanks for your help!

Best,
Ilya

scarches google colab and tqdm~=4.49.0 installation

Hi,
I am trying to install scarches on google colab.
But I get this error even when I try to specify the indicated tqdm version, can you help me fix it?

ERROR: scarches 0.3.4 has requirement tqdm~=4.49.0, but you'll have tqdm 4.56.0 which is incompatible.

Thank you in advance
Best

Problem with installation (Fedora 25)

I tried to install scarches in a conda environment using
pip install -U scarches
And I received the following error message:

Collecting scarches
  Downloading scArches-0.1.2-py3-none-any.whl (50 kB)
     |████████████████████████████████| 50 kB 1.5 MB/s 
Collecting keras==2.2.4
  Using cached Keras-2.2.4-py2.py3-none-any.whl (312 kB)
ERROR: Could not find a version that satisfies the requirement tensorflow==1.15.2 (from scarches) (from versions: 2.2.0rc1, 2.2.0rc2, 2.2.0rc3, 2.2.0rc4, 2.2.0, 2.3.0rc0, 2.3.0rc1, 2.3.0rc2, 2.3.0)
ERROR: No matching distribution found for tensorflow==1.15.2 (from scarches)

reference model question re: conditions

In the case where a reference model is published, all downstream users who want to transfer that to their query data must know both the gene names and conditions for training their derivative model. Gene names are easily accessed, as they are embedded in the model. Condition labels are also saved, but the model does not include information about how the condition labels were generated (eg, the original condition_key, or in more complex cases, a condition derived from multiple obs columns).

In the reference model re-use cases, do you have guidance on how model publishers should inform query users of proper condition label generation (or minimally, the condition key)?

Side note: it would be helpful to have a sample note book which shows best practices for the "query" user - ie, once you have a reference model, how you normalize, train, classify, etc with only the query dataset in hand.

Thanks again!

import fails on latest release (1.14)

with a clean virtual environment and install of all modules, importing scarches fails:

$ python
Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scarches
Using TensorFlow backend.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/scarches/__init__.py", line 4, in <module>
    from scarches.models import Adaptor
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/scarches/models/__init__.py", line 1, in <module>
    from .cvae import CVAE
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/scarches/models/cvae.py", line 15, in <module>
    from tensorflow.random import set_seed
ImportError: cannot import name 'set_seed'

It looks like a TF2.0 API (set_seed) was used, when you probably want TF1.15 set_random_seed()

fewer cell types predicted than in the reference

hi, i'm following the tutorial and i ran into the following issue:

my reference data set contains 9 cell types. when i run vae.predict(), only 5 are predicted but this is nothing out of ordinary because the remaining 4 contain few cells so no wonder they are not predicted correctly. the accuracy of the learned classifier is still 91%. importantly, the 5 that are predicted are not consecutive - they are ['0','1','2','3','8'], whereby '0' is represented by the largest number of cells.

when i run model.predict(), the only predicted IDs are ['1','2','3']. does it mean that only these reference cell types could be predicted or does it look like something went wrong?

my code:

sca.dataset.setup_anndata(source_adata,
                          batch_key='sample',
                          labels_key='leiden')

vae = sca.models.SCANVI(
    source_adata,
    "Unknown",
    n_layers=2,
    encode_covariates=True,
    deeply_inject_covariates=False,
    use_layer_norm="both",
    use_batch_norm="none",
)

vae.train(
    n_epochs_unsupervised=vae_epochs,
    n_epochs_semisupervised=scanvi_epochs,
    unsupervised_trainer_kwargs=dict(early_stopping_kwargs=early_stopping_kwargs),
    semisupervised_trainer_kwargs=dict(metrics_to_monitor=["elbo","accuracy"],
                                       early_stopping_kwargs=early_stopping_kwargs_scanvi),
    frequency=1
)

target_adata.obs['leiden'] = vae.unlabeled_category_

model = sca.models.SCANVI.load_query_data(
    target_adata,
    vae,
    freeze_dropout = True,
)
model._unlabeled_indices = np.arange(target_adata.n_obs)
model._labeled_indices = []

model.train(
    n_epochs_semisupervised=surgery_epochs,
    train_base_model=False,
    semisupervised_trainer_kwargs=dict(metrics_to_monitor=["accuracy", "elbo"],
                                       weight_decay=0,
                                       early_stopping_kwargs=early_stopping_kwargs_surgery
                                      ),
    frequency=1
)

Error using use_batchnorm in operate

I get an error if i use use_batchnorm=False in operate. This does not happen if I comment out the relevant line.
(Using dev version from a few days ago)

    network = sca.operate(network,
        new_task_name=params['task_name'],
        new_conditions=query_adata.obs['study_sample'].unique(),
        # Does not work, so change below
        #new_network_kwargs={'model_path':path_out}
        new_network_kwargs={'use_batchnorm':False},
                         )

scArches' network has been successfully constructed!
scArches' network has been successfully compiled!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-125a7587b68d> in <module>
     16         # Does not work, so change below
     17         #new_network_kwargs={'model_path':path_out}
---> 18         new_network_kwargs={'use_batchnorm':False}
     19                          )
     20     network.model_path=path_out+network.task_name+os.sep

~/miniconda3/envs/py3.6-scarches/lib/python3.6/site-packages/scarches/scarches/__init__.py in operate(network, new_task_name, new_conditions, adaptors, init, version, remove_dropout, print_summary, new_training_kwargs, new_network_kwargs)
    194     for idx, encoder_layer in enumerate(new_network.encoder_model.layers):
    195         if encoder_layer.name != 'first_layer' and encoder_layer.get_weights() != []:
--> 196             encoder_layer.set_weights(network.encoder_model.layers[idx].get_weights())
    197 
    198     for idx, decoder_layer in enumerate(new_network.decoder_model.layers):

~/miniconda3/envs/py3.6-scarches/lib/python3.6/site-packages/keras/engine/base_layer.py in set_weights(self, weights)
   1045                              str(len(params)) +
   1046                              ' weights. Provided weights: ' +
-> 1047                              str(weights)[:50] + '...')
   1048         if not params:
   1049             return

ValueError: You called `set_weights(weights)` on layer "dense_85" with a  weight list of length 0, but the layer was expecting 1 weights. Provided weights: []...

conditions question

It appears that the models can only be conditioned on a single dataframe column, so if I want to have multiple conditions, I have to create a single column to represent these multiple conditions with unique values. Is that reading correct or is there a way to condition on multiple columns?

scanpy compatibility

Hi,

I'm getting package compatibility issues between scArches and Scanpy. Do you have a recommended version of scanpy to go with scArches? I was hoping to use a recent SCtransform branch. But its dependencies are >version than py3.7

Thanks!

scArches incompatible with latest version of scvi

When using scvi-tools-0.9.0a1, I hit the following error when importing scArches:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-b440abd12709> in <module>
      6 import zipfile
      7 import scipy.io
----> 8 import scarches as sca
      9 
     10 sc.set_figure_params(dpi=200)

~/.local/lib/python3.8/site-packages/scarches/__init__.py in <module>
----> 1 from . import metrics, models, dataset, trainers, zenodo, plotting
      2 from .surgery import scvi_operate, trvae_operate
      3 
      4 __author__ = ', '.join([
      5     'Marco Wagenstetter',

~/.local/lib/python3.8/site-packages/scarches/models/__init__.py in <module>
----> 1 from .trvae import trVAE, TRVAE
      2 
      3 from scvi.model import SCVI, SCANVI, TOTALVI

~/.local/lib/python3.8/site-packages/scarches/models/trvae/__init__.py in <module>
      1 from .trvae import trVAE
----> 2 from .trvae_model import TRVAE

~/.local/lib/python3.8/site-packages/scarches/models/trvae/trvae_model.py in <module>
      9 
     10 from .trvae import trVAE
---> 11 from scarches.trainers.trvae.unsupervised import trVAETrainer
     12 from ._utils import _validate_var_names
     13 

~/.local/lib/python3.8/site-packages/scarches/trainers/__init__.py in <module>
----> 1 from .scvi import (
      2     scVITrainer,
      3     scANVITrainer,
      4     totalTrainer
      5 )

~/.local/lib/python3.8/site-packages/scarches/trainers/scvi/__init__.py in <module>
----> 1 from .trainers import scVITrainer, scANVITrainer, totalTrainer
      2 
      3 __all__ = [
      4     "scVITrainer",
      5     "totalTrainer",

~/.local/lib/python3.8/site-packages/scarches/trainers/scvi/trainers.py in <module>
      1 from typing import Union
----> 2 from scvi.core.trainers import UnsupervisedTrainer, SemiSupervisedTrainer, TotalTrainer
      3 
      4 
      5 class scVITrainer(UnsupervisedTrainer):

ModuleNotFoundError: No module named 'scvi.core'

This does not occur when using scvi-tools-0.8.1

Training of the model

Dear scArches team!

Thank you for already helping me a lot with getting the code to work.

I have attached a pdf of my notebook with the best model I was able to build. I am wondering if there are some parameters I can change to be able to get a better prediction? I have already tried with the different loss_fn and increasing the complexity of the architecture.

About the data:
The reference data is from dorsal root ganglia and is single nuclei sequenced with inDrops.
The query data is also from dorsal root ganglia but is single cell sequenced with 10x.
The two dataset are made in different labs.

Do you think the two datasets are too different to get a better prediction?

Thanks again,

Best regards Sara
scArches-Renthal_as_ref_Avraham_as_query_cleaned_up (1).pdf

scarches.models has no attribute scgen

Hi,

I was following the tutorial of Reference maping using scGen. But stuck with the initial step
Here is the ERROR
image

And here is my following code

import os
import sys
sys.path.insert(0, "../")

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
import scanpy as sc
import scarches as sca
from scarches.dataset.trvae.data_handling import remove_sparsity
import matplotlib.pyplot as plt
import numpy as np
import gdown
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

condition_key = 'study'
cell_type_key = 'cell_type'
target_conditions = ['Pancreas CelSeq2', 'Pancreas SS2']

epoch = 50

early_stopping_kwargs = {
"early_stopping_metric": "val_loss",
"patience": 20,
"threshold": 0,
"reduce_lr": True,
"lr_patience": 13,
"lr_factor": 0.1,
}
adata = sc.read('../pancreas.h5ad')
adata = remove_sparsity(adata) # remove sparsity
source_adata = adata[~adata.obs[condition_key].isin(target_conditions)].copy()
target_adata = adata[adata.obs[condition_key].isin(target_conditions)].copy()

network = sca.models.scgen(adata = source_adata,
hidden_layer_sizes=[256,128])

Integration of datasets with different normalization methods

Good day!

I am trying to use your great tool to integrate several public single-cell datasets. The data has been generated using different platforms (10x, smart-seq2, microwell, fluidigm C1). However, the data matrix for each study has been shared using different normalization methods (raw counts, TPM, normalized by downscaling, log-normalized).

I tried to follow the same methodology shown in scIB (https://github.com/theislab/scib/blob/master/notebooks/data_preprocessing/pancreas/02_normalize_human_pancreas.ipynb), to normalize the data via scran and then log-scale them (except for the TPM-normalized dataset), but still do not get a nice integration of the data. What would you recommend to do in this case? How could I assess the 'efficiency' of the integration?

Thanks in advance for the help!

question on query data and reference data

Thanks for your amazing work. I have some general questions regarding query data and reference data.

  1. In a setup that query data is 'all cell atlas' and reference is only cells from one organ. Will scArches overfit the huge number of unknown cells to reference after training? Is it epoch sensitive?

  2. In a setup that reference data is 'all cell health atlas' and query is only cells from one organ + tumor. I saw the covid19 case in the paper and it seems it works well. Are there any tips on training in this setup?

  3. Have you tried to train between scRNA and snRNA?

  4. If I have enough GPUs, and if I only have one reference and one query. Shall I train them at the same time instead of 'fine-tuning'? I guess in this case, there is no such thing as 'fine-tunning'. Then in the setups i mentioned above, will the overfit be a problem if I train two very different dataset from sratch?

I will have a try to benchmark these questions myself and it would be great to get some insights from you. Thanks!

Random seed

Hi, thank you for the great package!

Thanks for adding random seed recently. However, there are still some parts of the workflow for which I couldn't find how to fix randomness.

Currently, I have this:

import scarches as sca
import scanpy as sc
import matplotlib as mpl
import scipy.sparse
import numpy as np
import pandas as pd
import os

ds = sc.read_h5ad("../../data/31scarches-habermann/habermann.h5ad")
…
adata = ds[:, ds.var.highly_variable]
network = sca.models.scArches(task_name='habermann_ref',
                              x_dimension=adata.shape[1],
                              z_dimension=10,
                              architecture=[128, 128],
                              gene_names=adata.var_names.tolist(),
                              conditions=adata.obs[condition_key].unique().tolist(),
                              alpha=0.001,
                              loss_fn='sse',
                              model_path="./models/scArches/",
                              seed=1066)
network.train(adata,
              condition_key=condition_key,
              n_epochs=200,
              batch_size=128,
              save=True,
              retrain=False)
latent_adata = network.get_latent(adata, condition_key)
latent_adata_t = network.get_latent(adata, condition_key)

Then,

np.sum(latent_adata.X[:10, :10] == latent_adata.X[:10, :10])
np.sum(latent_adata.X[:10, :10] == latent_adata_t.X[:10, :10])

Outputs 100 and 0 respectively.

Is there a way to fix the result of get_latent call?

If you need it, I think I can create a reproducible small notebook, but it'll take time. Thank you

Exception: set of gene names in train adata are inconsistent with class' gene_names

Dear scArches team!

Thank you for this exciting package.

I am not a coder but thanks to your tutorial and the notebook from Nikolay Markov et al I have been able to create a model with my reference data. Unfortunately, when I try to project query data on the top of my reference data with new_network.train() I get the following error: Exception: set of gene names in train adata are inconsistent with class' gene_names

I have tried to make sure that the genes I have in the query dataset is also present in the reference dataset, but I still get the same error. I used this to make sure the query dataset (data_lin) don’t have any extra genes compared to the reference dataset (data):
data_var_names=data.var_names
data_lin2 = data_lin[:, data_lin.var_names.isin(data_var_names)]

I have extra genes in the reference dataset that I don’t have in the query dataset, but I guess that is normal?

Here is the full traceback:

Exception Traceback (most recent call last)
in
5 batch_size=512,
6 save=True,
----> 7 retrain=False)

~\Anaconda\lib\site-packages\scarches\models\cvae.py in train(self, adata, condition_key, train_size, cell_type_key, n_epochs, batch_size, early_stop_limit, lr_reducer, n_per_epoch, score_filename, save, retrain, verbose)
754 return self._train_on_batch(adata, condition_key, train_size, cell_type_key, n_epochs, batch_size,
755 early_stop_limit, lr_reducer, n_per_epoch, score_filename, save, retrain,
--> 756 verbose)
757
758 def _fit(self, adata,

~\Anaconda\lib\site-packages\scarches\models\scarches.py in _train_on_batch(self, adata, condition_key, train_size, cell_type_key, n_epochs, batch_size, early_stop_limit, lr_reducer, n_per_epoch, score_filename, save, retrain, verbose)
423 train_adata = train_adata[:, self.gene_names]
424 else:
--> 425 raise Exception("set of gene names in train adata are inconsistent with class' gene_names")
426
427 if set(self.gene_names).issubset(set(valid_adata.var_names)):

Exception: set of gene names in train adata are inconsistent with class' gene_names

Thank you very much for your help,

Best regards Sara

Scarches refactor controlling

Would be cool if another pair of eyes would check the functionality of my refactor branch. If everything works I could push to master and release a new version.
marco/trvae_refactor is the name of the branch.

KL scaling parameter

This is a really cool project! I was taking a look at your paper and found that you used a fixed scaling parameter on the KL term of the ELBO. I'm curious -- did you do any benchmarking on whether this led to better performance? In scVI we actually warmup the KL scaling term so that it reaches 1.0 by the end of training. We have options to warm it up per epochs or per minibatch. In terms of integration, it's possible you'd get better results as part of the correction comes from the independence of the prior p(z | s) = p(z).

Screen Shot 2020-08-28 at 1 31 28 PM

genes to include for query adata

Hi team,
This could be somewhat related to my previous issue but it may be easier to have it in a separate one.
It seems like to train a new network based on a pre-trained scArches, query data set needs to have the same genes used as the reference data set, otherwise it errors. Is there a way to shrink or expand the gene sets for the training? I've added a bunch of 0 for genes missing in my query data set but are in the reference for now to avoid the error message. I can also imagine in some cases there will be genes with important features for query data sets but are not captured in the original training with the reference data sets. What should be the best practice?
Thanks,
Xiao

Issue with the Semi-supervised surgery pipeline with scANVI

Hello,

I have already used scArches when you have just released it and I thank you again for this great tool.

I tried to use your last version with the scANVI model but I encountered issues at the step of training the model on fully labelled reference dataset.

It seems that none of the argument given into vae.train is recognized by the function

vae.train(
    n_epochs_unsupervised=vae_epochs,
    n_epochs_semisupervised=scanvi_epochs,
    unsupervised_trainer_kwargs=dict(early_stopping_kwargs=early_stopping_kwargs),
    semisupervised_trainer_kwargs=dict(metrics_to_monitor=["elbo", "accuracy"],
                                       early_stopping_kwargs=early_stopping_kwargs_scanvi),
    frequency=1
)

INFO     Training for 400 epochs.                                                            
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-d682f12d66fe> in <module>
      5     semisupervised_trainer_kwargs=dict(metrics_to_monitor=["elbo", "accuracy"],
      6                                        early_stopping_kwargs=early_stopping_kwargs_scanvi),
----> 7     frequency=1
      8 )

/mnt/DD/Sc RNA-Seq/Cortex_env/lib/python3.6/site-packages/scvi_tools-0.9.0a2-py3.6.egg/scvi/model/_scanvi.py in train(self, max_epochs, n_samples_per_label, check_val_every_n_epoch, train_size, validation_size, batch_size, use_gpu, plan_kwargs, **kwargs)
    273             callbacks=sampler_callback,
    274             check_val_every_n_epoch=check_val_every_n_epoch,
--> 275             **kwargs,
    276         )
    277         if len(self.validation_indices_) != 0:

/mnt/DD/Sc RNA-Seq/Cortex_env/lib/python3.6/site-packages/scvi_tools-0.9.0a2-py3.6.egg/scvi/lightning/_trainer.py in __init__(self, gpus, benchmark, flush_logs_every_n_steps, check_val_every_n_epoch, max_epochs, default_root_dir, checkpoint_callback, num_sanity_val_steps, weights_summary, early_stopping, early_stopping_monitor, early_stopping_min_delta, early_stopping_patience, early_stopping_mode, progress_bar_refresh_rate, simple_progress_bar, logger, **kwargs)
    120             logger=logger,
    121             progress_bar_refresh_rate=progress_bar_refresh_rate,
--> 122             **kwargs,
    123         )
    124 

/mnt/DD/Sc RNA-Seq/Cortex_env/lib/python3.6/site-packages/pytorch_lightning-1.2.0rc0-py3.6.egg/pytorch_lightning/trainer/connectors/env_vars_connector.py in overwrite_by_env_vars(self, *args, **kwargs)
     39 
     40         # all args were already moved to kwargs
---> 41         return fn(self, **kwargs)
     42 
     43     return overwrite_by_env_vars

TypeError: __init__() got an unexpected keyword argument 'n_epochs_unsupervised'

I tried to change the arguments (max_epochs instead of n_epochs...) but it is not working.

How can I fix these issues?

I thank you in advance,
Best regards

Runtime error

Hi,

I am using scArches to project and integrate query datasets on the top of a reference,
it is working well until training on reference dataset but training on query dataset gives me a Runtime error...

model = sca.models.SCANVI.load_query_data(
query_adata,
ref_path,
freeze_dropout = True,
)
model._unlabeled_indices = np.arange(query_adata.n_obs)
model._labeled_indices = []
print("Labelled Indices: ", len(model._labeled_indices))
print("Unlabelled Indices: ", len(model._unlabeled_indices))

INFO Using data from adata.X
INFO Computing library size prior per batch
INFO Registered keys:['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels']
INFO Successfully registered anndata object containing 1099 cells, 4102 vars, 28 batches,
88 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0
extra continuous covariates.

RuntimeError Traceback (most recent call last)
in
2 query_adata,
3 ref_path,
----> 4 freeze_dropout = True,
5 )
6 model._unlabeled_indices = np.arange(query_adata.n_obs)

~/anaconda3/envs/epigenf2/lib/python3.7/site-packages/scvi/core/models/archesmixin.py in load_query_data(cls, adata, reference_model, inplace_subset_query_vars, use_cuda, unfrozen, freeze_dropout, freeze_expression, freeze_decoder_first_layer, freeze_batchnorm_encoder, freeze_batchnorm_decoder, freeze_classifier)
116 else:
117 dim_diff = new_ten.size()[-1] - load_ten.size()[-1]
--> 118 fixed_ten = torch.cat([load_ten, new_ten[..., -dim_diff:]], dim=-1)
119 load_state_dict[key] = fixed_ten
120

RuntimeError: Sizes of tensors must match except in dimension 1. Got 77 and 88 in dimension 0 (The offending index is 1)

while the query_adata has 73 labels (even if I select only 77 labels still the same error with 99 labels)...
I'll be grateful for any help!

Best,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.