kevinmenden / scaden Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 25.0 1.11 MB

Deep Learning based cell composition analysis with Scaden.

Home Page: https://scaden.readthedocs.io

License: MIT License

Python 99.33% Dockerfile 0.67%

bioinformatics cell-composition-analysis deconvolution deep-learning machine-learning rna-seq single-cell-rna-seq

scaden's People

Contributors

Stargazers

Watchers

scaden's Issues

Bulk simulation

Hi,

I processed my scRNA-seq dataset(s) that I want to use for training. I used Seurat for this and got celltype labels. Then I created two input files( _norm_counts_all.txt for the count data, _celltypes.txt for the cell type labels ). But when I use bulk_simulation.py to do Bulk simulation. It have an error : IndexError: list index out of range. But I don't think my files have problems. What is the problem?

Error: fail to generate example data by using scaden expample

I want to generate the files "example_counts.txt", "example_celltypes.txt" and "example_bulk_data.txt" in the "example_data" by using scaden example, but an error occurs like the following.

(lnc) [liziyue@login01 ~]$ scaden example 2020-12-29 01:06:39.852508: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 Segmentation fault

Can anybody give me some advice about this error?
Thanks a lot!

Can you release the processed real bulk data for model validation?

Hi Kevin,
I come across your paper and find it really useful to deal with the bulk RNAseq data, which is exactly what I need. I wonder if you could release the processed real bulk data listed in Table S2 (PBMC1, PBMC2, Xin, ROSMAP, and Ascites)? So that I can explore the existing real bulk RNAseq data first, and repeat your experiment on the real bulk data.

Thanks ahead.
Best,
Fan

numpy.core._exceptions.MemoryError: Unable to allocate array with shape (24012, 10716) and data type float32

I am testing Scaden using the demo data. But when I run the training step, I got a error like this "Training on: ['data8k', 'donorA', 'donorC']
Training M256 Model ...
2019-11-04 17:59:27.410629: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410714: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410746: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410766: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410791: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/bin/scaden", line 125, in
cli()
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/bin/scaden", line 63, in train
num_steps=steps)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/scaden_main.py", line 64, in training
cdn256.train(input_path=data_path, train_datasets=train_datasets)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/model/scaden.py", line 252, in train
self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/model/scaden.py", line 213, in build_model
self.load_h5ad_file(input_path=input_path, batch_size=self.batch_size, datasets=train_datasets)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/model/scaden.py", line 158, in load_h5ad_file
raw_input = raw_input[raw_input.obs['ds'] != ds].copy()
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 1230, in getitem
return self._getitem_view(index)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 1234, in _getitem_view
return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 561, in init
self._init_as_view(X, oidx, vidx)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 628, in _init_as_view
self._init_X_as_view()
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 641, in _init_X_as_view
X = self._adata_ref._X[self._oidx, self._vidx]
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (24012, 10716) and data type float32
"

I google the error and I am sure I have enough memory and I am using the cpu tensorflow. And I am not root ,I cannot revise like this "echo 1 > /proc/sys/vm/overcommit_memory". The problem is MemoryError really? The version of scaden is 0.9.0. If there is someone could help me? Thanks in advance.

Implement testing via GitHub actions

Implement testing and building of docker images via GitHub actions instead of Travis or manual building.

Availability of sc-rnaseq data

Hi,
I would like to know if the sc-rnaseq (already clustered and labeled) given to CibersortX for example on 6k PBMC and 8k PBMC are available somewhere ?
Thanks for your article and your great work.

Error when running simulate

Hi ,
I am trying to run Scaden simulate for my data .
This is how my data looks currently :
data_counts.txt : 2627 rows (cell types ) and 26405 columns ( genes )
data_celltypes.txt : 2627 cell type labels corresponding to the rows in data_counts.txt.

When I run the command

scaden simulate --data data_scaden/ -n 100 --pattern "*_counts.txt"

I get the following output and the .h5ad file is not created:

Datasets: ['data']
Loading data dataset ...
Index(['Celltype'], dtype='object')
CRITICAL:root:No common genes found. Exiting.

Please let me know if I am going wrong somewhere? and How I can fix this ?

Thanks ,
RK

Data Scaden required

Hi Kevin!
I found that the example data that you provided are non-zero integers. Does Scaden require raw count as inputs? Is there any preprocessing step that you did to remove 0s?
Ruan

Scaden simulate error: KeyError: 'Celltype'

Hi,

Thanks for your great work on scaden.

I've been recently trying to use scaden to recognize cell fractions. With the newly released Scaden v0.9.5, I tried to run the 'scaden simulate' command but resulted in the same error showing something wrong with 'Celltype'.

Log/Error is:
`Datasets: ['']
Loading dataset ...
No. of common genes: 14003
Merging unknown cell types: ['unknown']
Traceback (most recent call last):
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Celltype'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/bin/scaden", line 8, in
sys.exit(main())
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/main.py", line 31, in main
cli()
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/main.py", line 211, in simulate
simulation(
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/preprocessing/simulate.py", line 13, in simulation
simulate_bulk(
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/preprocessing/bulk_simulation.py", line 308, in simulate_bulk
ys[i] = merge_unkown_celltypes(ys[i], unknown_celltypes)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/preprocessing/bulk_simulation.py", line 194, in merge_unkown_celltypes
celltypes = list(y["Celltype"])
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'Celltype'
`

My command for 'scaden simuate' is:
scaden simulate --cells 150 --n_samples 1000 --data ../ --pattern _.txt

The count data file '_.txt' has been generated following the normalization steps of your example in jupyter script, containing 14003 genes (column) and 2298 cells (row). Formatted as follows (I've also tried to replace the row name with barcodes, or set the genes as row and cells as column, however the error is the same):

And the _celltypes.txt has following format (2298 rows. Actually I was not sure about the 'n x 2' format mentioned in your tutorial, whether to add column name and row name. So I've tried to add column name or/and row name to this file but the error appears to be the same.):

I suppose something is wrong with the format, but have tried several times and couldn't figure out the reason.
May I have your suggestion on this issue? Thanks a lot!

Marcus

Bulk generation

Hi,

I was wondering how I obtain the scripts for generating bulk from my single cell data (bulk_generation.py)?

Best,
Ida

Bump conda

Hi! Would you consider bumping the bioconda package? It doesn't really fit the documentation anymore. Thanks!

Cheers,
Rasmus

simulate and preprocess

Hi Kevin,

I am confused about 'simulate' and 'process'. Actually, a simple end-to-end example starting from count matrices up to prediction with examples for all input/intermediate/output files would help a lot.

Practically, my inputs are

test_celltypes.txt ##celltypes of single cells
test_counts.txt ##raw read counts of single cells
bulk.txt ##raw counts of by bulk rna-seq (3' sequencing no need for gene length norm), genes are the same and in the same order as in test_counts.txt

Is this the correct pipeline?

scaden simulate --cells 100 --n_samples 32000 --data ./ --pattern '*_counts.txt' 
scaden train data.h5ad --steps 20000
scaden predict bulk.txt

This runs without error messages but the predictions are grossly wrong, so wondered if I missed something. The single cell and bulk come from the same piece of tissue, so I am 100% sure the single cells match the cells within the bulks.

I was surprised that replacing 'bulk.txt' by its log2-transformed and [0,1]-scaled version yield the exact same prediction. Is that expected?

I tried to run 'process':

scaden process data.h5ad bulk.txt

it completed, but then 'train' generated an error (see below). Is 'process' actually needed after 'simulate' if my single cell bulk matrix has the same genes in the same order?

Another question: my bulks are actually mini bulks with 10-100 cells and super low coverage. What parameters would you advise for 'simulate' ?

Again, thank you so much for your help

Vincent

scaden train processed.h5ad 
2020-12-18 18:32:19.194073: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-12-18 18:32:19.194143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

     ____                _            
    / ___|  ___ __ _  __| | ___ _ __  
    \___ \ / __/ _` |/ _` |/ _ \ '_ \ 
     ___) | (_| (_| | (_| |  __/ | | |
    |____/ \___\__,_|\__,_|\___|_| |_|

    
Training M256 Model ...
2020-12-18 18:32:20.448871: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-12-18 18:32:20.448929: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-18 18:32:20.448954: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7603195280b0): /proc/driver/nvidia/version does not exist
2020-12-18 18:32:20.449306: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-18 18:32:20.456092: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2020-12-18 18:32:20.456259: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5af1120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-18 18:32:20.456304: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:353: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:354: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:356: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py:53: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py:54: dropout (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
2020-12-18 18:32:33.561962: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 969984000 exceeds 10% of free system memory.
2020-12-18 18:32:33.833938: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 969984000 exceeds 10% of free system memory.
Model parameters restored successfully
  0%|                                                                     | 0/5000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [128,7578], In[1]: [11994,256]
	 [[{{node scaden_model/dense1/MatMul}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/scaden", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/scaden/__main__.py", line 31, in main
    cli()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/scaden/__main__.py", line 90, in train
    training(data_path=data_path,
  File "/usr/local/lib/python3.8/dist-packages/scaden/scaden/training.py", line 63, in training
    cdn256.train(input_path=data_path, train_datasets=train_datasets)
  File "/usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py", line 285, in train
    _, loss, summary = self.sess.run([self.optimizer, self.loss, self.merged_summary_op])
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 957, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1180, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [128,7578], In[1]: [11994,256]
	 [[node scaden_model/dense1/MatMul (defined at /lib/python3.8/dist-packages/scaden/model/scaden.py:53) ]]

Errors may have originated from an input operation.
Input Source operations connected to node scaden_model/dense1/MatMul:
 IteratorGetNext (defined at /lib/python3.8/dist-packages/scaden/model/scaden.py:234)

Original stack trace for 'scaden_model/dense1/MatMul':
  File "/bin/scaden", line 8, in <module>
    sys.exit(main())
  File "/lib/python3.8/dist-packages/scaden/__main__.py", line 31, in main
    cli()
  File "/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/lib/python3.8/dist-packages/scaden/__main__.py", line 90, in train
    training(data_path=data_path,
  File "/lib/python3.8/dist-packages/scaden/scaden/training.py", line 63, in training
    cdn256.train(input_path=data_path, train_datasets=train_datasets)
  File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 265, in train
    self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
  File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 245, in build_model
    self.logits = self.model_fn(X=self.x, n_classes=self.n_classes)
  File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 53, in model_fn
    layer1 = tf.compat.v1.layers.dense(X, units=self.hidden_units[0], activation=activation , name="dense1")
  File "/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/core.py", line 187, in dense
    return layer.apply(inputs)
  File "/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 1701, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 547, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 776, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 255, in wrapper
    return converted_call(f, args, kwargs, options=options)
  File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 532, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 339, in _call_unconverted
    return f(*args, **kwargs)
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/layers/core.py", line 1193, in call
    return core_ops.dense(
  File "/lib/python3.8/dist-packages/tensorflow/python/keras/layers/ops/core.py", line 53, in dense
    outputs = gen_math_ops.mat_mul(inputs, kernel)
  File "/lib/python3.8/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5640, in mat_mul
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal
    ret = Operation(
  File "/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
    self._traceback = tf_stack.extract_stack()

Recommended #s for cells and # of samples

Hello,

I am wondering if you have specific recommendations for the # of cells to include in each sample, and for how many samples to create? I see the defaults are 100 cells in 8000 samples -- in your experience, is this sufficient for most use-cases? In your manuscript, it looks like you used 500 cells per sample, but the total # of samples seemed to vary quite a bit between datasets.

Thanks for your help!

Refactor with TF 2.0

Would be good to properly refactor and make use of TF 2. This will require some restructuring of the code.

Once this is done and everything is working properly we can do the 1.0.0 release.

"scaden example --out example_data/" doesn't work

Hello Kevin,

After installing scaden via pip(I am using macOS), I tried running it for the example data, but it gives error saying:

Error: No such command 'example'.

Could you suggest why this is happening?

deprecated `as_matrix` in create_h5ad_file.py

Using the scripr create_h5ad_file.py I get the error

Traceback (most recent call last): File "Scaden/create_h5ad_file.py", line 109, in <module> adata.append(anndata.AnnData(X=x.as_matrix(), File "/home/luca/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 5136, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'as_matrix'

However substituiting .as_matrix() with .values fixes the problem

scaden/scaden/preprocessing/create_h5ad_file.py

Line 109 in b2fd12d

adata.append(anndata.AnnData(X=x.as_matrix(),

PBMC2 (monaco) dataset in pbmc_data.h5ad

Hello, I downloaded pbmc_data.h5ad and it has 4 simulated datasets and 2 real datasets named sdy67 and GSE65133.
Could you please also release the PBMC2 dataset in the pbmc_data.h5ad?

Best,
Minji

Dataset recognition is not robust

When having dataset names with multiple underscores, Scaden will fail to read the single datasets correctly as it simply splits but underscores.

This has to be made more robust.

Order of genes in prediction data

Hi,

Maybe I have missed this but couldn't find it in the code so apologies!

Can you confirm that the prediction gene matrix is re-indexed so that the order of genes is the same as the order of genes in the training data gene matrix.

Thanks!

Improve Documentation

The documentation needs more detail. Things that are currently missing:

gene identfiers used in the training datasets
link to a docker file
proper documentation of how to create training datasets

Curious about generating samples and labels from existing data

Hi,

Great work on the SCADEN paper. I was just curious if there is a way to generate my custom samples with labels as shown in the example file 'data6k_500_.txt' ?
Is prediction only possible on the data attached here? - https://figshare.com/articles/code/Publication_Figures/8234030?file=17855789

Thanks!

Normalization of the training expression data

Hi Kevin Menden,

Thank you for your excellent work.

I was wondering what kind of normalization were the numbers (numbers in the .X data matrix of the .h5ad file) in the PBMC training expression dataset (https://scaden.readthedocs.io/en/latest/datasets/) applied? TPM, RPKM, or log-transformed counts?

Thank you for your time.

Yi Han

Automatic cell type labeling for easier data preparation

Currently the biggest hurdle for using Scaden is to prepare the training data.

This could be substantially eased by developing a semi-automatic pipeline (or maybe fully automatic), which converts raw scRNA-seq datasets into training data.

The most important part for this is a automatic cell type identification tool, because that's the most important manual step at the moment.

Make Scaden runs reproducible

Owing to randomness in batching, initialization and also GPU computing, different training runs with Scaden will give different results.
This should be fixed by settings the seeds manually and changing to GPU deterministic behaviour.

Problems with pattern finding in 'scaden simulate'

scaden simulate sometimes can't find the files even if pattern is specified correctly.

Append a / solves this problem.
os.path.join() doesn't solve it.

Need to implement a more robust way of finding the data.

How to repeat the experiments in your paper

Hi Kevin,

I am new to the scRNA deconvolution. I notice that there are many interesting experiments in your paper. However, according to your codes and the datasets you provided, I don't know how to repeat your experimental results (The current codes and webtool are more like tools to let people use on their own datasets). For example, the first experiment is leaving one PBMC dataset for validation and using the others as training data. I guess all of the four PBMC datasets are mixed in your provided pbmc_data.h5ad. How can I split the dataset to perform the experiments? Besides, for the real tissue datasets PBMC1 and PBMC2, can you provide the processed datasets used in your experiments? I don't know which file to download. I would appreciate it if you could give detailed instructions about how to reimplement your experiments and provide all the datasets used in your paper.

Thanks a lot for your time.

Best,
Xiaohan

Problem with Scaden process

Hi! I'm back with a new issue. While trying to run the Scaden process command I get this error:

p209-114:preprocessing idala384$ scaden process barcoding.h5ad DMSO_3.txt

     ____                _            
    / ___|  ___ __ _  __| | ___ _ __  
    \___ \ / __/ _` |/ _` |/ _ \ '_ \ 
     ___) | (_| (_| | (_| |  __/ | | |
    |____/ \___\__,_|\__,_|\___|_| |_|

    
Found 0 common genes.
Pre-processing raw data ...
Subsetting genes ...
Traceback (most recent call last):
  File "/usr/local/bin/scaden", line 13, in <module>
    main()
  File "/usr/local/lib/python3.8/site-packages/scaden/cli.py", line 137, in main
    cli()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/scaden/cli.py", line 121, in process
    processing(data_path=prediction_data,
  File "/usr/local/lib/python3.8/site-packages/scaden/scaden_main.py", line 165, in processing
    preprocess_h5ad_data(raw_input_path=training_data,
  File "/usr/local/lib/python3.8/site-packages/scaden/model/functions.py", line 60, in preprocess_h5ad_data
    raw_input = raw_input[:, sig_genes]
  File "/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py", line 1088, in __getitem__
    return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
  File "/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py", line 305, in __init__
    self._init_as_view(X, oidx, vidx)
  File "/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py", line 348, in _init_as_view
    self._varm = adata_ref.varm._view(self, (vidx,))
  File "/usr/local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 92, in _view
    return self._view_class(self, parent, subset_idx)
  File "/usr/local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 246, in __init__
    self.dim_names = parent_mapping.dim_names[subset_idx]
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4111, in __getitem__
    result = getitem(key)
IndexError: arrays used as indices must be of integer (or boolean) type

I guess the problem is that it cannot find any common genes between the training data and prediction data. I've formatted the prediction files as you described (genes as rows, samples as columns, tab-delimited). Any ideas?

Error when installing Scaden

Hello,

I am trying to install Scaden on my computer (macOS Mojave 10.14.6).

I tried installing it using pip (v19.3.1), but I have the following error:

ERROR: Could not find a version that satisfies the requirement tensorflow==1.12.1 (from scaden) (from versions: 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 1.15.0rc0, 1.15.0rc1, 1.15.0rc2, 1.15.0rc3, 1.15.0, 2.0.0a0, 2.0.0b0, 2.0.0b1, 2.0.0rc0, 2.0.0rc1, 2.0.0rc2, 2.0.0)
ERROR: No matching distribution found for tensorflow==1.12.1 (from scaden)

When trying to install it with conda, I also get an error:

UnsatisfiableError: The following specifications were found to be incompatible with each other:                                        

Package setuptools conflicts for:
python=3.7 -> pip -> setuptools
scaden -> tensorflow[version='<=1.13.0'] -> markdown[version='>=2.6.8'] -> setuptools[version='>=36']
scaden -> matplotlib -> setuptools
Package certifi conflicts for:
python=3.7 -> pip -> setuptools -> certifi[version='>=2016.09|>=2016.9.26']
Package python-dateutil conflicts for:
scaden -> click -> python-dateutil[version='>=2.5.*|>=2.6.1']
Package python conflicts for:
python=3.7
Package pip conflicts for:
python=3.7 -> pip
scaden -> python -> pip
Package wheel conflicts for:
python=3.7 -> pip -> wheel
scaden -> python -> pip -> wheel
Package ca-certificates conflicts for:
scaden -> matplotlib -> setuptools -> ca-certificates
python=3.7 -> openssl[version='>=1.0.2o,<1.0.3a'] -> ca-certificates

Thank you in advance for your help,
Clara

Enable multiprocessing for data simulation

Data simulation can be done with multiple threads/processes relatively easily.

Variance cut-off in get_signature_genes( )

Hi Kevin,

Firstly, thanks for developing the package! The underlying concept is very cool.

Secondly, this is less of an issue and more of a heads-up. I've been working through your usage documentation in an attempt to apply scaden to my own data. One bug I ran in to while running through the workflow was during the scaden process step. It turns out that one of the functions get_signature_genes( ) filters prediction samples by their within-gene variance (data.var(axis=1) > var_cutoff, where var_cutoff is set to 0.1 by default). As I was using normalised counts (raw counts/total library size) in my bulk data, which has been sequenced to a very high depth, the resulting within-gene variances I got were tiny and much smaller than the cut-off. For this reason, no common genes were found between my training data and prediction data and scaden process didn't run. Once I scaled my normalised counts by multiplying by 10,000 I had no problem. It might be worth pointing this out to future users who might run into a similar issue.

All the best,
Regina

Add proper error messages

Currently Scaden doesn't throw useful error messages.

Memory footprint too big for simulation of large datasets

When a lot of datasets are used for simulation (e.g. over 80), Scaden uses a lot of memory because every dataset is stored in memory.

This can be done better, by first iterating through the datasets to quickly get the common genes between them, and then subsampling every dataset separately.

tensorflow vs tensorflow-gpu

When using pip to install scaden, scaden always pulls tensorflow and never tensorflow-gpu.

The real problem is that pip is not able to deal with GPU variants, as described at tensorflow/tensorflow#7166. If pip was able, then tensorflow would not need to have those two separate packages and everything would work as expected.

There are three possible scenarios:

BAD: We depend on tensorflow, and it is not possible to install scaden using the tensorflow-gpu package because they conflict. (this is the current status)
UGLY: We publish to pypi two packages scaden and scaden-gpu, depending on tensorflow and tensorflow-gpu respectively.
ALSO UGLY: We optionally depend on tensorflow and tensorflow-gpu. The instructions to install scaden through pip would be: pip install scaden[cpu] or pip install scaden[gpu]. pip install scaden would not install any tensorflow variant and lead to an error (we could warn the user then, though).

I can easily implement ALSO UGLY, the community is split between the two ugly solutions. Do you have any preference, @KevinMenden ?

Normalization of input bulk data

Hi Kevin
What kind of normalization of the input bulk data is suitable for SCADEN? Although, normalization to the sequencing depth is most common, isn't normalization to the gene length more appropriate while estimating different cell types which have different marker genes?

mouse_brain.h5ad file

Hi there,

I'd like to use the training dataset for the mouse brain to slot into CibersortX as a signature matrix.

I tried opening the .h5ad file in Seurat so I could export the matrix with the phenotype labels (e.g. genes on rows, and cells on columns). But ran into this error:

file =  ReadH5AD("C:/Users/James Hong/Desktop/CVT Paper/mouse_brain.h5ad")
Pulling expression matrices and metadata
Data is scaled
Error in ReadH5AD.H5File(file = hfile, assay = assay, layers = layers,  : 
  Seurat requires normalized data present in the raw slot when X is scaled

I think the raw slot is missing and only the scaled perhaps SCTransform data is available.

Could you help me with the right .h5ad file or perhaps generate the expression matrix above for CibersortX?

Thanks

Is L1 loss function correct?

I saw L1 loss function in your code
loss = tf.reduce_mean(input_tensor=tf.math.square(logits-targets))
on the line of https://github.com/KevinMenden/scaden/blob/master/scaden/model/scaden.py#L67

Is it correct?

In my understand, it should be the sum of absolute value, right?

This is the defination of L1 norm:
https://mathworld.wolfram.com/L1-Norm.html

predicting bulk RNA-seq

Hi,

I have trained scaden on a .h5ad file I produced.

I am looking to run scaden predict on bulk RNA-seq data if possible.
Could you tell me what format the bulk RNA-seq data would need to be in?

Regards

Jon8991

Problem creating h5ad-file

Hello again,

I've now managed to get bulk_generation.py to work on my data, but encounter an error when trying to create the h5ad-file. The error message is:

Celltypes: [1, 2, 3, 4, 5, 6]
Traceback (most recent call last):
  File "create_h5ad_file.py", line 98, in <module>
    y = sort_celltypes(y, labels, celltypes)
  File "create_h5ad_file.py", line 55, in sort_celltypes
    idx = [labels.index(x) for x in ref_labels]
  File "create_h5ad_file.py", line 55, in <listcomp>
    idx = [labels.index(x) for x in ref_labels]
ValueError: 1 is not in list

Does this tell you anything?

Scaden example sometimes breaks

No 'simulate' command in docker image? And doc for h5da file content

Hello,

I am trying to run Scaden from the docker image on OSX.

The scaden command is found and version is 1.0.0.

Now, following the example from this page,
I run
scaden simulate --cells 100 --n_samples 1000 --data preprocessing
and got quite a few FutureWarnings and this error:

Usage: scaden [OPTIONS] COMMAND [ARGS]...
Try 'scaden --help' for help.

Error: No such command 'simulate'.

Is the container (installed with 'docker pull kevinmenden/scaden' last Friday, Dec 11th) up to date?

In addition, I'll have to write my own bulk simulator for my specific application. It would be very useful to get a description of whatthe h5da file provided to train, and/or toy examples of inputs for create_h5ad_file.py contains. As a R programmer this would also save me the trouble to write the simulated bulk data to file and generate directly the h5da file.

Nevertheless, I was actually able run create_h5ad_file.py on the file I generated in R:

python ../../../scaden/scaden-master/scaden/preprocessing/create_h5ad_file.py --data ./preprocessing/ --out ./preprocessing/train.h5ad
Celltypes: ['c5', 'c11', 'c9', 'c16', 'c2', 'c14', 'c3', 'c18', 'c19', 'c20', 'c10', 'c12', 'c22', 'c17', 'c8', 'c21', 'c13', 'c7', 'c4', 'c1', 'c15', 'c6', 'c0']

No error message, and file train.h5da is created and non empty. But when I next running 'scaden train', i got an error:

scaden train ./preprocessing/train.h5ad --steps 20000    
Processing TESTSAMPLE
/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py:119: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
... storing 'ds' as categorical
(base) root@e4ff2c3d5805:/Volumes/VD-MISC-2TB/ST/experiments/PTC7SN1/scaden_proportion_model# scaden train ./preprocessing/train.h5ad --steps 20000
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/core/anndata.py:17: FutureWarning: pandas.core.index is deprecated and will be removed in a future version.  The public classes are available in the top-level namespace.
  from pandas.core.index import RangeIndex
/opt/conda/envs/scaden/lib/python3.6/site-packages/scanpy/api/__init__.py:6: FutureWarning: 

In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.

  FutureWarning
Training on: []
Training M256 Model ...
2020-12-12 09:41:01.301049: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-12-12 09:41:01.327354: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
  File "/opt/conda/envs/scaden/bin/scaden", line 125, in <module>
    cli()
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/envs/scaden/bin/scaden", line 63, in train
    num_steps=steps)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/scaden_main.py", line 64, in training
    cdn256.train(input_path=data_path, train_datasets=train_datasets)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/model/scaden.py", line 252, in train
    self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/model/scaden.py", line 213, in build_model
    self.load_h5ad_file(input_path=input_path, batch_size=self.batch_size, datasets=train_datasets)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/model/scaden.py", line 151, in load_h5ad_file
    raw_input = sc.read_h5ad(input_path)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/readwrite/read.py", line 447, in read_h5ad
    constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/readwrite/read.py", line 502, in _read_args_from_h5ad
    return AnnData._args_from_dict(d)
  File "/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/core/anndata.py", line 2157, in _args_from_dict
    if key in d_true_keys[true_key].dtype.names:
AttributeError: 'dict' object has no attribute 'dtype'

Any Idea?

All the best,

Vincent

Scaden test

Hi Kevin,

I am doing some tests on Scaden. I found that after adding some cells from other sources to the original scRNAseq, the prediction effect will drop a lot. Scaden’s result shows that this kind of cells that shouldn't exist in bulk data exist and some even account for a large proportion.
In my test, I use PBMC scRNAseq and whole blood bulk RNAseq as the test data. Tumor cells from LUAD (Lung Adenocarcinoma) were added to the PBMC scRNAseq.

Ruan

issues with the webtool

Hi Kevin

I am trying out the Scaden webtool currently. The demo run works well. However, when I run the webtool with my own data, it gives an error saying Error preparing job directories and files. Do you know where the problem might be? thanks for your help.

Best,
Yingjun

 scaden simulate --data ./ -n 3000 --pattern "*_counts.txt"
Normal samples: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [02:10<00:00, 11.47it/s]
Sparse samples: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [01:27<00:00, 17.23it/s

Is this wrong? I am confused because I did not see the number 3000 appear in the result.

Thank you so much for your help!
Ruan

kevinmenden / scaden Goto Github PK

scaden's People

Contributors

Stargazers

Watchers

Forkers

scaden's Issues

Recommend Projects

Recommend Topics

Recommend Org