kevinmenden / scaden Goto Github PK
View Code? Open in Web Editor NEWDeep Learning based cell composition analysis with Scaden.
Home Page: https://scaden.readthedocs.io
License: MIT License
Deep Learning based cell composition analysis with Scaden.
Home Page: https://scaden.readthedocs.io
License: MIT License
Hi,
I processed my scRNA-seq dataset(s) that I want to use for training. I used Seurat for this and got celltype labels. Then I created two input files( _norm_counts_all.txt for the count data, _celltypes.txt for the cell type labels ). But when I use bulk_simulation.py to do Bulk simulation. It have an error : IndexError: list index out of range. But I don't think my files have problems. What is the problem?
I want to generate the files "example_counts.txt", "example_celltypes.txt" and "example_bulk_data.txt" in the "example_data" by using scaden example, but an error occurs like the following.
(lnc) [liziyue@login01 ~]$ scaden example 2020-12-29 01:06:39.852508: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 Segmentation fault
Can anybody give me some advice about this error?
Thanks a lot!
Hi Kevin,
I come across your paper and find it really useful to deal with the bulk RNAseq data, which is exactly what I need. I wonder if you could release the processed real bulk data listed in Table S2 (PBMC1, PBMC2, Xin, ROSMAP, and Ascites)? So that I can explore the existing real bulk RNAseq data first, and repeat your experiment on the real bulk data.
Thanks ahead.
Best,
Fan
I am testing Scaden using the demo data. But when I run the training step, I got a error like this "Training on: ['data8k', 'donorA', 'donorC']
Training M256 Model ...
2019-11-04 17:59:27.410629: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410714: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410746: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410766: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-11-04 17:59:27.410791: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/bin/scaden", line 125, in
cli()
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/bin/scaden", line 63, in train
num_steps=steps)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/scaden_main.py", line 64, in training
cdn256.train(input_path=data_path, train_datasets=train_datasets)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/model/scaden.py", line 252, in train
self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/model/scaden.py", line 213, in build_model
self.load_h5ad_file(input_path=input_path, batch_size=self.batch_size, datasets=train_datasets)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/scaden/model/scaden.py", line 158, in load_h5ad_file
raw_input = raw_input[raw_input.obs['ds'] != ds].copy()
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 1230, in getitem
return self._getitem_view(index)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 1234, in _getitem_view
return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 561, in init
self._init_as_view(X, oidx, vidx)
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 628, in _init_as_view
self._init_X_as_view()
File "/hpc/dhl_ec/kcui/tools/anaconda3/envs/myenv/lib/python3.6/site-packages/anndata/core/anndata.py", line 641, in _init_X_as_view
X = self._adata_ref._X[self._oidx, self._vidx]
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (24012, 10716) and data type float32
"
I google the error and I am sure I have enough memory and I am using the cpu tensorflow. And I am not root ,I cannot revise like this "echo 1 > /proc/sys/vm/overcommit_memory". The problem is MemoryError really? The version of scaden is 0.9.0. If there is someone could help me? Thanks in advance.
Implement testing and building of docker images via GitHub actions instead of Travis or manual building.
Hi,
I would like to know if the sc-rnaseq (already clustered and labeled) given to CibersortX for example on 6k PBMC and 8k PBMC are available somewhere ?
Thanks for your article and your great work.
Hi ,
I am trying to run Scaden simulate for my data .
This is how my data looks currently :
data_counts.txt : 2627 rows (cell types ) and 26405 columns ( genes )
data_celltypes.txt : 2627 cell type labels corresponding to the rows in data_counts.txt.
When I run the command
scaden simulate --data data_scaden/ -n 100 --pattern "*_counts.txt"
I get the following output and the .h5ad file is not created:
Datasets: ['data']
Loading data dataset ...
Index(['Celltype'], dtype='object')
CRITICAL:root:No common genes found. Exiting.
Please let me know if I am going wrong somewhere? and How I can fix this ?
Thanks ,
RK
Hi Kevin!
I found that the example data that you provided are non-zero integers. Does Scaden require raw count as inputs? Is there any preprocessing step that you did to remove 0s?
Ruan
Hi,
Thanks for your great work on scaden.
I've been recently trying to use scaden to recognize cell fractions. With the newly released Scaden v0.9.5, I tried to run the 'scaden simulate' command but resulted in the same error showing something wrong with 'Celltype'.
Log/Error is:
`Datasets: ['']
Loading dataset ...
No. of common genes: 14003
Merging unknown cell types: ['unknown']
Traceback (most recent call last):
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Celltype'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/bin/scaden", line 8, in
sys.exit(main())
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/main.py", line 31, in main
cli()
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/main.py", line 211, in simulate
simulation(
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/preprocessing/simulate.py", line 13, in simulation
simulate_bulk(
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/preprocessing/bulk_simulation.py", line 308, in simulate_bulk
ys[i] = merge_unkown_celltypes(ys[i], unknown_celltypes)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/scaden/preprocessing/bulk_simulation.py", line 194, in merge_unkown_celltypes
celltypes = list(y["Celltype"])
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/disk1/marcus/Le/scaden_analysis/scaden-0.9.5/venv/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
raise KeyError(key) from err
KeyError: 'Celltype'
`
My command for 'scaden simuate' is:
scaden simulate --cells 150 --n_samples 1000 --data ../ --pattern _.txt
The count data file '_.txt' has been generated following the normalization steps of your example in jupyter script, containing 14003 genes (column) and 2298 cells (row). Formatted as follows (I've also tried to replace the row name with barcodes, or set the genes as row and cells as column, however the error is the same):
And the _celltypes.txt has following format (2298 rows. Actually I was not sure about the 'n x 2' format mentioned in your tutorial, whether to add column name and row name. So I've tried to add column name or/and row name to this file but the error appears to be the same.):
I suppose something is wrong with the format, but have tried several times and couldn't figure out the reason.
May I have your suggestion on this issue? Thanks a lot!
Marcus
Hi,
I was wondering how I obtain the scripts for generating bulk from my single cell data (bulk_generation.py)?
Best,
Ida
Hi! Would you consider bumping the bioconda package? It doesn't really fit the documentation anymore. Thanks!
Cheers,
Rasmus
Hi Kevin,
I am confused about 'simulate' and 'process'. Actually, a simple end-to-end example starting from count matrices up to prediction with examples for all input/intermediate/output files would help a lot.
Practically, my inputs are
test_celltypes.txt ##celltypes of single cells
test_counts.txt ##raw read counts of single cells
bulk.txt ##raw counts of by bulk rna-seq (3' sequencing no need for gene length norm), genes are the same and in the same order as in test_counts.txt
Is this the correct pipeline?
scaden simulate --cells 100 --n_samples 32000 --data ./ --pattern '*_counts.txt'
scaden train data.h5ad --steps 20000
scaden predict bulk.txt
This runs without error messages but the predictions are grossly wrong, so wondered if I missed something. The single cell and bulk come from the same piece of tissue, so I am 100% sure the single cells match the cells within the bulks.
I was surprised that replacing 'bulk.txt' by its log2-transformed and [0,1]-scaled version yield the exact same prediction. Is that expected?
I tried to run 'process':
scaden process data.h5ad bulk.txt
it completed, but then 'train' generated an error (see below). Is 'process' actually needed after 'simulate' if my single cell bulk matrix has the same genes in the same order?
Another question: my bulks are actually mini bulks with 10-100 cells and super low coverage. What parameters would you advise for 'simulate' ?
Again, thank you so much for your help
Vincent
scaden train processed.h5ad
2020-12-18 18:32:19.194073: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-12-18 18:32:19.194143: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
____ _
/ ___| ___ __ _ __| | ___ _ __
\___ \ / __/ _` |/ _` |/ _ \ '_ \
___) | (_| (_| | (_| | __/ | | |
|____/ \___\__,_|\__,_|\___|_| |_|
Training M256 Model ...
2020-12-18 18:32:20.448871: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-12-18 18:32:20.448929: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-18 18:32:20.448954: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7603195280b0): /proc/driver/nvidia/version does not exist
2020-12-18 18:32:20.449306: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-18 18:32:20.456092: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2020-12-18 18:32:20.456259: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5af1120 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-18 18:32:20.456304: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:353: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:354: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/iterator_ops.py:356: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py:53: dense (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py:54: dropout (from tensorflow.python.keras.legacy_tf_layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
2020-12-18 18:32:33.561962: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 969984000 exceeds 10% of free system memory.
2020-12-18 18:32:33.833938: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 969984000 exceeds 10% of free system memory.
Model parameters restored successfully
0%| | 0/5000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [128,7578], In[1]: [11994,256]
[[{{node scaden_model/dense1/MatMul}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/scaden", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/scaden/__main__.py", line 31, in main
cli()
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/scaden/__main__.py", line 90, in train
training(data_path=data_path,
File "/usr/local/lib/python3.8/dist-packages/scaden/scaden/training.py", line 63, in training
cdn256.train(input_path=data_path, train_datasets=train_datasets)
File "/usr/local/lib/python3.8/dist-packages/scaden/model/scaden.py", line 285, in train
_, loss, summary = self.sess.run([self.optimizer, self.loss, self.merged_summary_op])
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 957, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1180, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1358, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Matrix size-incompatible: In[0]: [128,7578], In[1]: [11994,256]
[[node scaden_model/dense1/MatMul (defined at /lib/python3.8/dist-packages/scaden/model/scaden.py:53) ]]
Errors may have originated from an input operation.
Input Source operations connected to node scaden_model/dense1/MatMul:
IteratorGetNext (defined at /lib/python3.8/dist-packages/scaden/model/scaden.py:234)
Original stack trace for 'scaden_model/dense1/MatMul':
File "/bin/scaden", line 8, in <module>
sys.exit(main())
File "/lib/python3.8/dist-packages/scaden/__main__.py", line 31, in main
cli()
File "/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/lib/python3.8/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/lib/python3.8/dist-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/lib/python3.8/dist-packages/scaden/__main__.py", line 90, in train
training(data_path=data_path,
File "/lib/python3.8/dist-packages/scaden/scaden/training.py", line 63, in training
cdn256.train(input_path=data_path, train_datasets=train_datasets)
File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 265, in train
self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 245, in build_model
self.logits = self.model_fn(X=self.x, n_classes=self.n_classes)
File "/lib/python3.8/dist-packages/scaden/model/scaden.py", line 53, in model_fn
layer1 = tf.compat.v1.layers.dense(X, units=self.hidden_units[0], activation=activation , name="dense1")
File "/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/core.py", line 187, in dense
return layer.apply(inputs)
File "/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 1701, in apply
return self.__call__(inputs, *args, **kwargs)
File "/lib/python3.8/dist-packages/tensorflow/python/keras/legacy_tf_layers/base.py", line 547, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/lib/python3.8/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 776, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 255, in wrapper
return converted_call(f, args, kwargs, options=options)
File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 532, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 339, in _call_unconverted
return f(*args, **kwargs)
File "/lib/python3.8/dist-packages/tensorflow/python/keras/layers/core.py", line 1193, in call
return core_ops.dense(
File "/lib/python3.8/dist-packages/tensorflow/python/keras/layers/ops/core.py", line 53, in dense
outputs = gen_math_ops.mat_mul(inputs, kernel)
File "/lib/python3.8/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5640, in mat_mul
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal
ret = Operation(
File "/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
self._traceback = tf_stack.extract_stack()
Hello,
I am wondering if you have specific recommendations for the # of cells to include in each sample, and for how many samples to create? I see the defaults are 100 cells in 8000 samples -- in your experience, is this sufficient for most use-cases? In your manuscript, it looks like you used 500 cells per sample, but the total # of samples seemed to vary quite a bit between datasets.
Thanks for your help!
Would be good to properly refactor and make use of TF 2. This will require some restructuring of the code.
Once this is done and everything is working properly we can do the 1.0.0 release.
Hello Kevin,
After installing scaden via pip(I am using macOS), I tried running it for the example data, but it gives error saying:
Error: No such command 'example'.
Could you suggest why this is happening?
Using the scripr create_h5ad_file.py I get the error
Traceback (most recent call last): File "Scaden/create_h5ad_file.py", line 109, in <module> adata.append(anndata.AnnData(X=x.as_matrix(), File "/home/luca/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 5136, in __getattr__ return object.__getattribute__(self, name) AttributeError: 'DataFrame' object has no attribute 'as_matrix'
However substituiting .as_matrix()
with .values
fixes the problem
Hello, I downloaded pbmc_data.h5ad and it has 4 simulated datasets and 2 real datasets named sdy67 and GSE65133.
Could you please also release the PBMC2 dataset in the pbmc_data.h5ad?
Best,
Minji
When having dataset names with multiple underscores, Scaden will fail to read the single datasets correctly as it simply splits but underscores.
This has to be made more robust.
Hi,
Maybe I have missed this but couldn't find it in the code so apologies!
Can you confirm that the prediction gene matrix is re-indexed so that the order of genes is the same as the order of genes in the training data gene matrix.
Thanks!
The documentation needs more detail. Things that are currently missing:
Hi,
Great work on the SCADEN paper. I was just curious if there is a way to generate my custom samples with labels as shown in the example file 'data6k_500_.txt' ?
Is prediction only possible on the data attached here? - https://figshare.com/articles/code/Publication_Figures/8234030?file=17855789
Thanks!
Hi Kevin Menden,
Thank you for your excellent work.
I was wondering what kind of normalization were the numbers (numbers in the .X data matrix of the .h5ad file) in the PBMC training expression dataset (https://scaden.readthedocs.io/en/latest/datasets/) applied? TPM, RPKM, or log-transformed counts?
Thank you for your time.
Yi Han
Currently the biggest hurdle for using Scaden is to prepare the training data.
This could be substantially eased by developing a semi-automatic pipeline (or maybe fully automatic), which converts raw scRNA-seq datasets into training data.
The most important part for this is a automatic cell type identification tool, because that's the most important manual step at the moment.
Owing to randomness in batching, initialization and also GPU computing, different training runs with Scaden will give different results.
This should be fixed by settings the seeds manually and changing to GPU deterministic behaviour.
scaden simulate
sometimes can't find the files even if pattern is specified correctly.
Append a /
solves this problem.
os.path.join()
doesn't solve it.
Need to implement a more robust way of finding the data.
Hi Kevin,
I am new to the scRNA deconvolution. I notice that there are many interesting experiments in your paper. However, according to your codes and the datasets you provided, I don't know how to repeat your experimental results (The current codes and webtool are more like tools to let people use on their own datasets). For example, the first experiment is leaving one PBMC dataset for validation and using the others as training data. I guess all of the four PBMC datasets are mixed in your provided pbmc_data.h5ad. How can I split the dataset to perform the experiments? Besides, for the real tissue datasets PBMC1 and PBMC2, can you provide the processed datasets used in your experiments? I don't know which file to download. I would appreciate it if you could give detailed instructions about how to reimplement your experiments and provide all the datasets used in your paper.
Thanks a lot for your time.
Best,
Xiaohan
Hi! I'm back with a new issue. While trying to run the Scaden process command I get this error:
p209-114:preprocessing idala384$ scaden process barcoding.h5ad DMSO_3.txt
____ _
/ ___| ___ __ _ __| | ___ _ __
\___ \ / __/ _` |/ _` |/ _ \ '_ \
___) | (_| (_| | (_| | __/ | | |
|____/ \___\__,_|\__,_|\___|_| |_|
Found 0 common genes.
Pre-processing raw data ...
Subsetting genes ...
Traceback (most recent call last):
File "/usr/local/bin/scaden", line 13, in <module>
main()
File "/usr/local/lib/python3.8/site-packages/scaden/cli.py", line 137, in main
cli()
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/scaden/cli.py", line 121, in process
processing(data_path=prediction_data,
File "/usr/local/lib/python3.8/site-packages/scaden/scaden_main.py", line 165, in processing
preprocess_h5ad_data(raw_input_path=training_data,
File "/usr/local/lib/python3.8/site-packages/scaden/model/functions.py", line 60, in preprocess_h5ad_data
raw_input = raw_input[:, sig_genes]
File "/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py", line 1088, in __getitem__
return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
File "/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py", line 305, in __init__
self._init_as_view(X, oidx, vidx)
File "/usr/local/lib/python3.8/site-packages/anndata/_core/anndata.py", line 348, in _init_as_view
self._varm = adata_ref.varm._view(self, (vidx,))
File "/usr/local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 92, in _view
return self._view_class(self, parent, subset_idx)
File "/usr/local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 246, in __init__
self.dim_names = parent_mapping.dim_names[subset_idx]
File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4111, in __getitem__
result = getitem(key)
IndexError: arrays used as indices must be of integer (or boolean) type
I guess the problem is that it cannot find any common genes between the training data and prediction data. I've formatted the prediction files as you described (genes as rows, samples as columns, tab-delimited). Any ideas?
Hello,
I am trying to install Scaden on my computer (macOS Mojave 10.14.6).
I tried installing it using pip (v19.3.1), but I have the following error:
ERROR: Could not find a version that satisfies the requirement tensorflow==1.12.1 (from scaden) (from versions: 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 1.15.0rc0, 1.15.0rc1, 1.15.0rc2, 1.15.0rc3, 1.15.0, 2.0.0a0, 2.0.0b0, 2.0.0b1, 2.0.0rc0, 2.0.0rc1, 2.0.0rc2, 2.0.0)
ERROR: No matching distribution found for tensorflow==1.12.1 (from scaden)
When trying to install it with conda, I also get an error:
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package setuptools conflicts for:
python=3.7 -> pip -> setuptools
scaden -> tensorflow[version='<=1.13.0'] -> markdown[version='>=2.6.8'] -> setuptools[version='>=36']
scaden -> matplotlib -> setuptools
Package certifi conflicts for:
python=3.7 -> pip -> setuptools -> certifi[version='>=2016.09|>=2016.9.26']
Package python-dateutil conflicts for:
scaden -> click -> python-dateutil[version='>=2.5.*|>=2.6.1']
Package python conflicts for:
python=3.7
Package pip conflicts for:
python=3.7 -> pip
scaden -> python -> pip
Package wheel conflicts for:
python=3.7 -> pip -> wheel
scaden -> python -> pip -> wheel
Package ca-certificates conflicts for:
scaden -> matplotlib -> setuptools -> ca-certificates
python=3.7 -> openssl[version='>=1.0.2o,<1.0.3a'] -> ca-certificates
Thank you in advance for your help,
Clara
Data simulation can be done with multiple threads/processes relatively easily.
Hi Kevin,
Firstly, thanks for developing the package! The underlying concept is very cool.
Secondly, this is less of an issue and more of a heads-up. I've been working through your usage documentation in an attempt to apply scaden to my own data. One bug I ran in to while running through the workflow was during the scaden process
step. It turns out that one of the functions get_signature_genes( )
filters prediction samples by their within-gene variance (data.var(axis=1) > var_cutoff
, where var_cutoff
is set to 0.1 by default). As I was using normalised counts (raw counts/total library size) in my bulk data, which has been sequenced to a very high depth, the resulting within-gene variances I got were tiny and much smaller than the cut-off. For this reason, no common genes were found between my training data and prediction data and scaden process
didn't run. Once I scaled my normalised counts by multiplying by 10,000 I had no problem. It might be worth pointing this out to future users who might run into a similar issue.
All the best,
Regina
Currently Scaden doesn't throw useful error messages.
When a lot of datasets are used for simulation (e.g. over 80), Scaden uses a lot of memory because every dataset is stored in memory.
This can be done better, by first iterating through the datasets to quickly get the common genes between them, and then subsampling every dataset separately.
When using pip to install scaden, scaden always pulls tensorflow
and never tensorflow-gpu
.
The real problem is that pip is not able to deal with GPU variants, as described at tensorflow/tensorflow#7166. If pip was able, then tensorflow would not need to have those two separate packages and everything would work as expected.
There are three possible scenarios:
tensorflow
, and it is not possible to install scaden using the tensorflow-gpu
package because they conflict. (this is the current status)scaden
and scaden-gpu
, depending on tensorflow and tensorflow-gpu respectively.tensorflow
and tensorflow-gpu
. The instructions to install scaden through pip would be: pip install scaden[cpu]
or pip install scaden[gpu]
. pip install scaden
would not install any tensorflow variant and lead to an error (we could warn the user then, though).I can easily implement ALSO UGLY
, the community is split between the two ugly solutions. Do you have any preference, @KevinMenden ?
Hi Kevin
What kind of normalization of the input bulk data is suitable for SCADEN? Although, normalization to the sequencing depth is most common, isn't normalization to the gene length more appropriate while estimating different cell types which have different marker genes?
Hi there,
I'd like to use the training dataset for the mouse brain to slot into CibersortX as a signature matrix.
I tried opening the .h5ad file in Seurat so I could export the matrix with the phenotype labels (e.g. genes on rows, and cells on columns). But ran into this error:
file = ReadH5AD("C:/Users/James Hong/Desktop/CVT Paper/mouse_brain.h5ad")
Pulling expression matrices and metadata
Data is scaled
Error in ReadH5AD.H5File(file = hfile, assay = assay, layers = layers, :
Seurat requires normalized data present in the raw slot when X is scaled
I think the raw slot is missing and only the scaled perhaps SCTransform data is available.
Could you help me with the right .h5ad file or perhaps generate the expression matrix above for CibersortX?
Thanks
I saw L1 loss function in your code
loss = tf.reduce_mean(input_tensor=tf.math.square(logits-targets))
on the line of https://github.com/KevinMenden/scaden/blob/master/scaden/model/scaden.py#L67
Is it correct?
In my understand, it should be the sum of absolute value, right?
This is the defination of L1 norm:
https://mathworld.wolfram.com/L1-Norm.html
Hi,
I have trained scaden on a .h5ad file I produced.
I am looking to run scaden predict on bulk RNA-seq data if possible.
Could you tell me what format the bulk RNA-seq data would need to be in?
Regards
Jon8991
Hello again,
I've now managed to get bulk_generation.py to work on my data, but encounter an error when trying to create the h5ad-file. The error message is:
Celltypes: [1, 2, 3, 4, 5, 6]
Traceback (most recent call last):
File "create_h5ad_file.py", line 98, in <module>
y = sort_celltypes(y, labels, celltypes)
File "create_h5ad_file.py", line 55, in sort_celltypes
idx = [labels.index(x) for x in ref_labels]
File "create_h5ad_file.py", line 55, in <listcomp>
idx = [labels.index(x) for x in ref_labels]
ValueError: 1 is not in list
Does this tell you anything?
Hello,
I am trying to run Scaden from the docker image on OSX.
The scaden command is found and version is 1.0.0.
Now, following the example from this page,
I run
scaden simulate --cells 100 --n_samples 1000 --data preprocessing
and got quite a few FutureWarnings and this error:
Usage: scaden [OPTIONS] COMMAND [ARGS]...
Try 'scaden --help' for help.
Error: No such command 'simulate'.
Is the container (installed with 'docker pull kevinmenden/scaden' last Friday, Dec 11th) up to date?
In addition, I'll have to write my own bulk simulator for my specific application. It would be very useful to get a description of whatthe h5da file provided to train, and/or toy examples of inputs for create_h5ad_file.py contains. As a R programmer this would also save me the trouble to write the simulated bulk data to file and generate directly the h5da file.
Nevertheless, I was actually able run create_h5ad_file.py on the file I generated in R:
python ../../../scaden/scaden-master/scaden/preprocessing/create_h5ad_file.py --data ./preprocessing/ --out ./preprocessing/train.h5ad
Celltypes: ['c5', 'c11', 'c9', 'c16', 'c2', 'c14', 'c3', 'c18', 'c19', 'c20', 'c10', 'c12', 'c22', 'c17', 'c8', 'c21', 'c13', 'c7', 'c4', 'c1', 'c15', 'c6', 'c0']
No error message, and file train.h5da is created and non empty. But when I next running 'scaden train', i got an error:
scaden train ./preprocessing/train.h5ad --steps 20000
Processing TESTSAMPLE
/opt/conda/lib/python3.7/site-packages/anndata/_core/anndata.py:119: ImplicitModificationWarning: Transforming to str index.
warnings.warn("Transforming to str index.", ImplicitModificationWarning)
... storing 'ds' as categorical
(base) root@e4ff2c3d5805:/Volumes/VD-MISC-2TB/ST/experiments/PTC7SN1/scaden_proportion_model# scaden train ./preprocessing/train.h5ad --steps 20000
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/core/anndata.py:17: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace.
from pandas.core.index import RangeIndex
/opt/conda/envs/scaden/lib/python3.6/site-packages/scanpy/api/__init__.py:6: FutureWarning:
In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.
FutureWarning
Training on: []
Training M256 Model ...
2020-12-12 09:41:01.301049: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-12-12 09:41:01.327354: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
File "/opt/conda/envs/scaden/bin/scaden", line 125, in <module>
cli()
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/envs/scaden/bin/scaden", line 63, in train
num_steps=steps)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/scaden_main.py", line 64, in training
cdn256.train(input_path=data_path, train_datasets=train_datasets)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/model/scaden.py", line 252, in train
self.build_model(input_path=input_path, train_datasets=train_datasets, mode="train")
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/model/scaden.py", line 213, in build_model
self.load_h5ad_file(input_path=input_path, batch_size=self.batch_size, datasets=train_datasets)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/scaden/model/scaden.py", line 151, in load_h5ad_file
raw_input = sc.read_h5ad(input_path)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/readwrite/read.py", line 447, in read_h5ad
constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/readwrite/read.py", line 502, in _read_args_from_h5ad
return AnnData._args_from_dict(d)
File "/opt/conda/envs/scaden/lib/python3.6/site-packages/anndata/core/anndata.py", line 2157, in _args_from_dict
if key in d_true_keys[true_key].dtype.names:
AttributeError: 'dict' object has no attribute 'dtype'
Any Idea?
All the best,
Vincent
Hi Kevin,
I am doing some tests on Scaden. I found that after adding some cells from other sources to the original scRNAseq, the prediction effect will drop a lot. Scaden’s result shows that this kind of cells that shouldn't exist in bulk data exist and some even account for a large proportion.
In my test, I use PBMC scRNAseq and whole blood bulk RNAseq as the test data. Tumor cells from LUAD (Lung Adenocarcinoma) were added to the PBMC scRNAseq.
Ruan
Hi Kevin
I am trying out the Scaden webtool currently. The demo run works well. However, when I run the webtool with my own data, it gives an error saying Error preparing job directories and files. Do you know where the problem might be? thanks for your help.
Best,
Yingjun
If a user specified datasets which are not part of the training data, Scaden will not warn the user.
There's currently no proper documentation for how to generate training data from scRNA-seq data.
This must be updated urgently. Furthermore, the whole training data generation procedure must be made more usable.
As some users have had some trouble with input file formats, it might be useful to have a small linting function that allows to check whether the input is correctly formatted, and if not points to the issues with it.
It may still be a beta, but it is worth a try.
Add some form of continuous integration testing.
Hi, kevin
great work!
Can you provide the single cell expression matrix? Or they are exist somewhere and I missed.
All the best
HanLuo
The docker publish action should be modified such that it builds two images, one with Scaden installed for CPU usage and one for GPU usage
Hi Kevin,
I want to generates 3000 samples of training data. My input is:
scaden simulate --data ./ -n 3000 --pattern "*_counts.txt"
Normal samples: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [02:10<00:00, 11.47it/s]
Sparse samples: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [01:27<00:00, 17.23it/s
Is this wrong? I am confused because I did not see the number 3000 appear in the result.
Thank you so much for your help!
Ruan
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.