Hi Kevin, I am new to the scRNA deconvolution. I notice that there a

Hi Xiaohan, it should be usable when scalling <code class="notransla

How to repeat the experiments in your paper about scaden HOT 19 CLOSED

kevinmenden commented on August 26, 2024

How to repeat the experiments in your paper

from scaden.

Comments (19)

KevinMenden commented on August 26, 2024

Dear Xiaohan,

Scaden has a --training_datasets option, which you give a comma-separated list of the datasets. This way you can decide which datasets you want to use for training. This functionality was basically implemented for these experiments.

If you want to repeat these experiments, I would encourage you to start from the raw data. I did not do any special processing to the data (PBMC1 and PBMC2), other than merging some cell types together (described in the paper). If you want to generate some datasets yourself, you can also have a look at the processing scripts I provided on figshare:
https://figshare.com/projects/Scaden/62834

Some of the datasets used in the study are under some sort of restricted access or I got them directly from the authors - so I can't just share them, sorry.
I know some of the datasets are a bit tricky to find, so just let me know where exactly you are struggling and I might be able to help you out.

Cheers,
Kevin

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

Thanks a lot for your reply. I have tried the option "scaden --training_datasets", but it reports the following error:

I have downloaded the datasets from the webtool:

but it is not clear about the train-test separation. For example, if I want to train the model using the "data6k, data8k, donorA", and test the model on the dataset "donorC". What kind of command should I use?

Best,
Xiaohan

from scaden.

KevinMenden commented on August 26, 2024

Hi Xiaohan,

it should be usable when scalling scaden train. For instance when you call scaden train --help you get:

     ____                _            
    / ___|  ___ __ _  __| | ___ _ __  
    \___ \ / __/ _` |/ _` |/ _ \ '_ \ 
     ___) | (_| (_| | (_| |  __/ | | |
    |____/ \___\__,_|\__,_|\___|_| |_|

    
Usage: scaden train [OPTIONS] <training data>

  Train a Scaden model

Options:
  --train_datasets TEXT  Comma-separated list of datasets used for training.
                         Uses all by default.

  --model_dir TEXT       Path to store the model in
  --batch_size INTEGER   Batch size to use for training. Default: 128
  --learning_rate FLOAT  Learning rate used for training. Default: 0.0001
  --steps INTEGER        Number of training steps
  --seed INTEGER         Set random seed
  --help                 Show this message and exit.

You would indicate the datasets you want to use for training like you did and store the model with --model_dir
Then you can use this model dir during prediction, using scaden predict --model_dir <your_model_dir>

Usage: scaden predict [OPTIONS] <prediction data>

  Predict cell type composition using a trained Scaden model

Options:
  --model_dir TEXT  Path to trained model
  --outname TEXT    Name of predictions file.
  --seed INTEGER    Set random seed
  --help            Show this message and exit.

Let me know if that works!

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

I used the command "scaden train "./pbmc_data.h5ad" --train_datasets 'data6k, data8k, donorA' --steps 5000 --model_dir model", and it finally works. However, it reports another error:

Do you know how to solve it?

from scaden.

KevinMenden commented on August 26, 2024

Did you run scaden process before?

from scaden.

hathawayxxh commented on August 26, 2024

Is the following command correct?
scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/"

It runs into another error:

from scaden.

KevinMenden commented on August 26, 2024

Note quite, you need to point it to the file (expression matrix) you want to run prediction on:

Usage: scaden process [OPTIONS] <training dataset to be processed> <data for
                      prediction>

  Process a dataset for training

Options:
  --processed_path TEXT  Path of processed file. Must end with .h5ad
  --var_cutoff FLOAT     Filter out genes with a variance less than the
                         specified cutoff. A low cutoff is recommended,this
                         should only remove genes that are obviously
                         uninformative.

  --help                 Show this message and exit.

Have a look at the demo with example data simulation that I provide in the README.md for all the steps you need to do to perform training and prediction. It also generates example data which you can inspect to check if your data is formatted correctly:

https://github.com/KevinMenden/scaden/blob/master/README.md

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

In my understanding, the "data simulation" part is only useful when I need to generate a new training dataset. Is it right?
If I want to train with 'data6k, data8k, donorA' and test with 'donorC'. I think all of these datasets are contained in the 'pbmc_data.h5ad' file. So should I use "scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad"?
I am a bit confusing.

from scaden.

KevinMenden commented on August 26, 2024

Ahh I see. No, you need a text file containing the gene expression.
Have a look at the dataset I shared earlier and download this one:
https://figshare.com/articles/software/Publication_Figures/8234030

For the simulated PBMC data there are some expression matrices inside. Specifically:
paper_data_v3/figures/figure2/data

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

It finally works. Thanks a lot for your help.
I will reimplement the experiments and compare them with the results in your paper.

Best,
Xiaohan

from scaden.

KevinMenden commented on August 26, 2024

Perfect, let me know if you encounter any other issues! :)

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

sorry for bothering you again. I have trained a model with "data6k, data8k, donorA" and tested on "donorC_500_samples.txt". However, the result is quite different from your paper. The following is my result on the donorC dataset:

My operations are:

preprocess data: scaden process "./pbmc_data.h5ad" "./donorC_500_samples.txt"
train model: scaden train "processed.h5ad" --train_datasets 'data6k, data8k, donorA' --steps 5000 --model_dir model
test model: scaden predict --model_dir model donorC_500_samples.txt

Is there anything wrong with my commands? Maybe the training data and testing data are not matched in their distributions?

Looking forward to your reply.

Best,
Xiaohan

from scaden.

KevinMenden commented on August 26, 2024

Dear Xiahan,

nice that you could get it running.
The results look pretty normal on first glance, but you're right that the RMSE should probably be a bit lower and the CCC higher.
How did you calculate those values?
It would be nice to also calculate for all data points, and not by cell type (which is the main metric we used).
And maybe run it for another of those datasets. I'll replicate those steps myself when I get to it to make sure nothing is wrong with the training dataset.
But not sure if I will get around to do this in the next days.

Best,
Kevin

from scaden.

KevinMenden commented on August 26, 2024

Could you show me the output of Scaden after you type scaden train ... ?
It usually tells you which datasets where used for training. I have a feeling that it didn't use all the datasets but just one as you didn't supplied a comma-separated list but one that also has white spaces. So if you could try again using
'data6k,data8k,donorA'
instead of
'data6k, data8k, donorA'

that would be great :)
Let me know if that helps!

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

Thanks for your reply.
The information looks correct. It seems the model is trained on the three datasets. But I will try to remove the white spaces as you suggested.

For the calculation of the metrics, I used the code you provided scaden_paper_data_v3\figures\figure2\fig2_comparison_plots.ipynb
I also used this code to evaluate the results you provided at scaden_paper_data_v3\figures\figure2\scaden_predictions\scaden_predictions_donorC.txt
and get the metrics very close to your paper:

Thus, I think the metric computation is alright. I will do experiments on other datasets.

Best,
Xiaohan

from scaden.

hathawayxxh commented on August 26, 2024

Yesterday, I didn't know why this error was solved, but today it comes again. According to the solutions on the internet, it is about the usage of GPU memory (My GPU has 32G memory). I have tried several methods but did not solve the problem. Can you help check the codes to see whether there is something that occupies the GPU memory? Thanks a lot.

from scaden.

KevinMenden commented on August 26, 2024

Hi,
if you have a look at the "training on" message, it has appended those white spaces to the datasets and thus probably didn't use them for training. So it would be good to check again with the proper dataset description.

Sorry I have never encountered that error before and I have been testing Scaden a lot with a 6 GB GPU ... there really isn't anything special in the code that could lead to this (not that I can think of at least).

from scaden.

hathawayxxh commented on August 26, 2024

Hi Kevin,

Thanks for your reply. Now the problems are solved and I can get results very close to those reported in your paper.

Thanks a lot.

Best,
Xiaohan

from scaden.

KevinMenden commented on August 26, 2024

Awesome!

I already made a new issue to remind me that I should add a warning if datasets are supplied which are not part of the training datasets. That's very easy to miss ....

Best,
Kevin

from scaden.

How to repeat the experiments in your paper about scaden HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent