Coder Social home page Coder Social logo

Comments (19)

KevinMenden avatar KevinMenden commented on August 26, 2024

Dear Xiaohan,

Scaden has a --training_datasets option, which you give a comma-separated list of the datasets. This way you can decide which datasets you want to use for training. This functionality was basically implemented for these experiments.

If you want to repeat these experiments, I would encourage you to start from the raw data. I did not do any special processing to the data (PBMC1 and PBMC2), other than merging some cell types together (described in the paper). If you want to generate some datasets yourself, you can also have a look at the processing scripts I provided on figshare:
https://figshare.com/projects/Scaden/62834

Some of the datasets used in the study are under some sort of restricted access or I got them directly from the authors - so I can't just share them, sorry.
I know some of the datasets are a bit tricky to find, so just let me know where exactly you are struggling and I might be able to help you out.

Cheers,
Kevin

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

Thanks a lot for your reply. I have tried the option "scaden --training_datasets", but it reports the following error:
image
I have downloaded the datasets from the webtool:
image
but it is not clear about the train-test separation. For example, if I want to train the model using the "data6k, data8k, donorA", and test the model on the dataset "donorC". What kind of command should I use?

Best,
Xiaohan

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Hi Xiaohan,

it should be usable when scalling scaden train. For instance when you call scaden train --help you get:

     ____                _            
    / ___|  ___ __ _  __| | ___ _ __  
    \___ \ / __/ _` |/ _` |/ _ \ '_ \ 
     ___) | (_| (_| | (_| |  __/ | | |
    |____/ \___\__,_|\__,_|\___|_| |_|

    
Usage: scaden train [OPTIONS] <training data>

  Train a Scaden model

Options:
  --train_datasets TEXT  Comma-separated list of datasets used for training.
                         Uses all by default.

  --model_dir TEXT       Path to store the model in
  --batch_size INTEGER   Batch size to use for training. Default: 128
  --learning_rate FLOAT  Learning rate used for training. Default: 0.0001
  --steps INTEGER        Number of training steps
  --seed INTEGER         Set random seed
  --help                 Show this message and exit.

You would indicate the datasets you want to use for training like you did and store the model with --model_dir
Then you can use this model dir during prediction, using scaden predict --model_dir <your_model_dir>

Usage: scaden predict [OPTIONS] <prediction data>

  Predict cell type composition using a trained Scaden model

Options:
  --model_dir TEXT  Path to trained model
  --outname TEXT    Name of predictions file.
  --seed INTEGER    Set random seed
  --help            Show this message and exit.

Let me know if that works!

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

I used the command "scaden train "./pbmc_data.h5ad" --train_datasets 'data6k, data8k, donorA' --steps 5000 --model_dir model", and it finally works. However, it reports another error:

image

Do you know how to solve it?

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Did you run scaden process before?

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Is the following command correct?
scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/"

It runs into another error:
image

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Note quite, you need to point it to the file (expression matrix) you want to run prediction on:

Usage: scaden process [OPTIONS] <training dataset to be processed> <data for
                      prediction>

  Process a dataset for training

Options:
  --processed_path TEXT  Path of processed file. Must end with .h5ad
  --var_cutoff FLOAT     Filter out genes with a variance less than the
                         specified cutoff. A low cutoff is recommended,this
                         should only remove genes that are obviously
                         uninformative.

  --help                 Show this message and exit.

Have a look at the demo with example data simulation that I provide in the README.md for all the steps you need to do to perform training and prediction. It also generates example data which you can inspect to check if your data is formatted correctly:

https://github.com/KevinMenden/scaden/blob/master/README.md

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

In my understanding, the "data simulation" part is only useful when I need to generate a new training dataset. Is it right?
If I want to train with 'data6k, data8k, donorA' and test with 'donorC'. I think all of these datasets are contained in the 'pbmc_data.h5ad' file. So should I use "scaden process "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad" "/apdcephfs/share_1364273/shared_info/xiaohanxing/sc_deconv_datasets/pbmc_data.h5ad"?
I am a bit confusing.

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Ahh I see. No, you need a text file containing the gene expression.
Have a look at the dataset I shared earlier and download this one:
https://figshare.com/articles/software/Publication_Figures/8234030

For the simulated PBMC data there are some expression matrices inside. Specifically:
paper_data_v3/figures/figure2/data

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

It finally works. Thanks a lot for your help.
I will reimplement the experiments and compare them with the results in your paper.

Best,
Xiaohan

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Perfect, let me know if you encounter any other issues! :)

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

sorry for bothering you again. I have trained a model with "data6k, data8k, donorA" and tested on "donorC_500_samples.txt". However, the result is quite different from your paper. The following is my result on the donorC dataset:
my_donorC_metric

My operations are:

  1. preprocess data: scaden process "./pbmc_data.h5ad" "./donorC_500_samples.txt"
  2. train model: scaden train "processed.h5ad" --train_datasets 'data6k, data8k, donorA' --steps 5000 --model_dir model
  3. test model: scaden predict --model_dir model donorC_500_samples.txt

Is there anything wrong with my commands? Maybe the training data and testing data are not matched in their distributions?

Looking forward to your reply.

Best,
Xiaohan

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Dear Xiahan,

nice that you could get it running.
The results look pretty normal on first glance, but you're right that the RMSE should probably be a bit lower and the CCC higher.
How did you calculate those values?
It would be nice to also calculate for all data points, and not by cell type (which is the main metric we used).
And maybe run it for another of those datasets. I'll replicate those steps myself when I get to it to make sure nothing is wrong with the training dataset.
But not sure if I will get around to do this in the next days.

Best,
Kevin

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Could you show me the output of Scaden after you type scaden train ... ?
It usually tells you which datasets where used for training. I have a feeling that it didn't use all the datasets but just one as you didn't supplied a comma-separated list but one that also has white spaces. So if you could try again using
'data6k,data8k,donorA'
instead of
'data6k, data8k, donorA'

that would be great :)
Let me know if that helps!

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

Thanks for your reply.
The information looks correct. It seems the model is trained on the three datasets. But I will try to remove the white spaces as you suggested.

image

For the calculation of the metrics, I used the code you provided scaden_paper_data_v3\figures\figure2\fig2_comparison_plots.ipynb
I also used this code to evaluate the results you provided at scaden_paper_data_v3\figures\figure2\scaden_predictions\scaden_predictions_donorC.txt
and get the metrics very close to your paper:
paper_donorC_metric
Thus, I think the metric computation is alright. I will do experiments on other datasets.

Best,
Xiaohan

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Yesterday, I didn't know why this error was solved, but today it comes again. According to the solutions on the internet, it is about the usage of GPU memory (My GPU has 32G memory). I have tried several methods but did not solve the problem. Can you help check the codes to see whether there is something that occupies the GPU memory? Thanks a lot.
image

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Hi,
if you have a look at the "training on" message, it has appended those white spaces to the datasets and thus probably didn't use them for training. So it would be good to check again with the proper dataset description.

Sorry I have never encountered that error before and I have been testing Scaden a lot with a 6 GB GPU ... there really isn't anything special in the code that could lead to this (not that I can think of at least).

from scaden.

hathawayxxh avatar hathawayxxh commented on August 26, 2024

Hi Kevin,

Thanks for your reply. Now the problems are solved and I can get results very close to those reported in your paper.

Thanks a lot.

Best,
Xiaohan

from scaden.

KevinMenden avatar KevinMenden commented on August 26, 2024

Awesome!

I already made a new issue to remind me that I should add a warning if datasets are supplied which are not part of the training datasets. That's very easy to miss ....

Best,
Kevin

from scaden.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.