I attempted to train a model with ImmuneSEQRearrangement data using ImmuneML. The proc

ImmuneML: parsing the specification... about immuneml HOT 9 CLOSED

uio-bmi commented on July 29, 2024

ImmuneML: parsing the specification...

from immuneml.

Comments (9)

LonnekeScheffer commented on July 29, 2024

Hi Genokarma,

Thanks for reaching out! You mentioned the process has been running for 5 hours, but depending on the size of the dataset and specific methods and parameters used, some processes may be very computationally expensive and can indeed run for a long time. Since I don't have more information on the analysis you are trying to run, I cannot help you determine what the cause of this long running time may be. If you would like my input on that, you're welcome to share the YAML analysis specification with me.

As for debugging the problem: You could try running a small example to ensure everything works, for example using only a small number of repertoires or sequences. There is also an automatic test instruction which can be run, to check if the immuneML installation works at all: https://docs.immuneml.uio.no/latest/installation/install_with_package_manager.html#testing-immuneml

Since this issue as of now does not point towards a concrete bug, I will close it for now. Feel free to reach out on [email protected] if you have more questions.

from immuneml.

Genokarma commented on July 29, 2024

Hi LonnekeScheffer,
Its been more than 12 hours and still showing same thing
My sample details are: 15 breast cancer study samples; 30 Control samples.
Here I have attached my yaml specifications
1.converted ImmunoSEQRearrangement data in to ImmuneML format
Script for conversion:
definitions:
datasets:
dataset:
format: ImmunoSEQSample
params:
is_repertoire: true
metadata_file: /data/metadata2.csv
path: /data/DataS2/
region_type: IMGT_CDR3
result_path: /data/
instructions:
my_dataset_generation_instruction:
datasets:
- dataset
export_formats:
- ImmuneML
type: DatasetExport

2.Use following script to train model
definitions:
datasets:
dataset:
format: ImmuneML
params:
path: /data/
result_path: /data/results
encodings:
encoding_1:
KmerFrequency:
k: 3
reads: all
sequence_encoding: CONTINUOUS_KMER
ml_methods:
k_nearest_neighbors:
KNN:
n_neighbors:
- 3
- 5
- 7
show_warnings: true
model_selection_cv: true
model_selection_n_folds: 5
logistic_regression:
LogisticRegression:
C:
- 0.01
- 0.1
- 1
- 10
- 100
class_weight:
- balanced
penalty:
- l1
show_warnings: true
model_selection_cv: true
model_selection_n_folds: 5
random_forest:
RandomForestClassifier:
class_weight:
- balanced
n_estimators:
- 10
- 50
- 100
show_warnings: true
model_selection_cv: true
model_selection_n_folds: 5
support_vector_machine:
SVC:
C:
- 0.01
- 0.1
- 1
- 10
- 100
class_weight:
- balanced
dual: false
penalty:
- l1
show_warnings: true
model_selection_cv: true
model_selection_n_folds: 5
motifs: {}
preprocessing_sequences: {}
reports:
benchmark:
MLSettingsPerformance:
name: benchmark
single_axis_labels: false
x_label_position: -0.12
y_label_position: -0.08
coefficients:
Coefficients:
coefs_to_plot:
- N_LARGEST
n_largest:
- 25
name: coefficients
signals: {}
simulations: {}
instructions:
inst1:
assessment:
reports:
models:
- coefficients
split_count: 5
split_strategy: random
training_percentage: 0.7
dataset: dataset
labels:
- signal_disease
metrics: []
number_of_processes: 10
optimization_metric: accuracy
refit_optimal_model: true
reports:
- benchmark
selection:
split_count: 1
split_strategy: random
training_percentage: 0.7
settings:
- encoding: encoding_1
ml_method: random_forest
preprocessing: null
- encoding: encoding_1
ml_method: logistic_regression
preprocessing: null
- encoding: encoding_1
ml_method: support_vector_machine
preprocessing: null
- encoding: encoding_1
ml_method: k_nearest_neighbors
preprocessing: null
strategy: GridSearch
type: TrainMLModel
output:
format: HTML

from immuneml.

LonnekeScheffer commented on July 29, 2024

Hi Genokarma,

I don't think there is necessarily any reason why this should not work. The dataset does not seem extremely large. For debugging purposes, I recommend the following steps:

kill the existing run
make sure you have the latest version of immuneML installed
test if the immuneML installation works correctly according to the documentation: https://docs.immuneml.uio.no/latest/installation/install_with_package_manager.html#testing-immuneml
try running the quickstart example: https://docs.immuneml.uio.no/latest/quickstart/cli_yaml.html
try to run the TrainMLModel instruction with a minimal example, for instance, only running logistic regression, or perhaps a smaller dataset as well.

By following these steps, we can pinpoint where the issue might be (e.g., if there is something wrong with the installation, the computer setup, or the dataset). I don't believe there is a bug in immuneML that is causing this, since everything runs like normal on our end, but if we do find such indication we will of course fix it as soon as possible.

As a side note, it is not necessary to convert the dataset to immuneML format first (you can simply use the ImmunoSEQSample import in the same yaml as where the training happens), although it should work like this as well. Also, you have set the number of processes to 10, which may be alright, but please make sure the system you are running this on supports that number of CPUs (specifying too many processes can also slow down the runtime).

from immuneml.

Genokarma commented on July 29, 2024

Hi LonnekeScheffer,
I want to express my gratitude for your assistance; your time and efforts are highly appreciated. For your reference, I've attached my dataset files and YAML script. I am utilizing a Docker container, and the command details are provided in the attached README file. Link for dataset and yaml file is https://github.com/Genokarma/ImmuneMLTest

I've encountered an issue while running the process on two different systems. On my MacOS system with 16 CPUs and 16GB RAM, the process gets stuck at "parsing the specification." On the Linux system with 48GB RAM, it encounters an issue with encoding (encoding 1...). It's been more than 24 hrs but not progress.

I have attempted to troubleshoot the problem on both systems without success. Could you please attempt to execute the process or provide any suggestions to address this issue? Your assistance in resolving this issue is invaluable.

from immuneml.

Genokarma commented on July 29, 2024

Hello again LonnekeScheffer,

I want to express my gratitude for your assistance; your time and efforts are highly appreciated. For your reference, I've attached my dataset files and YAML script. I am utilizing a Docker container, and the command details are provided in the attached README file. Link for the dataset and yaml file is: https://github.com/Genokarma/ImmuneMLTest

from immuneml.

LonnekeScheffer commented on July 29, 2024

Dear GenoKarma,

Thanks for sharing the test dataset and YAML. I'm currently very busy (in preparation of my PhD defence), and I will have more time available in the last week of January. In the meantime, it would be helpful to try to run the test and Quickstart examples as mentioned in my previous comments. These examples are small and known to take only a short time to run, and can help us find an indication of whether immuneML is actually getting "stuck" on your system, or simply takes a long time to run.

from immuneml.

Genokarma commented on July 29, 2024

Thank you for your prompt response and for sharing the information. I completely understand that you're currently occupied with your PhD defense preparations. Wishing you the best of luck with your PhD defense.

I have used demo data during installation process. In the meantime, I took your advice and ran the Quickstart example again as per your previous suggestions. I'm pleased to inform you that the quickstart/demo went smoothly, and the process was completed successfully. I have attached the screenshots for your reference.

I look forward to connecting with you again in the last week of January.

from immuneml.

LonnekeScheffer commented on July 29, 2024

Dear GenoKarma,

My apologies for the delay, it was a busy period. But I have good news; I finally managed to take a deeper look into this issue, and implement a solution. I cloned your github repository and tried to reproduce your immuneML run. I indeed discovered two issues, one bug and one performance issue, which both were introduced during out recent large refactoring for the alpha version of immuneML 3.

Firstly, there was a bug in KmerFequencyEncoder due to some changed variable names. If you encountered this bug, you would run into the following error message:

--- Exception in _encode_examples : 'SequenceMetadata' object has no attribute 'count'

This bug was solved in the latest version of immuneML. I originally thought that this bug may have been the culprit for your analysis. But when I tried to run immuneML on your entire dataset, I indeed found that I did not even encounter the error above, because immuneML was taking a long time at some step earlier in the encoding process. I was able to locate and fix the issue, the KmerFrequencyEncoder should be a lot faster. With the updated code, encoding your dataset with 4 parallel processes took 8 minutes on my computer.

So in conclusion, if you reinstall the latest version of immuneML (version 3.0.0a3), encoding will be a lot faster. Since immuneML 3 is still in its 'alpha' version, there have been major refactorings and ongoing developments which have not yet been thoroughly tested. We therefore highly appreciate the user feedback, and I will try my best to resolve issues as soon as I can. However, if some issue is halting your work, it is always possible to downgrade to the latest stable immuneML release (v2.2.5).

All the best,
Lonneke

from immuneml.

Genokarma commented on July 29, 2024

Hi Lonneke,

Hope your viva went well! Thank you for reaching out.

Yes I have tried your suggestions with newer as well as stable version(s). However, I am not able to run as newer version provide some other error. I have attached log.txt for your reference. Have you prepare docker image for the newer version of ImmuneMl, if yes please share with me.

log.txt
Once again thank you.
with regards.

from immuneml.

ImmuneML: parsing the specification... about immuneml HOT 9 CLOSED

Comments (9)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent