Hi, I followed the guide to retrain DeepVariant in here: <a href="ht

I'm seeing an OOM in the logs: <div class="snippet-clipboard-content notranslate p

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

HI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

According to file dv_config.py : <div class="snippet-clipboard-content notranslate

error in training DeepVariant about deepvariant HOT 13 CLOSED

sophienguyen01 commented on June 8, 2024

error in training DeepVariant

from deepvariant.

Comments (13)

lucasbrambrink commented on June 8, 2024

I'm seeing an OOM in the logs:

OP_REQUIRES failed at conv_ops.cc:698 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[16384,32,37,110]

It also shows your training params:

Training Examples: 8264746
Batch Size: 16384
Epochs: 1
Steps per epoch: 504
Steps per tune: 1500000
Num train steps: 504

It seems that the --config.batch_size=512 is not being picked up. It could be related to setting num_epochs=0, try changing that to the original 10. If that doesn't work, you could edit the batch_size in dv_config.py directly.

Let me know if that helps!

from deepvariant.

sophienguyen01 commented on June 8, 2024

Hi @lucasbrambrink,

I actually tried with different batch_size (32 and 512) but the batch_size takes longer so I switched to 512. I also tried with epoch=10 but still have encountered the same error. I just updated my error log file with the error No checkpoint found.

from deepvariant.

danielecook commented on June 8, 2024

@sophienguyen01 can you try to run this again without using --debug=true? This runs tensorflow in eager mode which will be very inefficient.

The other issue is that you don't have a checkpoint file because you didn't train long enough - and no checkpoint outperformed the existing performance on your tune dataset.

Try re-running with --debug=false and --config.num_epochs=10 and see where that gets you. If you get an OOM error with batch_size=512, reduce it and try again.

If training produces a better model, it will be output in the experiment_dir.

from deepvariant.

sophienguyen01 commented on June 8, 2024

Hi @danielecook ,
I tried without --debug=false and set --config.num_epochs=10 but I still get the same error that --config.num_epochs=10. I attached my log file here

train_040324.log

THis is the command I used:

BIN_VERSION="1.6.1"
DOCKER_IMAGE="google/deepvariant:${BIN_VERSION}"

time sudo docker run --gpus 1 \
    -v /home/${USER}:/home/${USER} \
    -w /home/${USER} \
    ${DOCKER_IMAGE}-gpu \
    train \
    --config=s3-mount/deepvariant_training/script/dv_config.py:base \
    --config.train_dataset_pbtxt="${SHUFFLE_DIR}/training_set.dataset_config.pbtxt" \
    --config.tune_dataset_pbtxt="${SHUFFLE_DIR}/validation_set.dataset_config.pbtxt" \
    --config.init_checkpoint="${GCS_PRETRAINED_WGS_MODEL}" \
    --config.num_epochs=10 \
    --config.learning_rate=0.02 \
    --config.num_validation_examples=0 \
    --experiment_dir="model_train" \
    --strategy=mirrored \
    --config.batch_size=512

Did I miss anything?

from deepvariant.

danielecook commented on June 8, 2024

@sophienguyen01 - from the log file it looks like everything worked.

Here are all the tune/categorical accuracies from your training data.

tune/categorical_accuracy=0.9944317936897278
tune/categorical_accuracy=0.9909400343894958
tune/categorical_accuracy=0.9915463924407959
tune/categorical_accuracy=0.9925118088722229
tune/categorical_accuracy=0.9921825528144836
tune/categorical_accuracy=0.9924613237380981
tune/categorical_accuracy=0.9926846623420715
tune/categorical_accuracy=0.9929667711257935
tune/categorical_accuracy=0.9925829172134399
tune/categorical_accuracy=0.9926416277885437
tune/categorical_accuracy=0.9923893213272095
tune/categorical_accuracy=0.9925225377082825

The first number represents accuracy direct from the pretrained model. Since none of the subsequent tuning evaluations outperformed the original, no checkpoints were created.

One thing you could try: reduce the learning rate, and see if that helps.

from deepvariant.

pichuan commented on June 8, 2024

Hi @sophienguyen01 , let me know if you have a chance to try and provide some updates here. Thanks!

from deepvariant.

sophienguyen01 commented on June 8, 2024

Hi Pichuan,

You can close this issue now. I will try with different samples. I tried to lower the learning rate but it still does not exceed the performance of default model.

I will have to train on different samples.

Thanks

from deepvariant.

sophienguyen01 commented on June 8, 2024

HI @pichuan,

I trained on a new dataset and run into similar issue. This time there are files created in checkpoint but I still get the same error. Only the first epoch has low tune/categorical_accuracy and the next remaining epoch the accuracy higher than 0.9. I attached the log file here
train_041924.log

Here is the parameter I used to train:

    --config.learning_rate=0.0001 \
    --config.num_validation_examples=0 \
    --experiment_dir="model_train" \
    --strategy=mirrored \
    --config.batch_size=32 \

Would you take a look and let me know what's going wrong? Thank you

from deepvariant.

pichuan commented on June 8, 2024

Hi @sophienguyen01 ,
Is there a reason why you're setting --config.num_validation_examples=0? You'll need to have a reasonable amount of num_validation_examples for the model to be able to evaluate and pick a reasonable checkpoint.

from deepvariant.

sophienguyen01 commented on June 8, 2024

According to file dv_config.py :

 # If set to 0, use full validation dataset.
  config.num_validation_examples = 0

Also, the training tutorial also use --config.num_validation_examples=0

from deepvariant.

kishwarshafin commented on June 8, 2024

Hi @sophienguyen01 , can you specifically point out the line of the error? All the lines in the logs are API warnings, you can safely ignore those.

from deepvariant.

danielecook commented on June 8, 2024

@sophienguyen01 the logs indicate checkpoints are output:

I0423 18:41:59.026870 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.9114237 step=3352 epoch=1 path=model_train/checkpoints/ckpt-3352
I0423 18:44:53.215049 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.91949123 step=6704 epoch=2 path=model_train/checkpoints/ckpt-6704
I0423 18:47:47.292658 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.92320794 step=10056 epoch=3 path=model_train/checkpoints/ckpt-10056

But as @kishwarshafin suggests, the warnings at the end are normal and can be ignored.

from deepvariant.

sophienguyen01 commented on June 8, 2024

Thank you for your input, I am able to find the checkpoints with this training.

from deepvariant.

error in training DeepVariant about deepvariant HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent