Coder Social home page Coder Social logo

Comments (13)

lucasbrambrink avatar lucasbrambrink commented on June 8, 2024

I'm seeing an OOM in the logs:

OP_REQUIRES failed at conv_ops.cc:698 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[16384,32,37,110]

It also shows your training params:

Training Examples: 8264746
Batch Size: 16384
Epochs: 1
Steps per epoch: 504
Steps per tune: 1500000
Num train steps: 504

It seems that the --config.batch_size=512 is not being picked up. It could be related to setting num_epochs=0, try changing that to the original 10. If that doesn't work, you could edit the batch_size in dv_config.py directly.

Let me know if that helps!

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on June 8, 2024

Hi @lucasbrambrink,

I actually tried with different batch_size (32 and 512) but the batch_size takes longer so I switched to 512. I also tried with epoch=10 but still have encountered the same error. I just updated my error log file with the error No checkpoint found.

from deepvariant.

danielecook avatar danielecook commented on June 8, 2024

@sophienguyen01 can you try to run this again without using --debug=true? This runs tensorflow in eager mode which will be very inefficient.

The other issue is that you don't have a checkpoint file because you didn't train long enough - and no checkpoint outperformed the existing performance on your tune dataset.

Try re-running with --debug=false and --config.num_epochs=10 and see where that gets you. If you get an OOM error with batch_size=512, reduce it and try again.

If training produces a better model, it will be output in the experiment_dir.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on June 8, 2024

Hi @danielecook ,
I tried without --debug=false and set --config.num_epochs=10 but I still get the same error that --config.num_epochs=10. I attached my log file here

train_040324.log

THis is the command I used:

BIN_VERSION="1.6.1"
DOCKER_IMAGE="google/deepvariant:${BIN_VERSION}"

time sudo docker run --gpus 1 \
    -v /home/${USER}:/home/${USER} \
    -w /home/${USER} \
    ${DOCKER_IMAGE}-gpu \
    train \
    --config=s3-mount/deepvariant_training/script/dv_config.py:base \
    --config.train_dataset_pbtxt="${SHUFFLE_DIR}/training_set.dataset_config.pbtxt" \
    --config.tune_dataset_pbtxt="${SHUFFLE_DIR}/validation_set.dataset_config.pbtxt" \
    --config.init_checkpoint="${GCS_PRETRAINED_WGS_MODEL}" \
    --config.num_epochs=10 \
    --config.learning_rate=0.02 \
    --config.num_validation_examples=0 \
    --experiment_dir="model_train" \
    --strategy=mirrored \
    --config.batch_size=512

Did I miss anything?

from deepvariant.

danielecook avatar danielecook commented on June 8, 2024

@sophienguyen01 - from the log file it looks like everything worked.

Here are all the tune/categorical accuracies from your training data.

tune/categorical_accuracy=0.9944317936897278
tune/categorical_accuracy=0.9909400343894958
tune/categorical_accuracy=0.9915463924407959
tune/categorical_accuracy=0.9925118088722229
tune/categorical_accuracy=0.9921825528144836
tune/categorical_accuracy=0.9924613237380981
tune/categorical_accuracy=0.9926846623420715
tune/categorical_accuracy=0.9929667711257935
tune/categorical_accuracy=0.9925829172134399
tune/categorical_accuracy=0.9926416277885437
tune/categorical_accuracy=0.9923893213272095
tune/categorical_accuracy=0.9925225377082825

The first number represents accuracy direct from the pretrained model. Since none of the subsequent tuning evaluations outperformed the original, no checkpoints were created.

One thing you could try: reduce the learning rate, and see if that helps.

from deepvariant.

pichuan avatar pichuan commented on June 8, 2024

Hi @sophienguyen01 , let me know if you have a chance to try and provide some updates here. Thanks!

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on June 8, 2024

Hi Pichuan,

You can close this issue now. I will try with different samples. I tried to lower the learning rate but it still does not exceed the performance of default model.

I will have to train on different samples.

Thanks

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on June 8, 2024

HI @pichuan,

I trained on a new dataset and run into similar issue. This time there are files created in checkpoint but I still get the same error. Only the first epoch has low tune/categorical_accuracy and the next remaining epoch the accuracy higher than 0.9. I attached the log file here
train_041924.log

Here is the parameter I used to train:

    --config.learning_rate=0.0001 \
    --config.num_validation_examples=0 \
    --experiment_dir="model_train" \
    --strategy=mirrored \
    --config.batch_size=32 \

Would you take a look and let me know what's going wrong? Thank you

from deepvariant.

pichuan avatar pichuan commented on June 8, 2024

Hi @sophienguyen01 ,
Is there a reason why you're setting --config.num_validation_examples=0? You'll need to have a reasonable amount of num_validation_examples for the model to be able to evaluate and pick a reasonable checkpoint.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on June 8, 2024

According to file dv_config.py :

 # If set to 0, use full validation dataset.
  config.num_validation_examples = 0

Also, the training tutorial also use --config.num_validation_examples=0

from deepvariant.

kishwarshafin avatar kishwarshafin commented on June 8, 2024

Hi @sophienguyen01 , can you specifically point out the line of the error? All the lines in the logs are API warnings, you can safely ignore those.

from deepvariant.

danielecook avatar danielecook commented on June 8, 2024

@sophienguyen01 the logs indicate checkpoints are output:

I0423 18:41:59.026870 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.9114237 step=3352 epoch=1 path=model_train/checkpoints/ckpt-3352
I0423 18:44:53.215049 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.91949123 step=6704 epoch=2 path=model_train/checkpoints/ckpt-6704
I0423 18:47:47.292658 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.92320794 step=10056 epoch=3 path=model_train/checkpoints/ckpt-10056

But as @kishwarshafin suggests, the warnings at the end are normal and can be ignored.

from deepvariant.

sophienguyen01 avatar sophienguyen01 commented on June 8, 2024

Thank you for your input, I am able to find the checkpoints with this training.

from deepvariant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.