Comments (13)
I'm seeing an OOM in the logs:
OP_REQUIRES failed at conv_ops.cc:698 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[16384,32,37,110]
It also shows your training params:
Training Examples: 8264746
Batch Size: 16384
Epochs: 1
Steps per epoch: 504
Steps per tune: 1500000
Num train steps: 504
It seems that the --config.batch_size=512
is not being picked up. It could be related to setting num_epochs=0
, try changing that to the original 10. If that doesn't work, you could edit the batch_size in dv_config.py
directly.
Let me know if that helps!
from deepvariant.
Hi @lucasbrambrink,
I actually tried with different batch_size (32 and 512) but the batch_size takes longer so I switched to 512. I also tried with epoch=10 but still have encountered the same error. I just updated my error log file with the error No checkpoint found.
from deepvariant.
@sophienguyen01 can you try to run this again without using --debug=true
? This runs tensorflow in eager mode which will be very inefficient.
The other issue is that you don't have a checkpoint file because you didn't train long enough - and no checkpoint outperformed the existing performance on your tune dataset.
Try re-running with --debug=false
and --config.num_epochs=10
and see where that gets you. If you get an OOM error with batch_size=512, reduce it and try again.
If training produces a better model, it will be output in the experiment_dir
.
from deepvariant.
Hi @danielecook ,
I tried without --debug=false
and set --config.num_epochs=10
but I still get the same error that --config.num_epochs=10
. I attached my log file here
THis is the command I used:
BIN_VERSION="1.6.1"
DOCKER_IMAGE="google/deepvariant:${BIN_VERSION}"
time sudo docker run --gpus 1 \
-v /home/${USER}:/home/${USER} \
-w /home/${USER} \
${DOCKER_IMAGE}-gpu \
train \
--config=s3-mount/deepvariant_training/script/dv_config.py:base \
--config.train_dataset_pbtxt="${SHUFFLE_DIR}/training_set.dataset_config.pbtxt" \
--config.tune_dataset_pbtxt="${SHUFFLE_DIR}/validation_set.dataset_config.pbtxt" \
--config.init_checkpoint="${GCS_PRETRAINED_WGS_MODEL}" \
--config.num_epochs=10 \
--config.learning_rate=0.02 \
--config.num_validation_examples=0 \
--experiment_dir="model_train" \
--strategy=mirrored \
--config.batch_size=512
Did I miss anything?
from deepvariant.
@sophienguyen01 - from the log file it looks like everything worked.
Here are all the tune/categorical accuracies from your training data.
tune/categorical_accuracy=0.9944317936897278
tune/categorical_accuracy=0.9909400343894958
tune/categorical_accuracy=0.9915463924407959
tune/categorical_accuracy=0.9925118088722229
tune/categorical_accuracy=0.9921825528144836
tune/categorical_accuracy=0.9924613237380981
tune/categorical_accuracy=0.9926846623420715
tune/categorical_accuracy=0.9929667711257935
tune/categorical_accuracy=0.9925829172134399
tune/categorical_accuracy=0.9926416277885437
tune/categorical_accuracy=0.9923893213272095
tune/categorical_accuracy=0.9925225377082825
The first number represents accuracy direct from the pretrained model. Since none of the subsequent tuning evaluations outperformed the original, no checkpoints were created.
One thing you could try: reduce the learning rate, and see if that helps.
from deepvariant.
Hi @sophienguyen01 , let me know if you have a chance to try and provide some updates here. Thanks!
from deepvariant.
Hi Pichuan,
You can close this issue now. I will try with different samples. I tried to lower the learning rate but it still does not exceed the performance of default model.
I will have to train on different samples.
Thanks
from deepvariant.
HI @pichuan,
I trained on a new dataset and run into similar issue. This time there are files created in checkpoint but I still get the same error. Only the first epoch has low tune/categorical_accuracy and the next remaining epoch the accuracy higher than 0.9. I attached the log file here
train_041924.log
Here is the parameter I used to train:
--config.learning_rate=0.0001 \
--config.num_validation_examples=0 \
--experiment_dir="model_train" \
--strategy=mirrored \
--config.batch_size=32 \
Would you take a look and let me know what's going wrong? Thank you
from deepvariant.
Hi @sophienguyen01 ,
Is there a reason why you're setting --config.num_validation_examples=0
? You'll need to have a reasonable amount of num_validation_examples for the model to be able to evaluate and pick a reasonable checkpoint.
from deepvariant.
According to file dv_config.py :
# If set to 0, use full validation dataset.
config.num_validation_examples = 0
Also, the training tutorial also use --config.num_validation_examples=0
from deepvariant.
Hi @sophienguyen01 , can you specifically point out the line of the error? All the lines in the logs are API warnings, you can safely ignore those.
from deepvariant.
@sophienguyen01 the logs indicate checkpoints are output:
I0423 18:41:59.026870 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.9114237 step=3352 epoch=1 path=model_train/checkpoints/ckpt-3352
I0423 18:44:53.215049 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.91949123 step=6704 epoch=2 path=model_train/checkpoints/ckpt-6704
I0423 18:47:47.292658 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.92320794 step=10056 epoch=3 path=model_train/checkpoints/ckpt-10056
But as @kishwarshafin suggests, the warnings at the end are normal and can be ignored.
from deepvariant.
Thank you for your input, I am able to find the checkpoints with this training.
from deepvariant.
Related Issues (20)
- Restarting from post process variants step of deeptrio HOT 4
- BrokenPipeErro during postprocess_variants HOT 8
- Question about training HOT 1
- unable to run deepvariant using conda HOT 6
- Fatal Python error: Segmentation fault HOT 3
- How to get list of variants after make_examples step? HOT 1
- Highest mapping quality = 42 in bowtie2 HOT 3
- Output files are missing after running deepvariant. HOT 10
- Merging gvcf with GLnexus introduces non-zero heterozygous PL in hemizygous PAR HOT 1
- Dynamic cast failed HOT 6
- question for INDEL variant calling HOT 14
- Question about the time it takes for VC analysis HOT 5
- Merging vcf files error with glnexus:v1.2.7 HOT 6
- haploid contigs and PAR region options for DeepTrio HOT 13
- [E::vcf_parse_format] Incorrect number of FORMAT fields at NC_059157.1:24900 HOT 2
- postprocess_variants: Found multiple file patterns in input filename space HOT 7
- Issues with Incompatible TensorRT libraries in docker image google/deepvariant:latest-gpu and google/deepvariant:1.6.1-gpu HOT 9
- CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected HOT 7
- Info ONT R10.4.1 data HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepvariant.