Coder Social home page Coder Social logo

Issues with Incompatible TensorRT libraries in docker image google/deepvariant:latest-gpu and google/deepvariant:1.6.1-gpu about deepvariant HOT 9 CLOSED

nlopez94 avatar nlopez94 commented on June 9, 2024
Issues with Incompatible TensorRT libraries in docker image google/deepvariant:latest-gpu and google/deepvariant:1.6.1-gpu

from deepvariant.

Comments (9)

kishwarshafin avatar kishwarshafin commented on June 9, 2024

@nlopez94 ,

We also observe these warnings, but, DeepVariant does not use any TensorRT apis for training or inference. So these warning are usually non actionable for the deepvariant pipeline. Are you running inference and seeing the machine's GPU is not being utilized?

from deepvariant.

nlopez94 avatar nlopez94 commented on June 9, 2024

Hi @kishwarshafin,

The training process starts as expected with GPU activity visible, but it abruptly stops without any error message while processing the first epoch and determining the best checkpoint metric (code snippet). This step completes as expected when using the CPU image with the same dataset and parameters. Initially, I thought TensorRT issues might be causing this stop, but I'll share the logs with you to get your perspective and an extra set of eyes on the problem.

Command:

( time sudo docker run --runtime=nvidia --gpus 1\
    -v ${HOME}:${HOME} \
    -w ${HOME} \
    google/deepvariant:1.6.1-gpu \
     train \
     --config="${BASE}/dv_config.py":base \
     --config.train_dataset_pbtxt="${BASE}/training_set.pbtxt" \
     --config.tune_dataset_pbtxt="${BASE}/validation_set.pbtxt" \
     --config.init_checkpoint="${BASE}/checkpoint/deepvariant.wgs.ckpt" \
     --config.num_epochs=10 \
     --config.learning_rate=0.0001 \
     --config.num_validation_examples=0 \
     --experiment_dir=${TRAINING_DIR} \
     --strategy=mirrored \
     --config.batch_size=512 \
 ) > "${LOG_DIR}/train.log" 2>&1 &
I0508 17:53:46.544947 140534986602304 train.py:384] Starting epoch 0
I0508 17:53:46.545100 140534986602304 train.py:391] Performing initial evaluation of warmstart model.
I0508 17:53:46.545171 140534986602304 train.py:361] Running tune at step=0 epoch=0
I0508 17:53:46.545287 140534986602304 train.py:366] Tune step 0 / 15 (0.0%)
I0508 17:54:10.069682 140512707213056 logging_writer.py:48] [0] tune/categorical_accuracy=0.22617188096046448, tune/categorical_crossentropy=1.3209192752838135, tune/f1_het=0.02283571846783161, tune/f1_homalt=0.09889934211969376, tune/f1_homref=0.843934178352356, tune/f1_macro=0.3218897581100464, tune/f1_micro=0.22617188096046448, tune/f1_weighted=0.21346084773540497, tune/false_negatives_1=6123.0, tune/false_positives_1=5727.0, tune/loss=1.3209190368652344, tune/precision_1=0.21375617384910583, tune/precision_het=0.19323670864105225, tune/precision_homalt=0.05127762258052826, tune/precision_homref=0.9494163393974304, tune/recall_1=0.20273438096046448, tune/recall_het=0.007176175247877836, tune/recall_homalt=0.834269642829895, tune/recall_homref=0.6971428394317627, tune/true_negatives_1=9633.0, tune/true_positives_1=1557.0
I0508 17:54:10.083408 140534986602304 train.py:394] Warmstart checkpoint best checkpoint metric: tune/f1_weighted=0.21346085

real    1m12.933s
user    0m0.037s
sys     0m0.013s

train.log

from deepvariant.

kishwarshafin avatar kishwarshafin commented on June 9, 2024

@nlopez94 can you cat validation_set.pbtxt and see how many examples you have in the tune data? It looks like everything ended regularly but there's too little data.

from deepvariant.

nlopez94 avatar nlopez94 commented on June 9, 2024

@kishwarshafin As I mentioned before, the same dataset and parameters were used when I ran this on CPU, and as I indicated earlier, this process continued without abruptly ending as it did when I ran it with GPU. Below you can find what's on my validation_set.pbtxt

# Generated by shuffle_tfrecords_lowmem.py

name: "ASM3060704"
tfrecord_path: "/home/nlopez/training-case-study/customized_training/validation_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 7762
#
# --input_pattern_list=/home/nlopez/training-case-study/customized_training/validation_set.with_label.tfrecord-?????-of-00024.gz
# --output_pattern_prefix=/home/nlopez/training-case-study/customized_training/validation_set.with_label.shuffled
#
# class1: 5628
# class0: 1774
# class2: 360

from deepvariant.

kishwarshafin avatar kishwarshafin commented on June 9, 2024

@nlopez94 can you remove this parameter: --config.num_validation_examples=0 and rerun please

from deepvariant.

nlopez94 avatar nlopez94 commented on June 9, 2024

@kishwarshafin I will try this and update you on the results I get. Thank you so much for the support!

from deepvariant.

nlopez94 avatar nlopez94 commented on June 9, 2024

@kishwarshafin I just found the error that was causing this to abruptly exit without warning. I was running the script with insufficient memory, and after changing my instance type, everything ran as expected. Thank you very much for answering my questions!

from deepvariant.

kunmonster avatar kunmonster commented on June 9, 2024

Hi,In fact,you can find the path of above missing library and add them into LD_LIBRARY_PATH the the warning will be eliminated. I have the same problem with you ,and solve it by this way.

from deepvariant.

kishwarshafin avatar kishwarshafin commented on June 9, 2024

Thanks for confirming @nlopez94, I will close the issue.

from deepvariant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.