Comments (9)
We also observe these warnings, but, DeepVariant does not use any TensorRT apis for training or inference. So these warning are usually non actionable for the deepvariant pipeline. Are you running inference and seeing the machine's GPU is not being utilized?
from deepvariant.
Hi @kishwarshafin,
The training process starts as expected with GPU activity visible, but it abruptly stops without any error message while processing the first epoch and determining the best checkpoint metric (code snippet). This step completes as expected when using the CPU image with the same dataset and parameters. Initially, I thought TensorRT issues might be causing this stop, but I'll share the logs with you to get your perspective and an extra set of eyes on the problem.
Command:
( time sudo docker run --runtime=nvidia --gpus 1\
-v ${HOME}:${HOME} \
-w ${HOME} \
google/deepvariant:1.6.1-gpu \
train \
--config="${BASE}/dv_config.py":base \
--config.train_dataset_pbtxt="${BASE}/training_set.pbtxt" \
--config.tune_dataset_pbtxt="${BASE}/validation_set.pbtxt" \
--config.init_checkpoint="${BASE}/checkpoint/deepvariant.wgs.ckpt" \
--config.num_epochs=10 \
--config.learning_rate=0.0001 \
--config.num_validation_examples=0 \
--experiment_dir=${TRAINING_DIR} \
--strategy=mirrored \
--config.batch_size=512 \
) > "${LOG_DIR}/train.log" 2>&1 &
I0508 17:53:46.544947 140534986602304 train.py:384] Starting epoch 0
I0508 17:53:46.545100 140534986602304 train.py:391] Performing initial evaluation of warmstart model.
I0508 17:53:46.545171 140534986602304 train.py:361] Running tune at step=0 epoch=0
I0508 17:53:46.545287 140534986602304 train.py:366] Tune step 0 / 15 (0.0%)
I0508 17:54:10.069682 140512707213056 logging_writer.py:48] [0] tune/categorical_accuracy=0.22617188096046448, tune/categorical_crossentropy=1.3209192752838135, tune/f1_het=0.02283571846783161, tune/f1_homalt=0.09889934211969376, tune/f1_homref=0.843934178352356, tune/f1_macro=0.3218897581100464, tune/f1_micro=0.22617188096046448, tune/f1_weighted=0.21346084773540497, tune/false_negatives_1=6123.0, tune/false_positives_1=5727.0, tune/loss=1.3209190368652344, tune/precision_1=0.21375617384910583, tune/precision_het=0.19323670864105225, tune/precision_homalt=0.05127762258052826, tune/precision_homref=0.9494163393974304, tune/recall_1=0.20273438096046448, tune/recall_het=0.007176175247877836, tune/recall_homalt=0.834269642829895, tune/recall_homref=0.6971428394317627, tune/true_negatives_1=9633.0, tune/true_positives_1=1557.0
I0508 17:54:10.083408 140534986602304 train.py:394] Warmstart checkpoint best checkpoint metric: tune/f1_weighted=0.21346085
real 1m12.933s
user 0m0.037s
sys 0m0.013s
from deepvariant.
@nlopez94 can you cat validation_set.pbtxt and see how many examples you have in the tune data? It looks like everything ended regularly but there's too little data.
from deepvariant.
@kishwarshafin As I mentioned before, the same dataset and parameters were used when I ran this on CPU, and as I indicated earlier, this process continued without abruptly ending as it did when I ran it with GPU. Below you can find what's on my validation_set.pbtxt
# Generated by shuffle_tfrecords_lowmem.py
name: "ASM3060704"
tfrecord_path: "/home/nlopez/training-case-study/customized_training/validation_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 7762
#
# --input_pattern_list=/home/nlopez/training-case-study/customized_training/validation_set.with_label.tfrecord-?????-of-00024.gz
# --output_pattern_prefix=/home/nlopez/training-case-study/customized_training/validation_set.with_label.shuffled
#
# class1: 5628
# class0: 1774
# class2: 360
from deepvariant.
@nlopez94 can you remove this parameter: --config.num_validation_examples=0
and rerun please
from deepvariant.
@kishwarshafin I will try this and update you on the results I get. Thank you so much for the support!
from deepvariant.
@kishwarshafin I just found the error that was causing this to abruptly exit without warning. I was running the script with insufficient memory, and after changing my instance type, everything ran as expected. Thank you very much for answering my questions!
from deepvariant.
Hi,In fact,you can find the path of above missing library and add them into LD_LIBRARY_PATH the the warning will be eliminated. I have the same problem with you ,and solve it by this way.
from deepvariant.
Thanks for confirming @nlopez94, I will close the issue.
from deepvariant.
Related Issues (20)
- Fatal Python error: Segmentation fault HOT 3
- How to get list of variants after make_examples step? HOT 1
- Highest mapping quality = 42 in bowtie2 HOT 3
- Output files are missing after running deepvariant. HOT 10
- Merging gvcf with GLnexus introduces non-zero heterozygous PL in hemizygous PAR HOT 1
- Dynamic cast failed HOT 6
- question for INDEL variant calling HOT 14
- Question about the time it takes for VC analysis HOT 5
- Merging vcf files error with glnexus:v1.2.7 HOT 6
- haploid contigs and PAR region options for DeepTrio HOT 13
- [E::vcf_parse_format] Incorrect number of FORMAT fields at NC_059157.1:24900 HOT 2
- postprocess_variants: Found multiple file patterns in input filename space HOT 7
- CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected HOT 7
- Info ONT R10.4.1 data HOT 3
- error while running deepvariant with a bam file with phasing information
- Error while using deepvariant with a bam file that is phased HOT 4
- Homozygous GT value while IGV shows otherwise HOT 8
- Fix male VCF after calling without --haploid_contigs="chrX,chrY" and/or --par_regions_bed parameters HOT 1
- gvcf with true depth and not (only) min_dp HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepvariant.