nvidia-merlin / hugectr Goto Github PK

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

License: Apache License 2.0

CMake 1.34% C++ 39.83% Cuda 26.36% Shell 0.48% Python 15.76% Jupyter Notebook 16.19% Makefile 0.01% HTML 0.01% Batchfile 0.01% C 0.01%

cpp deep-learning gpu-acceleration recommendation-system recommender-system

hugectr's People

Contributors

Stargazers

Watchers

Forkers

lycing panyx0718 huangjun6919 yjmade hengqujushi freehawk laisun ivankxt zhp510730568 xiangchenchao awesome-archive tensor-tang lvjun93 zhouyonglong dnuang allensmile seeker1943 tigeroses buracagyang dmudiger straywarrior sprinterzzj jojoyu chenghuige shuoranly nadirnadir shafiahmed mbrukman renyi533 lndkcg oyilmaz-nvidia tilaba lliubr mtmd cxc-maker qingshui rhdong calliwen novigard jepsonwong ethem-kinginthenorth nicolascheng twoflypig goooxu yueyedeai vinhngx liujieshane flame4 xiaoleishi-nv eric-haibin-lin rcdnn noelenenoone jimshith wanglc2008 tobehuang vslyu justinhedge gloriaxie123 miguelusque dendisuhubdy chomolungma mulxcode mr-nineteen darwin-systems pahal2007 caoyuji1986 qianrenjian albertvillanova byshiue shadowell zhaojunzuozjzfr xiaming9880 447555240 bkarsin zhurou603 xiesai gsj1029 jinqingyu benfred pluto1944 btbujiangjun tonyweo baagie7 snapbuy yiqxiaobai chunyang-wen gavinljj antdogg149 bashimao tpnguyen wuziyou199217 benikahall cjnolet tianhaofu xmh645214784 jershi425 jaydown marsmiao sniperxyp teora

hugectr's Issues

question on backward computation

Hi Hugectr experts,

I have a question on backward computation. Take the localized slot as example,
I notice that hugectr perform the all-to-all after the forward propagation. And in the backward, it performs the all-to-all again before the backward propagation. Why there is two all-to-all operations between the forward and backward?

Could this cause a dead loop？

https://github.com/NVIDIA/HugeCTR/blob/90bd7a2e299961a656ceef3115f8fce1961f8c14/HugeCTR/include/hashtable/cudf/hash_functions.cuh#L82

v2.3 Internal Testing

Description:

QA Plan: Need a google document to address what to test
Testing
Review @zehuanw
Comments:

Sample Network Test in CI (AUC / Performance)

Description:
Adding some threshold for integration test like DLRM WDL DCN.
We will not be able to close this bug before the close of #165
Comments:

dump_to_tf throwing memory error

I followed instructions given in /hugectr/tutorials/dump_to_tf/ReadMe.
But when running "python3 main.py
../../samples/dcn/dcn_bin.json
../../samples/dcn/train/0.data
../../samples/dcn/_dense_9999.model
../../samples/dcn/0_sparse_9999.model", I am getting memory exception. Please refer attached screenshot for actual error.

Note: I have used nvtabular with binary format to preprocess and train with hugectr. Hence config file used in above command

is dcn_bin.json.

epoch number shown by main.cpp should be current iter / total_num_samples in the dataset.

Description:

Comments:

how data collector send it's CSR buffers to remote node?

When I read source code, I found data collector is supposed to
/**************************************

Each node will have one DataCollector.
Each iteration, one of the data collector will
send it's CSR buffers to remote node.
************************************/
as commented. However, I cannot find specific codes to do this thing. Can sb give some explanation? Thanks~

Will Hugectr add more support for tensorflow model transfering

Currently, there are only one tutorial about transfering hugectr model to tensorflow model: https://github.com/NVIDIA/HugeCTR/tree/master/tutorial/dump_to_tf . And the tutorial code is not well architectured , and seems to be a specific example , but not a common reuse modular.
My question is: what's the plan hugectr team to develop a common python moduar, which should have the following behaviors:

Input: hugectr model config file path, tensorflow output path
Output: tensorflow model under tensorflow output path

Fail to build docker image with ENABLE_MULTINODES=ON

Here is the command

docker build --build-arg ENABLE_MULTINODES=ON -t hugectr:devel -f ./tools/dockerfiles/build.Dockerfile .

and got errors below:

In file included from /HugeCTR/HugeCTR/include/gpu_resource.hpp:19:0,
                 from /HugeCTR/HugeCTR/src/gpu_resource.cpp:17:
/HugeCTR/HugeCTR/include/common.hpp:29:10: fatal error: mpi.h: No such file or directory
 #include <mpi.h>
          ^~~~~~~
compilation terminated.

I think the dockerfile does not meet the requirements of multi-nodes.

Build success but failed to run with CUDA 10.1

Want to run hugectr on device with cuda 10.1.

Change docker config in tools/dockerfiles/build.Dockerfile or dev.a100.Dockerfile
FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 --> ROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

And all the thing build ok and I got hugeCTR binary files.

But then the driver seems to break down, nothing could be run. Run nvidia-smi and got
Failed to initialize NVML: Driver/library version mismatch

Try to debug I find driver broke down after libarrow-cuda-dev install.
This line : apt update && apt install -y libarrow-dev=0.17.1-1 libarrow-cuda-dev=0.17.1-1
It installs another libnvidia-compute-435. After installed libnvidia-compute-435, the driver could not work correctly.

Any way to solve it?

Should hugectr add batch normalization offset and scale

Saving hugectr model with the batch normalization layer, we can get gamma, beta but not offset, scale , which should be estimator of the training data:

And when we transfer hugectr model to tensorflow model, we need to set offset and scale in tf.nn.batch_normalization( x, mean, variance, offset, scale, variance_epsilon, name=None ) .
Can hugectr adds offset and scale parameters to be saved to the binary model?

[GFN RecSys] RMM memory allocation fail for parquet Datareader

Description:

Overview:

The GFN dataset is pre-processed with NVTabular which resulted in 8 parquet files for training. I'm using just 1 parquet file for testing HugeCTR. I've modified _metadata.json to include only 1 filename (corresponding to the 1 parquet file).
While training DLRM with the least possible embedding_vec_size=1, I'm getting the following error:

terminate called after throwing an instance of 'rmm::bad_alloc'
what(): std::bad_alloc: CNMEM error at: /opt/conda/envs/rapids/include/rmm/mr/device/cnmem_memory_resource.hpp168: CNMEM_STATUS_OUT_OF_MEMORY

The error is discussed in detail here

Minimal reproducing steps:

The dataset is available on NGC Batch (dataset id: 68926) which contains:

parquet file
_metadata.json
_file_list_try.txt

The docker image was built using this script and is available on NGC Batch as nvidian/tme-gfnmerlin/hugectr_rel:1

Attached is the config used for training - dlrm_fp32_256_local.json

A NGC Batch job can be run using -

ngc batch run --name "gfn-hugectr" --preempt RUNONCE --ace nv-us-west-2 --instance dgx1v.32g.8.norm --commandline "bash -c 'source activate rapids && pip install gdown && jupyter notebook --allow-root --ip 0.0.0.0 --no-browser --NotebookApp.token='admin' --NotebookApp.allow_origin='*' --notebook-dir=/'" --result /results --image "nvidian/tme-gfnmerlin/hugectr_rel:1" --org nvidian --team sae --port 8786 --port 8787 --port 8888 --datasetid 68926:/gfn-merlin/data/preprocessed/preprocessed-53-jan-sept-parquet/

Please add your workspace and change your team.

Full error:

Attached is the error trace after running huge_ctr --train dlrm_fp32_256_local.json - error.log
Comments:

when cache_size_ >1 the train loss is zero

We find that setting cache_size_ >1 in DataCollector , the train loss is almost zero . In DataCollector.hpp :

template <typename TypeKey>
void DataCollector<TypeKey>::collect() {
  if (counter_ < cache_size_ || cache_size_ == 0) {
    collect_();
  } else {
    collect_blank_(); 
  }
}

counter_ is increment , and will never less than cache_size_ once it's bigger than cache_size_ . And not collect is running, so the train data is old version , and model is overfit , loss is almost zero .
The correct code is supposed to be

template <typename TypeKey>
void DataCollector<TypeKey>::collect() {
  if (counter_ % internal_buffers_.size() < cache_size_ || cache_size_ == 0) {
    collect_();
  } else {
    collect_blank_(); 
  }
}

Test failed when I input decimal

Hi HugeCTR experts:
in master/test/utest/layers/fully_connected_layer_test.cpp
107：for (size_t i = 0; i < k * n; ++i) h_weight[i] = (float)(rand() % 100);
108：for (size_t i = 0; i < m * k; ++i) h_in[i] = (float)(rand() % 100);

when I use decimal：
107：for (size_t i = 0; i < k * n; ++i) h_weight[i] = (float)((rand() % 100) * 0.1);
108：for (size_t i = 0; i < m * k; ++i) h_in[i] = (float)((rand() % 100)* 0.1);
the test is failed, the max_diff of CPU and GPU is > 0.1 (for example: 0.3125), why?

Custom models on HugeCTR

Hi HugeCTR experts,

I want to implement a custom model on HugeCTR. So far, I could not find docs that show how to import layers/optimizers to build a custom model. Or is there anything I miss?

I wonder if you guys have or will release documentations that show how to build custom model?

Thanks

grammar for input json files

Would you please let me know where can I find the grammer for input json files?

Support epoch in evaluation

Description:

Comments:

Docker for v2.3 release

Description:
add scikit-learn python module (Dmitry)

cudf 0.16 (Chirayu)

@jianbingd will help to check NGC docker of TF to run Embedding plugin
@xiaoleis will send email and ask if we need to update SWIAPT.

Plan B:
Four docker containers in total:
build.tfplugin.dockerfile + dev.tfplugin.dockerfile
build.dockerfile + dev.dockerfile

upload docker to NGC (Depends on QA)
Comments:

[FEA] Make optional the number of files in Norm Dataset File List

Hi!

I am not sure if starting the norm dataset file with the number of files in the list is the best option.

IMO, that value should not be needed, because it might be easily calculated by the parser. It might also be a source of future errors if the parser doesn't double-check that the number specified corresponds to the number of files detailed in the norm dataset file.

Therefore, I would suggest to make that value optional.

https://github.com/NVIDIA/HugeCTR/blob/master/docs/hugectr_user_guide.md#file-list

Hope it helps!

epoch number shown by main.cpp should be current iter / total_num_samples in the dataset.

Description:

Comments:

Running DLRM model got Runtime error: cublas_status_not_supported

Follow by dlrm_fp32_64k.json,
we test DLRM in our data: label_dim=1, dense_dim=5 slot_num=75 . And got error in the first fully connected layer.

What's wrong with my data or model config? Or there are some bugs in hugectr?

log

[10d12h27m36s][HUGECTR][INFO]: end_lr is not specified using default: 0.000000
[6421.34, init_end, ]
[6421.35, run_start, ]
HugeCTR training start:
[6421.36, train_epoch_start, 0, ]
[HCDEBUG][ERROR] Runtime error: cublas_status_not_supported /tmp/HugeCTR/HugeCTR/src/layers/fully_connected_layer.cu:143 

[HCDEBUG][ERROR] Runtime error: operation not permitted when stream is capturing /tmp/HugeCTR/HugeCTR/src/session.cpp:451 

[HCDEBUG][ERROR] Runtime error: cublas_status_not_supported /tmp/HugeCTR/HugeCTR/src/layers/fully_connected_layer.cu:143 

Terminated with error

model config

{
  "solver": {
    "lr_policy": "fixed",
    "display": 1,
    "max_iter": 2,
    "gpu": [
        0
    ],
    "batchsize": 32,
    "snapshot": 1,
    "snapshot_prefix": "./tmp/daw",
    "eval_interval": 1,
    "batchsize_eval":32,
    "eval_metrics": [
        "AUC:0.9",
        "AverageLoss"
    ],
    "eval_batches": 1,
    "input_key_type": "I64"
},
"optimizer": {
    "type": "Adam",
    "global_update": false,
    "adam_hparam": {
        "learning_rate": 0.0001,
        "beta1": 0.9,
        "beta2": 0.999,
        "epsilon": 1e-08
    }
},
"layers": [
    {
        "name": "data",
        "type": "Data",
        "source": "./tmp/file_list.txt",
        "eval_source": "./tmp/file_list_test.txt",
        "check": "Sum",
        "label": {
            "top": "label",
            "label_dim": 1
        },
        "dense": {
            "top": "dense",
            "dense_dim": 5
        },
        "sparse": [
            {
                "top": "data1",
                "type": "DistributedSlot",
                "max_feature_num_per_sample": 180,
                "slot_num": 75
            }
        ]
    },
    {
        "name": "sparse_embedding1",
        "type": "DistributedSlotSparseEmbeddingHash",
        "bottom": "data1",
        "top": "sparse_embedding1",
        "sparse_embedding_hparam": {
            "max_vocabulary_size_per_gpu": 24000000,
            "load_factor": 0.75,
            "embedding_vec_size": 16,
            "combiner": 1
        }
    },
  
      {
        "name": "fc1",
        "type": "InnerProduct",
        "bottom": "dense",
        "top": "fc1",
         "fc_param": {
          "num_output": 512
        }
      },
  
   
    {
        "name": "relu1",
        "type": "ReLU",
        "bottom": "fc1",
        "top": "relu1" 
      },
  
      {
        "name": "fc2",
        "type": "InnerProduct",
        "bottom": "relu1",
        "top": "fc2",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu2",
        "type": "ReLU",
        "bottom": "fc2",
        "top": "relu2"     
      },
      
      {
        "name": "fc3",
        "type": "InnerProduct",
        "bottom": "relu2",
        "top": "fc3",
         "fc_param": {
          "num_output": 16
        }
      },
  
      {
        "name": "relu3",
        "type": "ReLU",
        "bottom": "fc3",
        "top": "relu3"     
      },
      
      {
        "name": "interaction1",
        "type": "Interaction",
        "bottom": ["relu3", "sparse_embedding1"],
        "top": "interaction1"
      },
  
      {
        "name": "fc4",
        "type": "InnerProduct",
        "bottom": "interaction1",
        "top": "fc4",
         "fc_param": {
          "num_output": 1024
        }
      },
  
      {
        "name": "relu4",
        "type": "ReLU",
        "bottom": "fc4",
        "top": "relu4" 
      },
        
  
      {
        "name": "fc5",
        "type": "InnerProduct",
        "bottom": "relu4",
        "top": "fc5",
         "fc_param": {
          "num_output": 1024
        }
      },
  
      {
        "name": "relu5",
        "type": "ReLU",
        "bottom": "fc5",
        "top": "relu5"     
      },
      
      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
        }
      },
  
      {
        "name": "relu6",
        "type": "ReLU",
        "bottom": "fc6",
        "top": "relu6"     
      },
  
      {
        "name": "fc7",
        "type": "InnerProduct",
        "bottom": "relu6",
        "top": "fc7",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu7",
        "type": "ReLU",
        "bottom": "fc7",
        "top": "relu7"     
      },
      
      {
        "name": "fc8",
        "type": "InnerProduct",
        "bottom": "relu7",
        "top": "fc8",
         "fc_param": {
          "num_output": 1
        }
      },
      
      {
        "name": "loss",
        "type": "BinaryCrossEntropyLoss",
        "bottom": ["fc8","label"],
        "top": "loss"
      } 
    ]
  }

v2.2 GeneralBuffer is empty error

We run v2.2 in v100 with cuda10.1, and has some error:
[HCDEBUG][ERROR] Runtime error: GeneralBuffer is empty /tmp/HugeCTR/HugeCTR/include/general_buffer.hpp:136

Our config is:

{
    "solver": {
      "lr_policy": "fixed",
      "display":  100,
      "max_iter":  1000,
      "gpu":  [0],
      "input_key_type":"I64",
      "batchsize":  4096,
      "batchsize_eval":4096,
      "snapshot": 10000000,
      "snapshot_prefix": "./",
      "eval_interval": 100,
      "eval_metrics": ["AUC:0.9","AverageLoss"],
      "eval_batches": 500
    },
    
    "optimizer": {
      "type": "Adam",
      "global_update": true,
      "adam_hparam": {
        "learning_rate": 0.001,
        "alpha": 0.001,
        "beta1": 0.9,
        "beta2": 0.999,
        "epsilon": 0.00000001
      }
    },
  
    "layers": [ 
        {
        "name": "data",
        "type": "Data",
        "source": "./file_list.txt",
        "eval_source": "./file_list_test.txt",
        "check": "Sum",
        "label": {
          "top": "label",
          "label_dim": 1
        },
        "dense": {
          "top": "dense",
          "dense_dim": 0
        },
        "sparse": [
          {
            "top": "data1",
            "type": "DistributedSlot",
            "max_feature_num_per_sample": 100,
            "slot_num": 75
          }        
        ]
      },
  
      {
        "name": "sparse_embedding1",
        "type": "DistributedSlotSparseEmbeddingHash",
        "bottom": "data1",
        "top": "sparse_embedding1",
        "sparse_embedding_hparam": {
          "max_vocabulary_size_per_gpu": 20000000,
          "load_factor": 0.75,
          "embedding_vec_size": 16,
          "combiner": 1
        }
      },
  
      {
        "name": "reshape1",
        "type": "Reshape",
        "bottom": "sparse_embedding1",
        "top": "reshape1",
        "leading_dim": 1200
      },
  
  
      {
        "name": "concat1",
        "type": "Concat",
        "bottom": ["reshape1","dense"],
        "top": "concat1"
      },
  
      {
        "name": "slice1",
        "type": "Slice",
        "bottom": "concat1",
        "ranges": [[0,1200], [0,1200]],
        "top": ["slice11", "slice12"]
      },
  
  
      {
        "name": "multicross1",
        "type": "MultiCross",
        "bottom": "slice11",
        "top": "multicross1",
        "mc_param": {
          "num_layers": 3
        }
      },
  
      {
        "name": "fc1",
        "type": "InnerProduct",
        "bottom": "slice12",
        "top": "fc1",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu1",
        "type": "ReLU",
        "bottom": "fc1",
        "top": "relu1" 
      },
        
      {
        "name": "dropout1",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu1",
        "top": "dropout1" 
      },
  
      {
        "name": "fc2",
        "type": "InnerProduct",
        "bottom": "dropout1",
        "top": "fc2",
         "fc_param": {
          "num_output": 128
        }
      },
  
      {
        "name": "relu2",
        "type": "ReLU",
        "bottom": "fc2",
        "top": "relu2"     
      },
  
      {
        "name": "dropout2",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu2",
        "top": "dropout2" 
      },
         {
        "name": "fc3",
        "type": "InnerProduct",
        "bottom": "dropout2",
        "top": "fc3",
         "fc_param": {
          "num_output": 64
        }
      },
  
      {
        "name": "relu3",
        "type": "ReLU",
        "bottom": "fc3",
        "top": "relu3"     
      },
  
      {
        "name": "dropout3",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu3",
        "top": "dropout3" 
      },
      {
        "name": "concat2",
        "type": "Concat",
        "bottom": ["dropout3","multicross1"],
        "top": "concat2"
      },
      
      {
        "name": "fc4",
        "type": "InnerProduct",
        "bottom": "concat2",
        "top": "fc4",
         "fc_param": {
          "num_output": 1
        }
      },
      
      {
        "name": "loss",
        "type": "BinaryCrossEntropyLoss",
        "bottom": ["fc4","label"],
        "top": "loss"
      } 
    ]
  }

huge_ctr --train fails with Runtime error: Illegal Memory Access

I am using HugeCTR docker image: https://ngc.nvidia.com/catalog/containers/nvidia:hugectr.
When training datasets which are preprocessed using nvtabular with parquet format as mentioned in example for criteo, huge_ctr fails saying Illeagal memory access.

These are the machine configurations:

Azure VM with ubuntu 16.04 image
Size: Standard NC6s_v3, RAM: 112 GiB, SSD: 256 GiB, Single GPU.

Error snapshot:

HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(276): error: identifier "__syncwarp" is undefined

HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(276): error: identifier "__syncwarp" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(287): error: identifier "__any_sync" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(300): error: identifier "__all_sync" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(313): error: identifier "__ballot_sync" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_0000e907_00000000-6_embedding_creator.cpp1.ii".
HugeCTR/src/CMakeFiles/huge_ctr_static.dir/build.make:101: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/embedding_creator.cu.o' failed
make[2]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/embedding_creator.cu.o] Error 1
CMakeFiles/Makefile2:156: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all' failed
make[1]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Perf benchmark on A100?

Is there any performance benchmark result for the latest HugeCTR release on A100 GPUs? https://github.com/NVIDIA/HugeCTR/releases/tag/v2.1_a100_update

Can hugectr add predict command?

Currently , the hugectr main process supports: [--train] [--help] [--version] . However, it's a common scenario that when training is done we predict the test data and print the result on the screen which can redirect to file.

With the predict result, we can :

Comparing the hugectr predict result with the transferred tensorflow model's , to make sure the transfering process is correct.
Use the result to calculate other metric , like auc , precision, etc.
Do batch prediction.

Can hugectr add predict command?

huge_ctr --predict model_config.json test_file_list.txt

Questions on HashTable

Hi, thanks for the nice work. I viewed the code and meet the following quesitons.

Where is the hash map palced?
The hash map is responsible for mapping the input to a value. For example, given the input 10 and the bucket size 100, so it's hash value is

822 = hash('10')%100

But what I can find is only hasbTable to store the mapping <10, 822>. I want to know the part it generates the 822.

Does the embedding table support dynamic growth?
I see the embedding table behind the hashtable is a fixed size. ses here. So the HashTable is dynamic but the embedding table is fixed?

Looks like missing dollar mark in dockerfile.

This is a trivial bug report.

It looks like the last argument at https://github.com/NVIDIA/HugeCTR/blob/master/tools/dockerfiles/build.Dockerfile#L36, -DNCCL_A2A=NCCL_A2A, should be replaced with -DNCCL_A2A=$NCCL_A2A.

Documentations for v2.3

Description:

ReadMe (PIC Minseok):
Notebook: move release note here and link to the features in User Guide. / @KingsleyL will add notebook.
User Guide (PIC Minseok): Connections between term and feature introduction. + Known Issue
Samples
Tutorial @aleliu multi-node training
Question and Answers

Finish a draft version by contributors by 9th Nov
Reorganize start from 9th Nov (PIC Lamont)
Comments:

v2.2 build error

build v2.2 with command : mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release -DNCCL_A2A=ON -DSM=70 .. && make -j

got error:
[ 3%] Building CUDA object HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o
nvcc fatal : Value 'all-warnings' is not defined for option 'Werror'
HugeCTR/src/CMakeFiles/huge_ctr_static.dir/build.make:134: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o' failed
make[2]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o] Error 1
CMakeFiles/Makefile2:124: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all' failed
make[1]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

and

data/HugeCTR/test/utest/layers/multi_cross_layer_test.cpp:276:25: note: suggested alternative: 'compare_array_approx'
/data/HugeCTR/test/utest/layers/multi_cross_layer_test.cpp:276:7: error: expected primary-expression before '(' token
ASSERT_TRUE(test::compare_array_approx_with_ratio(

Solution:
we fix the error by deleting some conf in CMakeLists.txt,
delete the "--Werror all-warnings" and test modular.

Does the v2.2 testing use the dockerfile?

[BUG] Runtime error: an illegal memory access

After processing Criteo dataset with NVTabular and generating the output parquet files, I get Runtime error: an illegal memory access when I try to train using HugeCTR and DLRM model.

[06d20h48m42s][HUGECTR][INFO]: Iter: 14000 Time(1000 iters): 51.684892s Loss: 0.131229 lr:24.000000
[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/embeddings/update_params_functor.cu:571 

[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/embeddings/update_params_functor.cu:571 

[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/session.cpp:427 

terminate called after throwing an instance of 'HugeCTR::internal_runtime_error'
  what():  [HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/include/general_buffer2.hpp:37

[FEA] Support cudf 0.16

Currently HugeCTR does not support cudf 0.16. It keeps throwing the following error.

/home/rapids/hugectr/HugeCTR/include/data_readers/parquet_data_reader_worker.hpp:29:10: fatal error: cudf/io/functions.hpp: No such file or directory
 #include <cudf/io/functions.hpp>

cudf 0.16 has refactored some code, and the functions.hpp does not exist anymore. Includes in HugeCTR have to be updated.

setting seed can't reproduce the results

What have I done?

set seed in config file: "solver": {"seed": 100}
set maxiter=2 ， eval_interval=1
set file_list.txt with 1 file set file_list_with.txt with 1 file
set train and eval reader chunk_size with 1 : data_reader.reset(new DataReader(source_data, batch_size, label_dim, dense_dim,
check_type, data_reader_sparse_param_array,
gpu_resource_group, 1, use_mixed_precision));
run the train process ./huge_ctr --train model.json twice

What happened?

AverageLoss(1.200125 and 1.18949 ) are too far

the first train log:

 [05d17h49m17s][HUGECTR][INFO]: Iter: 1 Time(1 iters): 0.101207s Loss: 1.211278 lr:0.000100

[05d17h49m18s][HUGECTR][INFO]: Evaluation, AUC: 0.501446
[05d17h49m18s][HUGECTR][INFO]: Evaluation, AverageLoss: 1.200125

the second train log:

 
[05d17h51m37s][HUGECTR][INFO]: Iter: 1 Time(1 iters): 0.093456s Loss: 1.200530 lr:0.000100

[05d17h51m37s][HUGECTR][INFO]: Evaluation, AUC: 0.397724

[05d17h51m37s][HUGECTR][INFO]: Evaluation, AverageLoss: 1.18949

Runtime error: out of memory /mnt/HugeCTR/HugeCTR/include/general_buffer.hpp:64

Hi, when I was trying to run DLRM with terabyte dataset with one GPU, I got a runtime error message like this. My guess is I ran out of my GPU memory. I've also tried to decrease the mini batch size or batchsize_eval but still get this error. Does anyone know how to solve this issue?

I was running the following command:
./huge_ctr --train ./dlrm_fp16_64k.json

And the solver in my dlrm_fp16_64k.json looks like this:
"solver": {
"lr_policy": "fixed",
"display": 1000,
"max_iter":64013,
"gpu": [0],
"batchsize": 1024,
"batchsize_eval": 131072,
"snapshot": 10000000,
"snapshot_prefix": "./",
"eval_interval": 3200,
"eval_batches": 681,
"mixed_precision": 1024,
"eval_metrics": ["AUC:0.8025"]
}

Can Hugectr supports Sequence Model?

Currently the embedding layer, supports mean or sum pooling for the variable length. In the deep learning word, using LSTM 、 Attention, is normal. For example , DIN model uses attention layer to merge user behavior sequence.

Can Hugectr supports Sequence Model, such as LSTM 、 GRU 、 Attention ，etc?

what's the meaning of local_id = feature_ids[k] + slot_offset_[k]?

Hi,
I read the V2.2 code, when the hash type is LocalizedSlotSparseEmbeddingOneHot,why the local_id = feature_ids[k] + slot_offset_[k],what's the meaning of this?
if (params_.size() == 1 && params_[0].type == DataReaderSparse_t::Localized && !slot_offset_.empty()) { auto& param = params_[0]; for (int k = 0; k < param.slot_num; k++) { int dev_id = k % csr_chunk->get_num_devices(); T local_id = feature_ids[k] + slot_offset_[k]; csr_chunk->get_csr_buffer(param_id, dev_id).push_back_new_row(local_id);** } }

General command line options parser

Description:
We may need a general command line options parser for ./huge_ctr, ./data_generator et cetera.
Comments:

HugeCTR can't overlap ncclallreduce time with backpro time?

run nvprof and all run in the same stream

No reaction appeared after the training start

Hi professionals,
We tried the steps in the HugeCTR tutorial and picked DeepFM for trail and successfully started the training, but nothing happened after the 'HugeCTR training start' text (we had waited for several days).

We tried several network configs, which however focusing on the max_iter meaning that network architecture was not changed, same problem.

System: Ubuntu 18.04.4 LTS
GPU: GeForce RTX 2080 Ti
Driver Version: 440.44
CUDA Version: 10.2

hugeCTR train performance question

I watched Zehuan Wang's share "HugeCTR - 端到端点击率预估训练解决方案介绍",
in this ppt, at the "PERFORMANCE" slide, use the 8-GPU performance is only 17.8ms per iter.
only 17.8ms per iter is too fast, Is this some error?

Support Model Prefetching for Parquet / Raw data reader and Localized Embedding

Description:

Comments:

running hugectr with multi nodes

Is there any whole tutorial about running hugectr with multi nodes ?

I have try this:

Follow the examples(https://github.com/NVIDIA/HugeCTR/tree/master/samples/dcn2nodes) , what have done is:
Build an mutlinode support images:

base on the dockerfile in hugectr
install hwloc2.2.0
install ucx-1.8.0
install openmpi4.0.3 withe ucx support
install mpi4py 3.0.3
build ctr : cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_MULTINODES=ON ..

Run hugectr with two NVlink supported 8*V100(32G) phyical machines.

Start command is:
export SSH_PORT="xxx"
export NP="2"
export WORK_DIR="/data/dcn_data/"
export HOSTS="ip1:1,ip2:1"
export ARGS=" ./bin/huge_ctr --train ./data/dcn-dist.json "
cd $WORK_DIR
bash start_dist.sh

start_dist.sh:
set -x

mpirun --bind-to none --allow-run-as-root -np $NP -H ${HOSTS} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x LIBRARY_PATH=${LIBRARY_PATH} -x PATH=${PATH} -wdir ${PWD} --mca plm_rsh_agent "$PWD/ssh_resolver.sh" --mca btl_tcp_if_include ib0 $ARGS > logs.txt 2>&1 &

ssh_resolver.sh:

#!/bin/bash
HOSTNAME=$1
shift
ARGS=$*

ssh -p "$SSH_PORT" "$HOSTNAME" "$ARGS"

My question is:

Is my mpirun command is correct ? Should I specfic ucx in mpirun?How hugectr use the ucx 、hwloc ? And how can I user Inifiniband \ RDMA to accelerate hugectr?

For example ,the ucx command looks like:
mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

https://github.com/openucx/ucx

Add a flag to specify whether the top_tensor is reused as dgrad input for each layer

Description:
https://nvidia.slack.com/archives/CFNGYH1J9/p1603441321199300?thread_ts=1603341823.187200&cid=CFNGYH1J9

https://docs.google.com/document/d/10yg2RD-LVVPjPwQdkbayd3z5Izi0jKbU9tZKiRxbTDo/edit
Comments:

Criteo dataset sample processing issue

I was trying to run HugeCTR on the Criteo Kaggle dataset. When I was converting the original Kaggle dataset to HugeCTR format using Criteo2hugeCTR_legacy tool, I was running the command line as following:

$ ./criteo2hugectr_legacy 1 ../../tools/criteo_script_legacy/train.out criteo/sparse_embedding file_list.txt
$ ./criteo2hugectr_legacy 1 ../../tools/criteo_script_legacy/test.out criteo_test/sparse_embedding file_list_test.txt

However, I'm not able to get the file_list.txt abd file_list_test.txt from these scripts. I'm not sure what I did wrong here, since I pretty much followed the readme online from the beginning.

I also did some trails and realized that the problem might be in criteo2hugectr_legacy.cpp, since I wasn't able to read the eof of txt_file (line 95).

I'd really appreciate it if you guys could explain this a bit. Thank you very much!

DataReader Refactoring TODO list

Description:

Python friendly APIs: how to make the code and more uniform?

DataReader::set_source(){
   worker_group_.reset(new xxx_data_reader_worker_group);
}
# and no need to explicit call of start()

Completely eliminate repeat from DataReader when it is ready, e.g., enable set_source for Raw
Remove all the default arguments from DataReader
Decouple Dataset source type from DataReader type (#169 #138)
Make the number of DataReaders configurable (and perhaps automatic configuration for a given system)
Support Eval for one epoch instead of specifying n_batches. It may require the change to Metrics as well. (@minseokl will see how TF and Pytorch tackle this issue)
Remove the duplicate code if possible
Comments:

Error of running './huge_ctr --train ./deepfm_bin.json'

Hi there,
I tried running HugeCTR Docker example of DeepFM with NVTabular preprocessing, but after running the command on the title, it shows errors and stops at the training start. Is there any bug?Thx.

System: Ubuntu 18.04.4 LTS
GPU: GeForce RTX 2080 Ti
Driver Version: 440.44
CUDA Version: 10.2

[0.001, init_start, ]
HugeCTR Version: 2.2.1
Config file: ./deepfm_bin.json
[21d09h02m26s][HUGECTR][INFO]: batchsize_eval is not specified using default: 512
[21d09h02m26s][HUGECTR][INFO]: Default evaluation metric is AUC without threshold value
[21d09h02m26s][HUGECTR][INFO]: algorithm_search is not specified using default: 1
[21d09h02m26s][HUGECTR][INFO]: Algorithm search: ON
[21d09h02m26s][HUGECTR][INFO]: cuda_graph is not specified using default: 1
[21d09h02m26s][HUGECTR][INFO]: CUDA Graph: ON
[21d09h02m26s][HUGECTR][INFO]: Initial seed is 3545387129
[21d09h02m28s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: GeForce RTX 2080 Ti
[21d09h02m30s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[21d09h02m30s][HUGECTR][INFO]: max_nnz is not specified using default: 30
[21d09h02m30s][HUGECTR][INFO]: num_internal_buffers 1
[21d09h02m30s][HUGECTR][INFO]: num_internal_buffers 1
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR]
DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] [HCDEBUG][ERROR] 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderErrorDataHeaderErrorDataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::DataHeaderErrorDataHeaderError:58
58 [HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::[HCDEBUG][ERROR] 58[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

DataHeaderError[HCDEBUG][ERROR] DataHeaderError58[HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] [HCDEBUG][ERROR] :DataHeaderError58 /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
DataHeaderError58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] 58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError :58
:58DataHeaderError[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
::58:5858[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError58: 58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :58
58 [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58:

DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::DataHeaderError58

[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] 58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp 58
[HCDEBUG][ERROR] DataHeaderError58[HCDEBUG][ERROR] DataHeaderError:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError :[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::5858[HCDEBUG][ERROR]
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[21d09h02m30s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=1737709

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: 5858
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR]
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58:

DataHeaderError 5858DataHeaderError58[HCDEBUG][ERROR] 58DataHeaderError
58:58

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError
58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError:[HCDEBUG][ERROR] 58 ::
[HCDEBUG][ERROR] DataHeaderError:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :

58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58DataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError
58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58
58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69