Coder Social home page Coder Social logo

nvidia-merlin / hugectr Goto Github PK

View Code? Open in Web Editor NEW
916.0 916.0 197.0 55.36 MB

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

License: Apache License 2.0

CMake 1.34% C++ 39.83% Cuda 26.36% Shell 0.48% Python 15.76% Jupyter Notebook 16.19% Makefile 0.01% HTML 0.01% Batchfile 0.01% C 0.01%
cpp deep-learning gpu-acceleration recommendation-system recommender-system

hugectr's People

Contributors

aleckohlhoff avatar alexeedm avatar amukkara avatar ashishsardana avatar bashimao avatar benfred avatar bkarsin avatar chirayug-nvidia avatar emmaqiaoch avatar georgeliu95 avatar janekl avatar jershi425 avatar jianbing-d avatar kingsleyliu-nv avatar kunlunl avatar lgardenhire avatar mengran-nvidia avatar miguelusque avatar mikemckiernan avatar minseokl avatar nyrio avatar oyilmaz-nvidia avatar raywang96 avatar reoptnvidia avatar shijieliu avatar vinhngx avatar wl1136 avatar xiaoleishi-nv avatar yingcanw avatar zehuanw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hugectr's Issues

question on backward computation

Hi Hugectr experts,

I have a question on backward computation. Take the localized slot as example,
I notice that hugectr perform the all-to-all after the forward propagation. And in the backward, it performs the all-to-all again before the backward propagation. Why there is two all-to-all operations between the forward and backward?

dump_to_tf throwing memory error

I followed instructions given in /hugectr/tutorials/dump_to_tf/ReadMe.
But when running "python3 main.py
../../samples/dcn/dcn_bin.json
../../samples/dcn/train/0.data
../../samples/dcn/_dense_9999.model
../../samples/dcn/0_sparse_9999.model", I am getting memory exception. Please refer attached screenshot for actual error.

Note: I have used nvtabular with binary format to preprocess and train with hugectr. Hence config file used in above command
dump_to_tf_error
free-memory-in-gb

is dcn_bin.json.

how data collector send it's CSR buffers to remote node?

When I read source code, I found data collector is supposed to
/**************************************

  • Each node will have one DataCollector.
  • Each iteration, one of the data collector will
  • send it's CSR buffers to remote node.
    ************************************/
    as commented. However, I cannot find specific codes to do this thing. Can sb give some explanation? Thanks~

Will Hugectr add more support for tensorflow model transfering

Currently, there are only one tutorial about transfering hugectr model to tensorflow model: https://github.com/NVIDIA/HugeCTR/tree/master/tutorial/dump_to_tf . And the tutorial code is not well architectured , and seems to be a specific example , but not a common reuse modular.
My question is: what's the plan hugectr team to develop a common python moduar, which should have the following behaviors:

  • Input: hugectr model config file path, tensorflow output path
  • Output: tensorflow model under tensorflow output path

Fail to build docker image with ENABLE_MULTINODES=ON

Here is the command

docker build --build-arg ENABLE_MULTINODES=ON -t hugectr:devel -f ./tools/dockerfiles/build.Dockerfile .

and got errors below:

In file included from /HugeCTR/HugeCTR/include/gpu_resource.hpp:19:0,
                 from /HugeCTR/HugeCTR/src/gpu_resource.cpp:17:
/HugeCTR/HugeCTR/include/common.hpp:29:10: fatal error: mpi.h: No such file or directory
 #include <mpi.h>
          ^~~~~~~
compilation terminated.

I think the dockerfile does not meet the requirements of multi-nodes.

Build success but failed to run with CUDA 10.1

Want to run hugectr on device with cuda 10.1.

Change docker config in tools/dockerfiles/build.Dockerfile or dev.a100.Dockerfile
FROM nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 --> ROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

And all the thing build ok and I got hugeCTR binary files.

But then the driver seems to break down, nothing could be run. Run nvidia-smi and got
Failed to initialize NVML: Driver/library version mismatch

Try to debug I find driver broke down after libarrow-cuda-dev install.
This line : apt update && apt install -y libarrow-dev=0.17.1-1 libarrow-cuda-dev=0.17.1-1
It installs another libnvidia-compute-435. After installed libnvidia-compute-435, the driver could not work correctly.

Any way to solve it?

Should hugectr add batch normalization offset and scale

Saving hugectr model with the batch normalization layer, we can get gamma, beta but not offset, scale , which should be estimator of the training data:
image
image

And when we transfer hugectr model to tensorflow model, we need to set offset and scale in tf.nn.batch_normalization( x, mean, variance, offset, scale, variance_epsilon, name=None ) .
Can hugectr adds offset and scale parameters to be saved to the binary model?

[GFN RecSys] RMM memory allocation fail for parquet Datareader

Description:

Overview:

The GFN dataset is pre-processed with NVTabular which resulted in 8 parquet files for training. I'm using just 1 parquet file for testing HugeCTR. I've modified _metadata.json to include only 1 filename (corresponding to the 1 parquet file).
While training DLRM with the least possible embedding_vec_size=1, I'm getting the following error:

terminate called after throwing an instance of 'rmm::bad_alloc'
what(): std::bad_alloc: CNMEM error at: /opt/conda/envs/rapids/include/rmm/mr/device/cnmem_memory_resource.hpp168: CNMEM_STATUS_OUT_OF_MEMORY

The error is discussed in detail here

Minimal reproducing steps:

The dataset is available on NGC Batch (dataset id: 68926) which contains:

  1. parquet file
  2. _metadata.json
  3. _file_list_try.txt

The docker image was built using this script and is available on NGC Batch as nvidian/tme-gfnmerlin/hugectr_rel:1

Attached is the config used for training - dlrm_fp32_256_local.json

A NGC Batch job can be run using -

ngc batch run --name "gfn-hugectr" --preempt RUNONCE --ace nv-us-west-2 --instance dgx1v.32g.8.norm --commandline "bash -c 'source activate rapids && pip install gdown && jupyter notebook --allow-root --ip 0.0.0.0 --no-browser --NotebookApp.token='admin' --NotebookApp.allow_origin='*' --notebook-dir=/'" --result /results --image "nvidian/tme-gfnmerlin/hugectr_rel:1" --org nvidian --team sae --port 8786 --port 8787 --port 8888 --datasetid 68926:/gfn-merlin/data/preprocessed/preprocessed-53-jan-sept-parquet/

Please add your workspace and change your team.

Full error:

Attached is the error trace after running huge_ctr --train dlrm_fp32_256_local.json - error.log
Comments:

when cache_size_ >1 the train loss is zero

We find that setting cache_size_ >1 in DataCollector , the train loss is almost zero . In DataCollector.hpp :

template <typename TypeKey>
void DataCollector<TypeKey>::collect() {
  if (counter_ < cache_size_ || cache_size_ == 0) {
    collect_();
  } else {
    collect_blank_(); 
  }
}

counter_ is increment , and will never less than cache_size_ once it's bigger than cache_size_ . And not collect is running, so the train data is old version , and model is overfit , loss is almost zero .
The correct code is supposed to be

template <typename TypeKey>
void DataCollector<TypeKey>::collect() {
  if (counter_ % internal_buffers_.size() < cache_size_ || cache_size_ == 0) {
    collect_();
  } else {
    collect_blank_(); 
  }
}

Test failed when I input decimal

Hi HugeCTR experts:
in master/test/utest/layers/fully_connected_layer_test.cpp
107:for (size_t i = 0; i < k * n; ++i) h_weight[i] = (float)(rand() % 100);
108:for (size_t i = 0; i < m * k; ++i) h_in[i] = (float)(rand() % 100);

when I use decimal:
107:for (size_t i = 0; i < k * n; ++i) h_weight[i] = (float)((rand() % 100) * 0.1);
108:for (size_t i = 0; i < m * k; ++i) h_in[i] = (float)((rand() % 100)* 0.1);
the test is failed, the max_diff of CPU and GPU is > 0.1 (for example: 0.3125), why?

Custom models on HugeCTR

Hi HugeCTR experts,

I want to implement a custom model on HugeCTR. So far, I could not find docs that show how to import layers/optimizers to build a custom model. Or is there anything I miss?

I wonder if you guys have or will release documentations that show how to build custom model?

Thanks

Docker for v2.3 release

Description:
add scikit-learn python module (Dmitry)

cudf 0.16 (Chirayu)

  1. @jianbingd will help to check NGC docker of TF to run Embedding plugin
  2. @xiaoleis will send email and ask if we need to update SWIAPT.

Plan B:
Four docker containers in total:
build.tfplugin.dockerfile + dev.tfplugin.dockerfile
build.dockerfile + dev.dockerfile

  • upload docker to NGC (Depends on QA)
    Comments:

[FEA] Make optional the number of files in Norm Dataset File List

Hi!

I am not sure if starting the norm dataset file with the number of files in the list is the best option.

IMO, that value should not be needed, because it might be easily calculated by the parser. It might also be a source of future errors if the parser doesn't double-check that the number specified corresponds to the number of files detailed in the norm dataset file.

Therefore, I would suggest to make that value optional.

https://github.com/NVIDIA/HugeCTR/blob/master/docs/hugectr_user_guide.md#file-list

Hope it helps!

Running DLRM model got Runtime error: cublas_status_not_supported

Follow by dlrm_fp32_64k.json,
we test DLRM in our data: label_dim=1, dense_dim=5 slot_num=75 . And got error in the first fully connected layer.

What's wrong with my data or model config? Or there are some bugs in hugectr?

log

[10d12h27m36s][HUGECTR][INFO]: end_lr is not specified using default: 0.000000
[6421.34, init_end, ]
[6421.35, run_start, ]
HugeCTR training start:
[6421.36, train_epoch_start, 0, ]
[HCDEBUG][ERROR] Runtime error: cublas_status_not_supported /tmp/HugeCTR/HugeCTR/src/layers/fully_connected_layer.cu:143 

[HCDEBUG][ERROR] Runtime error: operation not permitted when stream is capturing /tmp/HugeCTR/HugeCTR/src/session.cpp:451 

[HCDEBUG][ERROR] Runtime error: cublas_status_not_supported /tmp/HugeCTR/HugeCTR/src/layers/fully_connected_layer.cu:143 

Terminated with error

model config

{
  "solver": {
    "lr_policy": "fixed",
    "display": 1,
    "max_iter": 2,
    "gpu": [
        0
    ],
    "batchsize": 32,
    "snapshot": 1,
    "snapshot_prefix": "./tmp/daw",
    "eval_interval": 1,
    "batchsize_eval":32,
    "eval_metrics": [
        "AUC:0.9",
        "AverageLoss"
    ],
    "eval_batches": 1,
    "input_key_type": "I64"
},
"optimizer": {
    "type": "Adam",
    "global_update": false,
    "adam_hparam": {
        "learning_rate": 0.0001,
        "beta1": 0.9,
        "beta2": 0.999,
        "epsilon": 1e-08
    }
},
"layers": [
    {
        "name": "data",
        "type": "Data",
        "source": "./tmp/file_list.txt",
        "eval_source": "./tmp/file_list_test.txt",
        "check": "Sum",
        "label": {
            "top": "label",
            "label_dim": 1
        },
        "dense": {
            "top": "dense",
            "dense_dim": 5
        },
        "sparse": [
            {
                "top": "data1",
                "type": "DistributedSlot",
                "max_feature_num_per_sample": 180,
                "slot_num": 75
            }
        ]
    },
    {
        "name": "sparse_embedding1",
        "type": "DistributedSlotSparseEmbeddingHash",
        "bottom": "data1",
        "top": "sparse_embedding1",
        "sparse_embedding_hparam": {
            "max_vocabulary_size_per_gpu": 24000000,
            "load_factor": 0.75,
            "embedding_vec_size": 16,
            "combiner": 1
        }
    },
  
      {
        "name": "fc1",
        "type": "InnerProduct",
        "bottom": "dense",
        "top": "fc1",
         "fc_param": {
          "num_output": 512
        }
      },
  
   
    {
        "name": "relu1",
        "type": "ReLU",
        "bottom": "fc1",
        "top": "relu1" 
      },
  
      {
        "name": "fc2",
        "type": "InnerProduct",
        "bottom": "relu1",
        "top": "fc2",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu2",
        "type": "ReLU",
        "bottom": "fc2",
        "top": "relu2"     
      },
      
      {
        "name": "fc3",
        "type": "InnerProduct",
        "bottom": "relu2",
        "top": "fc3",
         "fc_param": {
          "num_output": 16
        }
      },
  
      {
        "name": "relu3",
        "type": "ReLU",
        "bottom": "fc3",
        "top": "relu3"     
      },
      
      {
        "name": "interaction1",
        "type": "Interaction",
        "bottom": ["relu3", "sparse_embedding1"],
        "top": "interaction1"
      },
  
      {
        "name": "fc4",
        "type": "InnerProduct",
        "bottom": "interaction1",
        "top": "fc4",
         "fc_param": {
          "num_output": 1024
        }
      },
  
      {
        "name": "relu4",
        "type": "ReLU",
        "bottom": "fc4",
        "top": "relu4" 
      },
        
  
      {
        "name": "fc5",
        "type": "InnerProduct",
        "bottom": "relu4",
        "top": "fc5",
         "fc_param": {
          "num_output": 1024
        }
      },
  
      {
        "name": "relu5",
        "type": "ReLU",
        "bottom": "fc5",
        "top": "relu5"     
      },
      
      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
        }
      },
  
      {
        "name": "relu6",
        "type": "ReLU",
        "bottom": "fc6",
        "top": "relu6"     
      },
  
      {
        "name": "fc7",
        "type": "InnerProduct",
        "bottom": "relu6",
        "top": "fc7",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu7",
        "type": "ReLU",
        "bottom": "fc7",
        "top": "relu7"     
      },
      
      {
        "name": "fc8",
        "type": "InnerProduct",
        "bottom": "relu7",
        "top": "fc8",
         "fc_param": {
          "num_output": 1
        }
      },
      
      {
        "name": "loss",
        "type": "BinaryCrossEntropyLoss",
        "bottom": ["fc8","label"],
        "top": "loss"
      } 
    ]
  }

v2.2 GeneralBuffer is empty error

We run v2.2 in v100 with cuda10.1, and has some error:
[HCDEBUG][ERROR] Runtime error: GeneralBuffer is empty /tmp/HugeCTR/HugeCTR/include/general_buffer.hpp:136

Our config is:

{
    "solver": {
      "lr_policy": "fixed",
      "display":  100,
      "max_iter":  1000,
      "gpu":  [0],
      "input_key_type":"I64",
      "batchsize":  4096,
      "batchsize_eval":4096,
      "snapshot": 10000000,
      "snapshot_prefix": "./",
      "eval_interval": 100,
      "eval_metrics": ["AUC:0.9","AverageLoss"],
      "eval_batches": 500
    },
    
    "optimizer": {
      "type": "Adam",
      "global_update": true,
      "adam_hparam": {
        "learning_rate": 0.001,
        "alpha": 0.001,
        "beta1": 0.9,
        "beta2": 0.999,
        "epsilon": 0.00000001
      }
    },
  
    "layers": [ 
        {
        "name": "data",
        "type": "Data",
        "source": "./file_list.txt",
        "eval_source": "./file_list_test.txt",
        "check": "Sum",
        "label": {
          "top": "label",
          "label_dim": 1
        },
        "dense": {
          "top": "dense",
          "dense_dim": 0
        },
        "sparse": [
          {
            "top": "data1",
            "type": "DistributedSlot",
            "max_feature_num_per_sample": 100,
            "slot_num": 75
          }        
        ]
      },
  
      {
        "name": "sparse_embedding1",
        "type": "DistributedSlotSparseEmbeddingHash",
        "bottom": "data1",
        "top": "sparse_embedding1",
        "sparse_embedding_hparam": {
          "max_vocabulary_size_per_gpu": 20000000,
          "load_factor": 0.75,
          "embedding_vec_size": 16,
          "combiner": 1
        }
      },
  
      {
        "name": "reshape1",
        "type": "Reshape",
        "bottom": "sparse_embedding1",
        "top": "reshape1",
        "leading_dim": 1200
      },
  
  
      {
        "name": "concat1",
        "type": "Concat",
        "bottom": ["reshape1","dense"],
        "top": "concat1"
      },
  
      {
        "name": "slice1",
        "type": "Slice",
        "bottom": "concat1",
        "ranges": [[0,1200], [0,1200]],
        "top": ["slice11", "slice12"]
      },
  
  
      {
        "name": "multicross1",
        "type": "MultiCross",
        "bottom": "slice11",
        "top": "multicross1",
        "mc_param": {
          "num_layers": 3
        }
      },
  
      {
        "name": "fc1",
        "type": "InnerProduct",
        "bottom": "slice12",
        "top": "fc1",
         "fc_param": {
          "num_output": 256
        }
      },
  
      {
        "name": "relu1",
        "type": "ReLU",
        "bottom": "fc1",
        "top": "relu1" 
      },
        
      {
        "name": "dropout1",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu1",
        "top": "dropout1" 
      },
  
      {
        "name": "fc2",
        "type": "InnerProduct",
        "bottom": "dropout1",
        "top": "fc2",
         "fc_param": {
          "num_output": 128
        }
      },
  
      {
        "name": "relu2",
        "type": "ReLU",
        "bottom": "fc2",
        "top": "relu2"     
      },
  
      {
        "name": "dropout2",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu2",
        "top": "dropout2" 
      },
         {
        "name": "fc3",
        "type": "InnerProduct",
        "bottom": "dropout2",
        "top": "fc3",
         "fc_param": {
          "num_output": 64
        }
      },
  
      {
        "name": "relu3",
        "type": "ReLU",
        "bottom": "fc3",
        "top": "relu3"     
      },
  
      {
        "name": "dropout3",
        "type": "Dropout",
        "rate": 0.5,
        "bottom": "relu3",
        "top": "dropout3" 
      },
      {
        "name": "concat2",
        "type": "Concat",
        "bottom": ["dropout3","multicross1"],
        "top": "concat2"
      },
      
      {
        "name": "fc4",
        "type": "InnerProduct",
        "bottom": "concat2",
        "top": "fc4",
         "fc_param": {
          "num_output": 1
        }
      },
      
      {
        "name": "loss",
        "type": "BinaryCrossEntropyLoss",
        "bottom": ["fc4","label"],
        "top": "loss"
      } 
    ]
  }

HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(276): error: identifier "__syncwarp" is undefined

HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(276): error: identifier "__syncwarp" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(287): error: identifier "__any_sync" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(300): error: identifier "__all_sync" is undefined
HugeCTR-2.1_beta/cub/cub/device/dispatch/../../agent/../thread/../util_ptx.cuh(313): error: identifier "__ballot_sync" is undefined

4 errors detected in the compilation of "/tmp/tmpxft_0000e907_00000000-6_embedding_creator.cpp1.ii".
HugeCTR/src/CMakeFiles/huge_ctr_static.dir/build.make:101: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/embedding_creator.cu.o' failed
make[2]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/embedding_creator.cu.o] Error 1
CMakeFiles/Makefile2:156: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all' failed
make[1]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Can hugectr add predict command?

Currently , the hugectr main process supports: [--train] [--help] [--version] . However, it's a common scenario that when training is done we predict the test data and print the result on the screen which can redirect to file.

With the predict result, we can :

  1. Comparing the hugectr predict result with the transferred tensorflow model's , to make sure the transfering process is correct.
  2. Use the result to calculate other metric , like auc , precision, etc.
  3. Do batch prediction.

Can hugectr add predict command?

  • huge_ctr --predict model_config.json test_file_list.txt

Questions on HashTable

Hi, thanks for the nice work. I viewed the code and meet the following quesitons.

  1. Where is the hash map palced?
    The hash map is responsible for mapping the input to a value. For example, given the input 10 and the bucket size 100, so it's hash value is
822 = hash('10')%100

But what I can find is only hasbTable to store the mapping <10, 822>. I want to know the part it generates the 822.

  1. Does the embedding table support dynamic growth?
    I see the embedding table behind the hashtable is a fixed size. ses here. So the HashTable is dynamic but the embedding table is fixed?

Documentations for v2.3

Description:

  • ReadMe (PIC Minseok):
  • Notebook: move release note here and link to the features in User Guide. / @KingsleyL will add notebook.
  • User Guide (PIC Minseok): Connections between term and feature introduction. + Known Issue
  • Samples
  • Tutorial @aleliu multi-node training
  • Question and Answers

Finish a draft version by contributors by 9th Nov
Reorganize start from 9th Nov (PIC Lamont)
Comments:

v2.2 build error

build v2.2 with command : mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release -DNCCL_A2A=ON -DSM=70 .. && make -j

got error:
[ 3%] Building CUDA object HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o
nvcc fatal : Value 'all-warnings' is not defined for option 'Werror'
HugeCTR/src/CMakeFiles/huge_ctr_static.dir/build.make:134: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o' failed
make[2]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/layers/batch_norm_layer.cu.o] Error 1
CMakeFiles/Makefile2:124: recipe for target 'HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all' failed
make[1]: *** [HugeCTR/src/CMakeFiles/huge_ctr_static.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

and

data/HugeCTR/test/utest/layers/multi_cross_layer_test.cpp:276:25: note: suggested alternative: 'compare_array_approx'
/data/HugeCTR/test/utest/layers/multi_cross_layer_test.cpp:276:7: error: expected primary-expression before '(' token
ASSERT_TRUE(test::compare_array_approx_with_ratio(

Solution:
we fix the error by deleting some conf in CMakeLists.txt,
delete the "--Werror all-warnings" and test modular.

Does the v2.2 testing use the dockerfile?

[BUG] Runtime error: an illegal memory access

After processing Criteo dataset with NVTabular and generating the output parquet files, I get Runtime error: an illegal memory access when I try to train using HugeCTR and DLRM model.

[06d20h48m42s][HUGECTR][INFO]: Iter: 14000 Time(1000 iters): 51.684892s Loss: 0.131229 lr:24.000000
[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/embeddings/update_params_functor.cu:571 

[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/embeddings/update_params_functor.cu:571 

[HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/src/session.cpp:427 

terminate called after throwing an instance of 'HugeCTR::internal_runtime_error'
  what():  [HCDEBUG][ERROR] Runtime error: an illegal memory access was encountered /HugeCTR/HugeCTR/include/general_buffer2.hpp:37

[FEA] Support cudf 0.16

Currently HugeCTR does not support cudf 0.16. It keeps throwing the following error.

/home/rapids/hugectr/HugeCTR/include/data_readers/parquet_data_reader_worker.hpp:29:10: fatal error: cudf/io/functions.hpp: No such file or directory
 #include <cudf/io/functions.hpp>

cudf 0.16 has refactored some code, and the functions.hpp does not exist anymore. Includes in HugeCTR have to be updated.

setting seed can't reproduce the results

What have I done?

  • set seed in config file: "solver": {"seed": 100}

  • set maxiter=2 , eval_interval=1

  • set file_list.txt with 1 file set file_list_with.txt with 1 file

  • set train and eval reader chunk_size with 1 : data_reader.reset(new DataReader(source_data, batch_size, label_dim, dense_dim,
    check_type, data_reader_sparse_param_array,
    gpu_resource_group, 1, use_mixed_precision));

  • run the train process ./huge_ctr --train model.json twice

What happened?

AverageLoss(1.200125 and 1.18949 ) are too far

the first train log:

 [05d17h49m17s][HUGECTR][INFO]: Iter: 1 Time(1 iters): 0.101207s Loss: 1.211278 lr:0.000100

[05d17h49m18s][HUGECTR][INFO]: Evaluation, AUC: 0.501446
[05d17h49m18s][HUGECTR][INFO]: Evaluation, AverageLoss: 1.200125

the second train log:

 
[05d17h51m37s][HUGECTR][INFO]: Iter: 1 Time(1 iters): 0.093456s Loss: 1.200530 lr:0.000100

[05d17h51m37s][HUGECTR][INFO]: Evaluation, AUC: 0.397724

[05d17h51m37s][HUGECTR][INFO]: Evaluation, AverageLoss: 1.18949

Runtime error: out of memory /mnt/HugeCTR/HugeCTR/include/general_buffer.hpp:64

Hi, when I was trying to run DLRM with terabyte dataset with one GPU, I got a runtime error message like this. My guess is I ran out of my GPU memory. I've also tried to decrease the mini batch size or batchsize_eval but still get this error. Does anyone know how to solve this issue?

I was running the following command:
./huge_ctr --train ./dlrm_fp16_64k.json

And the solver in my dlrm_fp16_64k.json looks like this:
"solver": {
"lr_policy": "fixed",
"display": 1000,
"max_iter":64013,
"gpu": [0],
"batchsize": 1024,
"batchsize_eval": 131072,
"snapshot": 10000000,
"snapshot_prefix": "./",
"eval_interval": 3200,
"eval_batches": 681,
"mixed_precision": 1024,
"eval_metrics": ["AUC:0.8025"]
}

Can Hugectr supports Sequence Model?

Currently the embedding layer, supports mean or sum pooling for the variable length. In the deep learning word, using LSTM 、 Attention, is normal. For example , DIN model uses attention layer to merge user behavior sequence.

Can Hugectr supports Sequence Model, such as LSTM 、 GRU 、 Attention ,etc?

what's the meaning of local_id = feature_ids[k] + slot_offset_[k]?

Hi,
I read the V2.2 code, when the hash type is LocalizedSlotSparseEmbeddingOneHot,why the local_id = feature_ids[k] + slot_offset_[k],what's the meaning of this?
if (params_.size() == 1 && params_[0].type == DataReaderSparse_t::Localized && !slot_offset_.empty()) { auto& param = params_[0]; for (int k = 0; k < param.slot_num; k++) { int dev_id = k % csr_chunk->get_num_devices(); T local_id = feature_ids[k] + slot_offset_[k]; csr_chunk->get_csr_buffer(param_id, dev_id).push_back_new_row(local_id);** } }

No reaction appeared after the training start

Hi professionals,
We tried the steps in the HugeCTR tutorial and picked DeepFM for trail and successfully started the training, but nothing happened after the 'HugeCTR training start' text (we had waited for several days).

We tried several network configs, which however focusing on the max_iter meaning that network architecture was not changed, same problem.

System: Ubuntu 18.04.4 LTS
GPU: GeForce RTX 2080 Ti
Driver Version: 440.44
CUDA Version: 10.2

hugeCTR train performance question

I watched Zehuan Wang's share "HugeCTR - 端到端点击率预估训练解决方案介绍",
in this ppt, at the "PERFORMANCE" slide, use the 8-GPU performance is only 17.8ms per iter.
only 17.8ms per iter is too fast, Is this some error?

running hugectr with multi nodes

Is there any whole tutorial about running hugectr with multi nodes ?

I have try this:

Follow the examples(https://github.com/NVIDIA/HugeCTR/tree/master/samples/dcn2nodes) , what have done is:
Build an mutlinode support images:

  • base on the dockerfile in hugectr
  • install hwloc2.2.0
  • install ucx-1.8.0
  • install openmpi4.0.3 withe ucx support
  • install mpi4py 3.0.3
  • build ctr : cmake -DCMAKE_BUILD_TYPE=Release -DSM=70 -DENABLE_MULTINODES=ON ..

Run hugectr with two NVlink supported 8*V100(32G) phyical machines.

  • Start command is:
    export SSH_PORT="xxx"
    export NP="2"
    export WORK_DIR="/data/dcn_data/"
    export HOSTS="ip1:1,ip2:1"
    export ARGS=" ./bin/huge_ctr --train ./data/dcn-dist.json "
    cd $WORK_DIR
    bash start_dist.sh

start_dist.sh:
set -x

mpirun --bind-to none --allow-run-as-root -np $NP -H ${HOSTS} -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} -x LIBRARY_PATH=${LIBRARY_PATH} -x PATH=${PATH} -wdir ${PWD} --mca plm_rsh_agent "$PWD/ssh_resolver.sh" --mca btl_tcp_if_include ib0 $ARGS > logs.txt 2>&1 &

ssh_resolver.sh:

#!/bin/bash
HOSTNAME=$1
shift
ARGS=$*

ssh -p "$SSH_PORT" "$HOSTNAME" "$ARGS"

My question is:

Is my mpirun command is correct ? Should I specfic ucx in mpirun?How hugectr use the ucx 、hwloc ? And how can I user Inifiniband \ RDMA to accelerate hugectr?

For example ,the ucx command looks like:
mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

https://github.com/openucx/ucx

Criteo dataset sample processing issue

I was trying to run HugeCTR on the Criteo Kaggle dataset. When I was converting the original Kaggle dataset to HugeCTR format using Criteo2hugeCTR_legacy tool, I was running the command line as following:

$ ./criteo2hugectr_legacy 1 ../../tools/criteo_script_legacy/train.out criteo/sparse_embedding file_list.txt
$ ./criteo2hugectr_legacy 1 ../../tools/criteo_script_legacy/test.out criteo_test/sparse_embedding file_list_test.txt

However, I'm not able to get the file_list.txt abd file_list_test.txt from these scripts. I'm not sure what I did wrong here, since I pretty much followed the readme online from the beginning.

I also did some trails and realized that the problem might be in criteo2hugectr_legacy.cpp, since I wasn't able to read the eof of txt_file (line 95).

I'd really appreciate it if you guys could explain this a bit. Thank you very much!

DataReader Refactoring TODO list

Description:

  • Python friendly APIs: how to make the code and more uniform?
DataReader::set_source(){
   worker_group_.reset(new xxx_data_reader_worker_group);
}
# and no need to explicit call of start()
  • Completely eliminate repeat from DataReader when it is ready, e.g., enable set_source for Raw
  • Remove all the default arguments from DataReader
  • Decouple Dataset source type from DataReader type (#169 #138)
  • Make the number of DataReaders configurable (and perhaps automatic configuration for a given system)
  • Support Eval for one epoch instead of specifying n_batches. It may require the change to Metrics as well. (@minseokl will see how TF and Pytorch tackle this issue)
  • Remove the duplicate code if possible
    Comments:

Error of running './huge_ctr --train ./deepfm_bin.json'

Hi there,
I tried running HugeCTR Docker example of DeepFM with NVTabular preprocessing, but after running the command on the title, it shows errors and stops at the training start. Is there any bug?Thx.

System: Ubuntu 18.04.4 LTS
GPU: GeForce RTX 2080 Ti
Driver Version: 440.44
CUDA Version: 10.2

[0.001, init_start, ]
HugeCTR Version: 2.2.1
Config file: ./deepfm_bin.json
[21d09h02m26s][HUGECTR][INFO]: batchsize_eval is not specified using default: 512
[21d09h02m26s][HUGECTR][INFO]: Default evaluation metric is AUC without threshold value
[21d09h02m26s][HUGECTR][INFO]: algorithm_search is not specified using default: 1
[21d09h02m26s][HUGECTR][INFO]: Algorithm search: ON
[21d09h02m26s][HUGECTR][INFO]: cuda_graph is not specified using default: 1
[21d09h02m26s][HUGECTR][INFO]: CUDA Graph: ON
[21d09h02m26s][HUGECTR][INFO]: Initial seed is 3545387129
[21d09h02m28s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: GeForce RTX 2080 Ti
[21d09h02m30s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[21d09h02m30s][HUGECTR][INFO]: max_nnz is not specified using default: 30
[21d09h02m30s][HUGECTR][INFO]: num_internal_buffers 1
[21d09h02m30s][HUGECTR][INFO]: num_internal_buffers 1
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR]
DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] [HCDEBUG][ERROR] 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderErrorDataHeaderErrorDataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::DataHeaderErrorDataHeaderError:58
58 [HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::[HCDEBUG][ERROR] 58[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

DataHeaderError[HCDEBUG][ERROR] DataHeaderError58[HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] [HCDEBUG][ERROR] :DataHeaderError58 /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
DataHeaderError58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] 58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError :58
:58DataHeaderError[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
::58:5858[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError58: 58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :58
58 [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58:

DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::DataHeaderError58

[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] 58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp 58
[HCDEBUG][ERROR] DataHeaderError58[HCDEBUG][ERROR] DataHeaderError:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError :[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::5858[HCDEBUG][ERROR]
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[21d09h02m30s][HUGECTR][INFO]: max_vocabulary_size_per_gpu_=1737709

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: 5858
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR]
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58:

DataHeaderError 5858DataHeaderError58[HCDEBUG][ERROR] 58DataHeaderError
58:58

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError
58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError:[HCDEBUG][ERROR] 58 ::
[HCDEBUG][ERROR] DataHeaderError:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] :

58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58DataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError
58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58
58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR]
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError 58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError :58
[HCDEBUG][ERROR] DataHeaderError 58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError58
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58[HCDEBUG][ERROR]
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError
[HCDEBUG][ERROR] [HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderErrorDataHeaderError[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
:58[HCDEBUG][ERROR]
DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58 DataHeaderError [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError :58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

:58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:DataHeaderErrorDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError 58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp
[HCDEBUG][ERROR] DataHeaderError:58:
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58

[HCDEBUG][ERROR] [HCDEBUG][ERROR] DataHeaderError DataHeaderError 58
[HCDEBUG][ERROR] DataHeaderError :58[HCDEBUG][ERROR]
DataHeaderError[HCDEBUG][ERROR] DataHeaderError/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
:58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: [HCDEBUG][ERROR]
58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hppDataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::
[HCDEBUG][ERROR] 5858
[HCDEBUG][ERROR] DataHeaderErrorDataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp [HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp::58
:[HCDEBUG][ERROR] 58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58DataHeaderError58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError58/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError:[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
58
58 /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] 58[HCDEBUG][ERROR] DataHeaderError[HCDEBUG][ERROR] DataHeaderError
/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError [HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
DataHeaderError[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

/hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

: /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] DataHeaderError /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:58
[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[HCDEBUG][ERROR] Runtime error: failed to read a file /hugectr/HugeCTR/include/data_readers/data_reader_worker.hpp:69

[21d09h02m30s][HUGECTR][INFO]: gpu0 start to init embedding
[21d09h02m30s][HUGECTR][INFO]: gpu0 init embedding done
[21d09h02m30s][HUGECTR][INFO]: warmup_steps is not specified using default: 1
[21d09h02m30s][HUGECTR][INFO]: decay_start is not specified using default: 0
[21d09h02m30s][HUGECTR][INFO]: decay_steps is not specified using default: 1
[21d09h02m30s][HUGECTR][INFO]: decay_power is not specified using default: 2.000000
[21d09h02m30s][HUGECTR][INFO]: end_lr is not specified using default: 0.000000
[3538.92, init_end, ]
[3538.94, run_start, ]
HugeCTR training start:
[3538.95, train_epoch_start, 0, ]

Parser doesn't check if a given layer name is already in use

Description:
Currently, our parser doesn't check if a specified layer "name" is already being used by a preceding layer.
As a result, the following erroneous layers can be silently inserted into network.
Without any safety measure, this kind of config bug can result in a disconnected network, where parameters are not appropriately trained.

      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
                }
      },

      {
        "name": "fc6",
        "type": "InnerProduct",
        "bottom": "relu5",
        "top": "fc6",
         "fc_param": {
          "num_output": 512
        }
      },

      {
        "name": "relu6",
        "type": "ReLU",
        "bottom": "fc6",
        "top": "relu6"
      },

Comments:

parqute datareader illegal memory access when 2xDGXA100 training

Description:
commit:
commit 53a2ff8 (HEAD -> v2.3-integration, origin/v2.3-integration, origin/data-power-law-kingsley)
Merge: e75c290 3c49da0
Author: Joey Wang [email protected]
Date: Sat Oct 31 01:24:10 2020 -0700
Merge branch ‘fea-multinode-auc-dmitry-2.3’ into ‘v2.3-integration’
Multinode AUC
See merge request zehuanw/hugectr!257

dataset: /mnt/dldata/criteo_1TB/albertoa/test_dask/output/ in dlcluster

config: 2xdgxa100.json

error log:hugectr-test-1604632076.log

reproduce step:
currently I am facing this bug when using raplab. I will update how to reproduce when I have access to selene
Comments:

Why my AUC is so high in first 1000 iters?

I just want to run a DCN sample training, and use following model JSON

{
  "solver": {
    "lr_policy": "fixed",
    "display": 1000,
    "max_iter": 10000,
    "gpu": [0],
    "batchsize": 512,
    "snapshot": 10000000,
    "snapshot_prefix": "./",
    "eval_interval": 1000,
    "eval_batches": 60,
    "input_key_type": "I64"
  },
  
  "optimizer": {
    "type": "Adam",
    "global_update": true,
    "adam_hparam": {
      "learning_rate": 0.001,
      "beta1": 0.9,
      "beta2": 0.999,
      "epsilon": 0.0000001
    }
  },

  "layers": [ 
      {
      "name": "data",
      "type": "Data",
      "format": "Parquet",
      "slot_size_array": [1461, 558, 335378, 211710, 306, 20, 12136, 634, 4, 51298, 5302, 332600, 3179, 27, 12191, 301211, 11, 4841, 2086, 4, 324273, 17, 16, 79734, 96, 58622],
      "source": "./dcn_data/train/_file_list.txt",
      "eval_source": "./dcn_data/val/_file_list.txt",
      "check": "None",
      "label": {
        "top": "label",
        "label_dim": 1
      },
      "dense": {
        "top": "dense",
        "dense_dim": 13
      },
      "sparse": [
        {
          "top": "data1",
          "type": "DistributedSlot",
          "max_feature_num_per_sample": 30,
          "slot_num": 26
        }        
      ]
    },

    {
      "name": "sparse_embedding1",
      "type": "DistributedSlotSparseEmbeddingHash",
      "bottom": "data1",
      "top": "sparse_embedding1",
      "sparse_embedding_hparam": {
        "max_vocabulary_size_per_gpu": 1737709,
        "embedding_vec_size": 16,
        "combiner": 0
      }
    },

    {
      "name": "reshape1",
      "type": "Reshape",
      "bottom": "sparse_embedding1",
      "top": "reshape1",
      "leading_dim": 416
    },


    {
      "name": "concat1",
      "type": "Concat",
      "bottom": ["reshape1","dense"],
      "top": "concat1"
    },

    {
      "name": "slice1",
      "type": "Slice",
      "bottom": "concat1",
      "ranges": [[0,429], [0,429]],
      "top": ["slice11", "slice12"]
    },


    {
      "name": "multicross1",
      "type": "MultiCross",
      "bottom": "slice11",
      "top": "multicross1",
      "mc_param": {
        "num_layers": 6
      }
    },

    {
      "name": "fc1",
      "type": "InnerProduct",
      "bottom": "slice12",
      "top": "fc1",
       "fc_param": {
        "num_output": 1024
      }
    },

    {
      "name": "relu1",
      "type": "ReLU",
      "bottom": "fc1",
      "top": "relu1" 
    },
      
    {
      "name": "dropout1",
      "type": "Dropout",
      "rate": 0.5,
      "bottom": "relu1",
      "top": "dropout1" 
    },

    {
      "name": "fc2",
      "type": "InnerProduct",
      "bottom": "dropout1",
      "top": "fc2",
       "fc_param": {
        "num_output": 1024
      }
    },

    {
      "name": "relu2",
      "type": "ReLU",
      "bottom": "fc2",
      "top": "relu2"     
    },

    {
      "name": "dropout2",
      "type": "Dropout",
      "rate": 0.5,
      "bottom": "relu2",
      "top": "dropout2" 
    },
    
    {
      "name": "concat2",
      "type": "Concat",
      "bottom": ["dropout2","multicross1"],
      "top": "concat2"
    },
    
    {
      "name": "fc4",
      "type": "InnerProduct",
      "bottom": "concat2",
      "top": "fc4",
       "fc_param": {
        "num_output": 1
      }
    },
    
    {
      "name": "loss",
      "type": "BinaryCrossEntropyLoss",
      "bottom": ["fc4","label"],
      "top": "loss"
    } 
  ]
}

But my first 1000 iters metrics report is very strange:

[04d15h08m10s][HUGECTR][INFO]: Iter: 1000 Time(1000 iters): 6.113479s Loss: 0.527308 lr:0.001000
[8665.98, eval_start, 0.1, ]
[04d15h08m10s][HUGECTR][INFO]: Evaluation, AUC: 0.692035
[8708.16, eval_accuracy, 0.692035, 0.1, 1000, ]
[04d15h08m10s][HUGECTR][INFO]: Eval Time for 60 iters: 0.042175s
[8708.18, eval_stop, 0.1, ]
[04d15h08m16s][HUGECTR][INFO]: Iter: 2000 Time(1000 iters): 6.171510s Loss: 0.426323 lr:0.001000
[14837.7, eval_start, 0.2, ]
....

This is normal that my first 1000 iters AUC is hit 0.692035? And I find next many 1000 iters AUC is reducing.

[ QUESTION ] training without eval set

Is there a way to run training without having a validation set? Whenever I don't have anything for source_eval I got an file_empty kind of error.

On top of that, HugeCTR really needs to work on error messages, I had to trace the code to see what is happening.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.