Comments (22)
Hi @zpcalan ,I think we need more infomation:
-
Have you try to run without ETC? If you have a try , does it runs well?
-
How do you preprocessed the dataset for ETC? how do you generated the keyset for day()?
from hugectr.
Hi @zpcalan ,I think we need more infomation:
- Have you try to run without ETC? If you have a try , does it runs well?
- How do you preprocessed the dataset for ETC? how do you generated the keyset for day()?
I appreciate your quick reply!
- Yes, I ran without ETC but it still does not converge. The exception is the same.
- I preprocessed day0~day23 data using
bash preprocess.sh $i /root/HugeCTR/etc_data/day$i nvt 1 0 1
as this link. As for keyset, I ran this command:
cmd_str="python generate_keyset.py --src_dir_path ./etc_data/day"+str(i)+"/train --keyset_path ./etc_data/day"+str(i)+"/train/_hugectr.keyset"
os.system(cmd_str)
As you can see, I generate keyset without slot size array. It should not affect convergence.
from hugectr.
And another information you might be interested in is that I adjust the script preprocess.sh
so that it can process whole day's dataset instead of only first 5000000 samples. I think it might be a little issue of this script?
--- a/tools/preprocess.sh
+++ b/tools/preprocess.sh
@@ -50,10 +50,11 @@ fi
SCRIPT_TYPE=$3
-echo "Getting the first few examples from the uncompressed dataset..."
+sample_num=`wc -l day_$1|awk '{print $1}'`
+echo "Getting the first few examples from the uncompressed dataset... $sample_num"
mkdir -p $DST_DATA_DIR/train && \
mkdir -p $DST_DATA_DIR/val && \
-head -n 5000000 day_$1 > $DST_DATA_DIR/day_$1_small
+head -n $sample_num day_$1 > $DST_DATA_DIR/day_$1_small
if [ $? -ne 0 ]; then
echo "Warning: fallback to find original compressed data day_$1.gz..."
echo "Decompressing day_$1.gz..."
@@ -62,7 +63,7 @@ if [ $? -ne 0 ]; then
echo "Error: failed to decompress the file."
exit 2
fi
- head -n 5000000 day_$1 > $DST_DATA_DIR/day_$1_small
+ head -n $sample_num day_$1 > $DST_DATA_DIR/day_$1_small
if [ $? -ne 0 ]; then
echo "Error: day_$1 file"
exit 2
@@ -111,7 +112,7 @@ if [[ $SCRIPT_TYPE == "nvt" ]]; then
--freq_limit 6 \
--device_limit_frac 0.5 \
--device_pool_frac 0.5 \
- --out_files_per_proc 8 \
+ --out_files_per_proc 20 \
--devices "0" \
--num_io_threads 2 \
--parquet_format=$IS_PARQUET_FORMAT \
from hugectr.
Hi @zpcalan ,I guess you include more samples , do you modify the workspace_size_per_gpu_in_mb
orslot_size_array
value? those parameters should be increased accordingly
from hugectr.
Yes, I change workspace_size_per_gpu_in_mb
to 4048 because I set embedding_vec_size to 240.
But I didn't change slot_size_array
when I generate keyset files as this comment says. In my script, I also didn't change slot_size_array
so it's all 0s.
from hugectr.
Hi @zpcalan , I assumed that you were using Parquet
Dataset, right? How is your slot_size_array like in DataReaderParams
?
Is the script on #395 throwing such error? If so, I think the problem might be you have left the slot_size_array
alone.
Please refer to the doc:
slot_size_array: List[int], specify the maximum key value for each slot. Refer to the following equation. The array should be consistent with that of the sparse input. HugeCTR requires this argument for Parquet format data and RawAsync format when you want to add an offset to the input key. The default value is an empty list.
PS. Let's focus on the model without ETC feature first.
from hugectr.
BTW, @zpcalan have you ever tried without mixed precision training?
from hugectr.
@JacoCheung
Yes, the script in 395 issue throws this error as well.(Without ETC)
And yes, I have tried without mixed precision training. The result is correct and no exception is thrown.
slot_size_array: List[int], specify the maximum key value for each slot. Refer to the following equation. The array should be consistent with that of the sparse input. HugeCTR requires this argument for Parquet format data and RawAsync format when you want to add an offset to the input key. The default value is an empty list.
I didn't set slot_size_array because I don't need to add offset to the key.
I think each catagorical feature of each sample is unique globally. So I don't quite understand why offset should be added when I use one GPU card to train this model.
All description above is without ETC.
Do you have any good suggestion about this?
from hugectr.
No, I think if you're using our preprocessing script, there is no guarantee that keys range of 2 slots are unique. For instance,
C0
and C1
have the chance to be identical [12, 12... ].
from hugectr.
No, I think if you're using our preprocessing script, there is no guarantee that keys range of 2 slots are unique. For instance,
C0
andC1
have the chance to be identical [12, 12... ].
Do you mean C0
and C1
of one sample could be both 12?
If so, I understand the offset must be added so that keys of each slot in a sample are unique. But why is this happening.
I will set slot_size_array
and run FP16 training.
It is printed when preprocessing dataset, right? Just like this:
Preprocessing
Train Datasets Preprocessing.....
[932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]
Valid Datasets Preprocessing.....
[932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]
So I will set slot_size_array
to [932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]
.
from hugectr.
But why is this happening.
The uniqueness requirement derives from the Embedding, you can assume that for Norm
or Raw
, the keys are already guaranteed to be unique in the preprocessing. But for Parquet
, it's the data reader's duty to add a offset to make them unique. The data preprocessing of Parquet
is done per feature. Different slots would not interfere with each other. For example, C0
may indicate the user_id
while C1
may indicate the item_id
, nvt will process them individually so that C0
and C1
all start with 0
. Sorry for the inconsistency, we're trying to improve the user experience.
You can try embedding collection, which is a uniform and new embedding type.
It is printed when preprocessing dataset, right?
Yes.
from hugectr.
Thanks for explaining!
After offset is added to each key, total unique key number will be summation of [932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]
,right?
If so, single card's memory size will not be enough for embedding. ETC must be introduced.
My original goal is to run day0-23 ETC training with FP16.
I can run FP16 training with small dataset and if it converges, can I use ETC training to run day0's dataset?
from hugectr.
Hi, developers of HugeCTR.
I tried to run FP16 training with small dataset and parquet data format, and slot_size_array correctly assigned. but after several steps the auc continued to decline and finally it does not converge.
The script is:
import hugectr
from mpi4py import MPI
solver = hugectr.CreateSolver(
max_eval_batches=300,
batchsize_eval=16384,
batchsize=16384,
lr=0.001,
vvgpu=[[0]],
repeat_dataset=False,
i64_input_key=True,
use_mixed_precision=True
)
reader = hugectr.DataReaderParams(
data_reader_type=hugectr.DataReaderType_t.Parquet,
source=["./etc_data/small_day0/train/_file_list.txt"],
keyset = ["./etc_data/small_day0/train/_hugectr.keyset"],
eval_source="./etc_data/small_day0/val/_file_list.txt",
slot_size_array=[13332, 31854, 10950, 6598, 8743, 4261, 10704, 4, 5009, 991, 28, 12198, 8710, 13823, 10, 1426, 3437, 48, 4, 643, 15, 10765, 11862, 11316, 8432, 5954, 42, 33],
check_type=hugectr.Check_t.Non,
)
optimizer = hugectr.CreateOptimizer(
optimizer_type=hugectr.Optimizer_t.Adam,
update_type=hugectr.Update_t.Global,
beta1=0.9,
beta2=0.999,
epsilon=0.0000001,
)
model = hugectr.Model(solver, reader, optimizer)
model.add(
hugectr.Input(
label_dim=1,
label_name="label",
dense_dim=13,
dense_name="dense",
data_reader_sparse_param_array=[
hugectr.DataReaderSparseParam("wide_data", 30, True, 1),
hugectr.DataReaderSparseParam("deep_data", 2, False, 26),
],
)
)
model.add(
hugectr.SparseEmbedding(
embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
workspace_size_per_gpu_in_mb=69,
embedding_vec_size=1,
combiner="sum",
sparse_embedding_name="sparse_embedding2",
bottom_name="wide_data",
optimizer=optimizer,
)
)
model.add(
hugectr.SparseEmbedding(
embedding_type=hugectr.Embedding_t.LocalizedSlotSparseEmbeddingHash,
workspace_size_per_gpu_in_mb=1024,
embedding_vec_size=240,
combiner="sum",
sparse_embedding_name="sparse_embedding1",
bottom_name="deep_data",
optimizer=optimizer,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Reshape,
bottom_names=["sparse_embedding1"],
top_names=["reshape1"],
leading_dim=6240,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Reshape,
bottom_names=["sparse_embedding2"],
top_names=["reshape2"],
leading_dim=1,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Concat, bottom_names=["reshape1", "dense"], top_names=["concat1"]
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["concat1"],
top_names=["fc1"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu1"],
top_names=["dropout1"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout1"],
top_names=["fc2"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu2"],
top_names=["dropout2"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout2"],
top_names=["fc3"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc3"], top_names=["relu3"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu3"],
top_names=["dropout3"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout3"],
top_names=["fc4"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc4"], top_names=["relu4"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu4"],
top_names=["dropout4"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout4"],
top_names=["fc5"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc5"], top_names=["relu5"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu5"],
top_names=["dropout5"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout5"],
top_names=["fc6"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc6"], top_names=["relu6"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu6"],
top_names=["dropout6"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout6"],
top_names=["fc7"],
num_output=1024,
)
)
model.add(
hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc7"], top_names=["relu7"])
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Dropout,
bottom_names=["relu7"],
top_names=["dropout7"],
dropout_rate=0.5,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.InnerProduct,
bottom_names=["dropout7"],
top_names=["fc8"],
num_output=1,
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.Add, bottom_names=["fc8", "reshape2"], top_names=["add1"]
)
)
model.add(
hugectr.DenseLayer(
layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
bottom_names=["add1", "label"],
top_names=["loss"],
)
)
model.compile()
model.summary()
model.fit(max_iter=1, display=10, eval_interval=40, snapshot=1000000, snapshot_prefix="wdl", num_epochs=100)
The result:
[HCTR][12:21:15.423][INFO][RK0][main]: Eval Time for 300 iters: 0.66509s
[HCTR][12:21:15.900][INFO][RK0][main]: Iter: 4170 Time(10 iters): 1.13876s Loss: 0.0917276 lr:0.001
[HCTR][12:21:16.438][INFO][RK0][main]: Iter: 4180 Time(10 iters): 0.533409s Loss: 0.0895059 lr:0.001
[HCTR][12:21:16.939][INFO][RK0][main]: Iter: 4190 Time(10 iters): 0.495925s Loss: 0.086796 lr:0.001
[HCTR][12:21:17.395][INFO][RK0][main]: Iter: 4200 Time(10 iters): 0.451196s Loss: 0.0831752 lr:0.001
[HCTR][12:21:18.081][INFO][RK0][main]: Evaluation, AUC: 0.602723
[HCTR][12:21:18.081][INFO][RK0][main]: Eval Time for 300 iters: 0.684147s
[HCTR][12:21:18.552][INFO][RK0][main]: Iter: 4210 Time(10 iters): 1.15236s Loss: 0.0809434 lr:0.001
[HCTR][12:21:18.684][INFO][RK0][main]: train drop incomplete batch. batchsize:10752
[HCTR][12:21:18.684][INFO][RK0][main]: -----------------------------------Epoch 43-----------------------------------
[HCTR][12:21:19.083][INFO][RK0][main]: Iter: 4220 Time(10 iters): 0.526233s Loss: 0.077393 lr:0.001
[HCTR][12:21:19.552][INFO][RK0][main]: Iter: 4230 Time(10 iters): 0.464746s Loss: 0.078123 lr:0.001
[HCTR][12:21:20.005][INFO][RK0][main]: Iter: 4240 Time(10 iters): 0.447292s Loss: 0.0808783 lr:0.001
[HCTR][12:21:20.694][INFO][RK0][main]: Evaluation, AUC: 0.611602
[HCTR][12:21:20.694][INFO][RK0][main]: Eval Time for 300 iters: 0.688137s
Traceback (most recent call last):
File "small_wdl.py", line 267, in <module>
model.fit(max_iter=1, display=10, eval_interval=40, snapshot=1000000, snapshot_prefix="wdl", num_epochs=1000)
RuntimeError: Train Runtime error: Loss cannot converge /hugectr/HugeCTR/src/pybind/model.cpp:2019
Is there something wrong with the script?
from hugectr.
Hi, any progress on this issue? It seems some convergence issue when using use_mixed_precision=True
.
from hugectr.
It looks like you were using epoch
mode and the dataset contained 1642300
(= 4210 / 42 * 16384) samples or so, right?
Could you please post the AUC for each epoch? I'd like to know when the AUC did start to drop.
In addition, we had not tried enabling fp16 training for this model, the hyper parameters may be subtly different from of fp32 training. For example, the ,scaler
, learning_rate
, etc. Please refer to solver document for more details.
from hugectr.
Becuse HugeCTR does not support dynamic scaler, the divergence issue sometimes occurs if there's fp16
overflow (For example, the weight is too large and the gemm will produce intermediate numeric larger than 65,504(Inf) and the inf will propagate).
from hugectr.
It looks like you were using
epoch
mode and the dataset contained1642300
(= 4210 / 42 * 16384) samples or so, right? Could you please post the AUC for each epoch? I'd like to know when the AUC did start to drop.In addition, we had not tried enabling fp16 training for this model, the hyper parameters may be subtly different from of fp32 training. For example, the ,
scaler
,learning_rate
, etc. Please refer to solver document for more details.
I see. It seems fp16 training is not fully tested well and some hyper parameters should be adjusted.
The AUC:
[HCTR][03:41:14.346][INFO][RK0][main]: -----------------------------------Epoch 0-----------------------------------
[HCTR][03:41:17.143][INFO][RK0][main]: Evaluation, AUC: 0.614141
[HCTR][03:41:19.711][INFO][RK0][main]: Evaluation, AUC: 0.660234
[HCTR][03:41:20.493][INFO][RK0][main]: -----------------------------------Epoch 1-----------------------------------
[HCTR][03:41:22.209][INFO][RK0][main]: Evaluation, AUC: 0.539144
[HCTR][03:41:24.770][INFO][RK0][main]: Evaluation, AUC: 0.678606
[HCTR][03:41:26.449][INFO][RK0][main]: -----------------------------------Epoch 2-----------------------------------
[HCTR][03:41:27.352][INFO][RK0][main]: Evaluation, AUC: 0.682721
[HCTR][03:41:30.001][INFO][RK0][main]: Evaluation, AUC: 0.674805
[HCTR][03:41:32.636][INFO][RK0][main]: Evaluation, AUC: 0.631986
[HCTR][03:41:33.225][INFO][RK0][main]: -----------------------------------Epoch 3-----------------------------------
[HCTR][03:41:35.237][INFO][RK0][main]: Evaluation, AUC: 0.691028
[HCTR][03:41:37.849][INFO][RK0][main]: Evaluation, AUC: 0.705744
[HCTR][03:41:39.359][INFO][RK0][main]: -----------------------------------Epoch 4-----------------------------------
[HCTR][03:41:40.451][INFO][RK0][main]: Evaluation, AUC: 0.681173
[HCTR][03:41:43.091][INFO][RK0][main]: Evaluation, AUC: 0.705617
[HCTR][03:41:45.716][INFO][RK0][main]: Evaluation, AUC: 0.700359
[HCTR][03:41:46.108][INFO][RK0][main]: -----------------------------------Epoch 5-----------------------------------
[HCTR][03:41:48.327][INFO][RK0][main]: Evaluation, AUC: 0.696902
[HCTR][03:41:50.903][INFO][RK0][main]: Evaluation, AUC: 0.659046
[HCTR][03:41:52.239][INFO][RK0][main]: -----------------------------------Epoch 6-----------------------------------
[HCTR][03:41:53.527][INFO][RK0][main]: Evaluation, AUC: 0.700403
[HCTR][03:41:56.142][INFO][RK0][main]: Evaluation, AUC: 0.698039
[HCTR][03:41:58.755][INFO][RK0][main]: Evaluation, AUC: 0.688778
[HCTR][03:41:58.935][INFO][RK0][main]: -----------------------------------Epoch 7-----------------------------------
[HCTR][03:42:01.342][INFO][RK0][main]: Evaluation, AUC: 0.684459
[HCTR][03:42:03.961][INFO][RK0][main]: Evaluation, AUC: 0.689478
[HCTR][03:42:05.085][INFO][RK0][main]: -----------------------------------Epoch 8-----------------------------------
[HCTR][03:42:06.556][INFO][RK0][main]: Evaluation, AUC: 0.693916
[HCTR][03:42:09.151][INFO][RK0][main]: Evaluation, AUC: 0.691887
[HCTR][03:42:11.748][INFO][RK0][main]: Evaluation, AUC: 0.685392
[HCTR][03:42:11.748][INFO][RK0][main]: -----------------------------------Epoch 9-----------------------------------
[HCTR][03:42:14.344][INFO][RK0][main]: Evaluation, AUC: 0.682643
[HCTR][03:42:16.962][INFO][RK0][main]: Evaluation, AUC: 0.683189
[HCTR][03:42:17.884][INFO][RK0][main]: -----------------------------------Epoch 10-----------------------------------
[HCTR][03:42:19.543][INFO][RK0][main]: Evaluation, AUC: 0.6777
[HCTR][03:42:22.155][INFO][RK0][main]: Evaluation, AUC: 0.67126
[HCTR][03:42:23.952][INFO][RK0][main]: -----------------------------------Epoch 11-----------------------------------
[HCTR][03:42:24.773][INFO][RK0][main]: Evaluation, AUC: 0.676683
[HCTR][03:42:27.377][INFO][RK0][main]: Evaluation, AUC: 0.678563
[HCTR][03:42:30.015][INFO][RK0][main]: Evaluation, AUC: 0.677846
[HCTR][03:42:30.718][INFO][RK0][main]: -----------------------------------Epoch 12-----------------------------------
[HCTR][03:42:32.602][INFO][RK0][main]: Evaluation, AUC: 0.675074
[HCTR][03:42:35.200][INFO][RK0][main]: Evaluation, AUC: 0.669867
[HCTR][03:42:36.814][INFO][RK0][main]: -----------------------------------Epoch 13-----------------------------------
[HCTR][03:42:37.815][INFO][RK0][main]: Evaluation, AUC: 0.67586
[HCTR][03:42:40.420][INFO][RK0][main]: Evaluation, AUC: 0.667391
[HCTR][03:42:43.063][INFO][RK0][main]: Evaluation, AUC: 0.668702
[HCTR][03:42:43.559][INFO][RK0][main]: -----------------------------------Epoch 14-----------------------------------
[HCTR][03:42:45.641][INFO][RK0][main]: Evaluation, AUC: 0.675038
[HCTR][03:42:48.260][INFO][RK0][main]: Evaluation, AUC: 0.670299
[HCTR][03:42:49.677][INFO][RK0][main]: -----------------------------------Epoch 15-----------------------------------
[HCTR][03:42:50.854][INFO][RK0][main]: Evaluation, AUC: 0.671158
[HCTR][03:42:53.461][INFO][RK0][main]: Evaluation, AUC: 0.663839
[HCTR][03:42:56.069][INFO][RK0][main]: Evaluation, AUC: 0.659254
[HCTR][03:42:56.352][INFO][RK0][main]: -----------------------------------Epoch 16-----------------------------------
[HCTR][03:42:58.663][INFO][RK0][main]: Evaluation, AUC: 0.656877
[HCTR][03:43:01.294][INFO][RK0][main]: Evaluation, AUC: 0.663891
[HCTR][03:43:02.550][INFO][RK0][main]: -----------------------------------Epoch 17-----------------------------------
[HCTR][03:43:03.950][INFO][RK0][main]: Evaluation, AUC: 0.672403
[HCTR][03:43:06.594][INFO][RK0][main]: Evaluation, AUC: 0.665769
[HCTR][03:43:09.210][INFO][RK0][main]: Evaluation, AUC: 0.653346
[HCTR][03:43:09.284][INFO][RK0][main]: -----------------------------------Epoch 18-----------------------------------
[HCTR][03:43:11.797][INFO][RK0][main]: Evaluation, AUC: 0.648996
[HCTR][03:43:14.461][INFO][RK0][main]: Evaluation, AUC: 0.663004
[HCTR][03:43:15.481][INFO][RK0][main]: -----------------------------------Epoch 19-----------------------------------
[HCTR][03:43:17.026][INFO][RK0][main]: Evaluation, AUC: 0.667177
[HCTR][03:43:19.647][INFO][RK0][main]: Evaluation, AUC: 0.6584
[HCTR][03:43:21.530][INFO][RK0][main]: -----------------------------------Epoch 20-----------------------------------
[HCTR][03:43:22.291][INFO][RK0][main]: Evaluation, AUC: 0.653862
[HCTR][03:43:24.931][INFO][RK0][main]: Evaluation, AUC: 0.648851
[HCTR][03:43:27.568][INFO][RK0][main]: Evaluation, AUC: 0.657728
[HCTR][03:43:28.377][INFO][RK0][main]: -----------------------------------Epoch 21-----------------------------------
[HCTR][03:43:30.161][INFO][RK0][main]: Evaluation, AUC: 0.663705
[HCTR][03:43:32.785][INFO][RK0][main]: Evaluation, AUC: 0.648212
[HCTR][03:43:34.498][INFO][RK0][main]: -----------------------------------Epoch 22-----------------------------------
[HCTR][03:43:35.435][INFO][RK0][main]: Evaluation, AUC: 0.623344
[HCTR][03:43:38.061][INFO][RK0][main]: Evaluation, AUC: 0.642473
[HCTR][03:43:40.682][INFO][RK0][main]: Evaluation, AUC: 0.641618
[HCTR][03:43:41.272][INFO][RK0][main]: -----------------------------------Epoch 23-----------------------------------
[HCTR][03:43:43.283][INFO][RK0][main]: Evaluation, AUC: 0.653915
[HCTR][03:43:45.875][INFO][RK0][main]: Evaluation, AUC: 0.635579
[HCTR][03:43:47.386][INFO][RK0][main]: -----------------------------------Epoch 24-----------------------------------
[HCTR][03:43:48.470][INFO][RK0][main]: Evaluation, AUC: 0.641135
[HCTR][03:43:51.080][INFO][RK0][main]: Evaluation, AUC: 0.629272
[HCTR][03:43:53.721][INFO][RK0][main]: Evaluation, AUC: 0.637833
[HCTR][03:43:54.113][INFO][RK0][main]: -----------------------------------Epoch 25-----------------------------------
[HCTR][03:43:56.325][INFO][RK0][main]: Evaluation, AUC: 0.634068
[HCTR][03:43:58.966][INFO][RK0][main]: Evaluation, AUC: 0.632516
[HCTR][03:44:00.293][INFO][RK0][main]: -----------------------------------Epoch 26-----------------------------------
[HCTR][03:44:01.604][INFO][RK0][main]: Evaluation, AUC: 0.63881
[HCTR][03:44:04.189][INFO][RK0][main]: Evaluation, AUC: 0.629136
[HCTR][03:44:06.816][INFO][RK0][main]: Evaluation, AUC: 0.636865
[HCTR][03:44:06.995][INFO][RK0][main]: -----------------------------------Epoch 27-----------------------------------
[HCTR][03:44:09.392][INFO][RK0][main]: Evaluation, AUC: 0.630674
[HCTR][03:44:12.019][INFO][RK0][main]: Evaluation, AUC: 0.646079
[HCTR][03:44:13.148][INFO][RK0][main]: -----------------------------------Epoch 28-----------------------------------
[HCTR][03:44:14.626][INFO][RK0][main]: Evaluation, AUC: 0.652931
[HCTR][03:44:17.289][INFO][RK0][main]: Evaluation, AUC: 0.64743
[HCTR][03:44:19.890][INFO][RK0][main]: Evaluation, AUC: 0.636667
[HCTR][03:44:19.891][INFO][RK0][main]: -----------------------------------Epoch 29-----------------------------------
[HCTR][03:44:22.500][INFO][RK0][main]: Evaluation, AUC: 0.628411
[HCTR][03:44:25.026][INFO][RK0][main]: Evaluation, AUC: 0.643507
[HCTR][03:44:25.901][INFO][RK0][main]: -----------------------------------Epoch 30-----------------------------------
[HCTR][03:44:27.528][INFO][RK0][main]: Evaluation, AUC: 0.624483
[HCTR][03:44:30.098][INFO][RK0][main]: Evaluation, AUC: 0.627787
[HCTR][03:44:31.862][INFO][RK0][main]: -----------------------------------Epoch 31-----------------------------------
[HCTR][03:44:32.693][INFO][RK0][main]: Evaluation, AUC: 0.610337
[HCTR][03:44:35.310][INFO][RK0][main]: Evaluation, AUC: 0.629604
[HCTR][03:44:37.944][INFO][RK0][main]: Evaluation, AUC: 0.643551
[HCTR][03:44:38.666][INFO][RK0][main]: -----------------------------------Epoch 32-----------------------------------
[HCTR][03:44:40.527][INFO][RK0][main]: Evaluation, AUC: 0.625997
[HCTR][03:44:43.141][INFO][RK0][main]: Evaluation, AUC: 0.639994
[HCTR][03:44:44.730][INFO][RK0][main]: -----------------------------------Epoch 33-----------------------------------
[HCTR][03:44:45.742][INFO][RK0][main]: Evaluation, AUC: 0.626142
[HCTR][03:44:48.390][INFO][RK0][main]: Evaluation, AUC: 0.624291
[HCTR][03:44:51.018][INFO][RK0][main]: Evaluation, AUC: 0.62945
[HCTR][03:44:51.515][INFO][RK0][main]: -----------------------------------Epoch 34-----------------------------------
[HCTR][03:44:53.610][INFO][RK0][main]: Evaluation, AUC: 0.640553
[HCTR][03:44:56.249][INFO][RK0][main]: Evaluation, AUC: 0.634215
[HCTR][03:44:57.689][INFO][RK0][main]: -----------------------------------Epoch 35-----------------------------------
[HCTR][03:44:58.893][INFO][RK0][main]: Evaluation, AUC: 0.634598
[HCTR][03:45:01.556][INFO][RK0][main]: Evaluation, AUC: 0.618847
[HCTR][03:45:04.184][INFO][RK0][main]: Evaluation, AUC: 0.633745
[HCTR][03:45:04.461][INFO][RK0][main]: -----------------------------------Epoch 36-----------------------------------
[HCTR][03:45:06.769][INFO][RK0][main]: Evaluation, AUC: 0.634274
[HCTR][03:45:09.360][INFO][RK0][main]: Evaluation, AUC: 0.631094
[HCTR][03:45:10.607][INFO][RK0][main]: -----------------------------------Epoch 37-----------------------------------
[HCTR][03:45:11.965][INFO][RK0][main]: Evaluation, AUC: 0.635363
[HCTR][03:45:14.611][INFO][RK0][main]: Evaluation, AUC: 0.63397
[HCTR][03:45:17.225][INFO][RK0][main]: Evaluation, AUC: 0.62498
[HCTR][03:45:17.298][INFO][RK0][main]: -----------------------------------Epoch 38-----------------------------------
[HCTR][03:45:19.805][INFO][RK0][main]: Evaluation, AUC: 0.631844
[HCTR][03:45:22.421][INFO][RK0][main]: Evaluation, AUC: 0.616634
[HCTR][03:45:23.465][INFO][RK0][main]: -----------------------------------Epoch 39-----------------------------------
[HCTR][03:45:25.027][INFO][RK0][main]: Evaluation, AUC: 0.637862
[HCTR][03:45:27.672][INFO][RK0][main]: Evaluation, AUC: 0.62419
[HCTR][03:45:29.549][INFO][RK0][main]: -----------------------------------Epoch 40-----------------------------------
[HCTR][03:45:30.251][INFO][RK0][main]: Evaluation, AUC: 0.605228
[HCTR][03:45:32.900][INFO][RK0][main]: Evaluation, AUC: 0.622567
[HCTR][03:45:35.538][INFO][RK0][main]: Evaluation, AUC: 0.634089
[HCTR][03:45:36.353][INFO][RK0][main]: -----------------------------------Epoch 41-----------------------------------
[HCTR][03:45:38.122][INFO][RK0][main]: Evaluation, AUC: 0.633255
[HCTR][03:45:40.747][INFO][RK0][main]: Evaluation, AUC: 0.637025
[HCTR][03:45:42.439][INFO][RK0][main]: -----------------------------------Epoch 42-----------------------------------
[HCTR][03:45:43.339][INFO][RK0][main]: Evaluation, AUC: 0.628744
[HCTR][03:45:45.979][INFO][RK0][main]: Evaluation, AUC: 0.627138
[HCTR][03:45:48.624][INFO][RK0][main]: Evaluation, AUC: 0.61648
[HCTR][03:45:49.228][INFO][RK0][main]: -----------------------------------Epoch 43-----------------------------------
[HCTR][03:45:51.225][INFO][RK0][main]: Evaluation, AUC: 0.622604
[HCTR][03:45:53.792][INFO][RK0][main]: Evaluation, AUC: 0.634602
[HCTR][03:45:55.257][INFO][RK0][main]: -----------------------------------Epoch 44-----------------------------------
[HCTR][03:45:56.344][INFO][RK0][main]: Evaluation, AUC: 0.631658
[HCTR][03:45:58.894][INFO][RK0][main]: Evaluation, AUC: 0.621491
[HCTR][03:46:01.440][INFO][RK0][main]: Evaluation, AUC: 0.625775
[HCTR][03:46:01.818][INFO][RK0][main]: -----------------------------------Epoch 45-----------------------------------
[HCTR][03:46:03.927][INFO][RK0][main]: Evaluation, AUC: 0.614879
[HCTR][03:46:06.446][INFO][RK0][main]: Evaluation, AUC: 0.621894
[HCTR][03:46:07.737][INFO][RK0][main]: -----------------------------------Epoch 46-----------------------------------
[HCTR][03:46:08.961][INFO][RK0][main]: Evaluation, AUC: 0.623026
[HCTR][03:46:11.477][INFO][RK0][main]: Evaluation, AUC: 0.620515
[HCTR][03:46:14.014][INFO][RK0][main]: Evaluation, AUC: 0.62285
[HCTR][03:46:14.190][INFO][RK0][main]: -----------------------------------Epoch 47-----------------------------------
[HCTR][03:46:16.517][INFO][RK0][main]: Evaluation, AUC: 0.609457
[HCTR][03:46:19.059][INFO][RK0][main]: Evaluation, AUC: 0.630078
[HCTR][03:46:20.150][INFO][RK0][main]: -----------------------------------Epoch 48-----------------------------------
[HCTR][03:46:21.592][INFO][RK0][main]: Evaluation, AUC: 0.640645
[HCTR][03:46:24.096][INFO][RK0][main]: Evaluation, AUC: 0.630915
[HCTR][03:46:26.641][INFO][RK0][main]: Evaluation, AUC: 0.626881
[HCTR][03:46:26.642][INFO][RK0][main]: -----------------------------------Epoch 49-----------------------------------
[HCTR][03:46:29.174][INFO][RK0][main]: Evaluation, AUC: 0.624196
[HCTR][03:46:31.738][INFO][RK0][main]: Evaluation, AUC: 0.635639
[HCTR][03:46:32.630][INFO][RK0][main]: -----------------------------------Epoch 50-----------------------------------
[HCTR][03:46:34.234][INFO][RK0][main]: Evaluation, AUC: 0.625333
[HCTR][03:46:36.771][INFO][RK0][main]: Evaluation, AUC: 0.606624
[HCTR][03:46:38.501][INFO][RK0][main]: -----------------------------------Epoch 51-----------------------------------
[HCTR][03:46:39.301][INFO][RK0][main]: Evaluation, AUC: 0.615676
[HCTR][03:46:41.823][INFO][RK0][main]: Evaluation, AUC: 0.62022
[HCTR][03:46:44.350][INFO][RK0][main]: Evaluation, AUC: 0.624788
[HCTR][03:46:45.024][INFO][RK0][main]: -----------------------------------Epoch 52-----------------------------------
[HCTR][03:46:46.837][INFO][RK0][main]: Evaluation, AUC: 0.617381
[HCTR][03:46:49.374][INFO][RK0][main]: Evaluation, AUC: 0.635112
[HCTR][03:46:50.935][INFO][RK0][main]: -----------------------------------Epoch 53-----------------------------------
[HCTR][03:46:51.903][INFO][RK0][main]: Evaluation, AUC: 0.631386
[HCTR][03:46:54.435][INFO][RK0][main]: Evaluation, AUC: 0.616767
[HCTR][03:46:56.954][INFO][RK0][main]: Evaluation, AUC: 0.61728
[HCTR][03:46:57.431][INFO][RK0][main]: -----------------------------------Epoch 54-----------------------------------
[HCTR][03:46:59.458][INFO][RK0][main]: Evaluation, AUC: 0.628254
[HCTR][03:47:01.989][INFO][RK0][main]: Evaluation, AUC: 0.61775
[HCTR][03:47:03.370][INFO][RK0][main]: -----------------------------------Epoch 55-----------------------------------
[HCTR][03:47:04.515][INFO][RK0][main]: Evaluation, AUC: 0.629893
[HCTR][03:47:07.081][INFO][RK0][main]: Evaluation, AUC: 0.609286
[HCTR][03:47:09.617][INFO][RK0][main]: Evaluation, AUC: 0.627195
[HCTR][03:47:09.895][INFO][RK0][main]: -----------------------------------Epoch 56-----------------------------------
[HCTR][03:47:12.126][INFO][RK0][main]: Evaluation, AUC: 0.61822
[HCTR][03:47:15.025][INFO][RK0][main]: Evaluation, AUC: 0.619697
[HCTR][03:47:16.418][INFO][RK0][main]: -----------------------------------Epoch 57-----------------------------------
[HCTR][03:47:18.096][INFO][RK0][main]: Evaluation, AUC: 0.628677
[HCTR][03:47:21.540][INFO][RK0][main]: Evaluation, AUC: 0.632847
[HCTR][03:47:24.633][INFO][RK0][main]: Evaluation, AUC: 0.626757
[HCTR][03:47:24.758][INFO][RK0][main]: -----------------------------------Epoch 58-----------------------------------
[HCTR][03:47:27.734][INFO][RK0][main]: Evaluation, AUC: 0.630998
[HCTR][03:47:31.249][INFO][RK0][main]: Evaluation, AUC: 0.624877
[HCTR][03:47:32.528][INFO][RK0][main]: -----------------------------------Epoch 59-----------------------------------
[HCTR][03:47:34.863][INFO][RK0][main]: Evaluation, AUC: 0.626591
[HCTR][03:47:38.615][INFO][RK0][main]: Evaluation, AUC: 0.613877
[HCTR][03:47:40.785][INFO][RK0][main]: -----------------------------------Epoch 60-----------------------------------
[HCTR][03:47:41.855][INFO][RK0][main]: Evaluation, AUC: 0.617227
[HCTR][03:47:45.216][INFO][RK0][main]: Evaluation, AUC: 0.604119
[HCTR][03:47:48.435][INFO][RK0][main]: Evaluation, AUC: 0.610586
[HCTR][03:47:49.617][INFO][RK0][main]: -----------------------------------Epoch 61-----------------------------------
[HCTR][03:47:51.822][INFO][RK0][main]: Evaluation, AUC: 0.619046
[HCTR][03:47:55.262][INFO][RK0][main]: Evaluation, AUC: 0.61385
[HCTR][03:47:57.630][INFO][RK0][main]: -----------------------------------Epoch 62-----------------------------------
[HCTR][03:47:58.712][INFO][RK0][main]: Evaluation, AUC: 0.622791
[HCTR][03:48:02.060][INFO][RK0][main]: Evaluation, AUC: 0.617612
[HCTR][03:48:05.560][INFO][RK0][main]: Evaluation, AUC: 0.607431
[HCTR][03:48:06.490][INFO][RK0][main]: -----------------------------------Epoch 63-----------------------------------
[HCTR][03:48:09.092][INFO][RK0][main]: Evaluation, AUC: 0.622129
[HCTR][03:48:12.705][INFO][RK0][main]: Evaluation, AUC: 0.616801
[HCTR][03:48:14.504][INFO][RK0][main]: -----------------------------------Epoch 64-----------------------------------
[HCTR][03:48:15.979][INFO][RK0][main]: Evaluation, AUC: 0.611538
[HCTR][03:48:19.146][INFO][RK0][main]: Evaluation, AUC: 0.61975
[HCTR][03:48:22.412][INFO][RK0][main]: Evaluation, AUC: 0.601556
[HCTR][03:48:22.810][INFO][RK0][main]: -----------------------------------Epoch 65-----------------------------------
[HCTR][03:48:26.073][INFO][RK0][main]: Evaluation, AUC: 0.605704
[HCTR][03:48:29.257][INFO][RK0][main]: Evaluation, AUC: 0.617283
[HCTR][03:48:30.897][INFO][RK0][main]: -----------------------------------Epoch 66-----------------------------------
[HCTR][03:48:32.606][INFO][RK0][main]: Evaluation, AUC: 0.592738
[HCTR][03:48:35.761][INFO][RK0][main]: Evaluation, AUC: 0.60789
[HCTR][03:48:38.972][INFO][RK0][main]: Evaluation, AUC: 0.607424
[HCTR][03:48:39.223][INFO][RK0][main]: -----------------------------------Epoch 67-----------------------------------
[HCTR][03:48:42.298][INFO][RK0][main]: Evaluation, AUC: 0.611191
[HCTR][03:48:45.413][INFO][RK0][main]: Evaluation, AUC: 0.621895
[HCTR][03:48:46.675][INFO][RK0][main]: -----------------------------------Epoch 68-----------------------------------
[HCTR][03:48:48.483][INFO][RK0][main]: Evaluation, AUC: 0.625431
[HCTR][03:48:51.841][INFO][RK0][main]: Evaluation, AUC: 0.581377
[HCTR][03:48:54.986][INFO][RK0][main]: Evaluation, AUC: 0.580077
[HCTR][03:48:54.986][INFO][RK0][main]: -----------------------------------Epoch 69-----------------------------------
[HCTR][03:48:58.409][INFO][RK0][main]: Evaluation, AUC: 0.589024
from hugectr.
Yes, the hyper params (in the worst case, you had to opt another optimizer) should be adjusted. We have not fully tested fp16 training for all models.
There are two remarks from the AUC log you posted:
-
The AUC in the first epoch is much smaller than that with
fp32
. It may imply that this model+current configuration are not suitale forfp16
training. -
The AUC tends to drop after many epochs. It may imply overfitting.
Anyway, I recommend you to adjust some hyper parameters.
from hugectr.
Thanks. I will adjust as you suggest.
By the way, could you please check this issue about ETC? I am trying to run this test regardless disconvergence.
I appreciate it a lot!
from hugectr.
I am trying not to use use_mixed_precision
parameter and manully add hugectr.Layer_t.Cast
layer instead to fix this convergence issue while keep the performance consistent. But the API document seems lack of hugectr.Layer_t.Cast
layer's description. How can I use this type of DenseLayer to cast output to float16, for example.
from hugectr.
Hi @zpcalan , I'm afraid you can not manually cast the data type and feed the tensor to a layer of different datatype. The used_mixed_precision
flag has a global impact. If it's off, HugeCTR assumes all input tensors of all layer have fp32
datatype, while if it's on, all inputs to layers must have fp16
data type. HugeCTR does not support fp16
for a specific layer. In fp32
mode, the cast
layer will cast fp32
into fp16
; in fp16
mode, the cast
layer cast from fp16
to fp32
.
from hugectr.
I use dlrm instead of W&D because the parameters in the script have been already configured correctly for mixed precision.
Issue closed. : )
from hugectr.
Related Issues (20)
- [Question] tensorflow 1.15 sok example HOT 2
- [Question]Running the DCN on a single GPU leads to the illegal memory access HOT 1
- [Question] Is there a slack channel or discord server for questions and discussion ? HOT 4
- [Question] COnfiguration issues with mlcommon benchmarking HOT 1
- [Question] How to serve TF2 SOK model in Triton Inference and convert it to ONNX? HOT 1
- [Question] Difference between Embedding Training Cache and GPU Embedding Cache HOT 9
- Support for configuration issues HOT 1
- [Question] How can I pre-calculate the GPU memory required for embedding cache size? HOT 2
- [Question] nv_gpu_cache compiling problem HOT 1
- [BUG] Encountered ETC error of din model when training with multiple keyset. HOT 3
- Trouble installing hugectr_backend for Triton Server HOT 1
- sok-experiment static_map empty_key_sentinel and reclaimed_key_sentinel is not right for int64 [BUG] HOT 4
- [BUG] CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream HOT 21
- build docker failed with 401 Unauthorized (Set Up the Development Environment With Merlin Containers) HOT 4
- [BUG]preprocess.sh 1 criteo failed with 'Schema' object has no attribute 'write' HOT 1
- [Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? HOT 1
- [Question] How to dump incremental model to kafka in Release 23.12? HOT 2
- [BUG] Run sok tests error HOT 1
- [BUG] Seg Fault When Deploying TF+HPS Model with merlin-tensorflow HOT 9
- [BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hugectr.