Coder Social home page Coder Social logo

Comments (22)

kanghui0204 avatar kanghui0204 commented on July 17, 2024

Hi @zpcalan ,I think we need more infomation:

  • Have you try to run without ETC? If you have a try , does it runs well?

  • How do you preprocessed the dataset for ETC? how do you generated the keyset for day()?

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

Hi @zpcalan ,I think we need more infomation:

  • Have you try to run without ETC? If you have a try , does it runs well?
  • How do you preprocessed the dataset for ETC? how do you generated the keyset for day()?

I appreciate your quick reply!

  1. Yes, I ran without ETC but it still does not converge. The exception is the same.
  2. I preprocessed day0~day23 data using bash preprocess.sh $i /root/HugeCTR/etc_data/day$i nvt 1 0 1 as this link. As for keyset, I ran this command:
cmd_str="python generate_keyset.py  --src_dir_path ./etc_data/day"+str(i)+"/train --keyset_path ./etc_data/day"+str(i)+"/train/_hugectr.keyset"
os.system(cmd_str)

As you can see, I generate keyset without slot size array. It should not affect convergence.

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

And another information you might be interested in is that I adjust the script preprocess.sh so that it can process whole day's dataset instead of only first 5000000 samples. I think it might be a little issue of this script?

--- a/tools/preprocess.sh
+++ b/tools/preprocess.sh
@@ -50,10 +50,11 @@ fi

 SCRIPT_TYPE=$3

-echo "Getting the first few examples from the uncompressed dataset..."
+sample_num=`wc -l day_$1|awk '{print $1}'`
+echo "Getting the first few examples from the uncompressed dataset... $sample_num"
 mkdir -p $DST_DATA_DIR/train                         && \
 mkdir -p $DST_DATA_DIR/val                           && \
-head -n 5000000 day_$1 > $DST_DATA_DIR/day_$1_small
+head -n $sample_num day_$1 > $DST_DATA_DIR/day_$1_small
 if [ $? -ne 0 ]; then
        echo "Warning: fallback to find original compressed data day_$1.gz..."
        echo "Decompressing day_$1.gz..."
@@ -62,7 +63,7 @@ if [ $? -ne 0 ]; then
                echo "Error: failed to decompress the file."
                exit 2
        fi
-       head -n 5000000 day_$1 > $DST_DATA_DIR/day_$1_small
+       head -n $sample_num day_$1 > $DST_DATA_DIR/day_$1_small
        if [ $? -ne 0 ]; then
                echo "Error: day_$1 file"
                exit 2
@@ -111,7 +112,7 @@ if [[ $SCRIPT_TYPE == "nvt" ]]; then
                --freq_limit 6                        \
                --device_limit_frac 0.5               \
                --device_pool_frac 0.5                \
-               --out_files_per_proc 8                \
+               --out_files_per_proc 20                \
                --devices "0"                         \
                --num_io_threads 2                    \
         --parquet_format=$IS_PARQUET_FORMAT   \

from hugectr.

kanghui0204 avatar kanghui0204 commented on July 17, 2024

Hi @zpcalan ,I guess you include more samples , do you modify the workspace_size_per_gpu_in_mb orslot_size_array value? those parameters should be increased accordingly

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

Yes, I change workspace_size_per_gpu_in_mb to 4048 because I set embedding_vec_size to 240.
But I didn't change slot_size_array when I generate keyset files as this comment says. In my script, I also didn't change slot_size_array so it's all 0s.

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

Hi @zpcalan , I assumed that you were using Parquet Dataset, right? How is your slot_size_array like in DataReaderParams?
Is the script on #395 throwing such error? If so, I think the problem might be you have left the slot_size_array alone.
Please refer to the doc:

slot_size_array: List[int], specify the maximum key value for each slot. Refer to the following equation. The array should be consistent with that of the sparse input. HugeCTR requires this argument for Parquet format data and RawAsync format when you want to add an offset to the input key. The default value is an empty list.

PS. Let's focus on the model without ETC feature first.

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

BTW, @zpcalan have you ever tried without mixed precision training?

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

@JacoCheung
Yes, the script in 395 issue throws this error as well.(Without ETC)
And yes, I have tried without mixed precision training. The result is correct and no exception is thrown.

slot_size_array: List[int], specify the maximum key value for each slot. Refer to the following equation. The array should be consistent with that of the sparse input. HugeCTR requires this argument for Parquet format data and RawAsync format when you want to add an offset to the input key. The default value is an empty list.

I didn't set slot_size_array because I don't need to add offset to the key.
I think each catagorical feature of each sample is unique globally. So I don't quite understand why offset should be added when I use one GPU card to train this model.

All description above is without ETC.
Do you have any good suggestion about this?

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

No, I think if you're using our preprocessing script, there is no guarantee that keys range of 2 slots are unique. For instance,
C0 and C1 have the chance to be identical [12, 12... ].

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

No, I think if you're using our preprocessing script, there is no guarantee that keys range of 2 slots are unique. For instance, C0 and C1 have the chance to be identical [12, 12... ].

Do you mean C0 and C1 of one sample could be both 12?
If so, I understand the offset must be added so that keys of each slot in a sample are unique. But why is this happening.

I will set slot_size_array and run FP16 training.
It is printed when preprocessing dataset, right? Just like this:

Preprocessing
Train Datasets Preprocessing.....
[932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]
Valid Datasets Preprocessing.....
[932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34]

So I will set slot_size_array to [932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34].

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

But why is this happening.

The uniqueness requirement derives from the Embedding, you can assume that for Norm or Raw, the keys are already guaranteed to be unique in the preprocessing. But for Parquet, it's the data reader's duty to add a offset to make them unique. The data preprocessing of Parquet is done per feature. Different slots would not interfere with each other. For example, C0 may indicate the user_id while C1 may indicate the item_id, nvt will process them individually so that C0 and C1 all start with 0. Sorry for the inconsistency, we're trying to improve the user experience.
You can try embedding collection, which is a uniform and new embedding type.

It is printed when preprocessing dataset, right?

Yes.

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

Thanks for explaining!

After offset is added to each key, total unique key number will be summation of [932326, 1066648, 831979, 24672, 14847, 7123, 19357, 4, 6469, 1268, 55, 696499, 171522, 121926, 11, 2200, 8868, 64, 4, 951, 15, 838640, 446266, 793237, 130145, 10112, 83, 34],right?

If so, single card's memory size will not be enough for embedding. ETC must be introduced.
My original goal is to run day0-23 ETC training with FP16.
I can run FP16 training with small dataset and if it converges, can I use ETC training to run day0's dataset?

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

Hi, developers of HugeCTR.
I tried to run FP16 training with small dataset and parquet data format, and slot_size_array correctly assigned. but after several steps the auc continued to decline and finally it does not converge.
The script is:

import hugectr
from mpi4py import MPI

solver = hugectr.CreateSolver(
    max_eval_batches=300,
    batchsize_eval=16384,
    batchsize=16384,
    lr=0.001,
    vvgpu=[[0]],
    repeat_dataset=False,
    i64_input_key=True,
    use_mixed_precision=True
)

reader = hugectr.DataReaderParams(
    data_reader_type=hugectr.DataReaderType_t.Parquet,
    source=["./etc_data/small_day0/train/_file_list.txt"],
    keyset = ["./etc_data/small_day0/train/_hugectr.keyset"],
    eval_source="./etc_data/small_day0/val/_file_list.txt",
    slot_size_array=[13332, 31854, 10950, 6598, 8743, 4261, 10704, 4, 5009, 991, 28, 12198, 8710, 13823, 10, 1426, 3437, 48, 4, 643, 15, 10765, 11862, 11316, 8432, 5954, 42, 33],
    check_type=hugectr.Check_t.Non,
)

optimizer = hugectr.CreateOptimizer(
    optimizer_type=hugectr.Optimizer_t.Adam,
    update_type=hugectr.Update_t.Global,
    beta1=0.9,
    beta2=0.999,
    epsilon=0.0000001,
)

model = hugectr.Model(solver, reader, optimizer)
model.add(
    hugectr.Input(
        label_dim=1,
        label_name="label",
        dense_dim=13,
        dense_name="dense",
        data_reader_sparse_param_array=[
            hugectr.DataReaderSparseParam("wide_data", 30, True, 1),
            hugectr.DataReaderSparseParam("deep_data", 2, False, 26),
        ],
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.DistributedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=69,
        embedding_vec_size=1,
        combiner="sum",
        sparse_embedding_name="sparse_embedding2",
        bottom_name="wide_data",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.SparseEmbedding(
        embedding_type=hugectr.Embedding_t.LocalizedSlotSparseEmbeddingHash,
        workspace_size_per_gpu_in_mb=1024,
        embedding_vec_size=240,
        combiner="sum",
        sparse_embedding_name="sparse_embedding1",
        bottom_name="deep_data",
        optimizer=optimizer,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding1"],
        top_names=["reshape1"],
        leading_dim=6240,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Reshape,
        bottom_names=["sparse_embedding2"],
        top_names=["reshape2"],
        leading_dim=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Concat, bottom_names=["reshape1", "dense"], top_names=["concat1"]
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["concat1"],
        top_names=["fc1"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc1"], top_names=["relu1"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu1"],
        top_names=["dropout1"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout1"],
        top_names=["fc2"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc2"], top_names=["relu2"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu2"],
        top_names=["dropout2"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout2"],
        top_names=["fc3"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc3"], top_names=["relu3"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu3"],
        top_names=["dropout3"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout3"],
        top_names=["fc4"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc4"], top_names=["relu4"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu4"],
        top_names=["dropout4"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout4"],
        top_names=["fc5"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc5"], top_names=["relu5"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu5"],
        top_names=["dropout5"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout5"],
        top_names=["fc6"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc6"], top_names=["relu6"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu6"],
        top_names=["dropout6"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout6"],
        top_names=["fc7"],
        num_output=1024,
    )
)
model.add(
    hugectr.DenseLayer(layer_type=hugectr.Layer_t.ReLU, bottom_names=["fc7"], top_names=["relu7"])
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Dropout,
        bottom_names=["relu7"],
        top_names=["dropout7"],
        dropout_rate=0.5,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.InnerProduct,
        bottom_names=["dropout7"],
        top_names=["fc8"],
        num_output=1,
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.Add, bottom_names=["fc8", "reshape2"], top_names=["add1"]
    )
)
model.add(
    hugectr.DenseLayer(
        layer_type=hugectr.Layer_t.BinaryCrossEntropyLoss,
        bottom_names=["add1", "label"],
        top_names=["loss"],
    )
)
model.compile()
model.summary()
model.fit(max_iter=1, display=10, eval_interval=40, snapshot=1000000, snapshot_prefix="wdl", num_epochs=100)

The result:

[HCTR][12:21:15.423][INFO][RK0][main]: Eval Time for 300 iters: 0.66509s
[HCTR][12:21:15.900][INFO][RK0][main]: Iter: 4170 Time(10 iters): 1.13876s Loss: 0.0917276 lr:0.001
[HCTR][12:21:16.438][INFO][RK0][main]: Iter: 4180 Time(10 iters): 0.533409s Loss: 0.0895059 lr:0.001
[HCTR][12:21:16.939][INFO][RK0][main]: Iter: 4190 Time(10 iters): 0.495925s Loss: 0.086796 lr:0.001
[HCTR][12:21:17.395][INFO][RK0][main]: Iter: 4200 Time(10 iters): 0.451196s Loss: 0.0831752 lr:0.001
[HCTR][12:21:18.081][INFO][RK0][main]: Evaluation, AUC: 0.602723
[HCTR][12:21:18.081][INFO][RK0][main]: Eval Time for 300 iters: 0.684147s
[HCTR][12:21:18.552][INFO][RK0][main]: Iter: 4210 Time(10 iters): 1.15236s Loss: 0.0809434 lr:0.001
[HCTR][12:21:18.684][INFO][RK0][main]: train drop incomplete batch. batchsize:10752
[HCTR][12:21:18.684][INFO][RK0][main]: -----------------------------------Epoch 43-----------------------------------
[HCTR][12:21:19.083][INFO][RK0][main]: Iter: 4220 Time(10 iters): 0.526233s Loss: 0.077393 lr:0.001
[HCTR][12:21:19.552][INFO][RK0][main]: Iter: 4230 Time(10 iters): 0.464746s Loss: 0.078123 lr:0.001
[HCTR][12:21:20.005][INFO][RK0][main]: Iter: 4240 Time(10 iters): 0.447292s Loss: 0.0808783 lr:0.001
[HCTR][12:21:20.694][INFO][RK0][main]: Evaluation, AUC: 0.611602
[HCTR][12:21:20.694][INFO][RK0][main]: Eval Time for 300 iters: 0.688137s
Traceback (most recent call last):
  File "small_wdl.py", line 267, in <module>
    model.fit(max_iter=1, display=10, eval_interval=40, snapshot=1000000, snapshot_prefix="wdl", num_epochs=1000)
RuntimeError: Train Runtime error: Loss cannot converge /hugectr/HugeCTR/src/pybind/model.cpp:2019

Is there something wrong with the script?

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

Hi, any progress on this issue? It seems some convergence issue when using use_mixed_precision=True.

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

It looks like you were using epoch mode and the dataset contained 1642300(= 4210 / 42 * 16384) samples or so, right?
Could you please post the AUC for each epoch? I'd like to know when the AUC did start to drop.

In addition, we had not tried enabling fp16 training for this model, the hyper parameters may be subtly different from of fp32 training. For example, the ,scaler, learning_rate, etc. Please refer to solver document for more details.

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

Becuse HugeCTR does not support dynamic scaler, the divergence issue sometimes occurs if there's fp16 overflow (For example, the weight is too large and the gemm will produce intermediate numeric larger than 65,504(Inf) and the inf will propagate).

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

It looks like you were using epoch mode and the dataset contained 1642300(= 4210 / 42 * 16384) samples or so, right? Could you please post the AUC for each epoch? I'd like to know when the AUC did start to drop.

In addition, we had not tried enabling fp16 training for this model, the hyper parameters may be subtly different from of fp32 training. For example, the ,scaler, learning_rate, etc. Please refer to solver document for more details.

I see. It seems fp16 training is not fully tested well and some hyper parameters should be adjusted.
The AUC:

[HCTR][03:41:14.346][INFO][RK0][main]: -----------------------------------Epoch 0-----------------------------------
[HCTR][03:41:17.143][INFO][RK0][main]: Evaluation, AUC: 0.614141
[HCTR][03:41:19.711][INFO][RK0][main]: Evaluation, AUC: 0.660234
[HCTR][03:41:20.493][INFO][RK0][main]: -----------------------------------Epoch 1-----------------------------------
[HCTR][03:41:22.209][INFO][RK0][main]: Evaluation, AUC: 0.539144
[HCTR][03:41:24.770][INFO][RK0][main]: Evaluation, AUC: 0.678606
[HCTR][03:41:26.449][INFO][RK0][main]: -----------------------------------Epoch 2-----------------------------------
[HCTR][03:41:27.352][INFO][RK0][main]: Evaluation, AUC: 0.682721
[HCTR][03:41:30.001][INFO][RK0][main]: Evaluation, AUC: 0.674805
[HCTR][03:41:32.636][INFO][RK0][main]: Evaluation, AUC: 0.631986
[HCTR][03:41:33.225][INFO][RK0][main]: -----------------------------------Epoch 3-----------------------------------
[HCTR][03:41:35.237][INFO][RK0][main]: Evaluation, AUC: 0.691028
[HCTR][03:41:37.849][INFO][RK0][main]: Evaluation, AUC: 0.705744
[HCTR][03:41:39.359][INFO][RK0][main]: -----------------------------------Epoch 4-----------------------------------
[HCTR][03:41:40.451][INFO][RK0][main]: Evaluation, AUC: 0.681173
[HCTR][03:41:43.091][INFO][RK0][main]: Evaluation, AUC: 0.705617
[HCTR][03:41:45.716][INFO][RK0][main]: Evaluation, AUC: 0.700359
[HCTR][03:41:46.108][INFO][RK0][main]: -----------------------------------Epoch 5-----------------------------------
[HCTR][03:41:48.327][INFO][RK0][main]: Evaluation, AUC: 0.696902
[HCTR][03:41:50.903][INFO][RK0][main]: Evaluation, AUC: 0.659046
[HCTR][03:41:52.239][INFO][RK0][main]: -----------------------------------Epoch 6-----------------------------------
[HCTR][03:41:53.527][INFO][RK0][main]: Evaluation, AUC: 0.700403
[HCTR][03:41:56.142][INFO][RK0][main]: Evaluation, AUC: 0.698039
[HCTR][03:41:58.755][INFO][RK0][main]: Evaluation, AUC: 0.688778
[HCTR][03:41:58.935][INFO][RK0][main]: -----------------------------------Epoch 7-----------------------------------
[HCTR][03:42:01.342][INFO][RK0][main]: Evaluation, AUC: 0.684459
[HCTR][03:42:03.961][INFO][RK0][main]: Evaluation, AUC: 0.689478
[HCTR][03:42:05.085][INFO][RK0][main]: -----------------------------------Epoch 8-----------------------------------
[HCTR][03:42:06.556][INFO][RK0][main]: Evaluation, AUC: 0.693916
[HCTR][03:42:09.151][INFO][RK0][main]: Evaluation, AUC: 0.691887
[HCTR][03:42:11.748][INFO][RK0][main]: Evaluation, AUC: 0.685392
[HCTR][03:42:11.748][INFO][RK0][main]: -----------------------------------Epoch 9-----------------------------------
[HCTR][03:42:14.344][INFO][RK0][main]: Evaluation, AUC: 0.682643
[HCTR][03:42:16.962][INFO][RK0][main]: Evaluation, AUC: 0.683189
[HCTR][03:42:17.884][INFO][RK0][main]: -----------------------------------Epoch 10-----------------------------------
[HCTR][03:42:19.543][INFO][RK0][main]: Evaluation, AUC: 0.6777
[HCTR][03:42:22.155][INFO][RK0][main]: Evaluation, AUC: 0.67126
[HCTR][03:42:23.952][INFO][RK0][main]: -----------------------------------Epoch 11-----------------------------------
[HCTR][03:42:24.773][INFO][RK0][main]: Evaluation, AUC: 0.676683
[HCTR][03:42:27.377][INFO][RK0][main]: Evaluation, AUC: 0.678563
[HCTR][03:42:30.015][INFO][RK0][main]: Evaluation, AUC: 0.677846
[HCTR][03:42:30.718][INFO][RK0][main]: -----------------------------------Epoch 12-----------------------------------
[HCTR][03:42:32.602][INFO][RK0][main]: Evaluation, AUC: 0.675074
[HCTR][03:42:35.200][INFO][RK0][main]: Evaluation, AUC: 0.669867
[HCTR][03:42:36.814][INFO][RK0][main]: -----------------------------------Epoch 13-----------------------------------
[HCTR][03:42:37.815][INFO][RK0][main]: Evaluation, AUC: 0.67586
[HCTR][03:42:40.420][INFO][RK0][main]: Evaluation, AUC: 0.667391
[HCTR][03:42:43.063][INFO][RK0][main]: Evaluation, AUC: 0.668702
[HCTR][03:42:43.559][INFO][RK0][main]: -----------------------------------Epoch 14-----------------------------------
[HCTR][03:42:45.641][INFO][RK0][main]: Evaluation, AUC: 0.675038
[HCTR][03:42:48.260][INFO][RK0][main]: Evaluation, AUC: 0.670299
[HCTR][03:42:49.677][INFO][RK0][main]: -----------------------------------Epoch 15-----------------------------------
[HCTR][03:42:50.854][INFO][RK0][main]: Evaluation, AUC: 0.671158
[HCTR][03:42:53.461][INFO][RK0][main]: Evaluation, AUC: 0.663839
[HCTR][03:42:56.069][INFO][RK0][main]: Evaluation, AUC: 0.659254
[HCTR][03:42:56.352][INFO][RK0][main]: -----------------------------------Epoch 16-----------------------------------
[HCTR][03:42:58.663][INFO][RK0][main]: Evaluation, AUC: 0.656877
[HCTR][03:43:01.294][INFO][RK0][main]: Evaluation, AUC: 0.663891
[HCTR][03:43:02.550][INFO][RK0][main]: -----------------------------------Epoch 17-----------------------------------
[HCTR][03:43:03.950][INFO][RK0][main]: Evaluation, AUC: 0.672403
[HCTR][03:43:06.594][INFO][RK0][main]: Evaluation, AUC: 0.665769
[HCTR][03:43:09.210][INFO][RK0][main]: Evaluation, AUC: 0.653346
[HCTR][03:43:09.284][INFO][RK0][main]: -----------------------------------Epoch 18-----------------------------------
[HCTR][03:43:11.797][INFO][RK0][main]: Evaluation, AUC: 0.648996
[HCTR][03:43:14.461][INFO][RK0][main]: Evaluation, AUC: 0.663004
[HCTR][03:43:15.481][INFO][RK0][main]: -----------------------------------Epoch 19-----------------------------------
[HCTR][03:43:17.026][INFO][RK0][main]: Evaluation, AUC: 0.667177
[HCTR][03:43:19.647][INFO][RK0][main]: Evaluation, AUC: 0.6584
[HCTR][03:43:21.530][INFO][RK0][main]: -----------------------------------Epoch 20-----------------------------------
[HCTR][03:43:22.291][INFO][RK0][main]: Evaluation, AUC: 0.653862
[HCTR][03:43:24.931][INFO][RK0][main]: Evaluation, AUC: 0.648851
[HCTR][03:43:27.568][INFO][RK0][main]: Evaluation, AUC: 0.657728
[HCTR][03:43:28.377][INFO][RK0][main]: -----------------------------------Epoch 21-----------------------------------
[HCTR][03:43:30.161][INFO][RK0][main]: Evaluation, AUC: 0.663705
[HCTR][03:43:32.785][INFO][RK0][main]: Evaluation, AUC: 0.648212
[HCTR][03:43:34.498][INFO][RK0][main]: -----------------------------------Epoch 22-----------------------------------
[HCTR][03:43:35.435][INFO][RK0][main]: Evaluation, AUC: 0.623344
[HCTR][03:43:38.061][INFO][RK0][main]: Evaluation, AUC: 0.642473
[HCTR][03:43:40.682][INFO][RK0][main]: Evaluation, AUC: 0.641618
[HCTR][03:43:41.272][INFO][RK0][main]: -----------------------------------Epoch 23-----------------------------------
[HCTR][03:43:43.283][INFO][RK0][main]: Evaluation, AUC: 0.653915
[HCTR][03:43:45.875][INFO][RK0][main]: Evaluation, AUC: 0.635579
[HCTR][03:43:47.386][INFO][RK0][main]: -----------------------------------Epoch 24-----------------------------------
[HCTR][03:43:48.470][INFO][RK0][main]: Evaluation, AUC: 0.641135
[HCTR][03:43:51.080][INFO][RK0][main]: Evaluation, AUC: 0.629272
[HCTR][03:43:53.721][INFO][RK0][main]: Evaluation, AUC: 0.637833
[HCTR][03:43:54.113][INFO][RK0][main]: -----------------------------------Epoch 25-----------------------------------
[HCTR][03:43:56.325][INFO][RK0][main]: Evaluation, AUC: 0.634068
[HCTR][03:43:58.966][INFO][RK0][main]: Evaluation, AUC: 0.632516
[HCTR][03:44:00.293][INFO][RK0][main]: -----------------------------------Epoch 26-----------------------------------
[HCTR][03:44:01.604][INFO][RK0][main]: Evaluation, AUC: 0.63881
[HCTR][03:44:04.189][INFO][RK0][main]: Evaluation, AUC: 0.629136
[HCTR][03:44:06.816][INFO][RK0][main]: Evaluation, AUC: 0.636865
[HCTR][03:44:06.995][INFO][RK0][main]: -----------------------------------Epoch 27-----------------------------------
[HCTR][03:44:09.392][INFO][RK0][main]: Evaluation, AUC: 0.630674
[HCTR][03:44:12.019][INFO][RK0][main]: Evaluation, AUC: 0.646079
[HCTR][03:44:13.148][INFO][RK0][main]: -----------------------------------Epoch 28-----------------------------------
[HCTR][03:44:14.626][INFO][RK0][main]: Evaluation, AUC: 0.652931
[HCTR][03:44:17.289][INFO][RK0][main]: Evaluation, AUC: 0.64743
[HCTR][03:44:19.890][INFO][RK0][main]: Evaluation, AUC: 0.636667
[HCTR][03:44:19.891][INFO][RK0][main]: -----------------------------------Epoch 29-----------------------------------
[HCTR][03:44:22.500][INFO][RK0][main]: Evaluation, AUC: 0.628411
[HCTR][03:44:25.026][INFO][RK0][main]: Evaluation, AUC: 0.643507
[HCTR][03:44:25.901][INFO][RK0][main]: -----------------------------------Epoch 30-----------------------------------
[HCTR][03:44:27.528][INFO][RK0][main]: Evaluation, AUC: 0.624483
[HCTR][03:44:30.098][INFO][RK0][main]: Evaluation, AUC: 0.627787
[HCTR][03:44:31.862][INFO][RK0][main]: -----------------------------------Epoch 31-----------------------------------
[HCTR][03:44:32.693][INFO][RK0][main]: Evaluation, AUC: 0.610337
[HCTR][03:44:35.310][INFO][RK0][main]: Evaluation, AUC: 0.629604
[HCTR][03:44:37.944][INFO][RK0][main]: Evaluation, AUC: 0.643551
[HCTR][03:44:38.666][INFO][RK0][main]: -----------------------------------Epoch 32-----------------------------------
[HCTR][03:44:40.527][INFO][RK0][main]: Evaluation, AUC: 0.625997
[HCTR][03:44:43.141][INFO][RK0][main]: Evaluation, AUC: 0.639994
[HCTR][03:44:44.730][INFO][RK0][main]: -----------------------------------Epoch 33-----------------------------------
[HCTR][03:44:45.742][INFO][RK0][main]: Evaluation, AUC: 0.626142
[HCTR][03:44:48.390][INFO][RK0][main]: Evaluation, AUC: 0.624291
[HCTR][03:44:51.018][INFO][RK0][main]: Evaluation, AUC: 0.62945
[HCTR][03:44:51.515][INFO][RK0][main]: -----------------------------------Epoch 34-----------------------------------
[HCTR][03:44:53.610][INFO][RK0][main]: Evaluation, AUC: 0.640553
[HCTR][03:44:56.249][INFO][RK0][main]: Evaluation, AUC: 0.634215
[HCTR][03:44:57.689][INFO][RK0][main]: -----------------------------------Epoch 35-----------------------------------
[HCTR][03:44:58.893][INFO][RK0][main]: Evaluation, AUC: 0.634598
[HCTR][03:45:01.556][INFO][RK0][main]: Evaluation, AUC: 0.618847
[HCTR][03:45:04.184][INFO][RK0][main]: Evaluation, AUC: 0.633745
[HCTR][03:45:04.461][INFO][RK0][main]: -----------------------------------Epoch 36-----------------------------------
[HCTR][03:45:06.769][INFO][RK0][main]: Evaluation, AUC: 0.634274
[HCTR][03:45:09.360][INFO][RK0][main]: Evaluation, AUC: 0.631094
[HCTR][03:45:10.607][INFO][RK0][main]: -----------------------------------Epoch 37-----------------------------------
[HCTR][03:45:11.965][INFO][RK0][main]: Evaluation, AUC: 0.635363
[HCTR][03:45:14.611][INFO][RK0][main]: Evaluation, AUC: 0.63397
[HCTR][03:45:17.225][INFO][RK0][main]: Evaluation, AUC: 0.62498
[HCTR][03:45:17.298][INFO][RK0][main]: -----------------------------------Epoch 38-----------------------------------
[HCTR][03:45:19.805][INFO][RK0][main]: Evaluation, AUC: 0.631844
[HCTR][03:45:22.421][INFO][RK0][main]: Evaluation, AUC: 0.616634
[HCTR][03:45:23.465][INFO][RK0][main]: -----------------------------------Epoch 39-----------------------------------
[HCTR][03:45:25.027][INFO][RK0][main]: Evaluation, AUC: 0.637862
[HCTR][03:45:27.672][INFO][RK0][main]: Evaluation, AUC: 0.62419
[HCTR][03:45:29.549][INFO][RK0][main]: -----------------------------------Epoch 40-----------------------------------
[HCTR][03:45:30.251][INFO][RK0][main]: Evaluation, AUC: 0.605228
[HCTR][03:45:32.900][INFO][RK0][main]: Evaluation, AUC: 0.622567
[HCTR][03:45:35.538][INFO][RK0][main]: Evaluation, AUC: 0.634089
[HCTR][03:45:36.353][INFO][RK0][main]: -----------------------------------Epoch 41-----------------------------------
[HCTR][03:45:38.122][INFO][RK0][main]: Evaluation, AUC: 0.633255
[HCTR][03:45:40.747][INFO][RK0][main]: Evaluation, AUC: 0.637025
[HCTR][03:45:42.439][INFO][RK0][main]: -----------------------------------Epoch 42-----------------------------------
[HCTR][03:45:43.339][INFO][RK0][main]: Evaluation, AUC: 0.628744
[HCTR][03:45:45.979][INFO][RK0][main]: Evaluation, AUC: 0.627138
[HCTR][03:45:48.624][INFO][RK0][main]: Evaluation, AUC: 0.61648
[HCTR][03:45:49.228][INFO][RK0][main]: -----------------------------------Epoch 43-----------------------------------
[HCTR][03:45:51.225][INFO][RK0][main]: Evaluation, AUC: 0.622604
[HCTR][03:45:53.792][INFO][RK0][main]: Evaluation, AUC: 0.634602
[HCTR][03:45:55.257][INFO][RK0][main]: -----------------------------------Epoch 44-----------------------------------
[HCTR][03:45:56.344][INFO][RK0][main]: Evaluation, AUC: 0.631658
[HCTR][03:45:58.894][INFO][RK0][main]: Evaluation, AUC: 0.621491
[HCTR][03:46:01.440][INFO][RK0][main]: Evaluation, AUC: 0.625775
[HCTR][03:46:01.818][INFO][RK0][main]: -----------------------------------Epoch 45-----------------------------------
[HCTR][03:46:03.927][INFO][RK0][main]: Evaluation, AUC: 0.614879
[HCTR][03:46:06.446][INFO][RK0][main]: Evaluation, AUC: 0.621894
[HCTR][03:46:07.737][INFO][RK0][main]: -----------------------------------Epoch 46-----------------------------------
[HCTR][03:46:08.961][INFO][RK0][main]: Evaluation, AUC: 0.623026
[HCTR][03:46:11.477][INFO][RK0][main]: Evaluation, AUC: 0.620515
[HCTR][03:46:14.014][INFO][RK0][main]: Evaluation, AUC: 0.62285
[HCTR][03:46:14.190][INFO][RK0][main]: -----------------------------------Epoch 47-----------------------------------
[HCTR][03:46:16.517][INFO][RK0][main]: Evaluation, AUC: 0.609457
[HCTR][03:46:19.059][INFO][RK0][main]: Evaluation, AUC: 0.630078
[HCTR][03:46:20.150][INFO][RK0][main]: -----------------------------------Epoch 48-----------------------------------
[HCTR][03:46:21.592][INFO][RK0][main]: Evaluation, AUC: 0.640645
[HCTR][03:46:24.096][INFO][RK0][main]: Evaluation, AUC: 0.630915
[HCTR][03:46:26.641][INFO][RK0][main]: Evaluation, AUC: 0.626881
[HCTR][03:46:26.642][INFO][RK0][main]: -----------------------------------Epoch 49-----------------------------------
[HCTR][03:46:29.174][INFO][RK0][main]: Evaluation, AUC: 0.624196
[HCTR][03:46:31.738][INFO][RK0][main]: Evaluation, AUC: 0.635639
[HCTR][03:46:32.630][INFO][RK0][main]: -----------------------------------Epoch 50-----------------------------------
[HCTR][03:46:34.234][INFO][RK0][main]: Evaluation, AUC: 0.625333
[HCTR][03:46:36.771][INFO][RK0][main]: Evaluation, AUC: 0.606624
[HCTR][03:46:38.501][INFO][RK0][main]: -----------------------------------Epoch 51-----------------------------------
[HCTR][03:46:39.301][INFO][RK0][main]: Evaluation, AUC: 0.615676
[HCTR][03:46:41.823][INFO][RK0][main]: Evaluation, AUC: 0.62022
[HCTR][03:46:44.350][INFO][RK0][main]: Evaluation, AUC: 0.624788
[HCTR][03:46:45.024][INFO][RK0][main]: -----------------------------------Epoch 52-----------------------------------
[HCTR][03:46:46.837][INFO][RK0][main]: Evaluation, AUC: 0.617381
[HCTR][03:46:49.374][INFO][RK0][main]: Evaluation, AUC: 0.635112
[HCTR][03:46:50.935][INFO][RK0][main]: -----------------------------------Epoch 53-----------------------------------
[HCTR][03:46:51.903][INFO][RK0][main]: Evaluation, AUC: 0.631386
[HCTR][03:46:54.435][INFO][RK0][main]: Evaluation, AUC: 0.616767
[HCTR][03:46:56.954][INFO][RK0][main]: Evaluation, AUC: 0.61728
[HCTR][03:46:57.431][INFO][RK0][main]: -----------------------------------Epoch 54-----------------------------------
[HCTR][03:46:59.458][INFO][RK0][main]: Evaluation, AUC: 0.628254
[HCTR][03:47:01.989][INFO][RK0][main]: Evaluation, AUC: 0.61775
[HCTR][03:47:03.370][INFO][RK0][main]: -----------------------------------Epoch 55-----------------------------------
[HCTR][03:47:04.515][INFO][RK0][main]: Evaluation, AUC: 0.629893
[HCTR][03:47:07.081][INFO][RK0][main]: Evaluation, AUC: 0.609286
[HCTR][03:47:09.617][INFO][RK0][main]: Evaluation, AUC: 0.627195
[HCTR][03:47:09.895][INFO][RK0][main]: -----------------------------------Epoch 56-----------------------------------
[HCTR][03:47:12.126][INFO][RK0][main]: Evaluation, AUC: 0.61822
[HCTR][03:47:15.025][INFO][RK0][main]: Evaluation, AUC: 0.619697
[HCTR][03:47:16.418][INFO][RK0][main]: -----------------------------------Epoch 57-----------------------------------
[HCTR][03:47:18.096][INFO][RK0][main]: Evaluation, AUC: 0.628677
[HCTR][03:47:21.540][INFO][RK0][main]: Evaluation, AUC: 0.632847
[HCTR][03:47:24.633][INFO][RK0][main]: Evaluation, AUC: 0.626757
[HCTR][03:47:24.758][INFO][RK0][main]: -----------------------------------Epoch 58-----------------------------------
[HCTR][03:47:27.734][INFO][RK0][main]: Evaluation, AUC: 0.630998
[HCTR][03:47:31.249][INFO][RK0][main]: Evaluation, AUC: 0.624877
[HCTR][03:47:32.528][INFO][RK0][main]: -----------------------------------Epoch 59-----------------------------------
[HCTR][03:47:34.863][INFO][RK0][main]: Evaluation, AUC: 0.626591
[HCTR][03:47:38.615][INFO][RK0][main]: Evaluation, AUC: 0.613877
[HCTR][03:47:40.785][INFO][RK0][main]: -----------------------------------Epoch 60-----------------------------------
[HCTR][03:47:41.855][INFO][RK0][main]: Evaluation, AUC: 0.617227
[HCTR][03:47:45.216][INFO][RK0][main]: Evaluation, AUC: 0.604119
[HCTR][03:47:48.435][INFO][RK0][main]: Evaluation, AUC: 0.610586
[HCTR][03:47:49.617][INFO][RK0][main]: -----------------------------------Epoch 61-----------------------------------
[HCTR][03:47:51.822][INFO][RK0][main]: Evaluation, AUC: 0.619046
[HCTR][03:47:55.262][INFO][RK0][main]: Evaluation, AUC: 0.61385
[HCTR][03:47:57.630][INFO][RK0][main]: -----------------------------------Epoch 62-----------------------------------
[HCTR][03:47:58.712][INFO][RK0][main]: Evaluation, AUC: 0.622791
[HCTR][03:48:02.060][INFO][RK0][main]: Evaluation, AUC: 0.617612
[HCTR][03:48:05.560][INFO][RK0][main]: Evaluation, AUC: 0.607431
[HCTR][03:48:06.490][INFO][RK0][main]: -----------------------------------Epoch 63-----------------------------------
[HCTR][03:48:09.092][INFO][RK0][main]: Evaluation, AUC: 0.622129
[HCTR][03:48:12.705][INFO][RK0][main]: Evaluation, AUC: 0.616801
[HCTR][03:48:14.504][INFO][RK0][main]: -----------------------------------Epoch 64-----------------------------------
[HCTR][03:48:15.979][INFO][RK0][main]: Evaluation, AUC: 0.611538
[HCTR][03:48:19.146][INFO][RK0][main]: Evaluation, AUC: 0.61975
[HCTR][03:48:22.412][INFO][RK0][main]: Evaluation, AUC: 0.601556
[HCTR][03:48:22.810][INFO][RK0][main]: -----------------------------------Epoch 65-----------------------------------
[HCTR][03:48:26.073][INFO][RK0][main]: Evaluation, AUC: 0.605704
[HCTR][03:48:29.257][INFO][RK0][main]: Evaluation, AUC: 0.617283
[HCTR][03:48:30.897][INFO][RK0][main]: -----------------------------------Epoch 66-----------------------------------
[HCTR][03:48:32.606][INFO][RK0][main]: Evaluation, AUC: 0.592738
[HCTR][03:48:35.761][INFO][RK0][main]: Evaluation, AUC: 0.60789
[HCTR][03:48:38.972][INFO][RK0][main]: Evaluation, AUC: 0.607424
[HCTR][03:48:39.223][INFO][RK0][main]: -----------------------------------Epoch 67-----------------------------------
[HCTR][03:48:42.298][INFO][RK0][main]: Evaluation, AUC: 0.611191
[HCTR][03:48:45.413][INFO][RK0][main]: Evaluation, AUC: 0.621895
[HCTR][03:48:46.675][INFO][RK0][main]: -----------------------------------Epoch 68-----------------------------------
[HCTR][03:48:48.483][INFO][RK0][main]: Evaluation, AUC: 0.625431
[HCTR][03:48:51.841][INFO][RK0][main]: Evaluation, AUC: 0.581377
[HCTR][03:48:54.986][INFO][RK0][main]: Evaluation, AUC: 0.580077
[HCTR][03:48:54.986][INFO][RK0][main]: -----------------------------------Epoch 69-----------------------------------
[HCTR][03:48:58.409][INFO][RK0][main]: Evaluation, AUC: 0.589024

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

Yes, the hyper params (in the worst case, you had to opt another optimizer) should be adjusted. We have not fully tested fp16 training for all models.

There are two remarks from the AUC log you posted:

  1. The AUC in the first epoch is much smaller than that with fp32. It may imply that this model+current configuration are not suitale for fp16 training.

  2. The AUC tends to drop after many epochs. It may imply overfitting.

Anyway, I recommend you to adjust some hyper parameters.

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

Thanks. I will adjust as you suggest.
By the way, could you please check this issue about ETC? I am trying to run this test regardless disconvergence.
I appreciate it a lot!

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

I am trying not to use use_mixed_precision parameter and manully add hugectr.Layer_t.Cast layer instead to fix this convergence issue while keep the performance consistent. But the API document seems lack of hugectr.Layer_t.Cast layer's description. How can I use this type of DenseLayer to cast output to float16, for example.

from hugectr.

JacoCheung avatar JacoCheung commented on July 17, 2024

Hi @zpcalan , I'm afraid you can not manually cast the data type and feed the tensor to a layer of different datatype. The used_mixed_precision flag has a global impact. If it's off, HugeCTR assumes all input tensors of all layer have fp32 datatype, while if it's on, all inputs to layers must have fp16 data type. HugeCTR does not support fp16 for a specific layer. In fp32 mode, the cast layer will cast fp32 into fp16; in fp16 mode, the cast layer cast from fp16 to fp32.

from hugectr.

zpcalan avatar zpcalan commented on July 17, 2024

I use dlrm instead of W&D because the parameters in the script have been already configured correctly for mixed precision.
Issue closed. : )

from hugectr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.