westlake-repl / idvs.morec Goto Github PK
View Code? Open in Web Editor NEWEnd-to-end Training for Multimodal Recommendation Systems
License: Apache License 2.0
End-to-end Training for Multimodal Recommendation Systems
License: Apache License 2.0
Thank you for your excellent work.
I encountered some problems while running the code. Could you help to answer them? Here are the training parameters.
import os
root_data_dir = '../../'
dataset = 'dataset/HM'
behaviors = 'hm_50w_users.tsv'
images = 'hm_50w_items.tsv'
lmdb_data = 'hm_50w_items.lmdb'
logging_num = 2
testing_num = 1
CV_resize = 224
CV_model_load = 'swin_tiny'
freeze_paras_before = 0
mode = 'train'
item_tower = 'modal'
epoch = 150
load_ckpt_name = 'None'
l2_weight_list = [0.01]
drop_rate_list = [0.1]
batch_size_list = [16]
lr_list_ct = [(1e-4, 1e-4), (5e-5, 5e-5), (1e-4, 5e-5)]
embedding_dim_list = [512]
for l2_weight in l2_weight_list:
for batch_size in batch_size_list:
for drop_rate in drop_rate_list:
for embedding_dim in embedding_dim_list:
for lr_ct in lr_list_ct:
lr = lr_ct[0]
fine_tune_lr = lr_ct[1]
label_screen = '{}_bs{}_ed{}_lr{}_dp{}_L2{}_Flr{}'.format(
item_tower, batch_size, embedding_dim, lr,
drop_rate, l2_weight, fine_tune_lr)
run_py = "CUDA_VISIBLE_DEVICES='2,3' \
/home/zwy/anaconda3/envs/m/bin/python -m torch.distributed.launch --nproc_per_node 2 --master_port 1289\
run.py --root_data_dir {} --dataset {} --behaviors {} --images {} --lmdb_data {}\
--mode {} --item_tower {} --load_ckpt_name {} --label_screen {} --logging_num {} --testing_num {}\
--l2_weight {} --drop_rate {} --batch_size {} --lr {} --embedding_dim {}\
--CV_resize {} --CV_model_load {} --epoch {} --freeze_paras_before {} --fine_tune_lr {}".format(
root_data_dir, dataset, behaviors, images, lmdb_data,
mode, item_tower, load_ckpt_name, label_screen, logging_num, testing_num,
l2_weight, drop_rate, batch_size, lr, embedding_dim,
CV_resize, CV_model_load, epoch, freeze_paras_before, fine_tune_lr)
os.system(run_py)
Here is the error that occurred.
/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING]
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] *****************************************
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] *****************************************
usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER]
[--root_data_dir ROOT_DATA_DIR] [--dataset DATASET]
[--behaviors BEHAVIORS] [--images IMAGES]
[--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS]
[--new_seqs NEW_SEQS] [--new_items NEW_ITEMS]
[--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE]
[--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR]
[--l2_weight L2_WEIGHT]
[--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT]
[--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD]
[--freeze_paras_before FREEZE_PARAS_BEFORE]
[--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM]
[--num_attention_heads NUM_ATTENTION_HEADS]
[--transformer_block TRANSFORMER_BLOCK]
[--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN]
[--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME]
[--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM]
[--testing_num TESTING_NUM] [--local_rank LOCAL_RANK]
run.py: error: unrecognized arguments: --local-rank=0
usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER]
[--root_data_dir ROOT_DATA_DIR] [--dataset DATASET]
[--behaviors BEHAVIORS] [--images IMAGES]
[--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS]
[--new_seqs NEW_SEQS] [--new_items NEW_ITEMS]
[--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE]
[--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR]
[--l2_weight L2_WEIGHT]
[--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT]
[--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD]
[--freeze_paras_before FREEZE_PARAS_BEFORE]
[--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM]
[--num_attention_heads NUM_ATTENTION_HEADS]
[--transformer_block TRANSFORMER_BLOCK]
[--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN]
[--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME]
[--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM]
[--testing_num TESTING_NUM] [--local_rank LOCAL_RANK]
run.py: error: unrecognized arguments: --local-rank=1
[2023-10-14 21:32:30,604] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 3708157) of binary: /home/zwy/anaconda3/envs/m/bin/python
Traceback (most recent call last):
File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-10-14_21:32:30
host : gpuserver
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 3708158)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-14_21:32:30
host : gpuserver
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 3708157)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Looking forward to your reply, thank you.
Firstly, thanks for your insightful job.
I am wondering where is the training code of DSSM?
用单张3090跑bce-text/main-end2end/train_id.py
训练速度3min/epoch 但是验证非常慢,不知道是什么原因
I am intrigued by your work and have a few questions to discuss with you. You conducted hyperparameter search for IDRec and MoRec, could you provide the optimal hyperparameters that yielded the best performance?
Which article proposed In-batch debiased cross-entropy loss? Can you provide relevant literature?
self.fc = MLP_Layers(word_embedding_dim=num_fc_ftr,
item_embedding_dim=args.embedding_dim,
layers=[args.embedding_dim] * (args.dnn_layer + 1),
drop_rate=args.drop_rate)这段代码在训练是对输入的embedding进行了转换,然后再与候选的正负样本计算相似度以及BCE损失, 在模型预测时,为啥是直接使用item_embeddings而不需要经过上面得MLP_Layers呢?
item_embeddings = item_embeddings.to(local_rank)
with torch.no_grad():
eval_all_user = []
item_rank = torch.Tensor(np.arange(item_num) + 1).to(local_rank)
for data in eval_dl:
user_ids, input_embs, log_mask, labels = data
user_ids, input_embs, log_mask, labels = \
user_ids.to(local_rank), input_embs.to(local_rank),\
log_mask.to(local_rank), labels.to(local_rank).detach()
prec_emb = model.module.user_encoder(input_embs, log_mask, local_rank)[:, -1].detach()
scores = torch.matmul(prec_emb, item_embeddings.t()).squeeze(dim=-1).detach()
Thanks for sharing this interesting work. I was wondering if you are going to share the scripts for fine-tuning LLMs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.