westlake-repl / idvs.morec Goto Github PK

End-to-end Training for Multimodal Recommendation Systems

License: Apache License 2.0

Python 100.00%

foundation-models foundation-recommendation-model image-recommendation multimodal multimodal-deep-learning multimodal-recommendation multimodal-recommendation-dataset text-recommendation transferable-recommendation modality-based-recommendation

idvs.morec's People

Contributors

Stargazers

Watchers

Forkers

zkangning joanding mchoimis learnerma buptzzl hippoley weiyutang lujixiang xiaoqingwang

idvs.morec's Issues

Training problem about local_rank

Thank you for your excellent work.
I encountered some problems while running the code. Could you help to answer them? Here are the training parameters.

import os
root_data_dir = '../../'
dataset = 'dataset/HM'
behaviors = 'hm_50w_users.tsv'
images = 'hm_50w_items.tsv'
lmdb_data = 'hm_50w_items.lmdb'
logging_num = 2
testing_num = 1

CV_resize = 224
CV_model_load = 'swin_tiny'
freeze_paras_before = 0


mode = 'train'
item_tower = 'modal'

epoch = 150
load_ckpt_name = 'None'


l2_weight_list = [0.01]
drop_rate_list = [0.1]
batch_size_list = [16]
lr_list_ct = [(1e-4, 1e-4), (5e-5, 5e-5), (1e-4, 5e-5)]

embedding_dim_list = [512]

for l2_weight in l2_weight_list:
    for batch_size in batch_size_list:
        for drop_rate in drop_rate_list:
            for embedding_dim in embedding_dim_list:
                for lr_ct in lr_list_ct:
                    lr = lr_ct[0]
                    fine_tune_lr = lr_ct[1]
                    label_screen = '{}_bs{}_ed{}_lr{}_dp{}_L2{}_Flr{}'.format(
                        item_tower, batch_size, embedding_dim, lr,
                        drop_rate, l2_weight, fine_tune_lr)
                    run_py = "CUDA_VISIBLE_DEVICES='2,3' \
                             /home/zwy/anaconda3/envs/m/bin/python  -m torch.distributed.launch --nproc_per_node 2 --master_port 1289\
                             run.py --root_data_dir {}  --dataset {} --behaviors {} --images {}  --lmdb_data {}\
                             --mode {} --item_tower {} --load_ckpt_name {} --label_screen {} --logging_num {} --testing_num {}\
                             --l2_weight {} --drop_rate {} --batch_size {} --lr {} --embedding_dim {}\
                             --CV_resize {} --CV_model_load {}  --epoch {} --freeze_paras_before {}  --fine_tune_lr {}".format(
                        root_data_dir, dataset, behaviors, images, lmdb_data,
                        mode, item_tower, load_ckpt_name, label_screen, logging_num, testing_num,
                        l2_weight, drop_rate, batch_size, lr, embedding_dim,
                        CV_resize, CV_model_load, epoch, freeze_paras_before, fine_tune_lr)
                    os.system(run_py)

Here is the error that occurred.

/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] 
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] *****************************************
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] *****************************************
usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER]
              [--root_data_dir ROOT_DATA_DIR] [--dataset DATASET]
              [--behaviors BEHAVIORS] [--images IMAGES]
              [--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS]
              [--new_seqs NEW_SEQS] [--new_items NEW_ITEMS]
              [--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE]
              [--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR]
              [--l2_weight L2_WEIGHT]
              [--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT]
              [--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD]
              [--freeze_paras_before FREEZE_PARAS_BEFORE]
              [--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM]
              [--num_attention_heads NUM_ATTENTION_HEADS]
              [--transformer_block TRANSFORMER_BLOCK]
              [--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN]
              [--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME]
              [--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM]
              [--testing_num TESTING_NUM] [--local_rank LOCAL_RANK]
run.py: error: unrecognized arguments: --local-rank=0
usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER]
              [--root_data_dir ROOT_DATA_DIR] [--dataset DATASET]
              [--behaviors BEHAVIORS] [--images IMAGES]
              [--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS]
              [--new_seqs NEW_SEQS] [--new_items NEW_ITEMS]
              [--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE]
              [--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR]
              [--l2_weight L2_WEIGHT]
              [--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT]
              [--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD]
              [--freeze_paras_before FREEZE_PARAS_BEFORE]
              [--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM]
              [--num_attention_heads NUM_ATTENTION_HEADS]
              [--transformer_block TRANSFORMER_BLOCK]
              [--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN]
              [--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME]
              [--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM]
              [--testing_num TESTING_NUM] [--local_rank LOCAL_RANK]
run.py: error: unrecognized arguments: --local-rank=1
[2023-10-14 21:32:30,604] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 3708157) of binary: /home/zwy/anaconda3/envs/m/bin/python
Traceback (most recent call last):
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-10-14_21:32:30
  host      : gpuserver
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 3708158)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-14_21:32:30
  host      : gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 3708157)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Looking forward to your reply, thank you.

Where is the DSSM code

Firstly, thanks for your insightful job.
I am wondering where is the training code of DSSM?

验证速度很慢

用单张3090跑bce-text/main-end2end/train_id.py
训练速度3min/epoch 但是验证非常慢，不知道是什么原因

Hyperparameter about best performance.

I am intrigued by your work and have a few questions to discuss with you. You conducted hyperparameter search for IDRec and MoRec, could you provide the optimal hyperparameters that yielded the best performance?

About In-batch debiased cross-entropy loss

Which article proposed In-batch debiased cross-entropy loss? Can you provide relevant literature?

测试时为啥候选物品的embedding 不需要过Model模块的fc呢，也即MLP模块，而是直接使用物品的modalembedding或者说id embedding

    self.fc = MLP_Layers(word_embedding_dim=num_fc_ftr,
                         item_embedding_dim=args.embedding_dim,
                         layers=[args.embedding_dim] * (args.dnn_layer + 1),
                         drop_rate=args.drop_rate)这段代码在训练是对输入的embedding进行了转换，然后再与候选的正负样本计算相似度以及BCE损失， 在模型预测时，为啥是直接使用item_embeddings而不需要经过上面得MLP_Layers呢？
item_embeddings = item_embeddings.to(local_rank)
with torch.no_grad():
    eval_all_user = []
    item_rank = torch.Tensor(np.arange(item_num) + 1).to(local_rank)
    for data in eval_dl:
        user_ids, input_embs, log_mask, labels = data
        user_ids, input_embs, log_mask, labels = \
            user_ids.to(local_rank), input_embs.to(local_rank),\
            log_mask.to(local_rank), labels.to(local_rank).detach()
        prec_emb = model.module.user_encoder(input_embs, log_mask, local_rank)[:, -1].detach()
        scores = torch.matmul(prec_emb, item_embeddings.t()).squeeze(dim=-1).detach()

NDCG10以及Hit10的计算是不是又问题，我跑了下modal模式，很多都大于1，甚至到6，7.太诡异了。

Fine-tuning Script

Thanks for sharing this interesting work. I was wondering if you are going to share the scripts for fine-tuning LLMs.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.