sshaoshuai / mtr Goto Github PK

MTR: Motion Transformer with Global Intention Localization and Local Movement Refinement, NeurIPS 2022.

License: Apache License 2.0

Python 76.87% C++ 8.39% Cuda 14.10% C 0.35% Shell 0.29%

mtr's Introduction

Motion Transformer (MTR): A Strong Baseline for Multimodal Motion Prediction in Autonomous Driving

This repository is the official implementation of the NeurIPS 2022 paper (oral presentation) "Motion Transformer with Global Intention Localization and Local Movement Refinement".

Authors: Shaoshuai Shi, Li Jiang, Dengxin Dai, Bernt Schiele

[MTR (arXiv)] [MTR++ (arXiv)]

News

[2023-06] The formal paper of MTR++ is released to arXiv:2306.17770, which support multi-agent motion prediction and achieve state-of-the-art performance on Waymo Open Dataset.

[2023-05] MTR++ won the Championship of Waymo Open Dataset Motion Prediction Challenge 2023, see the leaderboard here.

[2022-06] MTR won the Championship of Waymo Open Dataset Motion Prediction Challenge 2022, see the official post here.

Abstract

Predicting multimodal future behavior of traffic participants is essential for robotic vehicles to make safe decisions. Existing works explore to directly predict future trajectories based on latent features or utilize dense goal candidates to identify agent's destinations, where the former strategy converges slowly since all motion modes are derived from the same feature while the latter strategy has efficiency issue since its performance highly relies on the density of goal candidates. In this paper, we propose the Motion TRansformer (MTR) framework that models motion prediction as the joint optimization of global intention localization and local movement refinement. Instead of using goal candidates, MTR incorporates spatial intention priors by adopting a small set of learnable motion query pairs. Each motion query pair takes charge of trajectory prediction and refinement for a specific motion mode, which stabilizes the training process and facilitates better multimodal predictions. Experiments show that MTR achieves state-of-the-art performance on both the marginal and joint motion prediction challenges, ranking $1^{st}$ on the leaderbaords of Waymo Open Motion Dataset.

Highlights

MTR Codebase

State-of-the-art performance with clear code structure, easy to be extended
A very simple context encoder for modeling agent/map relations
Motion decoder with learnable queries on intention points
Loss with Gaussian Mixture Model for multimodal motion prediction
Clear data processing and organization on Waymo Open Motion Dataset
Local evaluation tool with official Waymo Motion Evaluation API

Method

Simple: pure transformer-based context encoder and motion decoder
Efficient: model multimodal future prediction with a small number of learnable intention queries
Accurate: rank 1st place at Waymo Motion Prediction leaderboard (last update: Feb 2023)

Getting Started

Main Results

Performance on the validation set of Waymo Open Motion Dataset

Model	Training Set	minADE	minFDE	Miss Rate	mAP
MTR	20%	0.6697	1.3712	0.1668	0.3437
MTR	100%	0.6046	1.2251	0.1366	0.4164
MTR-e2e	100%	0.5160	1.0404	0.1234	0.3245

Performance on the testing set of Waymo Open Motion Dataset

Model	Training Set	minADE	minFDE	Miss Rate	mAP
MTR	100%	0.6050	1.2207	0.1351	0.4129
MTR-A (ens)	100%	0.5640	1.1344	0.1160	0.4492

Citation

If you find this work useful in your research, please consider cite:

@article{shi2022motion,
  title={Motion transformer with global intention localization and local movement refinement},
  author={Shi, Shaoshuai and Jiang, Li and Dai, Dengxin and Schiele, Bernt},
  journal={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{shi2023mtr,
  title={MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying},
  author={Shi, Shaoshuai and Jiang, Li and Dai, Dengxin and Schiele, Bernt},
  journal={arXiv preprint arXiv:2306.17770},
  year={2023}
}

@article{shi2022mtra,
  title={MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge--Motion Prediction},
  author={Shi, Shaoshuai and Jiang, Li and Dai, Dengxin and Schiele, Bernt},
  journal={arXiv preprint arXiv:2209.10033},
  year={2022}
}

mtr's People

Contributors

Stargazers

Watchers

Forkers

zivzone georgeliu233 cthlinmeng deanofthewebb mhhhaster yangkaia123 flclain kingwmk liuxy416 zzx9636 gumpcarl longzhong-lin elvinal shubaozhang babyblue26 kevinsfdollar cancui19 haohao11 luandi1996 kejingjing88212 xiaoshan-jun danpanwan meihuanshan linxiyuan jonbakerfish kongan cjj-bm pengzhenghao lemyx kanikel kikitian varunreddy1268 huangatlas jasonzhou404 shiliangcn bruinxiong lvhualong gah07123 txing-casia alexmcmaster yxgz smartbarbarian vincentyang116 manolotis levinesjob zhangdongkun98 avi9700 omertariq-kaist shadowlau minho8849 zhl6 dmame trentweiss o0oo0o0 lazure-ocean laurayuzheng zhlstone alan-lanfeng qiyan98 shuruiz mhshen star20174154 ktro2828 pbtfclx sjyu001 yangyangfu chaveza9 dmdsouza zarmars 24werewolf cheng123123111 xbchen82 asimay wep21 blueardour yangxingbang salamanderxing paulkmueller chenyuheng-gif humor-hu jaustinb1 robinwangsd vedantbonde19 livanoff croessert leorepeater ers1804 stepankonev jlidard hayoung-kim likun97 iven-wu rdyro lotuspeak sansangela mryu001 artrela chuyu-jpg kunni918 llllys

mtr's Issues

License

Could you please add a LICENSE to the repo?

Any difference for Argoverse2 ?

If I want to try MTR on Argoverse or Argoverse2, except the part of data loader, any configurations or hyper-parameters are different from the version for Waymo ?

About Data Normalization

I found your work only includes "we adopt the agent-centric strategy that normalizes all inputs to the coordinate system centered at this agent", but is it necessary to do further normalization of the input data (track and road map), e.g. min-max, z-score.

Python setup.py failed

I have an issue with installation when I run the command setup.py, it gave me this error :

How can I solve this issue?

Question about the difference between marginal and joint

A great work! But I confuse about the agent-centric's difference between marginal and joint, the work adopt the agent-centric to organize input in marginal. How to use agent-centric in joint?

Error while runnung train.py

Hi, when I'm running the command bash scripts/dist_train.sh 8 --cfg_file cfgs/waymo/mtr+100_percent_data.yaml --batch_size 80 --epochs 30 --extra_tag my_first_exp, I'm getting the following error:

+ NGPUS=8
+ PY_ARGS='--cfg_file cfgs/waymo/mtr+100_percent_data.yaml --batch_size 30 --epochs 100 --extra_tag my_first_exp'
+ true
+ PORT=10808
++ nc -z 127.0.0.1 10808
++ echo 1
+ status=1
+ '[' 1 '!=' 0 ']'
+ break
+ echo 10808
10808
+ python -m torch.distributed.launch --nproc_per_node=8 --rdzv_endpoint=localhost:10808 train.py --launcher pytorch --cfg_file cfgs/waymo/mtr+100_percent_data.yaml --batch_size 30 --epochs 100 --extra_tag my_first_exp
/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions
  warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 

usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=0
train.py: error: unrecognized arguments: --local-rank=2
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=3
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=6
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=7
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=1
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=5
usage: train.py [-h] [--cfg_file CFG_FILE] [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--workers WORKERS] [--extra_tag EXTRA_TAG] [--ckpt CKPT] [--pretrained_model PRETRAINED_MODEL]
                [--launcher {none,pytorch,slurm}] [--tcp_port TCP_PORT] [--without_sync_bn] [--fix_random_seed] [--ckpt_save_interval CKPT_SAVE_INTERVAL] [--local_rank LOCAL_RANK]
                [--max_ckpt_save_num MAX_CKPT_SAVE_NUM] [--merge_all_iters_to_one_epoch] [--set ...] [--max_waiting_mins MAX_WAITING_MINS] [--start_epoch START_EPOCH] [--save_to_file] [--not_eval_with_train]
                [--logger_iter_interval LOGGER_ITER_INTERVAL] [--ckpt_save_time_interval CKPT_SAVE_TIME_INTERVAL] [--add_worker_init_fn]
train.py: error: unrecognized arguments: --local-rank=4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 10113) of binary: /home/febin/anaconda3/envs/mtr/bin/python
Traceback (most recent call last):
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/febin/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

train.py FAILED

Failures:
[1]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 10114)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 10115)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 10116)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 4 (local_rank: 4)
  exitcode  : 2 (pid: 10117)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 5 (local_rank: 5)
  exitcode  : 2 (pid: 10118)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 6 (local_rank: 6)
  exitcode  : 2 (pid: 10119)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 7 (local_rank: 7)
  exitcode  : 2 (pid: 10120)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
  time      : 2023-03-17_18:49:28
  host      : febin-ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 10113)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Any help will be appreciated
Thank you

Does the ops folder have a python version

Is there a python version of the attention file and the knn file?
We encountered difficulties when converting to the onnx format. The attentionWeightComputation class cannot be converted. We deduce that because of the use of the C++ file

Run MTR in conda/Pycharm environment

Hello,

Thank you very much for this amazing model. I am trying to run it on Pycharm using a conda virtual environment but I cannot make it work.
I cloned the code using github>clone in Pycharm and I ran pip install -r requirements on the conda virtual environment.

I think the first step of your code would be to run dataset>Waymo>data_preprocess.py, right?
When I do that, I have the following error: "ModuleNotFoundError: No module named 'waymo_open_dataset'" so I tried to run the code line that you advised to run to install Waymo dataset -> "pip install waymo-open-dataset-tf-2-6-0" on the virtual environment but I keep getting the same error no matter what I change

"ERROR: Could not find a version that satisfies the requirement waymo-open-dataset-tf-2-6-0 (from versions: none)
ERROR: No matching distribution found for waymo-open-dataset-tf-2-6-0"

Do you know what I can do to use waymo-open-dataset? I did also clone it on a separate folder but it does not help.
Or is it another file tham data_preprocess.py that I should run?

Just to avoid confusion, the very first folder is wrongly named Waymo, it should be named MRT as this is where I cloned your model

Is it because Waymo can only be used on Google Colab? Is there a way to run your code locally to have a better idea of the different steps of your algorithm?

Thank you very much

Question about parameter “CENTER_OFFSET_OF_MAP”

Dear author, your work is really great, but I don't quite understand the use of the parameter "CENTER_OFFSET_OF_MAP", which is set to [30, 0], how can I understand it?

ValueError when decoding map features

Is it normal to see the ValueError when decoding map features, or this means that the dataset I've downloaded is broken?

Best kmeans-cluster number

I noticed that the kmeans cluster number is set to 64. Is this a value obtained from experiments or an empirical value?
How will the prediction results change with the increasing or decreasing of cluster number?

question about motion query pair initialization

Dear author, thanks for your excellent work and code release! I have read your paper and there is a question about motion query
pair initialization. In this work. I notice that the intention points are generated by adopting k-means clustering algorithm on the endpoints of GT trajectorys and are used to initialize the motion query pair and predicted trajectories in the 1st layer of decoder. I'm confused that how to initialize the query pair and trajectories during the inference because there is no GT in this phase.

How to reproduce MTR-e2e results?

Hello,

Thank you for sharing your code. I would like to understand how to replicate the results of the MTR-e2e model. I have noticed that there is no available configuration in the repository, and I am finding it difficult to comprehend the precise method for selecting the "positive mixture component."

MTR-e2e for end-to-end motion prediction. We also propose an end-to-end variant of MTR, called MTR-e2e, where only 6 motion query pairs are adopted so as to remove NMS post processing. In the training process, instead of using static intention points for target assignment as in MTR, MTR-e2e selects positive mixture component by calculating the distances between its 6 predicted trajectories and the GT trajectory, since 6 intention points are too sparse to well cover all potential future motions.

Could you provide further clarification on the definition of the "positive mixture component" and explain how to reproduce the results of the end-to-end (e2e) approach? Thank you.

KNN_CUDA

Hello, I'm glad to see your code. But I encountered some problems in the implementation process. I cannot import KNN_ CUDA after successful compilation, I have searched the relevant information online but haven't solved it. I hope you can help me solve this problem. Thank you very much and look forward to your reply!

err in training

Have you encountered the following problem?

torch.cuda.device_count(): 2
Traceback (most recent call last):
  File "train.py", line 217, in main
    model = nn.parallel.DistributedDataParallel(model, device_ids=[0], find_unused_parameters=True)
  File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    main()
  File "train.py", line 217, in main
    model = nn.parallel.DistributedDataParallel(model, device_ids=[0], find_unused_parameters=True)
  File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
    dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:47, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

my torch version is 1.11.0,cuda 10.2 gpu rtx 3090
What is your experimental environment?
thanks

About the Testset data pre-processing and submission to Waymo Server

I want to evaluate the model on Waymo Testset.
Could you please guide me to modify data_preprocessing.py or add inference for testing tfrecords?

Thanks!

overfitting issue

Hi @sshaoshuai,

Thank you for sharing your great work. In your paper, you mentioned that you trained the model for 30epochs without data augmentation. I wonder whether you have encountered any overfitting issue (especially in terms of the validation loss)? How is the 30epoch selected?

Thank you very much!

MTR++_Ens and MTR++

hello dear sshaoshuai,
Is MTR++ _Ens and MTR ++ 's code same with this repo ? if not , could you tell me which change with this repo or tell me other MTR++ and MTR++ _Ens repo?

Visualization.

Thank you so much for such a great job. I see that you provide very nice visualization results and I get pkl files containing the results after evaluation, can you please provide the script for visualizing the experimental results?

dynamic_map_infos

Hi, @sshaoshuai Thanks for your awesome work!
I wonder why the features of dynamic_map_infos in original pkl files are not extracted in waymo_dataset.py? It seems they are also useful.

do not understand gmm loss

if use_square_gmm:
log_std1 = log_std2 = torch.clip(nearest_trajs[:, :, 2], min=log_std_range[0], max=log_std_range[1])
std1 = std2 = torch.exp(log_std1) # (0.2m to 150m)
rho = torch.zeros_like(log_std1)
else:
log_std1 = torch.clip(nearest_trajs[:, :, 2], min=log_std_range[0], max=log_std_range[1])
log_std2 = torch.clip(nearest_trajs[:, :, 3], min=log_std_range[0], max=log_std_range[1])
std1 = torch.exp(log_std1) # (0.2m to 150m)
std2 = torch.exp(log_std2) # (0.2m to 150m)
rho = torch.clip(nearest_trajs[:, :, 4], min=-rho_limit, max=rho_limit)

gt_valid_mask = gt_valid_mask.type_as(pred_scores)
if timestamp_loss_weight is not None:
    gt_valid_mask = gt_valid_mask * timestamp_loss_weight[None, :]

# -log(a^-1 * e^b) = log(a) - b
reg_gmm_log_coefficient = log_std1 + log_std2 + 0.5 * torch.log(1 - rho**2)  # (batch_size, num_timestamps)
reg_gmm_exp = (0.5 * 1 / (1 - rho**2)) * ((dx**2) / (std1**2) + (dy**2) / (std2**2) - 2 * rho * dx * dy / (std1 * std2))  # (batch_size, num_timestamps)

reg_loss = ((reg_gmm_log_coefficient + reg_gmm_exp) * gt_valid_mask).sum(dim=-1)

hello i do not understand this gmm loss could you please tell me detail ?

About Validation data for submission to waymo

Hi, we use your evaluation method on validation data, which is waymo 1.2 and the total scenarios are 44097. But when we uploaded the result to the waymo motion challenge, we got the error:
INVALID_ARGUMENT: Not enough scenario predictions in submission : 0.
Can you give me some guidances about this issue?
Thank you!
The evaluation code is:
def eval_one_epoch(cfg, model, dataloader, epoch_id, logger, dist_test=False, save_to_file=True, result_dir=None, logger_iter_interval=50):
result_dir.mkdir(parents=True, exist_ok=True)

final_output_dir = result_dir / 'final_result' / 'data'
if save_to_file:
    final_output_dir.mkdir(parents=True, exist_ok=True)

dataset = dataloader.dataset

logger.info('*************** EPOCH %s EVALUATION *****************' % epoch_id)
if dist_test:
    if not isinstance(model, torch.nn.parallel.DistributedDataParallel):
        num_gpus = torch.cuda.device_count()
        local_rank = cfg.LOCAL_RANK % num_gpus
        model = torch.nn.parallel.DistributedDataParallel(
                model,
                device_ids=[local_rank],
                broadcast_buffers=False
        )
model.eval()

if cfg.LOCAL_RANK == 0:
    progress_bar = tqdm.tqdm(total=len(dataloader), leave=True, desc='eval', dynamic_ncols=True)
start_time = time.time()

pred_dicts = []
for i, batch_dict in enumerate(dataloader):
    with torch.no_grad():
        batch_pred_dicts = model(batch_dict)
        final_pred_dicts = dataset.generate_prediction_dicts(batch_pred_dicts, output_path=final_output_dir if save_to_file else None)
        pred_dicts += final_pred_dicts

    disp_dict = {}

    if cfg.LOCAL_RANK == 0 and (i % logger_iter_interval == 0 or i == 0 or i + 1== len(dataloader)):
        past_time = progress_bar.format_dict['elapsed']
        second_each_iter = past_time / max(i, 1.0)
        remaining_time = second_each_iter * (len(dataloader) - i)
        disp_str = ', '.join([f'{key}={val:.3f}' for key, val in disp_dict.items() if key != 'lr'])
        batch_size = batch_dict.get('batch_size', None)
        logger.info(f'eval: epoch={epoch_id}, batch_iter={i}/{len(dataloader)}, batch_size={batch_size}, iter_cost={second_each_iter:.2f}s, '
                    f'time_cost: {progress_bar.format_interval(past_time)}/{progress_bar.format_interval(remaining_time)}, '
                    f'{disp_str}')

if cfg.LOCAL_RANK == 0:
    progress_bar.close()

if dist_test:
    pred_dicts = common_utils.merge_results_dist(pred_dicts, len(dataset), tmpdir=result_dir / 'tmpdir')

if cfg.LOCAL_RANK != 0:
    return {}

with open(result_dir / 'result.pkl', 'wb') as f:
    pickle.dump(pred_dicts, f)

print('len_pred_dicts: ', len(pred_dicts))
print('pred_trajs: ', pred_dicts[0]['pred_trajs'].shape)

return pred_dicts

The print results:
len_pred_dicts: 192172
pred_trajs: (6, 80, 2)

Looking forward the code

Code for visualization

How to visualize , i only have pred_trajs in pickle file , how to visualize that predictions.

source code

Hi @sshaoshuai
Your work is great!
How soon can we expect the code to be released?

what is dense future prediction

Excuse me, I see there is a dense future prediction after transformer encoder.But I don't see any description in paper to discuss about it. Can you explain what "Dense Future Prediction" is?

Looking Forward to Pretrain Weights

Wonderful work! However, the training on the full dataset seems too resource-consuming, I wonder if you release the pretrained weights of model after?

The issue in the prediction.

Hello, Thanks for your work！However, I found that many trajectories would experience a U-turn like the one shown in the following figure when I was using MTR to predict trajectories. May I ask what may be the cause of this issue? I understand that MTR involves generating static intent points first, followed by dynamic map collection to refine the trajectory. Can we assume that the target point position is correct, but there is a problem with trajectory generation?

No idea why this happens, as previously I didn't notice the visualization carefully.

          No idea why this happens, as previously I didn't notice the visualization carefully.

Hope to see whether other people also have these unnormal effects.

Originally posted by @sshaoshuai in #39 (comment)

what's the version of your CUDA toolkit?

The CUDA and PyTorch versions I am using are 1.10.1+cu111. I keep encountering the following error after running for a while:

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
Epoch 0: 0%| | 361/97977 [01:36<7:14:33, 3.74it/s, loss=104, v_num=]
[W CUDAGuardImpl.h:113] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stack trace below might be incorrect.
For debugging, consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):

These errors always occur in the linear layer of the transformer_encoder_layer. It feels like it's caused by CUDA implementation of attention. What is the CUDA version used in the environment when building the attention operator?

Could you also release the training code on Argoverse/Argoverse2?

Dear Author,

I noticed that the argoverse or argoverse 2 results are also in the MTR/MTR++ papers.

Could you also release relevant training code on argoverse/argoverse2 datasets (like dataloading, clustered points file, evaluation, training)?

Thanks!

Error occurred when training.

When I ran bash scripts/dist_train.sh 1 --cfg_file cfgs/waymo/mtr+100_percent_data.yaml --batch_size 80 --epochs 30 --extra_tag my_first_exp, I met the error below:
Traceback (most recent call last):
File "train.py", line 22, in
from mtr.models import model as model_utils
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/models/model.py", line 12, in
from .context_encoder import build_context_encoder
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/models/context_encoder/init.py", line 7, in
from .mtr_encoder import MTREncoder
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/models/context_encoder/mtr_encoder.py", line 12, in
from mtr.models.utils.transformer import transformer_encoder_layer, position_encoding_utils
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/models/utils/transformer/transformer_encoder_layer.py", line 16, in
from .multi_head_attention_local import MultiheadAttentionLocal
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/models/utils/transformer/multi_head_attention_local.py", line 21, in
from mtr.ops import attention
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/ops/attention/init.py", line 5, in
from . import attention_utils
File "/mnt/sdb/jianghanhu/MTR/tools/../mtr/ops/attention/attention_utils.py", line 9, in
from . import attention_cuda
ImportError: /mnt/sdb/jianghanhu/MTR/tools/../mtr/ops/attention/attention_cuda.cpython-38-x86_64-linux-gnu.so: failed to map segment from shared object
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13193) of binary: /home/jianghanhu/anaconda3/envs/mtr/bin/python
Traceback (most recent call last):
File "/home/jianghanhu/anaconda3/envs/mtr/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jianghanhu/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/jianghanhu/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/jianghanhu/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/jianghanhu/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jianghanhu/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-12_22:11:14
host : psdz
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 13193)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

It is caused by 'from . import attention_cuda' in MTR/tools/../mtr/ops/attention/attention_utils.py, could you please tell me how to solve it?

'CENTER_OFFSET_OF_MAP', (30.0, 0)

Hi, I am very impressed by your great open-source work.

I want to ask you how you got the (30,0) for the center_offset.
center_offset=self.dataset_cfg.get('CENTER_OFFSET_OF_MAP', (30.0, 0)),

Thank you in advance.

DETR for Multi-Modal Trajectory Prediction

I have attempted to predict trajectories using DETR and the hard-assignment strategy, without incorporating auxiliary loss. Unfortunately, this approach consistently results in straight, linear trajectories for each sample provided. I observed that your method is similar but includes auxiliary loss. I'm interested in understanding if the inclusion of this additional loss component effectively mitigates the collapsing problem. I would greatly appreciate any advice or suggestions you can provide on this matter. Thank you so much!

why is the top1 metric very bad ?

when i train the model, i found that the top6 metric (minADE, minFDE and so on) is good, but the top1 (sorted by probability) metric is very bad. It seems that the classification between multi output trajectories is very hard to learn. Have you noticed this phenomenon ? And do you have any idea to resolve it ? I am looking forward to your reply.

Argoverse 2

Is this code prepared to run in Argoverse2?

Why use a 12 dimensional embedding to encode timestamp feature?

Dear Author,
Thanks for your great work.

I wonder why you use a 12 dimensional embedding to encode timestamp feature instead of just encode the relative timestamp feature using a 1 dimensional embedding ? Is there any ablation study aimed at this?

Is there any plan to release the MTR++ code?

Dear Author,

First, congratulations for winning the Waymo 2023 Motion Prediction Challenge! The improvement of MTR++ is very interesting and effective.

Is there any plan to release the MTR++ code?

Best,

Joint motion prediction

Hug I'm actually asking too many questions because I'm very interested in your work. I just want to ask you how to do joint motion prediction, I found that I can't seem to find this script.

python setup.py develop failed

Hello, I need your help. When I run python setup.py develop, the below error happens, I could find the directory /usr/local/cuda-11.6 and nvcc -Vis OK, but no directory /bin/nvcc, How could I do to solve the problem?

~ which nvcc   
/usr/local/cuda-11.6/bin/nvcc

python setup.py develop                 
running develop
/home/zetlin/miniconda3/envs/mtr/lib/python3.8/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/home/zetlin/miniconda3/envs/mtr/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running egg_info
writing MotionTransformer.egg-info/PKG-INFO
writing dependency_links to MotionTransformer.egg-info/dependency_links.txt
writing top-level names to MotionTransformer.egg-info/top_level.txt
/home/zetlin/miniconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'MotionTransformer.egg-info/SOURCES.txt'
writing manifest file 'MotionTransformer.egg-info/SOURCES.txt'
running build_ext
error: [Errno 2] No such file or directory: '/usr/local/cuda-11.6:/usr/local/cuda-11.6:/bin/nvcc'

About the intention point file.

Dear Author,
Thanks for your great work. When do training, a bug is reported as "No file 'cluster_64_center_dict.pkl'". I wonder where I can find it or should we generate the intention point file by some commands? Looking forward to your reply.

Code release date

Hi @sshaoshuai,

Thank you for sharing the great prediction method.
I am interested in the details of the implementation and the computational cost.
Do you plan to release the code soon?

Thank you in advance.

Best,

decode loss bug ?

layer_loss = loss_reg_gmm * weight_reg + loss_reg_vel * weight_vel + loss_cls.sum(dim=-1) * weight_cls, it should be layer_loss = loss_reg_gmm * weight_reg + loss_reg_vel * weight_vel + loss_cls* weight_cls. 'loss_cls.sum(dim=-1)' is wrong here. Do you agree with me ?

About data pre-processing

Hi, @sshaoshuai Thanks for your wonderful work!

When I try the code first for data pre-processing, it keeps poping out so many lines of errors like below

2023-02-24 00:55:16.221545: I tensorflow/stream_executor/cuda/cuda_driver.cc:732] failed to allocate 2.2K (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

and I cannot see the tqdm progress bar. Is there anything wrong with the process?

In addition, may I know how much time this processing usually costs?

data process problem.

When I want to process data, I meet a problem, I have used all the methods you recommended above, but nothing works. I really can't find the problem.

waymo dataset version

I see there are multiple version of Waymo Motion Dataset, which one is your choice?

Visualization codes

Thanks for your great work! But I am confused that if there are provided codes for doing visualization.

Any plan to make the code soon?

I would like to know if the code and the model weights will be public anytime soon.

About MTR-A

@sshaoshuai Hi, thank you for this wonderful work!

MTR is great, while MTR-A achieves more impressive performance. So would you be kind enough to also release the code for reproducing MTR-A? Thanks.

Question Regard to Auxilary Task

Hi @sshaoshuai,

Thank you for sharing your great work. I have one question regarding to the auxilary task, more specifically the training process. According to your paper(Eq.3), the auxilary task only predicts single mode trajectory. And dynamic map collection is performed based on this endpoint of this trajectory. However, during the training, do you still use the endpoint of predicted trajectory as reference center point of dynamic map collection? It would make more sense to use the endpoint of GT during training since it might help stablize training process?
In addition, for layer0 of transformer decoder, the dynamic searching query is initialized from endpoint auxilary prediction or initializedd from all zeros?
I am looking forward to your reply. Thank you in advance.

Best,

sshaoshuai / mtr Goto Github PK

mtr's Introduction

Motion Transformer (MTR): A Strong Baseline for Multimodal Motion Prediction in Autonomous Driving

News

Abstract

Highlights

MTR Codebase

Method

Getting Started

Main Results

Performance on the validation set of Waymo Open Motion Dataset

Performance on the testing set of Waymo Open Motion Dataset

Citation

mtr's People

Contributors

Stargazers

Watchers

Forkers

mtr's Issues

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-12_22:11:14 host : psdz rank : 0 (local_rank: 0) exitcode : 1 (pid: 13193) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Recommend Projects

Recommend Topics

Recommend Org

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-12_22:11:14
host : psdz
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 13193)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html