ch3cook-fdu / vote2cap-detr Goto Github PK

[CVPR 2023] Vote2Cap-DETR and [T-PAMI 2024] Vote2Cap-DETR++; A set-to-set perspective towards 3D Dense Captioning; State-of-the-Art 3D Dense Captioning methods

License: MIT License

Python 98.02% Cython 1.10% Shell 0.88%

3d-detection 3d-models caption-generation dense-captioning multimodal-deep-learning vision-and-language deep-learning pytorch cvpr2023 t-pami

vote2cap-detr's Introduction

Vote2Cap-DETR: A Set-to-Set Perspective Towards 3D Dense Captioning

Official implementation of "End-to-End 3D Dense Captioning with Vote2Cap-DETR" (CVPR 2023) and "Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning" (T-PAMI 2024).

Thanks to the implementation of 3DETR, Scan2Cap, and VoteNet.

0. News

2024-04-07. 💥 Our state-of-the-art 3D dense captioning method Vote2Cap-DETR++ is accepted to T-PAMI 2024!
2024-02-21. 💥 Code for Vote2Cap-DETR++ is released!
2024-02-20. 🚩 Vote2Cap-DETR++ reaches 1st place on the Scan2Cap online test benchmark.
2023-10-06. 🚩 Vote2Cap-DETR wins the Scan2Cap Challenge in the 3rd Language for 3D Scene Workshop at ICCV 2023.
2023-09-07. 📃 We further propose an advanced model, Vote2Cap-DETR++, which decouples feature extraction for object localization and caption generation.
2022-11-17. 🚩 Our model sets a new state-of-the-art on the Scan2Cap online test benchmark.

1. Environment

Our code is tested with PyTorch 1.7.1, CUDA 11.0 and Python 3.8.13. Besides pytorch, this repo also requires the following Python dependencies:

matplotlib
opencv-python
plyfile
'trimesh>=2.35.39,<2.35.40'
'networkx>=2.2,<2.3'
scipy
cython
transformers

If you wish to use multi-view feature extracted by Scan2Cap, you should also install h5py:

pip install h5py

It is also REQUIRED to compile the CUDA accelerated PointNet++, and compile gIoU support for fast training:

cd third_party/pointnet2
python setup.py install

cd utils
python cython_compile.py build_ext --inplace

To build support for METEOR metric for evaluating captioning performance, we also installed the java package.

2. Dataset Preparation

We follow Scan2Cap's procedure to prepare datasets under the ./data folder (Scan2CAD NOT required).

Preparing 3D point clouds from ScanNet. Download the ScanNetV2 dataset and change the SCANNET_DIR to the scans folder in data/scannet/batch_load_scannet_data.py, and run the following commands.

cd data/scannet/
python batch_load_scannet_data.py

Preparing Language Annotations. Please follow this to download the ScanRefer dataset, and put it under ./data.

[Optional] To prepare for Nr3D, it is also required to download and put the Nr3D under ./data. Since it's in .csv format, it is required to run the following command to process data.

cd data; python parse_nr3d.py

3. [Optional] Download Pretrained Weights

You can download all the ready-to-use weights from huggingface.

Model	SCST	rgb	multi-view	normal	checkpoint
Vote2Cap-DETR	-	$\checkmark$	-	$\checkmark$	[checkpoint]
Vote2Cap-DETR	-	-	$\checkmark$	$\checkmark$	[checkpoint]
Vote2Cap-DETR	$\checkmark$	$\checkmark$	-	$\checkmark$	[checkpoint]
Vote2Cap-DETR	$\checkmark$	-	$\checkmark$	$\checkmark$	[checkpoint]
Vote2Cap-DETR++	-	$\checkmark$	-	$\checkmark$	[checkpoint]
Vote2Cap-DETR++	-	-	$\checkmark$	$\checkmark$	[checkpoint]
Vote2Cap-DETR++	$\checkmark$	$\checkmark$	-	$\checkmark$	[checkpoint]
Vote2Cap-DETR++	$\checkmark$	-	$\checkmark$	$\checkmark$	[checkpoint]

4. Training and Evaluation

Though we provide training commands from scratch, you can also start with some pretrained parameters provided under the ./pretrained folder and skip certain steps.

[optional] 4.0 Pre-Training for Detection

You are free to SKIP the following procedures as they are to generate the pre-trained weights in ./pretrained folder.

To train the Vote2Cap-DETR's detection branch for point cloud input without additional 2D features (aka [xyz + rgb + normal + height]):

bash scripts/vote2cap-detr/train_scannet.sh

# Please also try our updated Vote2Cap-DETR++ model:
bash scripts/vote2cap-detr++/train_scannet.sh

To evaluate the pre-trained detection branch on ScanNet:

bash scripts/vote2cap-detr/eval_scannet.sh

# Our updated Vote2Cap-DETR++:
bash scripts/vote2cap-detr++/eval_scannet.sh

To train with additional 2D features (aka [xyz + multiview + normal + height]) rather than RGB inputs, you can manually replace --use_color to --use_multiview.

4.1 MLE Training for 3D Dense Captioning

Please make sure there are pretrained checkpoints under the ./pretrained directory. To train the mdoel for 3D dense captioning with MLE training on ScanRefer:

bash scripts/vote2cap-detr/train_mle_scanrefer.sh

# Our updated Vote2Cap-DETR++:
bash scripts/vote2cap-detr++/train_mle_scanrefer.sh

And on Nr3D:

bash scripts/vote2cap-detr/train_mle_nr3d.sh

# Our updated Vote2Cap-DETR++:
bash scripts/vote2cap-detr++/train_mle_nr3d.sh

4.2 Self-Critical Sequence Training for 3D Dense Captioning

To train the model with Self-Critical Sequence Training (SCST), you can use the following command:

bash scripts/vote2cap-detr/train_scst_scanrefer.sh

# Our updated Vote2Cap-DETR++:
bash scripts/vote2cap-detr++/train_scst_scanrefer.sh

And on Nr3D:

bash scripts/vote2cap-detr/train_scst_nr3d.sh

# Our updated Vote2Cap-DETR++:
bash scripts/vote2cap-detr++/train_scst_nr3d.sh

4.3 Evaluating the Weights

You can evaluate any trained model with specified models and checkponts. Change --dataset scene_scanrefer to --dataset scene_nr3d to evaluate the model for the Nr3D dataset.

bash scripts/eval_3d_dense_caption.sh

Run the following commands to store object predictions and captions for each scene.

bash scripts/demo.sh

5. Make Predictions for online test benchmark

Our model also provides the inference code for ScanRefer online test benchmark.

The following command will generate a .json file under the folder defined by --checkpoint_dir.

bash submit.sh

6. BibTex

If you find our work helpful, please kindly cite our paper:

@inproceedings{chen2023end,
  title={End-to-end 3d dense captioning with vote2cap-detr},
  author={Chen, Sijin and Zhu, Hongyuan and Chen, Xin and Lei, Yinjie and Yu, Gang and Chen, Tao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={11124--11133},
  year={2023}
}

@misc{chen2023vote2capdetr,
  title={Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning}, 
  author={Sijin Chen and Hongyuan Zhu and Mingsheng Li and Xin Chen and Peng Guo and Yinjie Lei and Gang Yu and Taihao Li and Tao Chen},
  year={2023},
  eprint={2309.02999},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

7. License

Vote2Cap-DETR and Vote2Cap-DETR++ are both licensed under a MIT License.

8. Contact

If you have any questions or suggestions regarding this repo, please feel free to open issues!

vote2cap-detr's People

Contributors

Stargazers

Watchers

Forkers

eltociear xuweiyichen nacayu luhui010831 ht0403

vote2cap-detr's Issues

”cd third_party/pointnet2；python setup.py install“

An error occurred using this installation method. RuntimeError: Error compiling objects for extension

CUDA kernel failed : no kernel image is available for execution on the device

CUDA kernel failed : no kernel image is available for execution on the device
void furthest_point_sampling_kernel_wrapper(int, int, int, const float*, float*, int*) at L:231 in /home/imi1214/WY/wang/Vote2Cap-DETR-master/third_party/pointnet2/_ext_src/src/sampling_gpu.cu
Hello, I would like to ask if the third GPU can run normally on the same machine. When I run on the first GPU, the above error is reported. Do you know what happened?
Thank you very much!

Question for ScanRefer benchmark, not Scan2cap

Dear authors,
I am wondering why the paper is said that Vote2Cap is tested on ScanRefer, not Scan2cap benchmark.
As long as I understand, ScanRefer takes pointclouds with a text query as inputs and finds the referred unique 3D box.
On the other hands, Scan2Cap takes only pointclouds input and estimate 3D boxes with descriptions.
I think Vote2Cap is conducted for the task like Scan2Cap, but in your paper it is written to be evaluated on ScanRefer.

Did you also conduct your model on ScanRefer benchmark test? If so, can you share how it works as ScanRefer task requires two inputs, the pointcloud scenes and query. If it was actually tested on Scan2Cap, is there any method to test your mode on ScanRefer?

Thanks for your help in advance!

How to get file `ScanRefer_vocabulary.json`?

Hi, thank you for opening your work. When I download the ScanRefer dataset scanrefer.zip, I can't find file ScanRefer_vocabulary.json:

Archive:  scanrefer.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
 30718184  2020-01-30 22:10   ScanRefer_filtered.json
 24370163  2020-01-30 22:09   ScanRefer_filtered_train.json
     7305  2020-01-20 19:03   ScanRefer_filtered_train.txt
  6348023  2020-01-30 22:09   ScanRefer_filtered_val.json
     1832  2020-01-20 19:03   ScanRefer_filtered_val.txt
---------                     -------
 61445507                     5 files

How can I get it?

loss is negative when running SCST

Hello, when I reproduce scst, loss is negative. Is this normal? The final loss during mle training was around 45. The following is the screenshot of loss. Thank you very much.

Why the 'nyu40id2class' of Vote2Cap is different with that of these detection methods?

## VoteNet
(Pdb) DC18.nyu40id2class
{3: 0, 4: 1, 5: 2, 6: 3, 7: 4, 8: 5, 9: 6, 10: 7, 11: 8, 12: 9, 14: 10, 16: 11, 24: 12, 28: 13, 33: 14, 34: 15, 36: 16, 39: 17}
(Pdb) len(DC18.nyu40id2class)
18
(Pdb) 

## Vote2Cap
(Pdb) self.dataset_config.nyu40id2class
{5: 2, 23: 17, 8: 5, 40: 17, 9: 6, 7: 4, 39: 17, 18: 17, 11: 8, 29: 17, 3: 0, 14: 10, 15: 17, 27: 17, 6: 3, 34: 15, 35: 17, 4: 1, 10: 7, 19: 17, 16: 11, 30: 17, 33: 14, 37: 17, 21: 17, 32: 17, 25: 17, 17: 17, 24: 12, 28: 13, 36: 16, 12: 9, 38: 17, 20: 17, 26: 17, 31: 17, 13: 17}
(Pdb) len(self.dataset_config.nyu40id2class)
37
(Pdb)

There is no 'detector_Vote2Cap_DETRv2' in current version

According to the demo.sh, it seems there is a new version of detector 'detector_Vote2Cap_DETRv2'.
but it seems not exist in your current git.
Could you tell me what's the new version and are you planning to update it?

scannet_means.npz and scannet_reference_means.npz

Hi @ch3cook-fdu thanks for the great work! I was able to reproduce your results on the ScanRefer dataset and now I want to try it on a new dataset. I see that you use 2 mean arrays - data/scannet/meta_data/scannet_means.npz and data/scannet/meta_data/scannet_reference_means.npz in the model, both with shape (18, 3). Could you let me know how you computed these, and how to do it for a new dataset?

Thanks!
Chandan

dataset processing issue

Hi, I've noticed that batch_load_scannet_data.py is the same as one in Scan2Cap. So there is the same issue.

Traceback (most recent call last):
File "batch_load_scannet_data.py", line 84, in
batch_export()
File "batch_load_scannet_data.py", line 79, in batch_export
export_one_scan(scan_name, output_filename_prefix)
File "batch_load_scannet_data.py", line 29, in export_one_scan
mesh_vertices, aligned_vertices, semantic_labels, instance_labels, instance_bboxes, aligned_instance_bboxes = export(mesh_file, agg_file, seg_file, meta_file, LABEL_MAP_FILE, None)
File "/root/autodl-tmp/Vote2Cap-DETR/data/scannet/load_scannet_data.py", line 56, in export
mesh_vertices = scannet_utils.read_mesh_vertices_rgb_normal(mesh_file)
File "/root/autodl-tmp/Vote2Cap-DETR/data/scannet/scannet_utils.py", line 99, in read_mesh_vertices_rgb_normal
assert(os.path.isfile(filename))
AssertionError

So how should I fix it? By the way, here is the original link: daveredrum/Scan2Cap#11

How to visualize the result?

Hi, thanks for sharing this awesome work.

I noticed that you mentioned in another issue that

You can use the tools in this repo to help

but the demo.py outputs just a JSON file.

So could you give me some ideas on how to use the provided 3d-pc-box-viz repo to visualize the JSON file?

Suddenly terminates during debugging

Hello, when I do debug, I always break one step before loss and jump directly to do_train under main.（args has been modified in main.py according to train_scannet.sh）

Normally, it should be run further down. So could you please tell me what's causing it? Thank you very much!

The performance gap between pretrained models and paper

Hello, @ch3cook-fdu!

Thanks for sharing your work about indoor 3d dense captioning. Recently I have tried to train the Vote2Cap-DETR(++) with different configs. I noticed that there is a slightly performance gap between metrics of (mine model)/(pretrained model of this repo) and (Table results in the paper).

Take scst_Vote2Cap_DETRv2_RGB_NORMAL with SCST settings for example:

My Results:
----------------------Evaluation-----------------------
INFO: [email protected] matched proposals: [1543 / 2068],
[BLEU-1] Mean: 0.6721, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.5761, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.4759, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.3892, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.7539, Max: 6.2306, Min: 0.0000
[ROUGE-L] Mean: 0.5473, Max: 0.9474, Min: 0.1015
[METEOR] Mean: 0.2638, Max: 0.5982, Min: 0.0448

Pretrained Model Results
----------------------Evaluation-----------------------
INFO: [email protected] matched proposals: [1548 / 2068],
[BLEU-1] Mean: 0.6729, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.5787, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.4783, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.3916, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.7636, Max: 6.3784, Min: 0.0000
[ROUGE-L] Mean: 0.5496, Max: 1.0000, Min: 0.1015
[METEOR] Mean: 0.2641, Max: 1.0000, Min: 0.0448

and Paper Results

About 1% ~2.5% performance gap exists in every different configs and settings, I am wondering how to figure it out.

Thanks, Jiaqi

Questions about performance

Thanks for sharing your great work!

I have some questions about your paper work.
There're 2 options for your inputs: w/o 2D, w/ 2D.
I initially thought that features from w/ 2D could outperform the features from w/o 2D, but it wasn't in your paper.
In Table1 from Vote2Cap-DETR++, some benchmarks like B-4, M, R were acquired better than w/ 2D.
How is it possible and why should we use these multiview features, which are not effective in performance and also could be hard to be extracted.

In addition, the 3detr is used as the encoder/decoder for your model.
As 3detr does not perform well in 3D detection benchmark like ScanNet, compared to other non-transformer based architectures, can I substitute the encoder/decoder to other models? would it perform well? For instance, the recently released 3D detector like V-DETR is based on 3detr, so that it would be another option for better performance for your model.

Question about caption evaluation

Hi,

Where could I find the pretrained weights of 4.1 or 4.2, I have tested the weights provided on huggingface by 4.3. However, they all returned errors as the image shown.

The json file generated by the predict.py may fail in the Benchmark test?

There's something wrong when I submit the test-pred.json generated by predict.py to the Benchmark Test. It shows Evaluation Failed.

Error running eval_3d_dense_caption on a model trained in huggingface

may I ask that I plan to run eval_3d_dense_caption.sh with the trained model in huggingface, but there are always errors, as shown in the following figure, is it because my environment is different? My environment: python3.8.13, torch1.11.0+cu113

Thank you!

Question about caption evaluation results

i use the pretrain weight "scanrefer_scst_vote2cap_detr_pp_XYZ_RGB_NORMAL.pth" and get the follow result:

INFO: [email protected] matched proposals: [1537 / 2068],
[BLEU-1] Mean: 0.6676, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.5745, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.4757, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.3895, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.7525, Max: 6.3784, Min: 0.0000
[ROUGE-L] Mean: 0.5467, Max: 1.0000, Min: 0.1015
[METEOR] Mean: 0.2631, Max: 1.0000, Min: 0.0448

This result is not consistent with the paper, is this normal?

Question about evaluate metric

Hi authors,

I am new to this task and want to consult a question about the metric in this 3D dense captioning domain, which is a little bit contradictory after I checked with several papers.

In you paper, the captioning metric is averaged by the number ground truth instance, so it cannot evaluate redundant bbox prediction. However, in Scan2cap and D3Net, which you put in the same table 1, they will average the captioning metric by the percentage of correct predicted bbox. Therefore, previous related work evaluated the redundant bbox prediction.

Is it unfair of your metric, or am I missing something here? I would really appreciate your help to clarify it out!

Thanks for your great work! I have some question

Dear authors. I have some questions about the lightweight caption head you proposed!
How does the lightweight caption head differ from existing captioning models in terms of architecture and computational efficiency so that it's a "lightweight design"?
Hope for your reply.

Inference on Custom DB

Dear authors,

I'd like to test your model on my custom 3D reconstructed pointclouds + rgb map.
In order to directly use the checkpoints you shared, I need to extract the normalization features from 3D mesh (as your models are all trained with the norm features).
However, I have no idea how to make the norm features from my custom dataset. I understand the normalized features are calculated in read_mesh_vertices_rgb_normal() from the face of 3D mesh. But I only have 3D pointclouds and rgb, without the mesh faces.
I initially tried to get the faces using Meshlab tool (surface reconstruction - ball pivoting), and the result is as below.

Using the 3D mesh, I followed the all preprocessing steps (xyz, rgb, norm) to extract the similar features to fit your model with the checkpoint(scanrefer_scst_vote2cap_detr_pp_XYZ_RGB_NORMAL.pth).
But the detection and captioning result was really poor, and I suspect the normalized features would be the issue.

Regarding this issue, can you share you insight?

What should I do to visualize this data?

What should I do to visualize this data? Or the pictures in the paper are made by PPT