Coder Social home page Coder Social logo

heng-hw / spacap3d Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 5.0 93.17 MB

[IJCAI 2022] Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds (official pytorch implementation)

Home Page: https://spacap3d.github.io/

Python 93.46% C 0.35% C++ 2.50% Cuda 3.68%
3d caption-generation computer-vision dense-captioning ijcai ijcai2022 natural-language-processing point-cloud pytorch scene-understanding

spacap3d's Introduction

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Official implementation of "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI 2022. ([arXiv] [project])

teaser

updates:

  • August 31, 2022: we are ranked the 2nd on the Scan2Cap benchmark!
  • May 01, 2022: codes are released!

benchmark

Main Results

ScanRefer

Method input [email protected] [email protected] [email protected] [email protected] [email protected] Model Eval. Cmd.
Scan2Cap xyz 32.94 20.63 21.10 41.58 27.45 - -
Scan2Cap xyz+rgb+normal 35.20 22.36 21.44 43.57 29.13 - -
Scan2Cap xyz+multiview+normal 39.08 23.32 21.97 44.78 32.21 - -
Oursbase xyz 40.19 (38.61*) 24.71 22.01 45.49 32.32 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --late_guide --no_learnt_src_pos --folder 'SPACAP_BASE'
Ours xyz 42.53 (40.47*) 25.02 22.22 45.65 34.44 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --folder 'SPACAP'
Ours xyz+rgb+normal 42.76 (39.80*) 25.38 22.84 45.66 35.55 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --use_color --use_normal --folder 'SPACAP_RGB_NORMAL'
Ours xyz+multiview+normal 44.02 (42.40*) 25.26 22.33 45.36 36.64 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --use_multiview --use_normal --folder 'SPACAP_MV_NORMAL'

Nr3D/ReferIt3D

Method input [email protected] [email protected] [email protected] [email protected] [email protected] Model Eval. Cmd.
Scan2Cap xyz+multiview+normal 24.10 15.01 21.01 47.95 32.21 - -
Oursbase xyz 31.06 (28.55*) 17.94 22.03 49.63 30.65 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --late_guide --no_learnt_src_pos --folder 'SPACAP_BASE_NR3D' --dataset ReferIt3D
Ours xyz 31.43 (29.35*) 18.98 22.24 49.79 33.17 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --folder 'SPACAP_NR3D' --dataset ReferIt3D
Ours xyz+rgb+normal 33.24 (31.01*) 19.46 22.61 50.41 33.23 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --use_color --use_normal --folder 'SPACAP_RGB_NORMAL_NR3D' --dataset ReferIt3D
Ours xyz+multiview+normal 33.71 (30.52*) 19.92 22.61 50.50 38.11 model python scripts/eval.py --eval_tag 'muleval' --mul_eval --use_multiview --use_normal --folder 'SPACAP_MV_NORMAL_NR3D' --dataset ReferIt3D

Notes:

  • * means the CIDEr score is averaged over multiple evaluation as the algorithm randomness is large. The rest metrics are computed for the evaluation when the CIDEr score achieves the best.
  • Oursbase: standard encoder with sinusoidal positional encoding, late-guide decoder, and no token-to-token spatial relation guidance.
  • Ours: token-to-token spatial relation guided encoder with learnable positional encoding, early-guide decoder
  • To evaluate the model, put the downloaded model folder under ./outputs as ./outputs/[--folder]/model.path. And run the command in Eval. Cmd.. It would take ~4 hours.
  • To download all the models at once, please click here.
  • All experiments were trained on a single GeForce RTX 2080Ti GPU.

Installation

Please execute the following command to install PyTorch 1.6:

conda create -n spacap python=3.6.13
conda activate spacap
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2.89 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
export CUDA_HOME=/usr/local/cuda-10.2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Data Preparation

Download the preprocessed GLoVE embeddings (~990MB) and put them under data/.

ScanRefer

  1. Download ScanRefer data HERE and unzip it under data/.

  2. Run python scripts/organize_scanrefer.py to generate organized ScanRefer ScanRefer_filtered_organized.json under data/.

ReferIt3D

  1. Download ReferIt3D data (Nr3D only) HERE and put it under data/.

  2. Run python scripts/split_referit3d.py to generate nr3d_train.json and nr3d_val.json under data/.

  3. Run python scripts/organize_referit3d.py to generate organized Nr3D nr3d_organized.json under data/.

ScanNet

In addition to ScanRefer and ReferIt3D, you also need to access the original ScanNet dataset to get the scene data.

  1. Follow instructions listed HERE to get ScanNet data. After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00.

  2. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:

    cd data/scannet/
    python batch_load_scannet_data.py

    After this step, you can check if the processed scene data is valid by running:

    python visualize.py --scene_id scene0000_00

    Check the *.obj file under /data/scannet/scannet_data

  3. To further generate axis-aligned mesh file [scene_id]_axis_aligned.ply under data/scans/[scene_id] for visualization:

    cd data/scannet/
    python align_axis.py
  4. (optionally) To use 2D pretrained multiview feature as input:

    a. Download the pretrained ENet weights (1.4MB) and put it under data/

    b. Download and unzip the extracted ScanNet frames (~13GB) under data/.

    c. Extract the ENet features:

    python scripts/compute_multiview_features.py

    d. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database enet_feats_maxpool.hdf5 under data/scannet_data/:

    python scripts/project_multiview_features.py --maxpool

    You can check if the projections make sense by projecting the semantic labels from image to the target point cloud by:

    python scripts/project_multiview_labels.py --scene_id scene0000_00 --maxpool

    The projection would be saved under /outputs/projections as scene0000_00.ply.

Relative Spatiality Label Generation

To equip the learning with token-to-token spatial relationship guidance, we need to generate the ground truth spatiality labels for each scene from train split. The relative spatiality labels along three axes would be stored under data/scannet/scannet_data as [scene_id]_x.npy, [scene_id]_y.npy, and [scene_id]_z.npy after running the following scripts:

cd data/scannet/
python generate_spatiality_label.py --dataset 'scanrefer' --split 'train' --verbose
python generate_spatiality_label.py --dataset 'nr3d' --split 'train' --verbose

You can also check if the relation label along --axis for a scene --scene_id is valid by visualizing:

python generate_spatiality_label.py --visualize --scene_id 'scene0011_00' --axis x --savefig --verbose

Note the --savefig flag saves the visualization as ./scans/[--scene_id]/[--scene_id]_[--axis].png. Check example.

After data preparation, the dataset files are structured as follows.

SpaCap
├── data
│   ├── ScanRefer_filtered_train.txt
│   ├── ScanRefer_filtered_val.txt
│   ├── ScanRefer_filtered.json 
│   ├── ScanRefer_filtered_train.json
│   ├── ScanRefer_filtered_val.json
│   ├── ScanRefer_filtered_organized.json
│   ├── nr3d.csv
│   ├── nr3d_train.json
│   ├── nr3d_val.json
│   ├── nr3d_organized.json 
│   ├── glove.p 
│   ├── scannet
│   │   ├── scans
│   │   │   ├── [scene_id]
│   │   │   │   ├── [scene_id]_vh_clean_2.ply & [scene_id].aggregation.json & [scene_id]_vh_clean_2.0.010000.segs.json & [scene_id].txt & [scene_id]_axis_aligned.ply
│   │   ├── scannet_data
│   │   │   ├── enet_feats_maxpool.hdf5 (optional if you do not use --use_multiview)
│   │   │   ├── [scene_id]_aligned_bbox.npy & [scene_id]_aligned_vert.npy & [scene_id]_bbox.npy & [scene_id]_vert.npy & [scene_id]_ins_label.npy & [scene_id]_sem_label.npy & [scene_id]_x.npy & [scene_id]_y.npy & [scene_id]_z.npy

Usage

Training

To train our model with xyz as input (Training time: ~33h 22m):

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --tag 'spacap' --dataset 'ReferIt3D'

To train our model with xyz+rgb+normal as input (Training time: ~33h 47m):

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --tag 'spacap_rgb_normal' --use_color --use_normal

To train our model with xyz+multiview+normal as input (Training time: ~39h 40m):

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --tag 'spacap_mv_normal' --use_multiview --use_normal

Note: the increased training time is mainly due to the fetch time of pretrained multiview features

To train our base model Oursbase with xyz as input (Training time: ~31h 14m):

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --tag 'spacap_base' --late_guide --no_relation --no_learnt_src_pos

Note if not specified, scripts above would train models on dataset ScanRefer by default. To train model on dataset ReferIt3D (Nr3D), toggle on flag --dataset ReferIt3D.

The trained model as well as the intermediate results will be dumped into outputs/<output_folder> where <output_folder> would be timestamp_[--tag] (e.g, 2022-04-20_11-59-59_SPACAP).

Evaluation

For evaluating the model (@0.5IoU) multiple times to find the best performance in CIDEr score, please run the following script and change the <output_folder> accordingly:

CUDA_VISIBLE_DEVICES=0 python scripts/eval.py --eval_tag 'muleval'  --mul_eval --folder <output_folder> 

Specific evaluation scripts for different model setting are provided in Eval. Cmd..

Visualization

To visualize the predicted bounding box and caption for each object, please run the following script:

CUDA_VISIBLE_DEVICES=0 python scripts/eval.py --eval_tag 'vis' --seed 25 --eval_visualize --folder <output_folder> --nodryrun

A folder /vis would be created under /outputs/<output_folder> where predicted caption for each testing scene would be saved as scene_id/predictions.json under /vis and the object bounding box prediction would be saved as scene_id/pred-[obj_id]-[obj_name].ply.

Note the --seed can be any number or the one when your model achieves the highest CIDEr score.

Citation

If you find our work helpful in your research, please kindly cite our paper via:

@inproceedings{SpaCap3D,
    title={Spatiality-guided Transformer for 3{D} Dense Captioning on Point Clouds},
    author={Wang, Heng and Zhang, Chaoyi and Yu, Jianhui and Cai, Weidong},
    booktitle={Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, {IJCAI-22}},
    year={2022}
}

Acknowledgement

This repo is built mainly upon Scan2Cap. We also borrow code from annotated-transformer for the basic Transformer building blocks.

Contact

If you have any questions or suggestions about this repo, please feel free to contact me! ([email protected])

spacap3d's People

Contributors

heng-hw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

spacap3d's Issues

Training results are much lower

I train scannet datasets with xyz as input, but the training results are much lower, What might be the cause?
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --tag 'spacap'
1713491170317

glove embedding not used in the code?

I've notice that the Transformer Captioner does not use glove embedding as pretrained word embeddings. Would this hurt the captioner's performance?

training time

Hi, I would like to ask why github shows more than 30 hours of training time, and why it shows more than 70 hours when I download and reproduce the code? tesla P100-PCI-16GB used.
Thank you very much.
1713579186873

About the loss curve

Hello, thanks for your work and code!

Why the loss curves of x and y are unstable while z is stable?
0d2c596b6092d1bfc55ddb447a42399
Did this happen during your training?

Errors encountered when evaluating on Nr3D

When evaluating spacap3D on nr3d with the provided checkpoint, i encounter the following problem, which implies that the vocabulary size differs from the checkpoint's.

RuntimeError: Error(s) in loading state_dict for SpaCapNet:
        size mismatch for caption.model.tgt_embed.0.lut.weight: copying a param with shape torch.Size([2862, 128]) from checkpoint, the shape in current model is torch.Size([2869, 128]).
        size mismatch for caption.model.generator.proj.weight: copying a param with shape torch.Size([2862, 128]) from checkpoint, the shape in current model is torch.Size([2869, 128]).
        size mismatch for caption.model.generator.proj.bias: copying a param with shape torch.Size([2862]) from checkpoint, the shape in current model is torch.Size([2869]).

could you please provide the vocabulary json or the prediction json file?

Evaluation Results

I've been evaluating the rgb+normal checkpoints you provided, but the results are low, is it normal?

python scripts/eval.py --eval_tag 'muleval' --use_color --use_normal --folder 'SPACAP_RGB_NORMAL' --min_iou 0.25 --eval_caption

[BLEU-1] Mean: 0.6417, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.5214, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.4093, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.3211, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.5393, Max: 7.1255, Min: 0.0000
[ROUGE-L] Mean: 0.5268, Max: 1.0000, Min: 0.1015
[METEOR] Mean: 0.2585, Max: 1.0000, Min: 0.0448

But that of multiview+normal are pretty high:

python scripts/eval.py --eval_tag 'muleval' --use_multiview --use_normal --folder 'SPACAP_MV_NORMAL' --min_iou 0.25 --eval_caption

[BLEU-1] Mean: 0.6688, Max: 1.0000, Min: 0.0000
[BLEU-2] Mean: 0.5604, Max: 1.0000, Min: 0.0000
[BLEU-3] Mean: 0.4541, Max: 1.0000, Min: 0.0000
[BLEU-4] Mean: 0.3665, Max: 1.0000, Min: 0.0000
[CIDEr] Mean: 0.6350, Max: 5.0265, Min: 0.0000
[ROUGE-L] Mean: 0.5601, Max: 1.0000, Min: 0.1015
[METEOR] Mean: 0.2684, Max: 1.0000, Min: 0.0448

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.