Coder Social home page Coder Social logo

andytianph / tgrs_pcvit Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 20.86 MB

Official implementation for [TGRS'24] "PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery"

License: GNU General Public License v3.0

Python 58.73% Jupyter Notebook 41.19% Dockerfile 0.03% Shell 0.02% Makefile 0.01% Batchfile 0.02% CSS 0.01%

tgrs_pcvit's Introduction

PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery

Updates | Introduction | Results & Models | Usage | Citation Details | Acknowledge

This branch contains the official pytorch implementation for PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery [TGRS'24].

Updates

2024.3.5

The codes of the PCViT has been released. The weights and logs will be uploaded soon.

Introduction

This repository contains codes, models and test results for the paper "PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery".

Fig. 1: **The structure of the baseline of the proposed PCViT.** The proposed backbone constitutes a multiscale pyramid with three scale stages. The initial two stages consist of convolutional blocks, and the final stage consists of transformer blocks. Here, we refine the transformer block using the PCM and LGKA module. Then, The multiscale features derived from the backbone are then fed into the subsequent FRPN neck to facilitate contextual information interaction before being directed to the detection head.
Fig. 2: **The pipeline of the proposed MPP.** During pretraining, K masked perspectives of each image are randomly sampled in a mini-batch with MPM. Then, they will be fed to the encoder and the decoder for invisible reconstruction with targets.
Fig. 3: **Local/Global k-NN Attention.** In each group of transformer subblocks, we use local attention for the first two layers, that is, reduce computational complexity through 16x16 window attention. For propagation between windows, we use global attention in the third layer.

Results and Models

MillionAID

The models are trained on 4 x 3090 machines with 2 images per gpu, which makes a batch size of 32 during training.

Pretrain Backbone Input size Params (M) Pretrained model
MPP PCViT 224 × 224 112 Weights; 百度云

Results from this repo on DIOR

The models are trained on 2 x 3090 machines with 2 images per gpu, which makes a batch size of 1 during training.

Model Pretrain Machine FrameWork Box mAP@50 config log weight
PCViT MPP GPU Faster RCNN 80.25 config log Weights; 百度云

Usage

Environment:

  • Python 3.8.5
  • Pytorch 1.9.0+cu111
  • torchvision 0.10.0+cu111
  • timm 0.4.12
  • mmcv-full 1.3.9

Pretrain (4 × 3090 GPUs, 1 weeks)

  1. Preparing the MillionAID: Download the MillionAID. It is easy for users to record image names and revise corresponding codes prtrain.

  2. To pretrain PCViT with multi-node distributed training, run the following on 1 node with 4 GPUs each (only mask 75% is supported): (batchsize: 128=4*32)

python -m torch.distributed.launch --nproc_per_node 4 main_pretrain.py \
--batch_size 32 --model fastconvmae_convvitae_base_patch16 \
--norm_pix_loss --mask_ratio 0.75 --epochs 100 \
--warmup_epochs 20 --blr 6.0e-4 --weight_decay 0.05

Note: Padding the convolutional kernel of PCM in the pretrained PCViT with convertK1toK3.py for finetuning.

Finetune

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/andytianph/TGRS_PCViT.git
cd PCViT/finetune
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

Download the pretrained models from MAE, ViTAE or PCViT, and then conduct the experiments by

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH>

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch

Citation Details

If you find this code helpful, please kindly cite:

@ARTICLE{10417056,
  author={Li, Jiaojiao and Tian, Penghao and Song, Rui and Xu, Haitao and Li, Yunsong and Du, Qian},
  journal={IEEE Transactions on Geoscience and Remote Sensing}, 
  title={PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery}, 
  year={2024},
  volume={62},
  number={},
  pages={1-15},
  keywords={Transformers;Feature extraction;Task analysis;Object detection;Detectors;Nickel;Semantics;Convolutional neural network (CNN);feature pyramid network (FPN);multiscale object detection;remote-sensing images (RSIs);vision transformer (ViT)},
  doi={10.1109/TGRS.2024.3360456}}

Acknowledge

We acknowledge the excellent implementation from mmdetection, MAE, Remote-Sensing-RVSA

tgrs_pcvit's People

Contributors

andytianph avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

tgrs_pcvit's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.