dingxiaoh / repvgg Goto Github PK

RepVGG: Making VGG-style ConvNets Great Again

License: MIT License

Python 100.00%

repvgg's Introduction

RepVGG: Making VGG-style ConvNets Great Again (CVPR-2021) (PyTorch)

Highlights (Sep. 1st, 2022)

RepVGG and the methodology of re-parameterization have been used in YOLOv6 (paper, code) and YOLOv7 (paper, code).

I have re-organized this repository and released the RepVGGplus-L2pse model with 84.06% ImageNet accuracy. Will release more RepVGGplus models in this month.

Introduction

This is a super simple ConvNet architecture that achieves over 84% top-1 accuracy on ImageNet with a VGG-like architecture! This repo contains the pretrained models, code for building the model, training, and the conversion from training-time model to inference-time, and an example of using RepVGG for semantic segmentation.

The MegEngine version

TensorRT implemention with C++ API by @upczww. Great work!

Another PyTorch implementation by @zjykzj. He also presented detailed benchmarks here. Nice work!

Included in a famous PyTorch model zoo https://github.com/rwightman/pytorch-image-models.

Objax implementation and models by @benjaminjellis. Great work!

Included in the MegEngine Basecls model zoo.

Citation:

@inproceedings{ding2021repvgg,
title={Repvgg: Making vgg-style convnets great again},
author={Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13733--13742},
year={2021}
}

From RepVGG to RepVGGplus

We have released an improved architecture named RepVGGplus on top of the original version presented in the CVPR-2021 paper.

RepVGGplus is deeper
RepVGGplus has auxiliary classifiers during training, which can also be removed for inference
(Optional) RepVGGplus uses Squeeze-and-Excitation blocks to further improve the performance.

RepVGGplus outperformed several recent visual transformers with a top-1 accuracy of 84.06% and higher throughput. Our training script is based on codebase of Swin Transformer. The throughput is tested with the Swin codebase as well. We would like to thank the authors of Swin for their clean and well-structured code.

Model	Train image size	Test size	ImageNet top-1	Throughput (examples/second), 320, batchsize=128, 2080Ti)
RepVGGplus-L2pse	256	320	84.06%	147
Swin Transformer	320	320	84.0%	102

("pse" means Squeeze-and-Excitation blocks after ReLU.)

Download this model: Google Drive or Baidu Cloud.

To train or finetune it, slightly change your training code like this:

        #   Build model and data loader as usual
        for samples, targets in enumerate(train_data_loader):
            #   ......
            outputs = model(samples)                        #   Your original code
            if type(outputs) is dict:                       
                #   A training-time RepVGGplus outputs a dict. The items are:
                    #   'main':     the output of the final layer
                    #   '*aux*':    the output of auxiliary classifiers
                loss = 0
                for name, pred in outputs.items():
                    if 'aux' in name:
                        loss += 0.1 * criterion(pred, targets)          #  Assume "criterion" is cross-entropy for classification
                    else:
                        loss += criterion(pred, targets)
            else:
                loss = criterion(outputs, targets)          #   Your original code
            #   Backward as usual
            #   ......

To use it for downstream tasks like semantic segmentation, just discard the aux classifiers and the final FC layer.

Pleased note that the custom weight decay trick I described last year turned out to be insignificant in our recent experiments (84.16% ImageNet acc and negligible improvements on other tasks), so I decided to stop using it as a new feature of RepVGGplus. You may try it optionally on your task. Please refer to the last part of this page for details.

Use our pretrained model

You may download all of the ImageNet-pretrained models reported in the paper from Google Drive (https://drive.google.com/drive/folders/1Avome4KvNp0Lqh2QwhXO6L5URQjzCjUq?usp=sharing) or Baidu Cloud (https://pan.baidu.com/s/1nCsZlMynnJwbUBKn0ch7dQ, the access code is "rvgg"). For the ease of transfer learning on other tasks, they are all training-time models (with identity and 1x1 branches). You may test the accuracy by running

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12349 main.py --arch [model name] --data-path [/path/to/imagenet] --batch-size 32 --tag test --eval --resume [/path/to/weights/file] --opts DATA.DATASET imagenet DATA.IMG_SIZE [224 or 320]

The valid model names include

RepVGGplus-L2pse, RepVGG-A0, RepVGG-A1, RepVGG-A2, RepVGG-B0, RepVGG-B1, RepVGG-B1g2, RepVGG-B1g4, RepVGG-B2, RepVGG-B2g2, RepVGG-B2g4, RepVGG-B3, RepVGG-B3g2, RepVGG-B3g4

Convert a training-time RepVGG into the inference-time structure

For a RepVGG model or a model with RepVGG as one of its components (e.g., the backbone), you can convert the whole model by simply calling switch_to_deploy of every RepVGG block. This is the recommended way. Examples are shown in tools/convert.py and example_pspnet.py.

    for module in model.modules():
        if hasattr(module, 'switch_to_deploy'):
            module.switch_to_deploy()

We have also released a script for the conversion. For example,

python convert.py RepVGGplus-L2pse-train256-acc84.06.pth RepVGGplus-L2pse-deploy.pth -a RepVGGplus-L2pse

Then you may build the inference-time model with --deploy, load the converted weights and test

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12349 main.py --arch RepVGGplus-L2pse --data-path [/path/to/imagenet] --batch-size 32 --tag test --eval --resume RepVGGplus-L2pse-deploy.pth --deploy --opts DATA.DATASET imagenet DATA.IMG_SIZE [224 or 320]

Except for the final conversion after training, you may want to get the equivalent kernel and bias during training in a differentiable way at any time (get_equivalent_kernel_bias in repvgg.py). This may help training-based pruning or quantization.

Train from scratch

Reproduce RepVGGplus-L2pse (not presented in the paper)

To train the recently released RepVGGplus-L2pse from scratch, activate mixup and use --AUG.PRESET raug15 for RandAug.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12349 main.py --arch RepVGGplus-L2pse --data-path [/path/to/imagenet] --batch-size 32 --tag train_from_scratch --output-dir /path/to/save/the/log/and/checkpoints --opts TRAIN.EPOCHS 300 TRAIN.BASE_LR 0.1 TRAIN.WEIGHT_DECAY 4e-5 TRAIN.WARMUP_EPOCHS 5 MODEL.LABEL_SMOOTHING 0.1 AUG.PRESET raug15 AUG.MIXUP 0.2 DATA.DATASET imagenet DATA.IMG_SIZE 256 DATA.TEST_SIZE 320

Reproduce original RepVGG results reported in the paper

To reproduce the models reported in the CVPR-2021 paper, use no mixup nor RandAug.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12349 main.py --arch [model name] --data-path [/path/to/imagenet] --batch-size 32 --tag train_from_scratch --output-dir /path/to/save/the/log/and/checkpoints --opts TRAIN.EPOCHS 300 TRAIN.BASE_LR 0.1 TRAIN.WEIGHT_DECAY 1e-4 TRAIN.WARMUP_EPOCHS 5 MODEL.LABEL_SMOOTHING 0.1 AUG.PRESET weak AUG.MIXUP 0.0 DATA.DATASET imagenet DATA.IMG_SIZE 224

The original RepVGG models were trained in 120 epochs with cosine learning rate decay from 0.1 to 0. We used 8 GPUs, global batch size of 256, weight decay of 1e-4 (no weight decay on fc.bias, bn.bias, rbr_dense.bn.weight and rbr_1x1.bn.weight) (weight decay on rbr_identity.weight makes little difference, and it is better to use it in most of the cases), and the same simple data preprocssing as the PyTorch official example:

            trans = transforms.Compose([
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

Other released models not presented in the paper

Apr 25, 2021 A deeper RepVGG model achieves 83.55% top-1 accuracy on ImageNet with SE blocks and an input resolution of 320x320 (and a wider version achieves 83.67% accuracy without SE). Note that it is trained with 224x224 but tested with 320x320, so that it is still trainable with a global batch size of 256 on a single machine with 8 1080Ti GPUs. If you test it with 224x224, the top-1 accuracy will be 81.82%. It has 1, 8, 14, 24, 1 layers in the 5 stages respectively. The width multipliers are a=2.5 and b=5 (the same as RepVGG-B2). The model name is "RepVGG-D2se". The code for building the model (repvgg.py) and testing with 320x320 (the testing example below) has been updated and the weights have been released at Google Drive and Baidu Cloud. Please check the links below.

Example 1: use Structural Re-parameterization like this in your own code

from repvgg import repvgg_model_convert, create_RepVGG_A0
train_model = create_RepVGG_A0(deploy=False)
train_model.load_state_dict(torch.load('RepVGG-A0-train.pth'))          # or train from scratch
# do whatever you want with train_model
deploy_model = repvgg_model_convert(train_model, save_path='RepVGG-A0-deploy.pth')
# do whatever you want with deploy_model

deploy_model = create_RepVGG_A0(deploy=True)
deploy_model.load_state_dict(torch.load('RepVGG-A0-deploy.pth'))
# do whatever you want with deploy_model

If you use RepVGG as a component of another model, the conversion is as simple as calling switch_to_deploy of every RepVGG block.

Example 2: use RepVGG as the backbone for downstream tasks

I would suggest you use popular frameworks like MMDetection and MMSegmentation. The features from any stage or layer of RepVGG can be fed into the task-specific heads. If you are not familiar with such frameworks and just would like to see a simple example, please check example_pspnet.py, which shows how to use RepVGG as the backbone of PSPNet for semantic segmentation: 1) build a PSPNet with RepVGG backbone, 2) load the ImageNet-pretrained weights, 3) convert the whole model with switch_to_deploy, 4) save and use the converted model for inference.

Quantization

RepVGG works fine with FP16 but the accuracy may decrease when directly quantized to INT8. If IN8 quantization is essential to your application, we suggest three practical solutions.

Solution A: RepOptimizer

I strongly recommend trying RepOptimizer if quantization is essential to your application. RepOptimizer directly trains a VGG-like model via Gradient Re-parameterization without any structural conversions. Quantizing a VGG-like model trained with RepOptimizer is as easy as quantizing a regular model. RepOptimizer has already been used in YOLOv6.

Paper: https://arxiv.org/abs/2205.15242

Code: https://github.com/DingXiaoH/RepOptimizers

Tutorial provided by the authors of YOLOv6: https://github.com/meituan/YOLOv6/blob/main/docs/tutorial_repopt.md. Great work! Many thanks!

Solution B: custom quantization-aware training

Another choice is is to constrain the equivalent kernel (get_equivalent_kernel_bias() in repvgg.py) to be low-bit (e.g., make every param in {-127, -126, .., 126, 127} for int8), instead of constraining the params of every kernel separately for an ordinary model.

Solution C: use the off-the-shelf toolboxes

(TODO: check and refactor the code of this example)

For the simplicity, we can also use the off-the-shelf quantization toolboxes to quantize RepVGG. We use the simple QAT (quantization-aware training) tool in torch.quantization as an example.

Given the base model converted into the inference-time structure. We insert BN after the converted 3x3 conv layers because QAT with torch.quantization requires BN. Specifically, we run the model on ImageNet training set and record the mean/std statistics and use them to initialize the BN layers, and initialize BN.gamma/beta accordingly so that the saved model has the same outputs as the inference-time model.

python quantization/convert.py RepVGG-A0.pth RepVGG-A0_base.pth -a RepVGG-A0 
python quantization/insert_bn.py [imagenet-folder] RepVGG-A0_base.pth RepVGG-A0_withBN.pth -a RepVGG-A0 -b 32 -n 40000

Build the model, prepare it for QAT (torch.quantization.prepare_qat), and conduct QAT. This is only an example and the hyper-parameters may not be optimal.

python quantization/quant_qat_train.py [imagenet-folder] -j 32 --epochs 20 -b 256 --lr 1e-3 --weight-decay 4e-5 --base-weights RepVGG-A0_withBN.pth --tag quanttest

FAQs

Q: Is the inference-time model's output the same as the training-time model?

A: Yes. You can verify that by

python tools/verify.py

Q: How to use the pretrained RepVGG models for other tasks?

A: It is better to finetune the training-time RepVGG models on your datasets. Then you should do the conversion after finetuning and before you deploy the models. For example, say you want to use PSPNet for semantic segmentation, you should build a PSPNet with a training-time RepVGG model as the backbone, load pre-trained weights into the backbone, and finetune the PSPNet on your segmentation dataset. Then you should convert the backbone following the code provided in this repo and keep the other task-specific structures (the PSPNet parts, in this case). The pseudo code will be like

#   train_backbone = create_RepVGG_B2(deploy=False)
#   train_backbone.load_state_dict(torch.load('RepVGG-B2-train.pth'))
#   train_pspnet = build_pspnet(backbone=train_backbone)
#   segmentation_train(train_pspnet)
#   deploy_pspnet = repvgg_model_convert(train_pspnet)
#   segmentation_test(deploy_pspnet)

There is an example in example_pspnet.py.

Finetuning with a converted RepVGG also makes sense if you insert a BN after each conv (please see the quantization example), but the performance may be slightly lower.

Q: I tried to finetune your model with multiple GPUs but got an error. Why are the names of params like "stage1.0.rbr_dense.conv.weight" in the downloaded weight file but sometimes like "module.stage1.0.rbr_dense.conv.weight" (shown by nn.Module.named_parameters()) in my model?

A: DistributedDataParallel may prefix "module." to the name of params and cause a mismatch when loading weights by name. The simplest solution is to load the weights (model.load_state_dict(...)) before DistributedDataParallel(model). Otherwise, you may insert "module." before the names like this

checkpoint = torch.load(...)    # This is just a name-value dict
ckpt = {('module.' + k) : v for k, v in checkpoint.items()}
model.load_state_dict(ckpt)

Likewise, if the param names in the checkpoint file start with "module." but those in your model do not, you may strip the names like line 50 in test.py.

ckpt = {k.replace('module.', ''):v for k,v in checkpoint.items()}   # strip the names
model.load_state_dict(ckpt)

Q: So a RepVGG model derives the equivalent 3x3 kernels before each forwarding to save computations?

A: No! More precisely, we do the conversion only once right after training. Then the training-time model can be discarded, and the resultant model only has 3x3 kernels. We only save and use the resultant model.

An optional trick with a custom weight decay (deprecated)

This is deprecated. Please check repvggplus_custom_L2.py. The intuition is to add regularization on the equivalent kernel. It may work in some cases.

The trained model can be downloaded at Google Drive or Baidu Cloud

The training code should be changed like this:

        #   Build model and data loader as usual
        for samples, targets in enumerate(train_data_loader):
            #   ......
            outputs = model(samples)                        #   Your original code
            if type(outputs) is dict:                       
                #   A training-time RepVGGplus outputs a dict. The items are:
                    #   'main':     the output of the final layer
                    #   '*aux*':    the output of auxiliary classifiers
                    #   'L2':       the custom L2 regularization term
                loss = WEIGHT_DECAY * 0.5 * outputs['L2']
                for name, pred in outputs.items():
                    if name == 'L2':
                        pass
                    elif 'aux' in name:
                        loss += 0.1 * criterion(pred, targets)          #  Assume "criterion" is cross-entropy for classification
                    else:
                        loss += criterion(pred, targets)
            else:
                loss = criterion(outputs, targets)          #   Your original code
            #   Backward as usual
            #   ......

Contact

[email protected] (The original Tsinghua mailbox [email protected] will expire in several months)

Google Scholar Profile: https://scholar.google.com/citations?user=CIjw0KoAAAAJ&hl=en

Homepage: https://dingxiaohan.xyz/

My open-sourced papers and repos:

The Structural Re-parameterization Universe:

RepLKNet (CVPR 2022) Powerful efficient architecture with very large kernels (31x31) and guidelines for using large kernels in model CNNs
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
code.
RepOptimizer (ICLR 2023) uses Gradient Re-parameterization to train powerful models efficiently. The training-time RepOpt-VGG is as simple as the inference-time. It also addresses the problem of quantization.
Re-parameterizing Your Optimizers rather than Architectures
code.
RepVGG (CVPR 2021) A super simple and powerful VGG-style ConvNet architecture. Up to 84.16% ImageNet top-1 accuracy!
RepVGG: Making VGG-style ConvNets Great Again
code.
RepMLP (CVPR 2022) MLP-style building block and Architecture
RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality
code.
ResRep (ICCV 2021) State-of-the-art channel pruning (Res50, 55% FLOPs reduction, 76.15% acc)
ResRep: Lossless CNN Pruning via Decoupling Remembering and Forgetting
code.
ACB (ICCV 2019) is a CNN component without any inference-time costs. The first work of our Structural Re-parameterization Universe.
ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks.
code.
DBB (CVPR 2021) is a CNN component with higher performance than ACB and still no inference-time costs. Sometimes I call it ACNet v2 because "DBB" is 2 bits larger than "ACB" in ASCII (lol).
Diverse Branch Block: Building a Convolution as an Inception-like Unit
code.

Model compression and acceleration:

(CVPR 2019) Channel pruning: Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure
code
(ICML 2019) Channel pruning: Approximated Oracle Filter Pruning for Destructive CNN Width Optimization
code
(NeurIPS 2019) Unstructured pruning: Global Sparse Momentum SGD for Pruning Very Deep Neural Networks
code

repvgg's People

Contributors

Stargazers

Watchers

Forkers

lliai fengxingxiang chaoso pander-dancer hell-to-heaven highland2019 wuxiaolianggit peternara ph0809 zfxu chenjun2hao suyuyis xialuxi shengzhang90 yang-fei jackywang-001 mldl richardhahahaha lzhbrian dj-boy wobjtushisui wh-forker thailand88 ricklentz hust-wayne a-biao96 cyzhang87 wzb1005 happog guome iguazi kristine-li qinjinghui xrosliang zeta1999 metavai manojkesani wiesmanyaroo lynnnnnnnnn pustar lovecodestudent zhch-sun dengjunquan jadentan wangdeyu smile-ls-all zixuan-zhu scotter-qian noticeable qiuweibin2005 phamdinhkhanh mbrukman texervn abokalam anhtuanluu wingszb saadabbasi tragedyn tubbz-alt zhouleidcc zbwxp yolunghiu haeinkimkk davis-love-ai nariapetrova wz940216 zidanewenqsh mathpopo addingding chenkangyang jerrytony changerzz blakecheng xuanphu108 piantic guoyanannan liaorongfan qqyouhappy jaemyungkim nan1104 dzbwhut stjordanis what-ifs notrp delldu aipakchoi architect-road yang-botao zzg-tju saqibmamoon guruzoa dingguodong-826 wynmew mbharathteja wolfworld6 kylinyee crazyvertigo neverstoplearn hejing-maker wukailu

repvgg's Issues

Can RepVGG block combine with SeparableConv2d?

Thanks for your great work. It helps me a lot. Now I want to speed up the 3x3 convolution even further.

My question is, can RepVGG block combine with SeparableConv2d? This is shown in the figure below.

Why not compare with Mobilenetv2/v3?

Dear author:
thanks for this insightful idea. It's really userful to deploy plain cnn since its simplicity and high efficiency. I wonder why you did not conduct experiments on mobilenetv2/v3. I am eager to see whether mobilenetv2/v3 will benefit from the plain structure? Thank you again.

Objax Implementation

Hi,

First of all the papers is great and these models are really cool!

I've implemented the models with weights in Objax here: https://github.com/benjaminjellis/Objax-RepVGG

The output between the RepVGGNet (training mode) and the converted RepVGGNet (deploy mode) is not the same.

I created a training RepVGGNet_A0 by calling this interface:
rep_vgg_a0_training = create_RepVGG_A0()

And I created a deploy RepVGGNet_A0 from the training RepVGGNet_A0 by calling this interface:
rep_vgg_a0_deploy = repvgg_model_convert(rep_vgg_a0_training, create_RepVGG_A0)

for the same input tensor
in_tensors = torch.rand([args.batch_size, args.in_channels, args.height, args.width])

I regretfully found that the output between rep_vgg_a0_training and rep_vgg_a0_deploy is not the same.
Anything wrong with my code?
How can I get a correct converted deploy model?

Looking forward to your kind reply!

Repeated output when training with multi-gpu

Hi DingXiaoH,
I encountered with the repeated outputs when training with multi-gpu (8 GPUs), and the specified outputs are as follows.

As you can see in the image, the program prints the training result of the same batch for 8 times.
I use the code below to start the training.
python train.py -a RepVGG-A0 --dist-url 'tcp://127.0.0.1:23333' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --workers 32 imagenet
How can I solve this problem? Or, this problem doesn't influence the result?

Thanks,
Ema1997.

Why converted model become slower?

I used the repvgg-a0 in my task and convert the trained model by whole_model_convert function. But the trained model is much faster than the converted model while testing. The trained model's test time is around 288s, but converted model's test time over 400s.

is it possible to use depthwise 3x3 conv layer?

Why EfficientNet but not EfficientNet-lite?

Can you compare with EffcientNet-lite?
Sigmoid function in Swish is much slower than ReLU. Is this the reason why EffcientNet is not as good as your model？

Different outputs from train-model and deploy-model

After I converted the trained model into the inference-time structure, I tested two models with the same input, and I got different outputs from train-model (RepVGG-X-train.pth) and deploy model (RepVGG-X-depoly.pth).

Have you done that kind of comparison?
lots of thanks~

Can I train this code in window?

The following error occurred when I trained this code under window:
AttributeError: module 'torch.multiprocessing' has no attribute 'spawn'

So I wonder if it's systemic. Thank you very much!!

我尝试使用RepVGG替换yolov5的backbone

但是当我去掉最后一层linear层后，就无法使用预训练的权重了，怎么解决这个问题

difference output in segmentation with RepVGG encoder

I tried to implement RepVGG backbone into U-net. After training process, the result is pretty well.
Then I converted model with my code (base on your code):

`model = smp.Unet(
encoder_name="RepVGG-A2", # choose encoder, e.g. mobilenet_v2 or efficientnet-b7
classes=config_seg.NUM_CLASSES, # model output channels (number of classes in your dataset)
deploy=False
).cuda()
model.load_state_dict(pretrained_dict)

model_deploy = smp.Unet(
    encoder_name="RepVGG-A2",        # choose encoder, e.g. mobilenet_v2 or efficientnet-b7
    classes=config_seg.NUM_CLASSES,                      # model output channels (number of classes in your dataset)
    deploy=True
).cuda()

all_weights = {}
for name, module in model.named_modules():
    if hasattr(module, 'repvgg_convert'):
        kernel, bias = module.repvgg_convert()
        print('>> name: ', name)
        all_weights[name + '.rbr_reparam.weight'] = kernel
        all_weights[name + '.rbr_reparam.bias'] = bias
        print('convert RepVGG block')
    else:
        for p_name, p_tensor in module.named_parameters():
            full_name = name + '.' + p_name
            print('>> not vgg block name: ', name, p_name, full_name)
            if full_name not in all_weights:
                all_weights[full_name] = p_tensor.detach().cpu().numpy()
        for p_name, p_tensor in module.named_buffers():
            full_name = name + '.' + p_name
            if full_name not in all_weights:
                all_weights[full_name] = p_tensor.cpu().numpy()


for name, param in model_deploy.named_parameters():
    print('deploy param: ', name, param.size(), np.mean(all_weights[name]))
    param.data = torch.from_numpy(all_weights[name]).float()
del model
model = model_deploy.cuda()`

Feature extraction

Hi, thanks for your great work

Can I use RepVGG as feature extractor? And if it's possible how i can do that in this source code.

Thanks

deploy model and training model inference value check error

I using the provided code for model output equivalence test, but it failed.
Here is my code, are there any error in my code?

from repvgg import repvgg_model_convert, create_RepVGG_A0
import torch
import time
import numpy as np

def model_equivalence(model_1,
model_2,
device,
rtol=1e-05,
atol=1e-08,
num_tests=100,
input_size=(1, 3, 32, 32)):

model_1.to(device)
model_2.to(device)

for _ in range(num_tests):
    x = torch.rand(size=input_size).to(device)
    y1 = model_1(x).detach().cpu().numpy()
    y2 = model_2(x).detach().cpu().numpy()
    if np.allclose(a=y1, b=y2, rtol=rtol, atol=atol,equal_nan=False) == False:
        print("Model equivalence test sample failed: ")
        print(y1)
        print(y2)
        return False
return True

def measure_inference_latency(model,
device,
input_size=(1, 3, 32, 32),
num_samples=100):

model.to(device)
model.eval()

x = torch.rand(size=input_size).to(device)

start_time = time.time()
for _ in range(num_samples):
    _ = model(x)
end_time = time.time()
elapsed_time = end_time - start_time
elapsed_time_ave = elapsed_time / num_samples

return elapsed_time_ave

if name == "main":
RepVGG_A0 = create_RepVGG_A0(deploy=False)
RepVGG_A0.load_state_dict(torch.load('RepVGG-A0-train.pth')) # or train from scratch
# do whatever you want with train_model
RepVGG_A0_deploy = repvgg_model_convert(RepVGG_A0, create_RepVGG_A0, save_path='RepVGG_A0_deploy.pth')
print(model_equivalence(RepVGG_A0, RepVGG_A0_deploy , torch.device("cpu:0"), rtol=1e-03, atol=1e-06,
num_tests=100, input_size=(1, 3, 224, 224)))

bug of convert.py

when i try to convert the A1-train.pt but get "return kernel * t, beta - running_mean * gamma / std
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!"
can you help me ? thank you!

The results of multiple ,with pt file runs are different in c++

thx for your code,i wanna to use pt to inference,but The results of multiple ,with pt file runs are different in c++,what the reason may be?

> ```直接跑初始化的模型的话是没问题的，但是如果是加载权重的模型就不行

    x = torch.from_numpy(np.random.randn(1,*shape)).float()
    y = model(x)
    model_d = repvgg_model_convert(model,model_func,out_c=186*2,num_blocks=[4,6,16,1],in_c=1)
    y_d = model_d(x)
    print('diff abs: max {},\n**2:{}'.format(abs(y - y_d).max(),((y - y_d) ** 2).sum()))
输出：
diff abs: max 6.67572021484375e-06,
**2:1.419987460948846e-09
这里看正常的，但是实际训练下来，最后导出就是有之前贴出来的那么大差异。convert细节我没搞清，不好枉下结论。

我在实现RepVGG的时候观察到两个现象：

训练阶段和测试阶段的模型都需要执行eval后再进行精度比较，否则会找成较大差异；
当初始化方式如下所示：

def init_weights(modules):
    for m in modules:
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
            nn.init.constant_(m.weight, 1)
            nn.init.constant_(m.bias, 0)

会造成较大精度不对齐，而使用下述初始化则可以保证一致性

    def _init_weights(self, gamma=0.01):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, gamma)
                nn.init.constant_(m.bias, gamma)

下面是测试代码：

def test_regvgg():
    model = RepVGGRecognizer()
    model.eval()
    print(model)

    data = torch.randn(1, 3, 224, 224)
    insert_repvgg_block(model)
    model.eval()
    train_outputs = model(data)[KEY_OUTPUT]
    print(model)

    fuse_repvgg_block(model)
    model.eval()
    eval_outputs = model(data)[KEY_OUTPUT]
    print(model)

    print(torch.sqrt(torch.sum((train_outputs - eval_outputs) ** 2)))
    print(torch.allclose(train_outputs, eval_outputs, atol=1e-8))
    assert torch.allclose(train_outputs, eval_outputs, atol=1e-8)

希望能够对你有所帮助

Originally posted by @zjykzj in #23 (comment)

Plug-in version implementation

hi @DingXiaoH, This is a simple and intuitive implementation !!! I implemented a plug-in version about RepVGGBlock. I hope it will help you and others

This plug-in version implements the following functions:

The training model and the test model are separated;
You can apply RepVGGBlock to other models；
You can use RepVGGBlock and ACBlock together in training, no matter which order。

Framework

Implementation and test files are as follows:

implementation
- repvgg_backbone: RepVGG Backbone implementation;
- general_head_2d: RepVGG Head Implementation;
- repvgg_recognizer: RepVGG implementation;
- repvgg_block: RepVGGBlock implementation;
- conv_helper: insert and fuse implementation
- repvgg_util
test

Key implementation

Training and testing models are separated by inserting and fusing functions

####### conv_helper
def insert_repvgg_block(model: nn.Module):
    items = list(model.named_children())
    idx = 0
    while idx < len(items):
        name, module = items[idx]
        if isinstance(module, nn.Conv2d) and module.kernel_size[0] > 1:
            # 将标准卷积替换为RepVGGBlock
            in_channels = module.in_channels
            out_channels = module.out_channels
            kernel_size = module.kernel_size
            stride = module.stride
            padding = module.padding
            dilation = module.dilation
            groups = module.groups
            padding_mode = module.padding_mode

            acblock = RepVGGBlock(in_channels,
                                  out_channels,
                                  kernel_size[0],
                                  stride[0],
                                  padding=padding[0],
                                  padding_mode=padding_mode,
                                  dilation=dilation,
                                  groups=groups)
            model.add_module(name, acblock)
            # 如果conv层之后跟随着BN层，那么删除该BN层
            # 参考[About BN layer #35](https://github.com/DingXiaoH/ACNet/issues/35)
            if (idx + 1) < len(items) and isinstance(items[idx + 1][1], nn.BatchNorm2d):
                new_layer = nn.Identity()
                model.add_module(items[idx + 1][0], new_layer)
        else:
            insert_repvgg_block(module)
        idx += 1


def fuse_repvgg_block(model: nn.Module):
    for name, module in model.named_children():
        if isinstance(module, RepVGGBlock):
            # 将RepVGGBlock替换为标准卷积
            kernel, bias = get_equivalent_kernel_bias(module.rbr_dense,
                                                      module.rbr_1x1,
                                                      module.rbr_identity,
                                                      module.in_channels,
                                                      module.groups,
                                                      module.padding)
            # 新建标准卷积，赋值权重和偏差后重新插入模型
            fused_conv = nn.Conv2d(module.in_channels,
                                   module.out_channels,
                                   module.kernel_size,
                                   stride=module.stride,
                                   padding=module.padding,
                                   dilation=module.dilation,
                                   groups=module.groups,
                                   padding_mode=module.padding_mode,
                                   bias=True
                                   )
            fused_conv.weight = nn.Parameter(kernel.detach().cpu())
            fused_conv.bias = nn.Parameter(bias.detach().cpu())
            model.add_module(name, fused_conv)
        else:
            fuse_repvgg_block(module)

I modified the specific fusion function to ensure ACB and repvgg_block can be used in one training, and the block can be inserted into other models with different sizes conv

################ repvgg_block.py
# -*- coding: utf-8 -*-

"""
@date: 2021/2/2 下午8:32
@file: repvgg_block.py
@author: zj
@description: 
"""

import torch.nn as nn


def conv_bn(in_channels, out_channels, kernel_size, stride, padding, groups=1):
    result = nn.Sequential()
    result.add_module('conv', nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
                                        kernel_size=kernel_size, stride=stride, padding=padding, groups=groups,
                                        bias=False))
    result.add_module('bn', nn.BatchNorm2d(num_features=out_channels))
    return result


class RepVGGBlock(nn.Module):

    def __init__(self, in_channels, out_channels, kernel_size,
                 stride=1, padding=0, dilation=1, groups=1, padding_mode='zeros'):
        super(RepVGGBlock, self).__init__()

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.dilation = dilation
        self.groups = groups
        self.padding_mode = padding_mode

        # assert kernel_size == 3                      # ----------------- Annotate it so that the block can be inserted into other models
        # assert padding == 1

        padding_11 = padding - kernel_size // 2

        self.rbr_identity = nn.BatchNorm2d(
            num_features=in_channels) if out_channels == in_channels and stride == 1 else None
        self.rbr_dense = conv_bn(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size,
                                 stride=stride, padding=padding, groups=groups)
        self.rbr_1x1 = conv_bn(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=stride,
                               padding=padding_11, groups=groups)

        self._init_weights()

    def _init_weights(self, gamma=0.01):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, gamma)
                nn.init.constant_(m.bias, gamma)

    def forward(self, inputs):
        if self.rbr_identity is None:
            id_out = 0
        else:
            id_out = self.rbr_identity(inputs)

        return self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out

    def repvgg_convert(self):
        kernel, bias = self.get_equivalent_kernel_bias()
        return kernel.detach().cpu().numpy(), bias.detach().cpu().numpy(),
############## repvgg_util.py
# -*- coding: utf-8 -*-

"""
@date: 2021/2/2 下午8:51
@file: repvgg_util.py
@author: zj
@description: 
"""

import torch
import torch.nn as nn
import numpy as np


#   This func derives the equivalent kernel and bias in a DIFFERENTIABLE way.
#   You can get the equivalent kernel and bias at any time and do whatever you want,
#   for example, apply some penalties or constraints during training, just like you do to the other models.
#   May be useful for quantization or pruning.
def get_equivalent_kernel_bias(rbr_dense, rbr_1x1, rbr_identity, in_channels, groups, padding_11):
    kernel3x3, bias3x3 = _fuse_bn_tensor(rbr_dense, in_channels, groups)
    kernel1x1, bias1x1 = _fuse_bn_tensor(rbr_1x1, in_channels, groups)
    kernelid, biasid = _fuse_bn_tensor(rbr_identity, in_channels, groups)
    return kernel3x3 + _pad_1x1_to_3x3_tensor(kernel1x1, padding_11) + kernelid, bias3x3 + bias1x1 + biasid


def _pad_1x1_to_3x3_tensor(kernel1x1, padding_11=1): # --------------->add  padding_11 to 1x1 conv can match 3x3 conv
    if kernel1x1 is None:
        return 0
    else:
        # return torch.nn.functional.pad(kernel1x1, [1, 1, 1, 1])
        return torch.nn.functional.pad(kernel1x1, [padding_11] * 4)


def _fuse_bn_tensor(branch, in_channels, groups):
    if branch is None:
        return 0, 0
    if isinstance(branch, nn.Sequential):
        layer_list = list(branch)
        if len(layer_list) == 2 and isinstance(layer_list[1], nn.Identity):
            # conv/bn已经在acb中进行了融合
            return branch.conv.weight, branch.conv.bias
        kernel = branch.conv.weight
        running_mean = branch.bn.running_mean
        running_var = branch.bn.running_var
        gamma = branch.bn.weight
        beta = branch.bn.bias
        eps = branch.bn.eps
    else:
        assert isinstance(branch, nn.BatchNorm2d)
        input_dim = in_channels // groups
        kernel_value = np.zeros((in_channels, input_dim, 3, 3), dtype=np.float32)
        for i in range(in_channels):
            kernel_value[i, i % input_dim, 1, 1] = 1

        kernel = torch.from_numpy(kernel_value).to(branch.weight.device)
        running_mean = branch.running_mean
        running_var = branch.running_var
        gamma = branch.weight
        beta = branch.bias
        eps = branch.eps
    std = (running_var + eps).sqrt()
    t = (gamma / std).reshape(-1, 1, 1, 1)
    return kernel * t, beta - running_mean * gamma / std

About Test

I notice that the implementation of precision matching test in the warehouse can be further improved

################## origin
print(((train_y - deploy_y) ** 2).sum())    # Will be around 1e-10
################## mine
print(torch.sqrt(torch.sum((train_outputs - eval_outputs) ** 2)))
print(torch.allclose(train_outputs, eval_outputs, atol=1e-8))
assert torch.allclose(train_outputs, eval_outputs, atol=1e-8)

how to use

you can create model as usual, then insert acblock or repvgg_block or togetehr no matter which order

。。。
。。。
    if cfg.MODEL.CONV.ADD_BLOCKS is not None:
        assert isinstance(cfg.MODEL.CONV.ADD_BLOCKS, tuple)
        for add_block in cfg.MODEL.CONV.ADD_BLOCKS:
            if add_block == 'RepVGGBlock':
                insert_repvgg_block(model)
            if add_block == 'ACBlock':
                insert_acblock(model)
。。。
。。。

Then normal training and model parameter preservation are carried out, if you want to fuse ACBlock, you can use func fuse_acblock; if you want to fuse RepVGGBlock, then use fuse_repvgg_block. Note: The order of insertion should be opposite to that of fusion.

insert_acblock -> insert_regvgg_block .... fuse_regvgg_block -> fuse_acblock
or 
insert_regvgg_block -> insert_acblock .... fuse_acblock -> fuse_regvgg_block

The complete implementation can be referred to ZJCV/ZCls

'int' object is not callable error in whole_model_convert

When I use repVGG as the backbone of my model, the whole_model_convert will raise the error "TypeError: 'int' object is not callable" in the function whole_model_convert. Could you please give a concert sample? Think you.

Style transfer

Hello,
How does RepVGG compare to using standard VGG in style transfer tasks (Perceptual Loss) where it's uncommon to use models such as ResNet?

bias=False?

I noticed the conv2d bias in conv_bn() is False and doubt if it is reasonable.

"If the conv2d bias is True, meanwhile the final bias returned by _fuse_bn_tensor() should consider this, the accuracy will be improved". Does it make sense?

Have you done some comparison?

Thank you!

repVGG support FPN?

RepVGG in OD

Hello, I embed the RepVGG module into the detection model for training. Without converting the model, I can predict the result normally, but when I use the method in convert.py to process the model, the result cannot be predicted.

替换resnet18作为TRN网络的backbone时的精度

十分感谢作者的工作和开源的代码。

我近期用RepVGG-B0替换了ResNet-18做backbone来训练和测试，在相同超参数的情况下，ResNet18可以达到85%，而RepVGG-B0仅有70%，对此有一些疑惑。
整体模型是TRN，用于多帧的动作识别，网络的结构主要是：

CNN对多帧进行feature提取；
对多个feature做concat；
MLP对concat后的feature做分类

以下是一些训练参数的设置，两个backbone用了相同参数：
优化器：Adam
学习率：1.0e-5
betas：[0.9, 0.99]
eps：1.0e-8
weight_decay：1.0e-4

学习率调整策略：ExponentialLR
gamma：0.99

epoch：150
batch：64
输入尺寸：96*96
同时训练帧数：5

测试时用的指标是f1-score，对替换backbone前后各训练了5个模型，对5个模型取在测试集上的最优指标，取平均。
其中，对RepVGG-base的模型进行测试时，没有进行deploy转换。

个人认为RepVGG对工业界模型部署十分友好，希望能用上这个模型，故提此issue。

RepVGG vs GENet

As far as I understand your approach has similar idea with GPU-Efficient Networks. Have you done some comparison with them?

speed test example

Thank you for your job. It will be helpful if you could provide some examples which we can get the same speed (inference) of your RepVGG model in the table4 of the paper.

The way to equip RepVGG with self-attention moudle?

Any try before or any idae, please?

分布式lr,batch_size等参数设置

你好，我看到README中给出参数为“lr:01, 8gpu, global batch_size 256”，请问一下“global batch_size”指单gpu上batch为256还是8张gpu上总共为256
我使用4server(8gpu/server)进行分布式训练，目前设置参数为 (batch_size 1024, lr 0.4),但最终精读不高，发现是学习率衰减存在问题。想请教一下，在分布式模式下，参数该如何设置

训练的RepVGG-A0模型在转模型时报错，我用的pytorch版本1.4.0，请问如何解决

RuntimeError: Error(s) in loading state_dict for RepVGG:
Missing key(s) in state_dict: "stage0.rbr_dense.conv.weight", "stage0.rbr_dense.bn.weight", "stage0.rbr_dense.bn.bias", "stage0.rbr_dense.bn.running_mean", "stage0.rbr_dense.bn.running_var", "stage0.rbr_1x1.conv.weight", "stage0.rbr_1x1.bn.weight", "stage0.rbr_1x1.bn.bias", "stage0.rbr_1x1.bn.running_mean", "stage0.rbr_1x1.bn.running_var", "stage1.0.rbr_dense.conv.weight", "stage1.0.rbr_dense.bn.weight", "stage1.0.rbr_dense.bn.bias", "stage1.0.rbr_dense.bn.running_mean", "stage1.0.rbr_dense.bn.running_var" .......

TensorRT implementation of RepVGG

RepVGG is a great work! I implement the RepVGG models with TensorRT C++ api.(https://github.com/upczww/TensorRT-RepVGG)

weights

Hello, thank you very much for the work done by you and your team. I noticed that there are many versions in the pre-training model provided. I don't know what the different names mean, could I add your WeChat or QQ for your advice if convenient? thank you

Is the repvgg_model_convert() a liitle wrong? The result of deploy version is not the same with that in train

Hi
I try to download the train-version of RepVGG-A2, and convert it to the deployed-version.

` model_o = create_RepVGG_A2()

model_o.load_state_dict(torch.load('/home/forrest/pycharm/data/RepVGG-A2-train.pth'), strict=False)
# original download version weight


model_o_copy = model_o

model = repvgg_model_convert(model_o_copy, create_RepVGG_A2, '/home/forrest/pycharm/data/RepVGG-A2-deploytest.pth')




x = torch.from_numpy(np.array(Image.open('/home/forrest/Downloads/data/syj/Haze20/HazeClear-train-test/train/00002/002.jpg'))).float().unsqueeze(0).permute(0,3,1,2) / 255.0
model2 = create_RepVGG_A2(deploy=True)


model2.load_state_dict(torch.load('/home/forrest/pycharm/data/RepVGG-A2-deploytest.pth'), strict=False)


out1 = model_o(x)
out2 = model2(x)
out3 = model(x)
print(torch.sum(torch.abs(out1 - out2)), torch.sum(torch.abs(out1 - out3)))

print(out1[0, :10], out2[0, :10], out3[0, :10])`

The result is
tensor(1090.8134, grad_fn=<SumBackward0>) tensor(1090.8134, grad_fn=<SumBackward0>) tensor([-1.1406e+00, -7.5946e-01, -9.9028e-01, -1.6798e+00, -1.5024e+00, -8.5764e-01, -1.3238e+00, -1.4038e-03, -4.1698e-01, -8.1957e-01], grad_fn=<SliceBackward>) tensor([ 1.9279, -0.5468, 0.9310, 0.5799, 0.7641, 1.0118, 0.0925, -0.8435, -0.9261, -0.0287], grad_fn=<SliceBackward>) tensor([ 1.9279, -0.5468, 0.9310, 0.5799, 0.7641, 1.0118, 0.0925, -0.8435, -0.9261, -0.0287], grad_fn=<SliceBackward>)

It seems the repvgg_model_convert() version is not the same with that in train-version?
I wonder why?

Reproduce accuracy

Thanks for your inspiring work.
I'm trying to reproduce the light A0 and midsize B1 model, but I only got 69.5% top1 accuracy for A0.
B1 accuracy is also lower than reported by about 1-2%.
I followed the 120 epoch cosine shedule, batch size 8*256.
Any other specific settings or tricks employed in the training pipe?

More smaller model

Thanks for sharing great model.

My question is, How can i make model smaller than repvgg_a0 by changing blocks or width multiplier?

any idea?

Inference environment

Hi,
Thanks for the great work and for sharing the code.

Is the reported performance on model compiled to TRT or naive Pytorch?
Do. you have comparison of latency (batch size of 1) rather than throughput that you dominate?

The accuracy problem

I wonder use your pytorch script to train RepVGGA0 can achieve which accuracy?

I try to reproduce RepVGGA0 with 0.1 labelsmooth, but get accuracy as 71.6%

Winograd conv speed

Only MULs are presented in the paper, do you have did experiments on the speed of winograd conv?

部署时精度差异大

感谢大佬的作品。
使用时，我训练小模型10MFLOPS以内部署时精度损失可以忽略，但是大模型2GFLOPS时精度就对不齐了：
LOG：
deploy param: stage0.rbr_reparam.weight torch.Size([64, 1, 3, 3]) -0.048573527
deploy param: stage0.rbr_reparam.bias torch.Size([64]) 0.23182523
deploy param: stage1.0.rbr_reparam.weight torch.Size([128, 64, 3, 3]) -0.0054542203
deploy param: stage1.0.rbr_reparam.bias torch.Size([128]) 1.0140312
deploy param: stage1.1.rbr_reparam.weight torch.Size([128, 64, 3, 3]) 0.0006282824
deploy param: stage1.1.rbr_reparam.bias torch.Size([128]) 0.32761782
deploy param: stage1.2.rbr_reparam.weight torch.Size([128, 128, 3, 3]) 0.0023862773
deploy param: stage1.2.rbr_reparam.bias torch.Size([128]) 0.34976208
deploy param: stage1.3.rbr_reparam.weight torch.Size([128, 64, 3, 3]) -9.027165e-05
deploy param: stage1.3.rbr_reparam.bias torch.Size([128]) 0.0063683093
deploy param: stage2.0.rbr_reparam.weight torch.Size([256, 128, 3, 3]) -8.460902e-05
deploy param: stage2.0.rbr_reparam.bias torch.Size([256]) 0.11033552
deploy param: stage2.1.rbr_reparam.weight torch.Size([256, 128, 3, 3]) -0.00010023986
deploy param: stage2.1.rbr_reparam.bias torch.Size([256]) -0.15826604
deploy param: stage2.2.rbr_reparam.weight torch.Size([256, 256, 3, 3]) -5.3966836e-05
deploy param: stage2.2.rbr_reparam.bias torch.Size([256]) -0.15924689
deploy param: stage2.3.rbr_reparam.weight torch.Size([256, 128, 3, 3]) -6.7551824e-05
deploy param: stage2.3.rbr_reparam.bias torch.Size([256]) -0.37404576
deploy param: stage2.4.rbr_reparam.weight torch.Size([256, 256, 3, 3]) -0.00012947948
deploy param: stage2.4.rbr_reparam.bias torch.Size([256]) -0.6853457
deploy param: stage2.5.rbr_reparam.weight torch.Size([256, 128, 3, 3]) 7.473848e-05
deploy param: stage2.5.rbr_reparam.bias torch.Size([256]) -0.16874048
deploy param: stage3.0.rbr_reparam.weight torch.Size([512, 256, 3, 3]) -0.000433887
deploy param: stage3.0.rbr_reparam.bias torch.Size([512]) 0.18602118
deploy param: stage3.1.rbr_reparam.weight torch.Size([512, 256, 3, 3]) 0.00048246872
deploy param: stage3.1.rbr_reparam.bias torch.Size([512]) -0.7235512
deploy param: stage3.2.rbr_reparam.weight torch.Size([512, 512, 3, 3]) 0.00021061227
deploy param: stage3.2.rbr_reparam.bias torch.Size([512]) -0.5657553
deploy param: stage3.3.rbr_reparam.weight torch.Size([512, 256, 3, 3]) -0.00081703335
deploy param: stage3.3.rbr_reparam.bias torch.Size([512]) -0.37847003
deploy param: stage3.4.rbr_reparam.weight torch.Size([512, 512, 3, 3]) -0.00033185782
deploy param: stage3.4.rbr_reparam.bias torch.Size([512]) -0.57922906
deploy param: stage3.5.rbr_reparam.weight torch.Size([512, 256, 3, 3]) -0.0007206367
deploy param: stage3.5.rbr_reparam.bias torch.Size([512]) -0.56909364
deploy param: stage3.6.rbr_reparam.weight torch.Size([512, 512, 3, 3]) -0.0003344199
deploy param: stage3.6.rbr_reparam.bias torch.Size([512]) -0.5628111
deploy param: stage3.7.rbr_reparam.weight torch.Size([512, 256, 3, 3]) -0.00021987755
deploy param: stage3.7.rbr_reparam.bias torch.Size([512]) -0.34248477
deploy param: stage3.8.rbr_reparam.weight torch.Size([512, 512, 3, 3]) -0.00010127398
deploy param: stage3.8.rbr_reparam.bias torch.Size([512]) -0.5895205
deploy param: stage3.9.rbr_reparam.weight torch.Size([512, 256, 3, 3]) -0.0005824505
deploy param: stage3.9.rbr_reparam.bias torch.Size([512]) -0.37577158
deploy param: stage3.10.rbr_reparam.weight torch.Size([512, 512, 3, 3]) -0.00012262027
deploy param: stage3.10.rbr_reparam.bias torch.Size([512]) -0.6199002
deploy param: stage3.11.rbr_reparam.weight torch.Size([512, 256, 3, 3]) 1.503076e-06
deploy param: stage3.11.rbr_reparam.bias torch.Size([512]) -0.7054796
deploy param: stage3.12.rbr_reparam.weight torch.Size([512, 512, 3, 3]) 0.0006349176
deploy param: stage3.12.rbr_reparam.bias torch.Size([512]) -1.0350925
deploy param: stage3.13.rbr_reparam.weight torch.Size([512, 256, 3, 3]) 0.00037807773
deploy param: stage3.13.rbr_reparam.bias torch.Size([512]) -1.1399512
deploy param: stage3.14.rbr_reparam.weight torch.Size([512, 512, 3, 3]) 0.00025178236
deploy param: stage3.14.rbr_reparam.bias torch.Size([512]) -0.27695537
deploy param: stage3.15.rbr_reparam.weight torch.Size([512, 256, 3, 3]) 0.00074805244
deploy param: stage3.15.rbr_reparam.bias torch.Size([512]) -0.8776718
deploy param: stage4.0.rbr_reparam.weight torch.Size([1024, 512, 3, 3]) -0.00013951868
deploy param: stage4.0.rbr_reparam.bias torch.Size([1024]) 0.021552037
deploy param: linear.weight torch.Size([372, 1024]) 0.0051029953
deploy param: linear.bias torch.Size([372]) 0.17604762

打印代码：

    deploy_model = build_func(deploy=True,**kwargs)
    for name, param in deploy_model.named_parameters():
        print('deploy param: ', name, param.size(), np.mean(converted_weights[name]))
        param.data = torch.from_numpy(converted_weights[name]).float()

Can this method apply to transformer?

I am new to RepVGG.

Can the multi-branch topology and the decoupling tech apply to transformer?

Thank you very much!

模型修改问题

您好，我在使用RepVGG_A0预训练模型做语义分割的任务时，进行了如下操作：
1、更换了stage层，因为我需要一个通道是输出
stage4 = nn.Sequential(nn.ReLU(),
nn.Conv2d(192, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
nn.ReLU(),
nn.Conv2d(96, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)))
model.stage4 = stage4
2、剔除了gap与linear，nn.Sequential(*list(model.modules())[:-2])

但是训练时报错：

Traceback (most recent call last):
File "/home/zgj/pycharmProject/competetion/kaggle/HuBMAP/1st_version/train.py", line 101, in
main()
File "/home/zgj/pycharmProject/competetion/kaggle/HuBMAP/1st_version/train.py", line 73, in main
predict = model(img) # ['out']
File "/home/zgj/anaconda3/envs/torch1.7.0py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.7.0py3.7/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/zgj/anaconda3/envs/torch1.7.0py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/pycharmProject/competetion/kaggle/HuBMAP/input/RepVGG/repvgg.py", line 145, in forward
out = self.linear(out)
File "/home/zgj/anaconda3/envs/torch1.7.0py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/zgj/anaconda3/envs/torch1.7.0py3.7/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/zgj/anaconda3/envs/torch1.7.0py3.7/lib/python3.7/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

请问该如何解决呢？

Is lr = 0.01 -> 1e-4 too high for semantic segmentation?

I use repvgg A0 as the backbone in mmseg, and the learning rate of mmseg decreases from 0.01 to 1e-4 by default. I find that the loss convergence is unstable, and the loss after training is above 0.25.

Data loading time during training

Hi @DingXiaoH , thanks for the great work.
I am running the training code in the repo recently. However, according to the log, the data loading time seems unstable, and slow down the training speed a lot.

My GPU utilization also fluctuates wildly during training, from 0% to 99%.

Apparently, the bottleneck should be the data loading of imagenet dataset. Do you have some practices or suggenstions on how to accelerate the training?

Thank you.

More Examples

First of all, thanks for your great job. It would be wonderful if you could kindly provide more examples to show the capabilities of the proposed model.

ask help: lr_scheduler has multi period? maybe a bug for local parallel train

Dear DingXiaoH, thanks for your innovative paper,
could you help me on this questions? thanks!
lr_scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=args.epochs * IMAGENET_TRAINSET_SIZE // args.batch_size // ngpus_per_node)

In my local parallel GPU test, without distributed training. batch_size is for all GPU with the orig code, when we used n gpu, there will be n period change of cosineLR(0--PI, as one period)
Is this normal for such multi period CosineLR ?

thanks in advance!!

question: have you ever tried to combine this method with ACNet?

If we replace the 3x3 conv with the convolution proposed with ACNet, are we expected to have further improvements on the test set ?

Pre-trained model on segmentation

Nice work. Could you share your pre-trained model for segmentation, .e.g RepVGG-B1g2-fast. Looking forward to your replay~

deploy 模型文件

您好，请问能开源deploy 模型文件吗？

我用train 的模型文件convert到deploy文件，但发现两种模型输出的结果不一致，希望能得到您的帮助～

drop mIOU when convert to int8 TensorRT

I use RepVGG A2 as Backbone segmentation. After converting to engine TensorRT, float 16 keep the mIOU as the pytorch model, but when I convert to int8 TensorRT, mIOU drop ~5%. Did you try it yet?

.pth model to onnx model with an error

RuntimeError: Error(s) in loading state_dict for RepVGG:
Missing key(s) in state_dict: "stage0.rbr_reparam.weight", "stage0.rbr_reparam.bias", "stage1.0.rbr_reparam.weight", ...........
Unexpected key(s) in state_dict: "stage0.rbr_dense.conv.weight", "stage0.rbr_dense.bn.weight", "stage0.rbr_dense.bn.bias", "stage0.rbr_dense.bn.running_mean", "stage0.rbr_dense.bn.running_var", "stage0.rbr_dense.bn.num_batches_tracked", "stage0.rbr_1x1.conv.weight",

what's wrong with this, looking forward to your reply.