jeonggg119 / dl_paper Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 69 KB

dl_paper's People

Contributors

Stargazers

Watchers

dl_paper's Issues

[CV_GAN] BeautyGAN: Instance-level Facial Makeup Transfer with Deep Generative Adversarial Network

BeautyGAN: Instance-level Facial Makeup Transfer with Deep Generative Adversarial Network

Abstract

Facial makeup transfer : translating makeup style from reference makeup img to non-makeup one, preserving face identity
Instance-level transfer : more challenging than conventional Domain-level transfer tasks, especially without paired data
Makeup style : Local styles/cosmetics (ex. eye shadow, lipstick, foundation) → different from Global style (ex. painting)
BeautyGAN
- Incorporating both Global Domain-level loss + Local Instance-level loss in dual in/output GAN
- Extracting and Translating Local style and Delicating makeup information
- Global Domain-level loss : ensured by Discriminators that distinguish generated imgs from domain's real samples
- Local Instance-level loss : calculated by pixel-level histogram loss on separate local facial regions
- Perceptual loss & Cycle Consistency loss : Generating high quality faces and preserving identity
- Overall objective function : to learn translation on Instance-level through unsupervised adversarial learning
- Extensive experiments : beautyGAN could generate visually pleasant makeup faces and accurate transferring results
New makeup dataset (3834 high-resolution face imgs)

CCS Concepts

Computing methodologies → Computer Vision taskes ; NN ; Unsupervised learning

Keywords

Facial makeup transfer ; GAN

1. Introduction

1) Why Need

: To help users try well-suited makeup style from photos without professional suggestions

Virtual Makeup application (Previous tools) : user's manual interaction required & only a certain number of fixed styles
Makeup transfer : efficient way to help users select the most suitable style

2) Existing automatic Makeup transfer (2 categories)

(1) Traditional image processing

Ex. image gradient editing, physics-based manipulation
Decompose imgs into several layers (ex. face structure, color, skin) →
→ Transfer each layer after warping refer makeup img to non-makeup one

(2) DL based methods

Ex. typically build upon DNN
several independent networks to deal with each cosmetic individually

Previous methods

treat makeup style as a simple combination of different components
→ Overall output img looks unnatural with apparent artifacts at combining places

3) Image-to-image translation : style-transfer

Existing End-to-end structure act on entire img could generate high quality results
But, directly applying in Facial makeup transfer task is still infeasible

4) Facial makeup transfer (2 main characteristics)

(1) Various makeup styles from face to face & Instance-level transfer required

Typical img-to-img translation methods (GAN) : mostly for Domain-level transfer
Ex. CycleGAN : img-to-img translation bw two collections (ex. horsea and zebras)
- Emphasize inter-domain differences while Omit intra-domain differences
- Generate an average domain-level style invariant given different refer imgs

(2) Makeup style = A Global style + Independent Local styles

Conventional style transfer : style = global painting manner
Makeup style = consisted of various local cosmetics → delicate and elaborate
Difficult to extract makeup style as a whole while preserving particular traits of various cosmetics

5) Making New makeup dataset

Lack of training data
- Released makeup dataset : too small to train big networks
- Difficult to obtain a pair of well-aligned face imgs with different makeup styles
- Supervised learning with paired data is implausible
So, making a new makeup dataset with 3834 imgs

6) BeautyGAN : A novel dual in/output GAN

Input : makeup and non-makeup face imgs
Output : transferred results
No additoinal pre-/post-processing
First, transfer non-makeup face to makeup domain with a couple of Discriminators
Instance-level transfer by pixel-level histogram loss on the basis of Domain-level transfer
Perceptual loss & Cycle Consistency loss : preserve face identity and eliminate artifacts
- Cycle Consistency bw in/outputs : achieved with only one Generator
Makeup and Anti-makeup simultaneously in a single forward pass
No paired data is needed
Generated result imgs : natural-looking, visually pleasant without observable artifacts

3. Main Contributions

(1) Automatic makeup transfer with a dual input/output GAN : effective and high quality
(2) Instance-level style transfer : pixel-level histogram losses on different local facial regions
- can be easily generalized to other img translation tasks (ex. head-shot portraits, img attribute transfer)
(3) New makeup dataset : 2824 imgs

2. Related works

2.1 Makeup Studies

(1) Makeup transfer frameworks based on Traditional methods

: Localize makeup transfer framework in DL + Warping and structure preservation to synthesize after-makeup imgs
= Divide facial makeup into several parts and conduct different methods on each facial part

[31] Facial makeup detector and remover framework based on locality-constrained dictionary learning
[20] Anti-Makeup : Adversarial net to generate non-makeup imgs for makeup-invariant face verification
[11] Digital face makeup : Decompose imgs into 3 layers & Transfer makeup info layer by layer
- Result : smooth facial details of source imgs
[19] Advanced Decomposition method (physics-based manipulation of intrinsic img layers)

(2) BeautyGAN

Realize makeup transfer and makeup removal simultaneously
Unified training process : considering relationships among cosmetics in different regions
End-to-end network : learning adaptation of cosmetics fed in source imgs → eliminating need of post-processing

2.2 Style Transfer

Aim : To combine content and style from different imgs
[8] Generating a reconstruction img by minimizing content and style reconstruction loss
[9] Perceptual factors to control more information (ex. color, scale, spatial location) : High quality but Heavy computation
[13] Feed-forward network (for real-time style transfer and super-resolution): Less computation and Approximate quality

2.3 Generative Adversarial Networks

GAN : Generator + Discriminator => Generating visually realistic imgs
[17] Super Resolution GAN
[6] ExGANs : a type of cGAN that utilize exemplar information to solve personalized eye in-painting problem
[27] training models on synthetic imgs for improving realism of them
[34] Generative visual manipulation on natural img manifold
- Incorporating user interactions to present real-time img editing + GAN was leveraged to estimate img manifold

2.4 GAN for Image-to-Image Translation

Aim : To learn a mapping from source domain to target domain
[4, 12, 35] Promising works appling GAN to Image-to-Image Translation
[12] pix2pix : synthesize imgs from label maps → reconstruct objects from edge imgs (using paired imgs for training)
[22] CoGAN (Coupled GAN) : generators were bounded with weight-sharing constraints to learn a joint distribution
[35] CycleGAN, [14] DiscoGAN : Cycle Consistency loss to regularize key attributes bw inputs and translated imgs
[14] StarGAN : mapping among multiple domains within one single generator

3. Our approach : BeautyGAN

Goal : Facial makeup transfer bw a reference makeup img and a source non-makeup img on instance-level
A : non-makeup img domain ⊂ R^(HxWx3)
B : makeup img domain ⊂ R^(HxWx3)
G : AxB → BxA : Mapping bw A, B domains (x is Cartesian product) - simultaneously learning
Inputs : given 2 imgs : a source img I_src ∈ A & a reference img I_ref ∈ B
Outputs : an after-makeup img I_src^B ∈ B & an anti-makeup img I_ref^A ∈ A
- (I_src^B, I_ref^A) = G(I_src, I_ref)
- I_src^B : synthesizing makeup style of I_ref while preserving face identity of I_src
- I_ref^A : realizing makeup removal from I_ref
Instance-level correspondence = Makeup style consistency bw I_src^B and I_ref
- No paired data for training
- Pixel-level Histogram loss acted on different cosmetics
  - Adversarial losses : to generate visually pleasant imgs and refine correlation among different cosmetics
  - Perceptual loss : to maintain face identity and structure -> transfer exact makeup to source img
  - Integrate all loss terms into one Full Objective function [3.1]

3.1 Full Objective

1 generator G & 2 discriminators D_A, D_B → Minmax game
- G : minimize Adversarial loss
- D_A, D_B L maximize same Adversarial loss

Loss function (Adversarial loss) of D_A, D_B

D_A : aim to distinguish generated img I_ref^A from non-makeup real samples in set A
D_B : aim to distinguish generated img I_src^B from makeup real samples in set B

Full Objective Loss function of G : 4 Loss terms

# Combined loss
g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt
if self.checkpoint or self.direct:
    g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt + g_A_loss_his + g_B_loss_his

(1) L_adv : Adversarial loss for G

# GAN loss D_A(G_A(A))
fake_B = self.G_A(org_A)
pred_fake = self.D_A(fake_B)
g_A_loss_adv =  self.criterionGAN(pred_fake, True)
#g_loss_adv = self.get_G_loss(out)

# GAN loss D_B(G_B(B))
fake_A = self.G_B(ref_B)
pred_fake = self.D_B(fake_A)
g_B_loss_adv = self.criterionGAN(pred_fake, True)

(2) L_per : Perceptual loss

(3) L_cyc : Cycle consistency loss

(4) L_makeup : Makeup loss

3.2 Domain-Level Makeup Transfer

Domain-level makeup transfer : foundation of Instance-level makeup transfer
Dual input/output architecture → simultaneously learning the mapping bw two domains(A,B) within just one Generator !
Output imgs : required to preserve face identities & background info as Input imgs
- Perceptual loss -> face identities
- Cycle consistency loss -> background info

Perceptual loss

Aim : to preserve face identities
How : Calculating differences bw high-level features extracted by Deep Conv (ImageNet pretrained VGG16)
F_l(x) : feature maps in l-th layer on VGG, F_l ∈ R^(C_l x H_l x W_l)
Perceptual loss bw input imgs(I_src, I_ref) and output imgs(I_src^B, I_ref^A) :

# identity loss
if self.lambda_idt > 0:
    # G should be identity if ref_B or org_A is fed
    idt_A1, idt_A2 = self.G(org_A, org_A)
    idt_B1, idt_B2 = self.G(ref_B, ref_B)
    loss_idt_A1 = self.criterionL1(idt_A1, org_A) * self.lambda_A * self.lambda_idt
    loss_idt_A2 = self.criterionL1(idt_A2, org_A) * self.lambda_A * self.lambda_idt
    loss_idt_B1 = self.criterionL1(idt_B1, ref_B) * self.lambda_B * self.lambda_idt
    loss_idt_B2 = self.criterionL1(idt_B2, ref_B) * self.lambda_B * self.lambda_idt
    # loss_idt
    loss_idt = (loss_idt_A1 + loss_idt_A2 + loss_idt_B1 + loss_idt_B2) * 0.5
else:
    loss_idt = 0

# vgg loss
vgg_org = self.vgg(org_A, self.content_layer)[0]
vgg_org = Variable(vgg_org.data).detach()
vgg_fake_A = self.vgg(fake_A, self.content_layer)[0]
g_loss_A_vgg = self.criterionL2(vgg_fake_A, vgg_org) * self.lambda_A * self.lambda_vgg

vgg_ref = self.vgg(ref_B, self.content_layer)[0]
vgg_ref = Variable(vgg_ref.data).detach()
vgg_fake_B = self.vgg(fake_B, self.content_layer)[0]
g_loss_B_vgg = self.criterionL2(vgg_fake_B, vgg_ref) * self.lambda_B * self.lambda_vgg

loss_rec = (g_loss_rec_A + g_loss_rec_B + g_loss_A_vgg + g_loss_B_vgg) * 0.5

Cycle consistency loss

Aim : to maintain background infomation
How : Passing output imgs into G -> imgs are generated as the original input imgs

# Forward cycle loss
rec_A = self.G_B(fake_B)
g_loss_rec_A = self.criterionL1(rec_A, org_A) * self.lambda_A

# Backward cycle loss
rec_B = self.G_A(fake_A)
g_loss_rec_B = self.criterionL1(rec_B, ref_B) * self.lambda_B

3.3 Instance-Level Makeup Transfer

Instance-Level Makeup Transfer : Adding constraints on makeup style consistency
Facial makeup : visually recognized as color distributions = color changing
Histogram Matching (HM) : a straightforward Color Transfer method
Histogram Loss on pixel-level : restricting I_src^B = I_ref in makeup style

Histogram Loss

Inappropriate strategy : MSE loss on pixel-level histograms of two imgs directly → Gradient=0 → No Optimization
Histogram matching strategy : Generating HM(x,y) first → MSE Loss → Backpropagation
- Goal : To calculate Histogram Loss on pixels bw original img x and reference img y
- HM(x,y) : a GT remapping img = same color distribution as y & preserved content info as x
- MSE Loss : bw HM(x,y) and x
- Back-prop for optimization

Face parsing

Inappropriate strategy : Histogram loss over the entire img
Face parsing strategy :
- split makeup style into 3 important components (lipsticks, eye shadow, foundation)
- apply localized histogram loss on each part
Reasons
- pixels in background and hairs : no relationship with makeup → disturb correct color distribution
- facial makeup is beyond a global style, but a collection of various styles in different cosmetics regions
Pre-trained Face parsing model : generating Face guidance mask M = FP(x) for each input img x
Face guidance mask M = FP(x) : denoting several facial locations (lips, eyes, skin, hairs, background, ...)
- For each M, tracking different labels to produce 3 corresponding Binary masks
- Binary masks (M_lip, M_eye, M_face) : representing for cosmetics spatiality
- M_shadow : calculate two rectangle areas enclosing eye shadows → exclude eyes regions, some hair, eyebrow regions
  - Why separated? No annotation for eye shadows on M (b/c before-makeup imgs have no eye shadows)

for self.i, (img_A, img_B, mask_A, mask_B) in enumerate(self.data_loader_train):
    # Convert tensor to variable
    # mask attribute: 0:background 1:face 2:left-eyebrown 3:right-eyebrown 4:left-eye 5: right-eye 6: nose 
    # 7: upper-lip 8: teeth 9: under-lip 10:hair 11: left-ear 12: right-ear 13: neck
    if self.checkpoint or self.direct:
        if self.lips==True:
            mask_A_lip = (mask_A==7).float() + (mask_A==9).float()
            mask_B_lip = (mask_B==7).float() + (mask_B==9).float()
            mask_A_lip, mask_B_lip, index_A_lip, index_B_lip = self.mask_preprocess(mask_A_lip, mask_B_lip)
        if self.skin==True:
            mask_A_skin = (mask_A==1).float() + (mask_A==6).float() + (mask_A==13).float()
            mask_B_skin = (mask_B==1).float() + (mask_B==6).float() + (mask_B==13).float()
            mask_A_skin, mask_B_skin, index_A_skin, index_B_skin = self.mask_preprocess(mask_A_skin, mask_B_skin)
        if self.eye==True:
            mask_A_eye_left = (mask_A==4).float()
            mask_A_eye_right = (mask_A==5).float()
            mask_B_eye_left = (mask_B==4).float()
            mask_B_eye_right = (mask_B==5).float()
            mask_A_face = (mask_A==1).float() + (mask_A==6).float()
            mask_B_face = (mask_B==1).float() + (mask_B==6).float()
            # avoid the situation that images with eye closed
            if not ((mask_A_eye_left>0).any() and (mask_B_eye_left>0).any() and \
                (mask_A_eye_right > 0).any() and (mask_B_eye_right > 0).any()):
                continue
            mask_A_eye_left, mask_A_eye_right = self.rebound_box(mask_A_eye_left, mask_A_eye_right, mask_A_face)
            mask_B_eye_left, mask_B_eye_right = self.rebound_box(mask_B_eye_left, mask_B_eye_right, mask_B_face)
            mask_A_eye_left, mask_B_eye_left, index_A_eye_left, index_B_eye_left = \
                self.mask_preprocess(mask_A_eye_left, mask_B_eye_left)
            mask_A_eye_right, mask_B_eye_right, index_A_eye_right, index_B_eye_right = \
                self.mask_preprocess(mask_A_eye_right, mask_B_eye_right)

Makeup Loss

Overall Makeup Loss : integrated by 3 local Histogram Losses (lips, eye shadows, face regions)

# color_histogram loss
g_A_loss_his = 0
g_B_loss_his = 0
if self.checkpoint or self.direct:
    if self.lips==True:
        g_A_lip_loss_his = self.criterionHis(fake_A, ref_B, mask_A_lip, mask_B_lip, index_A_lip) * self.lambda_his_lip
        g_B_lip_loss_his = self.criterionHis(fake_B, org_A, mask_B_lip, mask_A_lip, index_B_lip) * self.lambda_his_lip
        g_A_loss_his += g_A_lip_loss_his
        g_B_loss_his += g_B_lip_loss_his
    if self.skin==True:
        g_A_skin_loss_his = self.criterionHis(fake_A, ref_B, mask_A_skin, mask_B_skin, index_A_skin) * self.lambda_his_skin_1
        g_B_skin_loss_his = self.criterionHis(fake_B, org_A, mask_B_skin, mask_A_skin, index_B_skin) * self.lambda_his_skin_2
        g_A_loss_his += g_A_skin_loss_his
        g_B_loss_his += g_B_skin_loss_his
    if self.eye==True:
        g_A_eye_left_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_left, mask_B_eye_left, index_A_eye_left) * self.lambda_his_eye
        g_B_eye_left_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_left, mask_A_eye_left, index_B_eye_left) * self.lambda_his_eye
        g_A_eye_right_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_right, mask_B_eye_right, index_A_eye_right) * self.lambda_his_eye
        g_B_eye_right_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_right, mask_A_eye_right, index_B_eye_right) * self.lambda_his_eye
        g_A_loss_his += g_A_eye_left_loss_his + g_A_eye_right_loss_his
        g_B_loss_his += g_B_eye_left_loss_his + g_B_eye_right_loss_his

4. Data Collection

Makeup Transfer(MT) dataset : Facial makeup dataset consisting of 3834 female imgs (1115 non-makeup + 2719 makeup)
- Some variations in race, pose, expression, background clutter
- Many makeup styles : smoky-eyes, flashy, Retro, Korean, Japanese, ...
- More than 3000 subjects
- Nude makeup imgs for Non-makeup category
How : Initial data are crawled from websites → Low resolution imgs removed → Face alignment with 68 landmarks
Spatial size : 256x256
Test set : randomly selected 100 non-makeup imgs + 250 makeup imgs
Training set and Validation set : separated on remaining imgs

5. Experiments

Network Architecture, Training setting, Performances, Component Analysis

5.1 Implementation Details

Network Architecture

(1) Generator G with 2 inputs and 2 outputs

Front : 2 separate input branches with convolutions
Middle : concatenate 2 branches and feed them into several residual blocks
End : Upsampling output feature maps by 2 individual branches of transposed convolutions
Branches don't share params within layers
Instance Normalization for G

class Generator(nn.Module):
    """Generator. Encoder-Decoder Architecture."""
    def __init__(self, conv_dim=64, repeat_num=6):
        super(Generator, self).__init__()

        layers = []
        layers.append(nn.Conv2d(3, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers.append(nn.ReLU(inplace=True))

        # Down-Sampling
        curr_dim = conv_dim
        for i in range(2):
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim * 2

        # Bottleneck
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-Sampling
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        layers.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.Tanh())
        self.main = nn.Sequential(*layers)

    def forward(self, x):
        out = self.main(x)
        return out

class Generator_makeup(nn.Module):
    """Generator. Encoder-Decoder Architecture."""
    # input 2 images and output 2 images as well
    def __init__(self, conv_dim=64, repeat_num=6, input_nc=6):
        super(Generator_makeup, self).__init__()

        layers = []
        layers.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers.append(nn.ReLU(inplace=True))

        # Down-Sampling
        curr_dim = conv_dim
        for i in range(2):
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim * 2

        # Bottleneck
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-Sampling
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        self.main = nn.Sequential(*layers)

        layers_1 = []
        layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_1.append(nn.Tanh())
        self.branch_1 = nn.Sequential(*layers_1)
        layers_2 = []
        layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_2.append(nn.Tanh())
        self.branch_2 = nn.Sequential(*layers_2)

    def forward(self, x, y):
        input_x = torch.cat((x, y), dim=1)
        out = self.main(input_x)
        out_A = self.branch_1(out)
        out_B = self.branch_2(out)
        return out_A, out_B


class Generator_branch(nn.Module):
    """Generator. Encoder-Decoder Architecture."""
    # input 2 images and output 2 images as well
    def __init__(self, conv_dim=64, repeat_num=6, input_nc=3):
        super(Generator_branch, self).__init__()

        # Branch input
        layers_branch = []
        layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        self.Branch_0 = nn.Sequential(*layers_branch)

        # Branch input
        layers_branch = []
        layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        self.Branch_1 = nn.Sequential(*layers_branch)

        # Down-Sampling, branch merge
        layers = []
        curr_dim = conv_dim*2
        layers.append(nn.Conv2d(curr_dim*2, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
        layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
        layers.append(nn.ReLU(inplace=True))
        curr_dim = curr_dim * 2
     
        # Bottleneck
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-Sampling
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        self.main = nn.Sequential(*layers)

        layers_1 = []
        layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_1.append(nn.ReLU(inplace=True))
        layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_1.append(nn.ReLU(inplace=True))
        layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_1.append(nn.Tanh())
        self.branch_1 = nn.Sequential(*layers_1)
        layers_2 = []
        layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_2.append(nn.ReLU(inplace=True))
        layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_2.append(nn.ReLU(inplace=True))
        layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_2.append(nn.Tanh())
        self.branch_2 = nn.Sequential(*layers_2)

    def forward(self, x, y):
        input_x = self.Branch_0(x)
        input_y = self.Branch_1(y)
        input_fuse = torch.cat((input_x, input_y), dim=1)
        out = self.main(input_fuse)
        out_A = self.branch_1(out)
        out_B = self.branch_2(out)
        return out_A, out_B

(2) Discriminator D_A, D_B

Identical 70x70 PatchGANs : classify local overlapping img patches to be real or fake

class Discriminator(nn.Module):
    """Discriminator. PatchGAN."""
    def __init__(self, image_size=128, conv_dim=64, repeat_num=3, norm='SN'):
        super(Discriminator, self).__init__()

        layers = []
        if norm=='SN':
            layers.append(SpectralNorm(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1)))
        else:
            layers.append(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1))
        layers.append(nn.LeakyReLU(0.01, inplace=True))

        curr_dim = conv_dim
        for i in range(1, repeat_num):
            if norm=='SN':
                layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1)))
            else:
                layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1))
            layers.append(nn.LeakyReLU(0.01, inplace=True))
            curr_dim = curr_dim * 2

        #k_size = int(image_size / np.power(2, repeat_num))
        if norm=='SN':
            layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1)))
        else:
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1))
        layers.append(nn.LeakyReLU(0.01, inplace=True))
        curr_dim = curr_dim *2

        self.main = nn.Sequential(*layers)
        if norm=='SN':
            self.conv1 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False))
        else:
            self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False)

        # conv1 remain the last square size, 256*256-->30*30
        #self.conv2 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=k_size, bias=False))
        #conv2 output a single number

    def forward(self, x):
        h = self.main(x)
        #out_real = self.conv1(h)
        out_makeup = self.conv1(h)
        #return out_real.squeeze(), out_makeup.squeeze()
        return out_makeup.squeeze()

Training Details

2 Additional strategies to stabilize training and generate high quality imgs
(1) Replacing all negative log likelihood in Adversarial loss by least square loss
(2) Spectral Normalization : stably training Discriminators
- For computationally light and easy to incorporate
- Lipschitz constraint σ(W)=1 :

def l2normalize(v, eps=1e-12):
    return v / (v.norm() + eps)

class SpectralNorm(object):
    def __init__(self):
        self.name = "weight"
        #print(self.name)
        self.power_iterations = 1

    def compute_weight(self, module):
        u = getattr(module, self.name + "_u")
        v = getattr(module, self.name + "_v")
        w = getattr(module, self.name + "_bar")

        height = w.data.shape[0]
        for _ in range(self.power_iterations):
            v.data = l2normalize(torch.mv(torch.t(w.view(height,-1).data), u.data))
            u.data = l2normalize(torch.mv(w.view(height,-1).data, v.data))
        # sigma = torch.dot(u.data, torch.mv(w.view(height,-1).data, v.data))
        sigma = u.dot(w.view(height, -1).mv(v))
        return w / sigma.expand_as(w)

    @staticmethod
    def apply(module):
        name = "weight"
        fn = SpectralNorm()

        try:
            u = getattr(module, name + "_u")
            v = getattr(module, name + "_v")
            w = getattr(module, name + "_bar")
        except AttributeError:
            w = getattr(module, name)
            height = w.data.shape[0]
            width = w.view(height, -1).data.shape[1]
            u = Parameter(w.data.new(height).normal_(0, 1), requires_grad=False)
            v = Parameter(w.data.new(width).normal_(0, 1), requires_grad=False)
            w_bar = Parameter(w.data)

            #del module._parameters[name]

            module.register_parameter(name + "_u", u)
            module.register_parameter(name + "_v", v)
            module.register_parameter(name + "_bar", w_bar)

        # remove w from parameter list
        del module._parameters[name]

        setattr(module, name, fn.compute_weight(module))

        # recompute weight before every forward()
        module.register_forward_pre_hook(fn)

        return fn

    def remove(self, module):
        weight = self.compute_weight(module)
        delattr(module, self.name)
        del module._parameters[self.name + '_u']
        del module._parameters[self.name + '_v']
        del module._parameters[self.name + '_bar']
        module.register_parameter(self.name, Parameter(weight.data))

    def __call__(self, module, inputs):
        setattr(module, self.name, self.compute_weight(module))

def spectral_norm(module):
    SpectralNorm.apply(module)
    return module

def remove_spectral_norm(module):
    name = 'weight'
    for k, hook in module._forward_pre_hooks.items():
        if isinstance(hook, SpectralNorm) and hook.name == name:
            hook.remove(module)
            del module._forward_pre_hooks[k]
            return module

    raise ValueError("spectral_norm of '{}' not found in {}"
                     .format(name, module))

Masks annotated labels on different facial regions through a PSPNet trained for face segmentation
relu_4_1 feature layer of VGG16(pre-trained on ImageNet) : Identity preserving
Parameters fixed all through training process : α=1, β=10, γ=0.005, λ_l=1, λ_s=1, λ_f=0.1

parser.add_argument('--lambda_cls', default='1', type=float, help='the lambda_cls weight') 
parser.add_argument('--lambda_rec', default='10', type=int, help='lambda_A and lambda_B')
parser.add_argument('--lambda_vgg', default='5e-3', type=float, help='the param of vgg loss')
parser.add_argument('--lambda_his', default='1', type=float, help='histogram loss on lips') 
parser.add_argument('--lambda_eye', default='1', type=float, help='histogram loss on eyes equals to lambda_his*lambda_eye') 
parser.add_argument('--lambda_skin_1', default='0.1', type=float, help='histogram loss on skin equals to lambda_his* lambda_skin') 
parser.add_argument('--lambda_skin_2', default='0.1', type=float, help='histogram loss on skin equals to lambda_his* lambda_skin')

Training network from scratch using Adam (lr=0.0002, batch_size= 1)

parser.add_argument('--batch_size', default='1', type=int, help='batch_size')
parser.add_argument('--LR', default="2e-4", type=float, help='Learning rate')

5.2 Baselines

Digital Face Makeup : early makeup transfer work, applying traditional img processing method
DTN : SOTA makeup transfer work, proposing deep localized makeup transfer network
Deep Image Analogy : visual attribute transfer across two semantic-related imgs
- to match features extracted from DNN
CycleGAN : unsupervised img-to-img translation work
- BeautyGAN : modify generator in CycleGAN with 2 branches as input, but maintain all others
Style Transfer : training a feed-forward network for synthesizing style and content
- non-makeup img as content & reference makeup img as style

5.3 Comparison Against Baselines

Qualitative evaluation
- [11] : visible artifacts, mismatch problem around facial and eyes contour, incorrect details are transferred (eye shadows)
- [23] : alignment artifacts around eye areas and lips area, incorrect details are transferred (foundation and eye shadows)
- [13] Style transfer : grain-like artifacts (transfer global style → infeasible for delicate makeup transfer)
- [35] CycleGAN : realistic imgs BUT makeup style are not consistent with references
- [21] : similar makeup styles as references and natural results BUT also other non-facial features in references
  - Ex. background color from black to blue, hair color, pupil colors
  - lighter makeup styles than references (lipsticks, eye shadows, ...)
- BeautyGAN keep other makeup-irrelevant components intact as original non-makeup imgs (hairs, clothes, bg, ...)
Quantitative comparison
- User study with 84 volunteers to demonstrate BeautyGAN performs better than othe baselines
- Randomly choose 10 non-makeup test imgs + 20 makeup test imgs
- 10x20 after-makeup results for each makeup transfer method
- Comparison with [21] and [23]
- 5 imgs (1 non-makeup, 1 makeup as ref, 3 randomly shuffled makeup transfer imgs generated from diff methods)
- Rank of 3 generated imgs (based on quality + realism + makeup style similarity)
  - Rank 1 : the best makeup transfer performance -> BeautyGAN(61.84%)

5.4 Component Analysis of BeautyGAN

Ablation study to invetigate importance of each component in overall objective function
Main Analysis : Effect of Perceptual loss term, Makeup loss term
Conducted with Adversarial loss, Cycle consistency loss
[Table 2] : Settings / [Figure 6] : Results

(1) A : Remove L_per

Result : all fake imgs like two inputs warped and merged on pixels
↔ Other experiments : identities of non-makeup faces are maintained
Perceptual loss : to preserve img identity

(2) B, C, D : L_make (L_face, L_shadow, L_lips) = 3 local histogram loss acted on diff cosmetic regions

B : Remove L_make → makeup style transfer X
Makeup loss : to be for instance-level makeup transfer

6. Conclusion and Future work

A dual input/output BeautyGAN for Instance-level facial makeup transfer
1 Generator : realizing makeup and anti-makeup simultaneously in a single forward pass
Pixel-level histogram loss : to constrain similarity of makeup style
Perceptual loss and Cycle consistency loss : to preserve identity
Experimental results : Significant performance gain over other existing approaches

Code

https://github.com/wtjiang98/BeautyGAN_pytorch

(1) train

def train_net():
    # enable cudnn
    cudnn.benchmark = True

    data_loaders = get_loader(dataset_config, config, mode="train")    # return train&test
    #get the solver
    if args.model == 'cycleGAN':
        solver = Solver_cycleGAN(data_loaders, config, dataset_config)
    elif args.model =='makeupGAN':
        solver = Solver_makeupGAN(data_loaders, config, dataset_config)
    else:
        print("model that not support")
        exit()
    solver.train()

(2) GANLoss

import torch
import torch.nn as nn
from torch.autograd import Variable

class GANLoss(nn.Module):
    def __init__(self, use_lsgan=True, target_real_label=1.0, target_fake_label=0.0,
                 tensor=torch.FloatTensor):
        super(GANLoss, self).__init__()
        self.real_label = target_real_label
        self.fake_label = target_fake_label
        self.real_label_var = None
        self.fake_label_var = None
        self.Tensor = tensor
        if use_lsgan:
            self.loss = nn.MSELoss()
        else:
            self.loss = nn.BCELoss()

    def get_target_tensor(self, input, target_is_real):
        target_tensor = None
        if target_is_real:
            create_label = ((self.real_label_var is None) or
                            (self.real_label_var.numel() != input.numel()))
            if create_label:
                real_tensor = self.Tensor(input.size()).fill_(self.real_label)
                self.real_label_var = Variable(real_tensor, requires_grad=False)
            target_tensor = self.real_label_var
        else:
            create_label = ((self.fake_label_var is None) or
                            (self.fake_label_var.numel() != input.numel()))
            if create_label:
                fake_tensor = self.Tensor(input.size()).fill_(self.fake_label)
                self.fake_label_var = Variable(fake_tensor, requires_grad=False)
            target_tensor = self.fake_label_var
        return target_tensor

    def __call__(self, input, target_is_real):
        target_tensor = self.get_target_tensor(input, target_is_real)
        return self.loss(input, target_tensor)

(3) cycleGAN

def build_model(self):
    # Define generators and discriminators
    self.G_A = net.Generator(self.g_conv_dim, self.g_repeat_num) 
    self.G_B = net.Generator(self.g_conv_dim, self.g_repeat_num)
    self.D_A = net.Discriminator(self.img_size, self.d_conv_dim, self.d_repeat_num)
    self.D_B = net.Discriminator(self.img_size, self.d_conv_dim, self.d_repeat_num)
    self.criterionL1 = torch.nn.L1Loss()
    self.criterionGAN = GANLoss(use_lsgan=True, tensor =torch.cuda.FloatTensor)

    # Optimizers
    self.g_optimizer = torch.optim.Adam(itertools.chain(self.G_A.parameters(), self.G_B.parameters()),
                                            self.g_lr, [self.beta1, self.beta2])
    self.d_A_optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.D_A.parameters()), self.d_lr, [self.beta1, self.beta2])
    self.d_B_optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.D_B.parameters()), self.d_lr, [self.beta1, self.beta2])

    self.G_A.apply(self.weights_init_xavier)
    self.D_A.apply(self.weights_init_xavier)
    self.G_B.apply(self.weights_init_xavier)
    self.D_B.apply(self.weights_init_xavier)

    # Print networks
    #  self.print_network(self.E, 'E')
    self.print_network(self.G_A, 'G_A')
    self.print_network(self.D_A, 'D_A')
    self.print_network(self.G_B, 'G_B')
    self.print_network(self.D_B, 'D_B')

    if torch.cuda.is_available():
        self.G_A.cuda()
        self.G_B.cuda()
        self.D_A.cuda()
        self.D_B.cuda()


def train(self):
    """Train StarGAN within a single dataset."""
    # The number of iterations per epoch
    self.iters_per_epoch = len(self.data_loader_train)
    # Start with trained model if exists
    g_lr = self.g_lr
    d_lr = self.d_lr
    if self.checkpoint:
        start = int(self.checkpoint.split('_')[0])
    else:
        start = 0
    # Start training
    self.start_time = time.time()
    for self.e in range(start, self.num_epochs):
        for self.i, (img_A, img_B, _, _) in enumerate(self.data_loader_train):
            # Convert tensor to variable
            org_A = self.to_var(img_A, requires_grad=False)
            ref_B = self.to_var(img_B, requires_grad=False)

            # ================== Train D ================== #
            # training D_A
            # Real
            out = self.D_A(ref_B)
            d_loss_real = self.criterionGAN(out, True)
            # Fake
            fake = self.G_A(org_A)
            fake = Variable(fake.data)
            fake = fake.detach()
            out = self.D_A(fake)
            #d_loss_fake = self.get_D_loss(out, "fake")
            d_loss_fake =  self.criterionGAN(out, False)
            
            # Backward + Optimize
            d_loss = (d_loss_real + d_loss_fake) * 0.5
            self.d_A_optimizer.zero_grad()
            d_loss.backward(retain_graph=True)
            self.d_A_optimizer.step()

            # Logging
            self.loss = {}
            self.loss['D-A-loss_real'] = d_loss_real.item()

            # training D_B
            # Real
            out = self.D_B(org_A)
            d_loss_real = self.criterionGAN(out, True)
            # Fake
            fake = self.G_B(ref_B)
            fake = Variable(fake.data)
            fake = fake.detach()
            out = self.D_B(fake)
            #d_loss_fake = self.get_D_loss(out, "fake")
            d_loss_fake =  self.criterionGAN(out, False)
            
            # Backward + Optimize
            d_loss = (d_loss_real + d_loss_fake) * 0.5
            self.d_B_optimizer.zero_grad()
            d_loss.backward(retain_graph=True)
            self.d_B_optimizer.step()

            # Logging
            self.loss['D-B-loss_real'] = d_loss_real.item()

            # ================== Train G ================== #
            if (self.i + 1) % self.ndis == 0:
                # adversarial loss, i.e. L_trans,v in the paper 

                # identity loss
                if self.lambda_idt > 0:
                    # G_A should be identity if ref_B is fed
                    idt_A = self.G_A(ref_B)
                    loss_idt_A = self.criterionL1(idt_A, ref_B) * self.lambda_B * self.lambda_idt
                    # G_B should be identity if org_A is fed
                    idt_B = self.G_B(org_A)
                    loss_idt_B = self.criterionL1(idt_B, org_A) * self.lambda_A * self.lambda_idt
                    g_loss_idt = loss_idt_A + loss_idt_B
                else:
                    g_loss_idt = 0
                    
                # GAN loss D_A(G_A(A))
                fake_B = self.G_A(org_A)
                pred_fake = self.D_A(fake_B)
                g_A_loss_adv =  self.criterionGAN(pred_fake, True)
                #g_loss_adv = self.get_G_loss(out)

                # GAN loss D_B(G_B(B))
                fake_A = self.G_B(ref_B)
                pred_fake = self.D_B(fake_A)
                g_B_loss_adv = self.criterionGAN(pred_fake, True)

                # Forward cycle loss
                rec_A = self.G_B(fake_B)
                g_loss_rec_A = self.criterionL1(rec_A, org_A) * self.lambda_A

                # Backward cycle loss
                rec_B = self.G_A(fake_A)
                g_loss_rec_B = self.criterionL1(rec_B, ref_B) * self.lambda_B

                # Combined loss
                g_loss = g_A_loss_adv + g_B_loss_adv + g_loss_rec_A + g_loss_rec_B + g_loss_idt
                
                self.g_optimizer.zero_grad()
                g_loss.backward(retain_graph=True)
                self.g_optimizer.step()

                # Logging
                self.loss['G-A-loss_adv'] = g_A_loss_adv.item()
                self.loss['G-B-loss_adv'] = g_A_loss_adv.item()
                self.loss['G-loss_org'] = g_loss_rec_A.item()
                self.loss['G-loss_ref'] = g_loss_rec_B.item()
                self.loss['G-loss_idt'] = g_loss_idt.item()

            # Print out log info
            if (self.i + 1) % self.log_step == 0:
                self.log_terminal()

            #plot the figures
            for key_now in self.loss.keys():
                plot_fig.plot(key_now, self.loss[key_now])

            #save the images
            if (self.i + 1) % self.vis_step == 0:
                print("Saving middle output...")
                self.vis_train([org_A, ref_B, fake_A, fake_B, rec_A, rec_B])
                self.vis_test()

            # Save model checkpoints
            if (self.i + 1) % self.snapshot_step == 0:
                self.save_models()

            if (self.i % 100 == 99):
                plot_fig.flush(self.task_name)

            plot_fig.tick()
        
        # Decay learning rate
        if (self.e+1) > (self.num_epochs - self.num_epochs_decay):
            g_lr -= (self.g_lr / float(self.num_epochs_decay))
            d_lr -= (self.d_lr / float(self.num_epochs_decay))
            self.update_lr(g_lr, d_lr)
            print('Decay learning rate to g_lr: {}, d_lr:{}.'.format(g_lr, d_lr))

(4) makeupGAN

def build_model(self):
    # Define generators and discriminators
    if self.whichG=='normal':
        self.G = net.Generator_makeup(self.g_conv_dim, self.g_repeat_num)
    if self.whichG=='branch':
        self.G = net.Generator_branch(self.g_conv_dim, self.g_repeat_num)
    for i in self.cls:
        setattr(self, "D_" + i, net.Discriminator(self.img_size, self.d_conv_dim, self.d_repeat_num, self.norm))

    self.criterionL1 = torch.nn.L1Loss()
    self.criterionL2 = torch.nn.MSELoss()
    self.criterionGAN = GANLoss(use_lsgan=True, tensor =torch.cuda.FloatTensor)
    self.vgg = net.VGG()
    self.vgg.load_state_dict(torch.load('addings/vgg_conv.pth'))
    # Optimizers
    self.g_optimizer = torch.optim.Adam(self.G.parameters(), self.g_lr, [self.beta1, self.beta2])
    for i in self.cls:
        setattr(self, "d_" + i + "_optimizer", \
                torch.optim.Adam(filter(lambda p: p.requires_grad, getattr(self, "D_" + i).parameters()), \
                                    self.d_lr, [self.beta1, self.beta2]))

    # Weights initialization
    self.G.apply(self.weights_init_xavier)
    for i in self.cls:
        getattr(self, "D_" + i).apply(self.weights_init_xavier)

    # Print networks
    self.print_network(self.G, 'G')
    for i in self.cls:
        self.print_network(getattr(self, "D_" + i), "D_" + i)

    if torch.cuda.is_available():
        self.G.cuda()
        self.vgg.cuda()
        for i in self.cls:
            getattr(self, "D_" + i).cuda()


def train(self):
    """Train StarGAN within a single dataset."""
    # The number of iterations per epoch
    self.iters_per_epoch = len(self.data_loader_train)
    # Start with trained model if exists
    cls_A = self.cls[0]
    cls_B = self.cls[1]
    g_lr = self.g_lr
    d_lr = self.d_lr
    if self.checkpoint:
        start = int(self.checkpoint.split('_')[0])
        self.vis_test()
    else:
        start = 0
    # Start training
    self.start_time = time.time()
    for self.e in range(start, self.num_epochs):
        for self.i, (img_A, img_B, mask_A, mask_B) in enumerate(self.data_loader_train):
            # Convert tensor to variable
            # mask attribute: 0:background 1:face 2:left-eyebrown 3:right-eyebrown 4:left-eye 5: right-eye 6: nose 
            # 7: upper-lip 8: teeth 9: under-lip 10:hair 11: left-ear 12: right-ear 13: neck
            if self.checkpoint or self.direct:
                if self.lips==True:
                    mask_A_lip = (mask_A==7).float() + (mask_A==9).float()
                    mask_B_lip = (mask_B==7).float() + (mask_B==9).float()
                    mask_A_lip, mask_B_lip, index_A_lip, index_B_lip = self.mask_preprocess(mask_A_lip, mask_B_lip)
                if self.skin==True:
                    mask_A_skin = (mask_A==1).float() + (mask_A==6).float() + (mask_A==13).float()
                    mask_B_skin = (mask_B==1).float() + (mask_B==6).float() + (mask_B==13).float()
                    mask_A_skin, mask_B_skin, index_A_skin, index_B_skin = self.mask_preprocess(mask_A_skin, mask_B_skin)
                if self.eye==True:
                    mask_A_eye_left = (mask_A==4).float()
                    mask_A_eye_right = (mask_A==5).float()
                    mask_B_eye_left = (mask_B==4).float()
                    mask_B_eye_right = (mask_B==5).float()
                    mask_A_face = (mask_A==1).float() + (mask_A==6).float()
                    mask_B_face = (mask_B==1).float() + (mask_B==6).float()
                    # avoid the situation that images with eye closed
                    if not ((mask_A_eye_left>0).any() and (mask_B_eye_left>0).any() and \
                        (mask_A_eye_right > 0).any() and (mask_B_eye_right > 0).any()):
                        continue
                    mask_A_eye_left, mask_A_eye_right = self.rebound_box(mask_A_eye_left, mask_A_eye_right, mask_A_face)
                    mask_B_eye_left, mask_B_eye_right = self.rebound_box(mask_B_eye_left, mask_B_eye_right, mask_B_face)
                    mask_A_eye_left, mask_B_eye_left, index_A_eye_left, index_B_eye_left = \
                        self.mask_preprocess(mask_A_eye_left, mask_B_eye_left)
                    mask_A_eye_right, mask_B_eye_right, index_A_eye_right, index_B_eye_right = \
                        self.mask_preprocess(mask_A_eye_right, mask_B_eye_right)

            org_A = self.to_var(img_A, requires_grad=False)
            ref_B = self.to_var(img_B, requires_grad=False)
            # ================== Train D ================== #
            # training D_A, D_A aims to distinguish class B
            # Real
            out = getattr(self, "D_" + cls_A)(ref_B)
            d_loss_real = self.criterionGAN(out, True)
            # Fake
            fake_A, fake_B = self.G(org_A, ref_B)
            fake_A = Variable(fake_A.data).detach()
            fake_B = Variable(fake_B.data).detach()
            out = getattr(self, "D_" + cls_A)(fake_A)
            #d_loss_fake = self.get_D_loss(out, "fake")
            d_loss_fake =  self.criterionGAN(out, False)
            
            # Backward + Optimize
            d_loss = (d_loss_real + d_loss_fake) * 0.5
            getattr(self, "d_" + cls_A + "_optimizer").zero_grad()
            d_loss.backward(retain_graph=True)
            getattr(self, "d_" + cls_A + "_optimizer").step()

            # Logging
            self.loss = {}
            self.loss['D-A-loss_real'] = d_loss_real.item()

            # training D_B, D_B aims to distinguish class A
            # Real
            out = getattr(self, "D_" + cls_B)(org_A)
            d_loss_real = self.criterionGAN(out, True)
            # Fake
            out = getattr(self, "D_" + cls_B)(fake_B)
            #d_loss_fake = self.get_D_loss(out, "fake")
            d_loss_fake =  self.criterionGAN(out, False)
            
            # Backward + Optimize
            d_loss = (d_loss_real + d_loss_fake) * 0.5
            getattr(self, "d_" + cls_B + "_optimizer").zero_grad()
            d_loss.backward(retain_graph=True)
            getattr(self, "d_" + cls_B + "_optimizer").step()

            # Logging
            self.loss['D-B-loss_real'] = d_loss_real.item()

            # ================== Train G ================== #
            if (self.i + 1) % self.ndis == 0:
                # adversarial loss, i.e. L_trans,v in the paper 

                # identity loss
                if self.lambda_idt > 0:
                    # G should be identity if ref_B or org_A is fed
                    idt_A1, idt_A2 = self.G(org_A, org_A)
                    idt_B1, idt_B2 = self.G(ref_B, ref_B)
                    loss_idt_A1 = self.criterionL1(idt_A1, org_A) * self.lambda_A * self.lambda_idt
                    loss_idt_A2 = self.criterionL1(idt_A2, org_A) * self.lambda_A * self.lambda_idt
                    loss_idt_B1 = self.criterionL1(idt_B1, ref_B) * self.lambda_B * self.lambda_idt
                    loss_idt_B2 = self.criterionL1(idt_B2, ref_B) * self.lambda_B * self.lambda_idt
                    # loss_idt
                    loss_idt = (loss_idt_A1 + loss_idt_A2 + loss_idt_B1 + loss_idt_B2) * 0.5
                else:
                    loss_idt = 0
                    
                # GAN loss D_A(G_A(A))
                # fake_A in class B, 
                fake_A, fake_B = self.G(org_A, ref_B)
                pred_fake = getattr(self, "D_" + cls_A)(fake_A)
                g_A_loss_adv = self.criterionGAN(pred_fake, True)
                #g_loss_adv = self.get_G_loss(out)
                # GAN loss D_B(G_B(B))
                pred_fake = getattr(self, "D_" + cls_B)(fake_B)
                g_B_loss_adv = self.criterionGAN(pred_fake, True)
                rec_B, rec_A = self.G(fake_B, fake_A)

                # color_histogram loss
                g_A_loss_his = 0
                g_B_loss_his = 0
                if self.checkpoint or self.direct:
                    if self.lips==True:
                        g_A_lip_loss_his = self.criterionHis(fake_A, ref_B, mask_A_lip, mask_B_lip, index_A_lip) * self.lambda_his_lip
                        g_B_lip_loss_his = self.criterionHis(fake_B, org_A, mask_B_lip, mask_A_lip, index_B_lip) * self.lambda_his_lip
                        g_A_loss_his += g_A_lip_loss_his
                        g_B_loss_his += g_B_lip_loss_his
                    if self.skin==True:
                        g_A_skin_loss_his = self.criterionHis(fake_A, ref_B, mask_A_skin, mask_B_skin, index_A_skin) * self.lambda_his_skin_1
                        g_B_skin_loss_his = self.criterionHis(fake_B, org_A, mask_B_skin, mask_A_skin, index_B_skin) * self.lambda_his_skin_2
                        g_A_loss_his += g_A_skin_loss_his
                        g_B_loss_his += g_B_skin_loss_his
                    if self.eye==True:
                        g_A_eye_left_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_left, mask_B_eye_left, index_A_eye_left) * self.lambda_his_eye
                        g_B_eye_left_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_left, mask_A_eye_left, index_B_eye_left) * self.lambda_his_eye
                        g_A_eye_right_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_right, mask_B_eye_right, index_A_eye_right) * self.lambda_his_eye
                        g_B_eye_right_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_right, mask_A_eye_right, index_B_eye_right) * self.lambda_his_eye
                        g_A_loss_his += g_A_eye_left_loss_his + g_A_eye_right_loss_his
                        g_B_loss_his += g_B_eye_left_loss_his + g_B_eye_right_loss_his

                # cycle loss
                g_loss_rec_A = self.criterionL1(rec_A, org_A) * self.lambda_A
                g_loss_rec_B = self.criterionL1(rec_B, ref_B) * self.lambda_B

                # vgg loss
                vgg_org = self.vgg(org_A, self.content_layer)[0]
                vgg_org = Variable(vgg_org.data).detach()
                vgg_fake_A = self.vgg(fake_A, self.content_layer)[0]
                g_loss_A_vgg = self.criterionL2(vgg_fake_A, vgg_org) * self.lambda_A * self.lambda_vgg
                
                vgg_ref = self.vgg(ref_B, self.content_layer)[0]
                vgg_ref = Variable(vgg_ref.data).detach()
                vgg_fake_B = self.vgg(fake_B, self.content_layer)[0]
                g_loss_B_vgg = self.criterionL2(vgg_fake_B, vgg_ref) * self.lambda_B * self.lambda_vgg
                
                loss_rec = (g_loss_rec_A + g_loss_rec_B + g_loss_A_vgg + g_loss_B_vgg) * 0.5
                
                # Combined loss
                g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt
                if self.checkpoint or self.direct:
                    g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt + g_A_loss_his + g_B_loss_his
                
                self.g_optimizer.zero_grad()
                g_loss.backward(retain_graph=True)
                self.g_optimizer.step()

                # Logging
                self.loss['G-A-loss-adv'] = g_A_loss_adv.item()
                self.loss['G-B-loss-adv'] = g_A_loss_adv.item()
                self.loss['G-loss-org'] = g_loss_rec_A.item()
                self.loss['G-loss-ref'] = g_loss_rec_B.item()
                self.loss['G-loss-idt'] = loss_idt.item()
                self.loss['G-loss-img-rec'] = (g_loss_rec_A + g_loss_rec_B).item()
                self.loss['G-loss-vgg-rec'] = (g_loss_A_vgg + g_loss_B_vgg).item()
                if self.direct:
                    self.loss['G-A-loss-his'] = g_A_loss_his.item()
                    self.loss['G-B-loss-his'] = g_B_loss_his.item()

            # Print out log info
            if (self.i + 1) % self.log_step == 0:
                self.log_terminal()

            #plot the figures
            for key_now in self.loss.keys():
                plot_fig.plot(key_now, self.loss[key_now])

            #save the images
            if (self.i + 1) % self.vis_step == 0:
                print("Saving middle output...")
                self.vis_train([org_A, ref_B, fake_A, fake_B, rec_A, rec_B])


            # Save model checkpoints
            if (self.i + 1) % self.snapshot_step == 0:
                self.save_models()

            if (self.i % 100 == 99):
                plot_fig.flush(self.task_name)

            plot_fig.tick()
        
        # Decay learning rate
        if (self.e+1) > (self.num_epochs - self.num_epochs_decay):
            g_lr -= (self.g_lr / float(self.num_epochs_decay))
            d_lr -= (self.d_lr / float(self.num_epochs_decay))
            self.update_lr(g_lr, d_lr)
            print('Decay learning rate to g_lr: {}, d_lr:{}.'.format(g_lr, d_lr))

(5) network

import torch
import torch.nn as nn
import torch.nn.functional as F

from ops.spectral_norm import spectral_norm as SpectralNorm

# Defines the GAN loss which uses either LSGAN or the regular GAN.
# When LSGAN is used, it is basically same as MSELoss,
# but it abstracts away the need to create the target label tensor
# that has the same size as the input

class ResidualBlock(nn.Module):
    """Residual Block."""
    def __init__(self, dim_in, dim_out):
        super(ResidualBlock, self).__init__()
        self.main = nn.Sequential(
            nn.Conv2d(dim_in, dim_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.InstanceNorm2d(dim_out, affine=True),
            nn.ReLU(inplace=True),
            nn.Conv2d(dim_out, dim_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.InstanceNorm2d(dim_out, affine=True))

    def forward(self, x):
        return x + self.main(x)


class Generator(nn.Module):
    """Generator. Encoder-Decoder Architecture."""
    def __init__(self, conv_dim=64, repeat_num=6):
        super(Generator, self).__init__()

        layers = []
        layers.append(nn.Conv2d(3, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers.append(nn.ReLU(inplace=True))

        # Down-Sampling
        curr_dim = conv_dim
        for i in range(2):
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim * 2

        # Bottleneck
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-Sampling
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        layers.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.Tanh())
        self.main = nn.Sequential(*layers)

    def forward(self, x):
        out = self.main(x)
        return out

class Generator_makeup(nn.Module):
    """Generator. Encoder-Decoder Architecture."""
    # input 2 images and output 2 images as well
    def __init__(self, conv_dim=64, repeat_num=6, input_nc=6):
        super(Generator_makeup, self).__init__()

        layers = []
        layers.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers.append(nn.ReLU(inplace=True))

        # Down-Sampling
        curr_dim = conv_dim
        for i in range(2):
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim * 2

        # Bottleneck
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-Sampling
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        self.main = nn.Sequential(*layers)

        layers_1 = []
        layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_1.append(nn.Tanh())
        self.branch_1 = nn.Sequential(*layers_1)
        layers_2 = []
        layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_2.append(nn.Tanh())
        self.branch_2 = nn.Sequential(*layers_2)

    def forward(self, x, y):
        input_x = torch.cat((x, y), dim=1)
        out = self.main(input_x)
        out_A = self.branch_1(out)
        out_B = self.branch_2(out)
        return out_A, out_B


class Generator_branch(nn.Module):
    """Generator. Encoder-Decoder Architecture."""
    # input 2 images and output 2 images as well
    def __init__(self, conv_dim=64, repeat_num=6, input_nc=3):
        super(Generator_branch, self).__init__()

        # Branch input
        layers_branch = []
        layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        self.Branch_0 = nn.Sequential(*layers_branch)

        # Branch input
        layers_branch = []
        layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
        layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
        layers_branch.append(nn.ReLU(inplace=True))
        self.Branch_1 = nn.Sequential(*layers_branch)

        # Down-Sampling, branch merge
        layers = []
        curr_dim = conv_dim*2
        layers.append(nn.Conv2d(curr_dim*2, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
        layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
        layers.append(nn.ReLU(inplace=True))
        curr_dim = curr_dim * 2
     
        # Bottleneck
        for i in range(repeat_num):
            layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))

        # Up-Sampling
        for i in range(2):
            layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
            layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
            layers.append(nn.ReLU(inplace=True))
            curr_dim = curr_dim // 2

        self.main = nn.Sequential(*layers)

        layers_1 = []
        layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_1.append(nn.ReLU(inplace=True))
        layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_1.append(nn.ReLU(inplace=True))
        layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_1.append(nn.Tanh())
        self.branch_1 = nn.Sequential(*layers_1)
        layers_2 = []
        layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_2.append(nn.ReLU(inplace=True))
        layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
        layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
        layers_2.append(nn.ReLU(inplace=True))
        layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
        layers_2.append(nn.Tanh())
        self.branch_2 = nn.Sequential(*layers_2)

    def forward(self, x, y):
        input_x = self.Branch_0(x)
        input_y = self.Branch_1(y)
        input_fuse = torch.cat((input_x, input_y), dim=1)
        out = self.main(input_fuse)
        out_A = self.branch_1(out)
        out_B = self.branch_2(out)
        return out_A, out_B

class Discriminator(nn.Module):
    """Discriminator. PatchGAN."""
    def __init__(self, image_size=128, conv_dim=64, repeat_num=3, norm='SN'):
        super(Discriminator, self).__init__()

        layers = []
        if norm=='SN':
            layers.append(SpectralNorm(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1)))
        else:
            layers.append(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1))
        layers.append(nn.LeakyReLU(0.01, inplace=True))

        curr_dim = conv_dim
        for i in range(1, repeat_num):
            if norm=='SN':
                layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1)))
            else:
                layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1))
            layers.append(nn.LeakyReLU(0.01, inplace=True))
            curr_dim = curr_dim * 2

        #k_size = int(image_size / np.power(2, repeat_num))
        if norm=='SN':
            layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1)))
        else:
            layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1))
        layers.append(nn.LeakyReLU(0.01, inplace=True))
        curr_dim = curr_dim *2

        self.main = nn.Sequential(*layers)
        if norm=='SN':
            self.conv1 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False))
        else:
            self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False)

        # conv1 remain the last square size, 256*256-->30*30
        #self.conv2 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=k_size, bias=False))
        #conv2 output a single number

    def forward(self, x):
        h = self.main(x)
        #out_real = self.conv1(h)
        out_makeup = self.conv1(h)
        #return out_real.squeeze(), out_makeup.squeeze()
        return out_makeup.squeeze()

class VGG(nn.Module):
    def __init__(self, pool='max'):
        super(VGG, self).__init__()
        # vgg modules
        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_4 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        if pool == 'max':
            self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)
        elif pool == 'avg':
            self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2)
            self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2)
            self.pool3 = nn.AvgPool2d(kernel_size=2, stride=2)
            self.pool4 = nn.AvgPool2d(kernel_size=2, stride=2)
            self.pool5 = nn.AvgPool2d(kernel_size=2, stride=2)

    def forward(self, x, out_keys):
        out = {}
        out['r11'] = F.relu(self.conv1_1(x))
        out['r12'] = F.relu(self.conv1_2(out['r11']))
        out['p1'] = self.pool1(out['r12'])
        out['r21'] = F.relu(self.conv2_1(out['p1']))
        out['r22'] = F.relu(self.conv2_2(out['r21']))
        out['p2'] = self.pool2(out['r22'])
        out['r31'] = F.relu(self.conv3_1(out['p2']))
        out['r32'] = F.relu(self.conv3_2(out['r31']))
        out['r33'] = F.relu(self.conv3_3(out['r32']))
        out['r34'] = F.relu(self.conv3_4(out['r33']))
        out['p3'] = self.pool3(out['r34'])
        out['r41'] = F.relu(self.conv4_1(out['p3']))
        
        out['r42'] = F.relu(self.conv4_2(out['r41']))
        out['r43'] = F.relu(self.conv4_3(out['r42']))
        out['r44'] = F.relu(self.conv4_4(out['r43']))
        out['p4'] = self.pool4(out['r44'])
        out['r51'] = F.relu(self.conv5_1(out['p4']))
        out['r52'] = F.relu(self.conv5_2(out['r51']))
        out['r53'] = F.relu(self.conv5_3(out['r52']))
        out['r54'] = F.relu(self.conv5_4(out['r53']))
        out['p5'] = self.pool5(out['r54'])
        
        return [out[key] for key in out_keys]

Demo (test)

[CV_3D] PointConv: Deep Convolutional Networks on 3D Point Clouds

PointConv: Deep Convolutional Networks on 3D Point Clouds

Prior Research

PointNet : permutation invariant한 max-pooling 이용 → local region의 semantic feature 놓침
PointNet++ : hierarchical한 Set Abstraction layer 이용 → local feature 고려 O but 내부에서 PointNet 이용
local region의 semantic feature를 손실없이 고려하는 구조 필요 (PointConv)

Abstract

PointConv

Convolution kernel : nonlinear function of local coordinates of 3D points
- Weight function learned with MLP
- Density function through kernel density estimation
- Translation-invariant & Permutation-invariant on any point set in 3D space
Deconvolution operator (PointDeconv) : propagating features (subsampled → original resol)

1. Introduction

(Indoor/Outdoor) Sensors : directly obtaining 3D data (depth info, surface normals) = important
CNNs for 2D : translation invariance → all locations에 same set of filters 사용 가능 → params# ↓, generalization ↑
3D data (ex. pc) = a set of unordered 3D points (+additional features)
- Regular lattice grid : 불가 → conventional CNNs 어려움
- Volumetric grid : 가능 but sparse → high-resol에서 CNNs 어려움

PointConv : Convolution operation on 3D pc with Non-uniform sampling

Input : positions of pc
Goal : MLP로 weight function 학습(근사)
- Convolution operation = discrete approximation of a continuous convolution
- weights in 3D space = (Lipschitz) continuous function of local point w.r.t. a reference point
- continuous function : MLP로 근사 가능
보완 : 학습된 weights에 Non-uniform sampling 위해 Inverse density scale
- Inverse density scale = re-weighting continuous function
- = Monte Carlo approximation of continuous convolution
개선 (Memory efficient version) : summation order 변경
Results : translation-invariance (2D CNN 비슷) & permutation-invariance (pc 특성 고려)

∴ 3 Contributions

PointConv : Density re-weighted convolution to fully approximate 3D continuous conv on any set of 3D points
Memory efficient version : summation order 변경 → modern CNN level까지 scale up 가능
PointDeconv : better segmentation 가능

3. PointConv

PointConv : MC approximation of 3D continuous convolution
- MLP to approximate weight function
- → Inverse density scale to re-weight

3.1 Convolution on 3D Point Clouds

1) Image vs Point Cloud

Images : 2D discrete functions (grid-shaped matrices)
- relative positions bw different pixels : 항상 고정
- discretized filter : summation of real-valued weight for each location within local region
Point Cloud : a set of 3D points (fixed grid X, 임의의 continuous value)
- Point = position (x, y, z) + additional features (ex. color, surface normal)
- relative positions of different points : 다양 in each local region
- discretized filter 적용 불가 → ∴ permutation-invariant한 Convolution 등장 (PointConv)

2) Operations

Conventional (2D) Convolution
Continuous 3D Convolution
- $F$ : feature of a point in local region $G$ centered around point $p = (x,y,z)$
- $W$ : $F$의 continuous kernel
- $(\delta_x, \delta_y, \delta_z)$ : local region $G$에 속한 local point가 target point $p$를 중심으로 떨어진 정도
PointConv : entire convolution operation for PC (not full approx)
- 실제로 local region $G$에서 얻을 수 있는 것 = sample point pc
- PC : very non-uniform sample from continuous $R^3$ space
- $S$ : inverse density scale at any possible point in local region

Continuous input 대해서 PointConv가 잘 작동하는 이유

Continuous input PC를 discretize하여 discrete convolution으로 local feature 뽑아냄
raster img에서의 relative positions은 고정됨
∴ relative positions을 input으로 받으면 전체 img 대해 same weight and density 출력 가능

3) PointConv

Main idea : To approximate continuous weight function $W$ by MLP & KDE(Kernelized density estimation)
$W$ (Weights of MLP in PointConv) : permutation-invariant 위해 모든 points에서 공유됨

[Code] Weight Network

class WeightNet(nn.Module):

    def __init__(self, in_channel, out_channel, hidden_unit = [8, 8]):
        super(WeightNet, self).__init__()

        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        if hidden_unit is None or len(hidden_unit) == 0:
            self.mlp_convs.append(nn.Conv2d(in_channel, out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
        else:
            self.mlp_convs.append(nn.Conv2d(in_channel, hidden_unit[0], 1))
            self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[0]))
            for i in range(1, len(hidden_unit)):
                self.mlp_convs.append(nn.Conv2d(hidden_unit[i - 1], hidden_unit[i], 1))
                self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[i]))
            self.mlp_convs.append(nn.Conv2d(hidden_unit[-1], out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
        
    def forward(self, localized_xyz):
        #xyz : BxCxKxN

        weights = localized_xyz
        for i, conv in enumerate(self.mlp_convs):
            bn = self.mlp_bns[i]
            weights =  F.relu(bn(conv(weights)))

        return weights

$S$ (Inverse density Scale) : 계산 위해 KDE로 각 point의 density 구해서 MLP for 1D nonlinear transform에 feed
- Why nonlinear transform ? network가 density estimates를 사용할지를 adaptively 결정하도록 하기 위함

[Code] KDE(Kernelized density estimation)

def compute_density(xyz, bandwidth):
    '''
    xyz: input points position data, [B, N, C]
    '''
    #import ipdb; ipdb.set_trace()
    B, N, C = xyz.shape
    sqrdists = square_distance(xyz, xyz)
    gaussion_density = torch.exp(- sqrdists / (2.0 * bandwidth * bandwidth)) / (2.5 * bandwidth)
    xyz_density = gaussion_density.mean(dim = -1)

    return xyz_density

[Code] Density Network

class DensityNet(nn.Module):
    def __init__(self, hidden_unit = [16, 8]):
        super(DensityNet, self).__init__()
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList() 

        self.mlp_convs.append(nn.Conv2d(1, hidden_unit[0], 1))
        self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[0]))
        for i in range(1, len(hidden_unit)):
            self.mlp_convs.append(nn.Conv2d(hidden_unit[i - 1], hidden_unit[i], 1))
            self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[i]))
        self.mlp_convs.append(nn.Conv2d(hidden_unit[-1], 1, 1))
        self.mlp_bns.append(nn.BatchNorm2d(1))

    def forward(self, density_scale):
        for i, conv in enumerate(self.mlp_convs):
            bn = self.mlp_bns[i]
            density_scale =  bn(conv(density_scale))
            if i == len(self.mlp_convs):
                density_scale = F.sigmoid(density_scale)
            else:
                density_scale = F.relu(density_scale)
        
        return density_scale

$C_{in}$, $C_{out}$ : # of channels for input feature and output feature
PointConv on K-point local region
- Input feature $F_{in}$ = ( $K$ x $C_{in}$ ) dim vector
- Input of Computing Weight part : $P_{local}$ = ( $K$ x 3 ) dim vector = (relative) 3D local positions of points
- MLP (1x1 conv)
- ➀ Output of Computing Weight part : $W$ = $K$ x ( $C_{in}$, $C_{out}$ ) dim vector
- ➁ Inverse Density Scale : $S$ = ( $K$ x 1 ) dim vector → tile해서 $K$ x ( $C_{in}$, $C_{out}$ ) dim vector 맞춤
- ➀과 ➁를 element-wise product → summation 거쳐 Output feature $F_{out}$ = ( 1 x $C_{out}$ ) dim vector
Feature Encoding Modules
- Purpose : To aggregate features in entire point set
- Structure : hierarchical structure to combine detailed small region features → large abstract features
- Key layers : sampling layer, grouping layer, PointConv layer ... PointNet++ 비슷
  - $S$와 $W$를 이용하여 PointConv layer 구성 → PointNet의 Set Abstraction Block의 PointNet layer 대체
  - ∴ 더 좋은 local representation aggregate 가능!

[Code] Density Set Abstraction

class PointConvDensitySetAbstraction(nn.Module):
    def __init__(self, npoint, nsample, in_channel, mlp, bandwidth, group_all):
        super(PointConvDensitySetAbstraction, self).__init__()
        self.npoint = npoint
        self.nsample = nsample
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        last_channel = in_channel
        for out_channel in mlp:
            self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
            last_channel = out_channel

        self.weightnet = WeightNet(3, 16)
        self.linear = nn.Linear(16 * mlp[-1], mlp[-1])
        self.bn_linear = nn.BatchNorm1d(mlp[-1])
        self.densitynet = DensityNet()
        self.group_all = group_all
        self.bandwidth = bandwidth

    def forward(self, xyz, points):
        """
        Input:
            xyz: input points position data, [B, C, N]
            points: input points data, [B, D, N]
        Return:
            new_xyz: sampled points position data, [B, C, S]
            new_points_concat: sample points feature data, [B, D', S]
        """
        B = xyz.shape[0]
        N = xyz.shape[2]
        xyz = xyz.permute(0, 2, 1)
        if points is not None:
            points = points.permute(0, 2, 1)

        xyz_density = compute_density(xyz, self.bandwidth)
        inverse_density = 1.0 / xyz_density 

        if self.group_all:
            new_xyz, new_points, grouped_xyz_norm, grouped_density = sample_and_group_all(xyz, points, inverse_density.view(B, N, 1))
        else:
            new_xyz, new_points, grouped_xyz_norm, _, grouped_density = sample_and_group(self.npoint, self.nsample, xyz, points, inverse_density.view(B, N, 1))
        # new_xyz: sampled points position data, [B, npoint, C]
        # new_points: sampled points data, [B, npoint, nsample, C+D]
        new_points = new_points.permute(0, 3, 2, 1) # [B, C+D, nsample,npoint]
        for i, conv in enumerate(self.mlp_convs):
            bn = self.mlp_bns[i]
            new_points =  F.relu(bn(conv(new_points)))

        inverse_max_density = grouped_density.max(dim = 2, keepdim=True)[0]
        density_scale = grouped_density / inverse_max_density
        density_scale = self.densitynet(density_scale.permute(0, 3, 2, 1))
        new_points = new_points * density_scale

        grouped_xyz = grouped_xyz_norm.permute(0, 3, 2, 1)
        weights = self.weightnet(grouped_xyz)     
        new_points = torch.matmul(input=new_points.permute(0, 3, 1, 2), other = weights.permute(0, 3, 2, 1)).view(B, self.npoint, -1)
        new_points = self.linear(new_points)
        new_points = self.bn_linear(new_points.permute(0, 2, 1))
        new_points = F.relu(new_points)
        new_xyz = new_xyz.permute(0, 2, 1)

        return new_xyz, new_points

[Code] PointConv for Classification

class PointConvDensityClsSsg(nn.Module):
    def __init__(self, num_classes = 40):
        super(PointConvDensityClsSsg, self).__init__()
        feature_dim = 3
        self.sa1 = PointConvDensitySetAbstraction(npoint=512, nsample=32, in_channel=feature_dim + 3, mlp=[64, 64, 128], bandwidth = 0.1, group_all=False)
        self.sa2 = PointConvDensitySetAbstraction(npoint=128, nsample=64, in_channel=128 + 3, mlp=[128, 128, 256], bandwidth = 0.2, group_all=False)
        self.sa3 = PointConvDensitySetAbstraction(npoint=1, nsample=None, in_channel=256 + 3, mlp=[256, 512, 1024], bandwidth = 0.4, group_all=True)
        self.fc1 = nn.Linear(1024, 512)
        self.bn1 = nn.BatchNorm1d(512)
        self.drop1 = nn.Dropout(0.7)
        self.fc2 = nn.Linear(512, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.drop2 = nn.Dropout(0.7)
        self.fc3 = nn.Linear(256, num_classes)

    def forward(self, xyz, feat):
        B, _, _ = xyz.shape
        l1_xyz, l1_points = self.sa1(xyz, feat)
        l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
        l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
        x = l3_points.view(B, 1024)
        x = self.drop1(F.relu(self.bn1(self.fc1(x))))
        x = self.drop2(F.relu(self.bn2(self.fc2(x))))
        x = self.fc3(x)
        x = F.log_softmax(x, -1)
        return x

3.2 Feature Propagation Using Deconvolution [Segmentation]

Segmentation : point-wise prediction 필요 (subsampled pc에서 denser pc로 propagate 하여 모든 input features)
PointNet++ : distance-based Interpolation 제안 → full advantage of deconv 고려X
PointDeconv : Interpolation + PointConv 구성
- Linear Interpolation from 3 nearest points : propagating coarse features from previous layers
- Skip links : concatenating interpolated features
- PointConv : applying PointConv on concatenated features to obtain final output

4. Efficient PointConv

Motivation : MLP는 point마다 공유되어도, MC 기반 weight function으로 구한 weight $W$는 point마다 다름 → high memory consumption
Implementation : Matrix multiplication & 2d 1x1 convolution
- PointConv 마지막에 전체 points에 대한 summation 있으므로 K에 대한 summation을 먼저 수행하자
- → W : intermediate output $M$에 대해 마지막 weight인 $H$로 1x1 conv를 수행한 것
- ∴ $K$ x $C_{out}$을 $C_{mid}$로 대체한 것 = 효율적!
Advantage : parallel computing of GPU, easy implementation, → low memory consumption (1/64)
Generated weights filters : 두 파트로 나눔 (Intermediate output $M$ & Convolution kernel $H$)

5. Experiments

5.1 Classification on ModelNet40

Dataset : ModelNet40 (12,311 CAD models from 40 man-made object categories)
Using PointNet to sample 1024 points uniformly & compute normal vector from mesh models
Data augmentation : random rotating along z-axis, jittering by gaussian noise
Result : PointConv = SOTA among 3D input methods

5.2 ShapeNet Part Segmentation

Dataset : ShapeNet (16,881 shapes from 16 classes, 50 parts)
Goal : To assign a part category label to each point (fine-grained 3D recognition task)
Eval Metric : point IoU
Result : class avg mIoU 82.8%, instance avg mIoU 85.7% = par with SOTA

5.3 Semantic Scene Labeling(Segmentation)

Dataset : ScanNet (noisy dataset for realistic pc)
Goal : To predict semantic object labels on each 3D point given indoor scenes represented by pc
Train : 3m x 1.5m x 1.5m random cube samples 사용
Eval : using sliding window over entire scan
Eval Metric : IoU, mIouU
Result : PointConv outperforms other methods

5.4 Classification on CIFAR-10

Dataset : CIFAR-10
- each pixel as a 2D point with (x, y) + RGB features
- pc scaled onto unit ball
Result : same learning capacity as 2D CNN

6. Ablation Experiments and Visualization

6.1 The Structure of MLP

Dataset : 20 scene types for ScanNet (realistic 3D pc with RGB)
$C_{mid}$ : 크다고 성능이 반드시 좋은건 X, memory efficiency에 영향 O
MLP의 layers 수가 성능에 미치는 영향 적음

6.2 Inverse Density Scale

Dataset : ScanNet
Density > No Density → Effect of IDS
- more effective in layers closer to input
- FPS for sub-sampling → deeper layer : uniformly distributed 라서 density scale 영향이 줄어

6.3 Ablation Studies on ScanNet

Stride Size : 작을수록 좋음
RGB information : 있으면 좋지만 큰 효과는 없음

6.4 Visualization

Some patterns in learned continuous filters

[CV_3D] JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields

JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields

Abstract

Task : Semantic and Instance Segmentation of 3D point clouds
Methods
- Multi-task pointwise network : predicting semantic classes of 3D points & embedding points into high-dim vectors → points of same object instance are represented by similar embeddings
- Multi-value conditional random field : incorporating semantic and instance labels & formulating problem of semantic and instance segmentation as jointly optimising labels
Results : showing robustness / SOTA performance on semantic segmentation

1. Introduction

3D scene understanding : hard challenges (ex. large-scale and noisy data processing)
Point-based representation
- PC : more compact and intuitive representation of 3D data than multi-view of volumetric representations
- Recent NN on PC : promising results across multiple tasks
Motivation
- Semantic segmentation : identifying a class label or Object category for every 3D point in a scene
- Instance segmentation : clustering scene into Object instances
- Object categories and Object instances are mutually dependent → coupling semantic and instance segmentation into a single task!
Contributions
- Multi-Task Pointwise Network (MT-PNet) : predicting object categories of 3D points & embedding 3D points into high-dim feature vectors(→clustering points into object instances)
- Multi-Value Conditional Random Fields (MV-CRF) : joint optimisation of class labels and object instances by variational mean field technique
- Experiments : joint optimisation > each individual task / SOTA performance on semantic segmentation

2. Related Work

Semantic Segmentation

Multi-view approach : using pretrained models on 2D domain and applying to 3D space => inconsistency
Volumetric approach : ex. octree (limiting convolution operations only on free-space voxels)
Point cloud approach : directly storing attributes of geometry of 3D scene via coordinates and normals of vertices
Conditional Random Fields (CRFs) : unary and binary potentials capturing characteristics of individual 3D points or meshes

Instance Segmentation

(1) Localizing object bboxes by Object detection → Finding a mask that separates fg and bg within each box
(2) Semantic segmentation + Proposing object instances

3. Method

At first, Scan entire pc by overlapping 3D windows
NN for predicting semantic class labels of vertices within window & embedding vertices into high-dim vectors

Multi-Task Pointwise Network (MT-PNet)

Purpose : predict object class for every 3D point in scene & embedding 3D point into high-dim vector
Same object instance : Pull <-> Different object instance : Push each other

Multi-Value Conditional Random Fields (MV-CRF)

Purpose : jointly performing semantic and instance segmentation by variational inference
Class labels and embeddings are fused into MV-CRF model

3.1. Multi-Task Pointwise Network (MT-PNet)

Input PC (N) → Feature map (N x D)
based on feed forward architecture of PointNet
Two branches : Predicting semantic labels for 3D points & Creating their pointwise Embeddings
Notation
- $K$ : # of instance
- $N_k$ : # of elements in k-th instance
- $e_j$ : embedding of point $v_j$
- $m_k$ : mean (centroid) of embeddings in k-th instance
Loss = $L_{prediction}$ + $L_{embedding}$
- $L_{prediction}$ : CE
- $L_{embedding}$ : $L_{pull} + L_{push} + 0.001*L_{reg}$
  - $L_{pull}$ : to attract embeddings towards centroids <-> $L_{push}$ : to keep centroids away from each other
  - $L_{reg}$ : to draw all centroids towards the origin

[Code] Multi-Task Pointwise Network (MT-PNet)

class MTPNet(nn.Module):
    def __init__(self, input_channels, num_classes, embedding_size):
        super(MTPNet, self).__init__()
        self.num_classes = num_classes
        self.embedding_size = embedding_size
        self.input_channels = input_channels
        self.net = PointNet(self.input_channels)
        self.fc1 = nn.Conv1d(128, self.num_classes, 1)
        self.fc2 = nn.Conv1d(128, self.embedding_size, 1)

    def forward(self, x):
        x = self.net(x)
        logits = self.fc1(x)
        logits = logits.transpose(2, 1)
        logits = torch.log_softmax(logits, dim=-1)
        embedded = self.fc2(x)
        embedded = embedded.transpose(2, 1)
        return logits, embedded

[Code] Loss

class DiscriminativeLoss(nn.Module):
    def __init__(self, delta_d, delta_v,
                 alpha=1.0, beta=1.0, gamma=0.001,
                 reduction='mean'):
        # TODO: Respect the reduction rule
        super(DiscriminativeLoss, self).__init__()
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        # Set delta_d > 2 * delta_v
        self.delta_d = delta_d
        self.delta_v = delta_v

    def forward(self, embedded, masks, size):
        centroids = self._centroids(embedded, masks, size)
        L_v = self._variance(embedded, masks, centroids, size)
        L_d = self._distance(centroids, size)
        L_r = self._regularization(centroids, size)
        loss = self.alpha * L_v + self.beta * L_d + self.gamma * L_r
        return loss

    def _centroids(self, embedded, masks, size):
        batch_size = embedded.size(0)
        embedding_size = embedded.size(2)
        K = masks.size(2)
        x = embedded.unsqueeze(2).expand(-1, -1, K, -1)
        masks = masks.unsqueeze(3)
        x = x * masks
        centroids = []
        for i in range(batch_size):
            n = size[i]
            mu = x[i,:,:n].sum(0) / masks[i,:,:n].sum(0)
            if K > n:
                m = int(K - n)
                filled = torch.zeros(m, embedding_size)
                filled = filled.to(embedded.device)
                mu = torch.cat([mu, filled], dim=0)
            centroids.append(mu)
        centroids = torch.stack(centroids)
        return centroids

    def _variance(self, embedded, masks, centroids, size):
        batch_size = embedded.size(0)
        num_points = embedded.size(1)
        embedding_size = embedded.size(2)
        K = masks.size(2)
        # Convert input into the same size
        mu = centroids.unsqueeze(1).expand(-1, num_points, -1, -1)
        x = embedded.unsqueeze(2).expand(-1, -1, K, -1)
        # Calculate intra pull force
        var = torch.norm(x - mu, 2, dim=3)
        var = torch.clamp(var - self.delta_v, min=0.0) ** 2
        var = var * masks
        loss = 0.0
        for i in range(batch_size):
            n = size[i]
            loss += torch.sum(var[i,:,:n]) / torch.sum(masks[i,:,:n])
        loss /= batch_size
        return loss

    def _distance(self, centroids, size):
        batch_size = centroids.size(0)
        loss = 0.0
        for i in range(batch_size):
            n = size[i]
            if n <= 1: continue
            mu = centroids[i, :n, :]
            mu_a = mu.unsqueeze(1).expand(-1, n, -1)
            mu_b = mu_a.permute(1, 0, 2)
            diff = mu_a - mu_b
            norm = torch.norm(diff, 2, dim=2)
            margin = 2 * self.delta_d * (1.0 - torch.eye(n))
            margin = margin.to(centroids.device)
            distance = torch.sum(torch.clamp(margin - norm, min=0.0) ** 2) # hinge loss
            distance /= float(n * (n - 1))
            loss += distance
        loss /= batch_size
        return loss

    def _regularization(self, centroids, size):
        batch_size = centroids.size(0)
        loss = 0.0
        for i in range(batch_size):
            n = size[i]
            mu = centroids[i, :n, :]
            norm = torch.norm(mu, 2, dim=1)
            loss += torch.mean(norm)
        loss /= batch_size
        return loss

3.2. Multi-Value Conditional Random Fields (MV-CRF)

Conditional Random Fields (CRF)
- Classical algorithm for Named Entity Recognition (NER) in NLP task
- Softmax regression with potential function
- Select several candidates → Choose the most appropriate label among them
Notation
- $V$ : point cloud of 3D scene
- $v_j$ : 3D vertex(point) in $V$ - represented by its 3D location $l_j = [x_j, y_j, z_j]$ & normal $n_j$ & color $c_j = [c_R, c_G, c_B]$
- $e_j$ : embedding for each point $v_j$
- $I_J^S$ : semantic label ---> $L^S$ : set of semantic labels of $V$
- $I_J^I$ : instance label ---> $L^I$ : set of instance labels of $V$
Joint semantic-instance segmentation of point cloud $V$ by minimizing Energy function
MV-CRF : treating instance and semantic labels equally as unknown → optimizing together (minimizing E)
Energy function $E$ = ➀+➁+➂+➃+➄
- Physical constraints (eg. surface smoothness, geometric proximity) & Semantic constraints (ex. shape consistency, object class and instances) in both Semantic and Instance labeling
- ➀ : Unary potential defined over semantic labels BY classification score of MT-PNet $v_j$
- ➁ : Pairwise potential for same object class BY classification scores of both $v_j$ and $v_k$
- ➂ : Unary potential defined over instance labels → PULL same instance <-> PUSH different instance embeddings
- ➃ : Pairwise potential of instance labels → Geometric properties of surfaces in object instances
  - defined as a mixture of Gaussians of locations, normals, color of vertices $v_j$ and $v_k$
- ➄ : semantic-based potentials with instance-based potentials → Consistency bw semantic and instance labels
  - defined based on mutual information BY frequency that semantic label $s$ occurs in vertices whose instance label is $i$

[Code] Dense CRF

/////////////////////////////////
/////  Pairwise Potentials  /////
/////////////////////////////////
void DenseCRF::addPairwiseEnergy (const MatrixXf & features, LabelCompatibility * function, KernelType kernel_type, NormalizationType normalization_type) {
	assert( features.cols() == N_ );
	addPairwiseEnergy( new PairwisePotential( features, function, kernel_type, normalization_type ) );
}
void DenseCRF::addPairwiseEnergy ( PairwisePotential* potential ){
	pairwise_.push_back( potential );
}
void DenseCRF2D::addPairwiseGaussian ( float sx, float sy, LabelCompatibility * function, KernelType kernel_type, NormalizationType normalization_type ) {
	MatrixXf feature( 2, N_ );
	for( int j=0; j<H_; j++ )
		for( int i=0; i<W_; i++ ){
			feature(0,j*W_+i) = i / sx;
			feature(1,j*W_+i) = j / sy;
		}
	addPairwiseEnergy( feature, function, kernel_type, normalization_type );
}
void DenseCRF2D::addPairwiseBilateral ( float sx, float sy, float sr, float sg, float sb, const unsigned char* im, LabelCompatibility * function, KernelType kernel_type, NormalizationType normalization_type ) {
	MatrixXf feature( 5, N_ );
	for( int j=0; j<H_; j++ )
		for( int i=0; i<W_; i++ ){
			feature(0,j*W_+i) = i / sx;
			feature(1,j*W_+i) = j / sy;
			feature(2,j*W_+i) = im[(i+j*W_)*3+0] / sr;
			feature(3,j*W_+i) = im[(i+j*W_)*3+1] / sg;
			feature(4,j*W_+i) = im[(i+j*W_)*3+2] / sb;
		}
	addPairwiseEnergy( feature, function, kernel_type, normalization_type );
}
//////////////////////////////
/////  Unary Potentials  /////
//////////////////////////////
void DenseCRF::setUnaryEnergy ( UnaryEnergy * unary ) {
	if( unary_ ) delete unary_;
	unary_ = unary;
}
void DenseCRF::setUnaryEnergy( const MatrixXf & unary ) {
	setUnaryEnergy( new ConstUnaryEnergy( unary ) );
}
void  DenseCRF::setUnaryEnergy( const MatrixXf & L, const MatrixXf & f ) {
	setUnaryEnergy( new LogisticUnaryEnergy( L, f ) );
}
/////////////////////////////////////
/////  Higher Order Potentials  /////
/////////////////////////////////////
void DenseCRF::addHigherOrderEnergy( const VectorXs & cliques, float weight ) {
	if( higher_order_ ) delete higher_order_;
	higher_order_ = new HigherOrderPotential( cliques, weight );
}
///////////////////////
/////  Inference  /////
///////////////////////
void expAndNormalize ( MatrixXf & out, const MatrixXf & in ) {
	out.resize( in.rows(), in.cols() );
	for( int i=0; i<out.cols(); i++ ){
		VectorXf b = in.col(i);
		b.array() -= b.maxCoeff();
		b = b.array().exp();
		out.col(i) = b / b.array().sum();
	}
}
void sumAndNormalize( MatrixXf & out, const MatrixXf & in, const MatrixXf & Q ) {
	out.resize( in.rows(), in.cols() );
	for( int i=0; i<in.cols(); i++ ){
		VectorXf b = in.col(i);
		VectorXf q = Q.col(i);
		out.col(i) = b.array().sum()*q - b;
	}
}
MatrixXf DenseCRF::inference ( int n_iterations ) const {
	MatrixXf Q( M_, N_ ), tmp1, unary( M_, N_ ), tmp2, tmp3;
	unary.fill(0);
	if( unary_ )
		unary = unary_->get();
	expAndNormalize( Q, -unary );

	VectorXi mask(N_);
	// for (int i = 0; i < N_; ++i)
	// 	mask[i] = (Q.col(i).maxCoeff() > 0.8f);

	for( int it=0; it<n_iterations; it++ ) {
		tmp1 = -unary;

		// Higher-order Potts model
		if( higher_order_ ) {
			higher_order_->apply( tmp3, Q, mask );
			tmp1 -= tmp3;
		}

		for( unsigned int k=0; k<pairwise_.size(); k++ ) {
			pairwise_[k]->apply( tmp2, Q );
			tmp1 -= tmp2;
		}
		expAndNormalize( Q, tmp1 );
	}
	return Q;
}
VectorXs DenseCRF::map ( int n_iterations ) const {
	// Run inference
	MatrixXf Q = inference( n_iterations );
	// Find the map
	return currentMap( Q );
}
///////////////////
/////  Debug  /////
///////////////////
VectorXf DenseCRF::unaryEnergy(const VectorXs & l) {
	assert( l.cols() == N_ );
	VectorXf r( N_ );
	r.fill(0.f);
	if( unary_ ) {
		MatrixXf unary = unary_->get();

		for( int i=0; i<N_; i++ )
			if ( 0 <= l[i] && l[i] < M_ )
				r[i] = unary( l[i], i );
	}
	return r;
}
VectorXf DenseCRF::pairwiseEnergy(const VectorXs & l, int term) {
	assert( l.cols() == N_ );
	VectorXf r( N_ );
	r.fill(0.f);

	if( term == -1 ) {
		for( unsigned int i=0; i<pairwise_.size(); i++ )
			r += pairwiseEnergy( l, i );
		return r;
	}

	MatrixXf Q( M_, N_ );
	// Build the current belief [binary assignment]
	for( int i=0; i<N_; i++ )
		for( int j=0; j<M_; j++ )
			Q(j,i) = (l[i] == j);
	pairwise_[ term ]->apply( Q, Q );
	for( int i=0; i<N_; i++ )
		if ( 0 <= l[i] && l[i] < M_ )
			r[i] =-0.5*Q(l[i],i );
		else
			r[i] = 0;
	return r;
}
MatrixXf DenseCRF::startInference() const{
	MatrixXf Q( M_, N_ );
	Q.fill(0);

	// Initialize using the unary energies
	if( unary_ )
		expAndNormalize( Q, -unary_->get() );
	return Q;
}
void DenseCRF::stepInference( MatrixXf & Q, MatrixXf & tmp1, MatrixXf & tmp2 ) const{
	tmp1.resize( Q.rows(), Q.cols() );
	tmp1.fill(0);
	if( unary_ )
		tmp1 -= unary_->get();

	// Add up all pairwise potentials
	for( unsigned int k=0; k<pairwise_.size(); k++ ) {
		pairwise_[k]->apply( tmp2, Q );
		tmp1 -= tmp2;
	}

	// Exponentiate and normalize
	expAndNormalize( Q, tmp1 );
}
VectorXs DenseCRF::currentMap( const MatrixXf & Q ) const{
	VectorXs r(Q.cols());
	// Find the map
	for( int i=0; i<N_; i++ ){
		int m;
		Q.col(i).maxCoeff( &m );
		r[i] = m;
	}
	return r;
}

// Compute the KL-divergence of a set of marginals
double DenseCRF::klDivergence( const MatrixXf & Q ) const {
	double kl = 0;
	// Add the entropy term
	for( int i=0; i<Q.cols(); i++ )
		for( int l=0; l<Q.rows(); l++ )
			kl += Q(l,i)*log(std::max( Q(l,i), 1e-20f) );
	// Add the unary term
	if( unary_ ) {
		MatrixXf unary = unary_->get();
		for( int i=0; i<Q.cols(); i++ )
			for( int l=0; l<Q.rows(); l++ )
				kl += unary(l,i)*Q(l,i);
	}

	// Add all pairwise terms
	MatrixXf tmp;
	for( unsigned int k=0; k<pairwise_.size(); k++ ) {
		pairwise_[k]->apply( tmp, Q );
		kl += (Q.array()*tmp.array()).sum();
	}
	return kl;
}

// Gradient computations
double DenseCRF::gradient( int n_iterations, const ObjectiveFunction & objective, VectorXf * unary_grad, VectorXf * lbl_cmp_grad, VectorXf * kernel_grad) const {
	// Run inference
	std::vector< MatrixXf > Q(n_iterations+1);
	MatrixXf tmp1, unary( M_, N_ ), tmp2;
	unary.fill(0);
	if( unary_ )
		unary = unary_->get();
	expAndNormalize( Q[0], -unary );
	for( int it=0; it<n_iterations; it++ ) {
		tmp1 = -unary;
		for( unsigned int k=0; k<pairwise_.size(); k++ ) {
			pairwise_[k]->apply( tmp2, Q[it] );
			tmp1 -= tmp2;
		}
		expAndNormalize( Q[it+1], tmp1 );
	}

	// Compute the objective value
	MatrixXf b( M_, N_ );
	double r = objective.evaluate( b, Q[n_iterations] );
	sumAndNormalize( b, b, Q[n_iterations] );

	// Compute the gradient
	if(unary_grad && unary_)
		*unary_grad = unary_->gradient( b );
	if( lbl_cmp_grad )
		*lbl_cmp_grad = 0*labelCompatibilityParameters();
	if( kernel_grad )
		*kernel_grad = 0*kernelParameters();

	for( int it=n_iterations-1; it>=0; it-- ) {
		// Do the inverse message passing
		tmp1.fill(0);
		int ip = 0, ik = 0;
		// Add up all pairwise potentials
		for( unsigned int k=0; k<pairwise_.size(); k++ ) {
			// Compute the pairwise gradient expression
			if( lbl_cmp_grad ) {
				VectorXf pg = pairwise_[k]->gradient( b, Q[it] );
				lbl_cmp_grad->segment( ip, pg.rows() ) += pg;
				ip += pg.rows();
			}
			// Compute the kernel gradient expression
			if( kernel_grad ) {
				VectorXf pg = pairwise_[k]->kernelGradient( b, Q[it] );
				kernel_grad->segment( ik, pg.rows() ) += pg;
				ik += pg.rows();
			}
			// Compute the new b
			pairwise_[k]->applyTranspose( tmp2, b );
			tmp1 += tmp2;
		}
		sumAndNormalize( b, tmp1.array()*Q[it].array(), Q[it] );

		// Add the gradient
		if(unary_grad && unary_)
			*unary_grad += unary_->gradient( b );
	}
	return r;
}
VectorXf DenseCRF::unaryParameters() const {
	if( unary_ )
		return unary_->parameters();
	return VectorXf();
}
void DenseCRF::setUnaryParameters( const VectorXf & v ) {
	if( unary_ )
		unary_->setParameters( v );
}
VectorXf DenseCRF::labelCompatibilityParameters() const {
	std::vector< VectorXf > terms;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		terms.push_back( pairwise_[k]->parameters() );
	int np=0;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		np += terms[k].rows();
	VectorXf r( np );
	for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
		r.segment( i, terms[k].rows() ) = terms[k];
		i += terms[k].rows();
	}
	return r;
}
void DenseCRF::setLabelCompatibilityParameters( const VectorXf & v ) {
	std::vector< int > n;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		n.push_back( pairwise_[k]->parameters().rows() );
	int np=0;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		np += n[k];

	for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
		pairwise_[k]->setParameters( v.segment( i, n[k] ) );
		i += n[k];
	}
}
VectorXf DenseCRF::kernelParameters() const {
	std::vector< VectorXf > terms;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		terms.push_back( pairwise_[k]->kernelParameters() );
	int np=0;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		np += terms[k].rows();
	VectorXf r( np );
	for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
		r.segment( i, terms[k].rows() ) = terms[k];
		i += terms[k].rows();
	}
	return r;
}
void DenseCRF::setKernelParameters( const VectorXf & v ) {
	std::vector< int > n;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		n.push_back( pairwise_[k]->kernelParameters().rows() );
	int np=0;
	for( unsigned int k=0; k<pairwise_.size(); k++ )
		np += n[k];

	for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
		pairwise_[k]->setKernelParameters( v.segment( i, n[k] ) );
		i += n[k];
	}
}

3.3. Variational Inference

Optimization problem : Minimizing $E$ = Maximizing posterior conditional $p$ (intractable with naive implementation)
Mean field Variational Inference to solve optimization problem

Code Implementation

https://github.com/pqhieu/jsis3d

[CV_CNN] Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

Abstract

Residual learning framework to ease training of networks that are substantially deeper
Residual networks are easier to optimize and can gain acc from increased depth
152 layers ResNet (8x deeper than VGGNet) : deeper but still having lower complexity
Ensemble model : 3.57% top-5 error on ImageNet -> 1st place on ILSVRC 2015 classification task
Generalization performance on other recognition tasks (Object detection and Segmentation task)

1. Introduction

DNN for image classification (visual recognition task)
- Integrating low/mid/high level features <- levels can be enriched by stacking layers
- VGG, GoogLeNet : showed that Network depth is important
Possible problems of stacking many layers
- Vanishing/exploding gradients problem : can be solved by normalized initialization (ex. He), SGD, ..
- Overfitting (Variance ↑ + Bias ↓) : low train error but high test error
- Degradation problem : Deeper model 일수록 train and test error both ↑
  - 점점 error 줄어들므로 Vanishing gradient X, Train error도 높아지므로 Overfitting X
  - Solution (유일x) : Deep Residual Learning Network (ResNet)
Residual Learning
- H(x) : Desired(Original) mapping
- F(x) := H(x)-x : Residual mapping
- Output = F(x)+x = H(x)
- (Extreme assumption) If an identity mapping were optimal (H=x), residual to zero (F=0) is easier
Shortcut connections : +x
- Their outputs are added to the outputs of stacked layers
- Simply perform Identity mapping
  - No extra params and computational complexity
  - End-to-end by SGD with backprop, Easy implementation
Experiments
- ImageNet -> ResNet is easy to optimize & deeper net gets higher accuracy
- CIFAR-10 -> Similar phenomena are shown -> showing that generalization for other datasets
- Generalization performance on other recognition tasks (Object detection and Segmentation task)

2. Related Work

Residual Representations

Shortcut Connections

Previous models : GoogLeNet, highway networks, ...
ResNet : Always learns residual functions (Identity shortcuts are always opened)

3. Deep Residual Learning

3.1. Residual Learning

Residual Learning
- H(x) : Desired(Original) mapping
- F(x) := H(x)-x : Residual mapping
- Output = F(x)+x = H(x)
- (Extreme assumption) If an identity mapping were optimal (H=x), residual to zero (F=0) is easier
  - Both H(x) and F(x) can approximate the desired functions, but F(x) is easier to train
- (In real cases) Identity mappings are unlikely optimal, but reformulation helps to precondition problem
  - If the optimal function is closer to identity mapping than to zero mapping, it is easier to find perturbations with reference to an identity mapping than to learn a new function

3.2. Identify Mapping by Shortcuts

Definition of a building block :
- x and y : input and output vectors
- F : Residual mapping to be learned
Operation F + x : performed by a shortcut connection and element-wise addition
Shortcut connections : +x
- Their outputs are added to the outputs of stacked layers
- Simply perform identity mapping
  - No extra params and computational complexity
  - End-to-end by SGD with backprop, Easy implementation
2 types of shortcut connections
- Dimensions of input(x) and output(F) must be equal -> identity mapping
- If not(=when changing in/output channels) -> linear projection W_s to match dims
F can represent multiple conv layers

3.3 Network Architectures

Plain Network

Plain baselines are mainly inspired by VGGNets
- 3x3 filter size for all conv layers
- # of filters is same for the same output feature map size
- # of filters is doubled if the feature map size is halved to preserve time complexity per layer
- Downsampling by stride = 2 of conv layers (No Pooling layers) to match in/output dim
Fewer filters (params) and lower complexity than VGGNets

Residual Network

Plain Network + Shortcut Connections
Black lines : Identity shortcuts can be directly used when in/output same dimension
Dotted lines : 2 options to match in/output dimensions
- (A) Identity mapping with extra zero entries padded for increasing dimensions (No extra params)
- (B) Projection shortcuts by 1x1 conv
- Both (A) and (B), when shortcuts go across feature maps of two sizes, stride = 2

3.4. Implementation

(1) Training

Data Pre-processing
- Image Rescale : with shorter side randomly sampled in [256, 480] for augmentation
- Random crop 224 x 224
- Random horizontal flip
- Standard color augmentation
Train Details
- Batch Normalization right after each conv and before activation
- Weight initialization & Train all plain/residual nets from scratch
- SGD with a mini-batch size : 256
  - Learning rate : 0.1 -> divided by 10 up to 60 x 10^4 iterations
  - Weight decay : 0.0001
  - Momentum : 0.9
- No Dropout

(2) Testing

Standard 10-crop testing for comparison studies
For best results, fully-convolutional form -> average scores at multiple scales

4. Experiments

4.1. ImageNet Classification

Dataset : ImageNet 2012 classification dataset (1000 classes / 1.28M train + 50K val + 100K test)
Eval both top-1 and top-5 error rates

Plain Networks

18, 34, 50, 101, 152-layer Networks => kernel_size = 3
Ex) 18-layer conv2_x : conv1 -> BN1 -> ReLU -> conv2 -> BN2 --+) shortcut --> ReLU
Degradation problem : Deeper(34-layer) plain net has higher training error than shallower(18-layer) plain net
- No vanishing gradient (neither forward nor backward signals vanish)
- May be exponentially low convergence rates

Residual Networks

18-layer and 34-layer ResNet : same baseline arch with plain nets + a shortcut connection (to each pair of 3x3 filters)
(option A) Identity mapping for all shortcuts and Zero-padding for increasing dimensions
- Deeper(34-layer) ResNet has lower training error than shallower(18-layer) ResNet -> Solving Degradation problem
- 34-layer ResNet reduces top-1 error by 3.5% -> Effectiveness of residual learning on deep systems
- 18-layer ResNet converges faster than 18-layer plain net -> ResNet eases optimization by faster convergence at early stage

Identity vs. Projection Shortcuts

(option A) All Identity mapping shortcuts and zero-padding are used for increasing dim
(option B) Projection shortcuts are used for increasing dim & Others are Identity mapping
(option C) All Projection shortcuts
All 3 options are better than plain counterpart
B is slightly better than A : zero-padded dims have no residual learning
C(All Projection) is better but very small differences among 3 options
Identity shortcuts are mainly used for not increasing complexity of Bottleneck architecture!

Deeper Bottleneck Architectures

Structure : A stack of 3 layers (1x1 → 3x3 → 1x1) instead of 2 layers
- 1x1 conv Bottleneck layers for deeper nets (50+)
  - For reducing and then increasing(restoring) dimensions
  - For leaving 3x3 layer a bottle neck with smaller in/output dimensions
- Identity shortcuts : parameter-free -> more efficient
Results
- 152-layer ResNet still has lower complexity than VGGNet-16/19
- Deeper(50/101/152-layer) ResNets are more accurate than shallower(34-layer) ResNet -> Solving degradation problem & great acc gains from increased depth

Comparisions with SOTA

152-layer ResNet single model outperforms all previous ensemble results
Ensemble 6 models of different depth : 3.57% top-5 error -> 1st place in ILSVRC 2015

4.2. CIFAR-10 and Analysis

Dataset : CIFAR-10 dataset (10 classes / 45K train + 5K val + 10K test)
Network architectures
- Network input : 32x32 imgs
- Total (6n+2) stacked weighted layers
  - 1st layer : 3x3 conv layer
  - A stack of 6n 3x3 conv layers on feature maps of sizes {32, 16, 8} with 2n layers for each
  - GAP -> 10-way fc layer -> softmax
Shortcut Connections
- connected to the pairs of 3x3 layers (totally 3n shortcuts)
- (option A) All Identity shortcuts

(1) Training

Data Pre-processing
- Data augmentation : 4 pixels are padded on each side
- Random crop 32x32 sampled from the padded img or horizontal flip
Train Details
- Mini-batch size : 128 on 2 GPUs
  - Weight decay : 0.0001
  - Momentum : 0.9
  - Learning rate : 0.1 -> divided it by 10 at 32k, 48k, 64k iterations
- Weight initialization
- Batch Normalization
- No dropout

(2) Testing

Only eval the single view of the original 32x32 img

(3) Results

Similar to ImageNet cases
110-layer ResNet (n=18)
- initial lr = 0.01 to warm up -> go back to 0.1 and continue training
- Converges well & Fewer params than other deep networks (FitNet, Highway, ....)

Analysis of Layer Responses

ResNets have generally smaller responses than plain counterparts
- Residual functions might be generally closer to zero (F=0) than non-residual functions
Deeper ResNet has smaller magnitudes of responses

Exploring Over 1000 layers

1202-layer ResNet (n=200)
- No optimization difficulty & Training error < 0.1%
- BUT,, Small dataset + Too much Deep network -> Overfitting (Bad Test error)
- Using no strong regularization(maxout/dropout) -> Just simple regularization via deep and thin arch.

4.3 Object Detection on PASCAL and MS COCO

Good generalization performance on other recognition tasks(detection, localization, segmentation)

Code Review

[CV_Action Recognition] Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition

Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition (2s-AGCN)

Previous model (ST-GCN) : modeling human body skeleton as spatiotemporal graphs
- Topology of graph is set manually and fixed over all layers and samples
Proposed model (2s-AGCN) : modeling both 1st and 2nd information simultaneously
- Topology of graph is uniformly or individually learned in E2E
- More informative by using hierarchical GCN and diverse samples
- Result : flexibility & generality ↑ ⇒ better than SOTA

Paper Review

1. Introduction

Disadvantages of ST-GCN
- (1) Topology is fixed over all layers ⇒ lacking flexibility to model multilevel semantic information
- (2) Feature vector attached to each joint only contains 1st info (2D or 3D coordinates)
- (3) Skeleton graph is heuristically predefined and represents only physical structure of body
  ⇒ 'two hands' 처럼 멀리 떨어진 것들에 대한 dependency 얻기 어려움
- (4) One fixed graph structure is not optimal for all samples of different actions
  ⇒ 'touching head'와 'jumping up'에서 hands와 head 사이의 connection 강도 다름
Contributions of 2s-AGCN
- (1) Adaptively learn topology of graph for different layers and samples in E2E
- (2) Feature vector pointing from source joint to target joint contains 2nd info (lengths and directions of bones)
  ⇒ 2nd info is formulated and combined with 1st info using two-stream framework
- (3) SOTA on two large-scale datasets
Two types of graphs in 2s-AGCN (Data-driven method)
- Global graph : for common pattern for all the data
- Individual graph : for unique pattern for each data
- Both are optimized individually for different laters

3. Graph Convolution Networks

3.1. Graph construction

Raw data in one frame : sequence of vectors (each vector = coordinates of joint)
Complete action by multiple frames with different lengths and samples
Following structure of ST-GCN (spatiotemporal graph to model structured information) #36

3.2. Graph convolution

Configs
- Multiple layers of ST-GCN to extract high-level features
- GAP layer & Softmax classifier to predict action categories
Graph convolution operation on vertex $v_i$ in spatial dimension

- $B_i$ : sampling area enclosed by curve (1-distance neighbor vertexes) → # of vertexes in $B_i$ is varied
- Kernel size 3, $B_i$ divided into 3 subsets
  - $S_{i1}$ : vertex itself (red circle)
  - $S_{i2}$ : centripetal subset (green circle) ; closer to center of gravity
  - $S_{i3}$ : centrifugal subset (blue circle) ; farther from center of gravity
- $Z_{ij}$ : cardinality of $S_{ik}$ for balancing contribution of each subset
- $w$ : weighting function based on input → # of weight vectors is fixed
- $l_i$ : mapping function

3.3. Implementation

[Spatial dimension] shape of feature map = $C$ x $T$ x $N$ tensor

- $C$ : channels #, $T$ : temporal length, $N$ : vertexes #
- $K_v$ : kernel size of spatial dimension (=3)
- $A_k$ : adjacency matrix ( $N$ x $N$ ) → Whether there are connections bw two vertexes
- $W_k$ : weight vector ( $C_{out}$ x $C_{in}$ x 1 x 1 )
- $M_k$ : attention map (mask) for importance of each vertex ( $N$ x $N$ ) → Strength of connections
[Temporal dimension] neighbors for each vertex = fixed as 2 (two consecutive frames)

4. Two-Stream Adaptive Graph Convolutional Network

Adaptive Graph Convolutional Layer

Unique graph for different layers and samples (Flexibility)
1x1 Residual brach for matching channel dimension (Stabliity)
Adaptive graph form
- $A_k$ : original normalized adjacency matrix ( $N$ x $N$ ) → human body physical structure
- $B_k$ : trainable data-driven adjacency matrix ( $N$ x $N$ ) → existence and strength of connections (attention)
- $C_k$ : normalized embedded Gaussian function ( $θ$, $φ$ ) → similarity of two vertexes, equipped with softmax

Adaptive Graph Convolutional Block

Convs : spatial GCN
Convt : temporal GCN
BN (Batch Normalization), ReLU, Dropout(0.5)
Residual connection for each block

Adaptive Graph Convolutional Network

AGCN = stack of 9 basic blocks (output channels : 64, 64, 64, 128, 128, 128, 256, 256, 256)
BN at beginning to normalize input data
GAP at end to pool feature maps of different samples to same size
Softmax classifier to obtain final output prediction

Two-stream networks

J-stream (1st, Joint information)
B-stream (2nd, Bone information)
- $v_2$ : Target joint (far away from center)
- $v_1$ : Source joint (close to center)
- $e_{v_1, v_2}$ = $v_2$ - $v_1$ = ( $x_2-x_1$, $y_2-y_1$, $z_2-z_1$ ) : Bone vector (length & direction information)
Steps
- 1st. Calculate bone data based on joint data
- 2nd. Fed each data into each stream, respectively
- 3rd. Fuse each softmax score and Predict final action

5. Experiments

5.1. Datasets

5.2. Training details

5.3. Ablation Study

Block
Visualization
Two-stream framework
Comparison with SOTA
Conclusion

Code Review

[CV_Segmentation] Fully Convolutional Networks for Semantic Segmentation

Abstract

Convnet : Powerful visual models
Fully Convnet (FCN)
- Input of arbitrary size -> Producing corresponding-sized Output => Spatially dense prediction
- Use contemporary classification nets (AlexNet, VGGnet, GoogLeNet) into FCN
- Transfer learned representations by fine-tuning for segmentation task
- Skip arch : Semantic info (from deep, coarse layer) + Appearance info (from shallow, fine layer)
- SOTA segmentation of PASCAL VOC, NYUDv2, SIFT Flow (inference less than 1/5 sec for a img)

1. Introduction

(1) Prior methods

CNN : whole classification / local tasks (bbox object detection, part and key-point prediction, local correspondence)
CNN for Semantic segmentation : (Prior approach) labeling each pixel with class of enclosing object (전체 이미지를 기반으로 하는 것이 아닌, 픽셀들마다 dense prediction)
Patchwise training : lack efficiency of fully convolutional training, high computation
Pixelwise training : high computation (pooling X), hierarchical feature X
Use of pre/post-processing (ex. superpixels, proposals, post-hoc refinement by random fields or local classifiers)
Applying small convnets without supervised pre-training

(2) Semantic segmentation ... 4.2 Combining what and where

inherent tension(trade-off) bw Semantics (What / Global / Coarse) + Location (Where / Local / Fine)
=> FCN에서는 Skip architecture로 해당 문제 완화

(2) FCN

Architecture : Encoder CNN + Decoder (=upsampling) --> Segmentation task
- Convolutionalization (FC->Conv) : Any size input → same size (spatial dimension) dense output
- Upsampling layer (Deconv layer) : Pixelwise prediction and learning with subsampled pooling
- Skip architecture : Semantic info from deep, coarse layer + Appearance info from shallow, fine layer
Training / Test
- End-to-end training
- Supervised pre-training by fine-tuned CNNs (AlexNet, VGG, GoogLeNet) into FCN
- Performed whole img at a time by dense feedforward computation and backpropagation
No use of pre/post-processing

2. Related work

Fully Convolutional Networks

FCN for Detection
- Matan : extending convnet to any sized inputs (LeNet to recognize 1d strings)
- Wold and Platt : expanding convnet outputs to 2d (four corners of postal address blocks)
Ning : convnet for coarse multiclass Segmentation
He : feature extractor (proposals + spatial pooling) hybrid model -> no end-to-end
Sliding window detection, Semantic segmentation, image restoration

Dense prediction with convnets

Semantic segmentation, boundary prediction, hybrid convnet/nearest neighbor model, image restoration, depth estimation
Common machinery elements of upper approaches : patchwise training / post-processing / input shifting and output interlacing / multi-scale pyramid processing / saturating tanh nonlinearites / ensemble / ...

3. Fully convolutional networks

Each layer in a Convnet : h x w x d (h x w : spatial dim, d : feature or color channel dim)
- Locations in the input img of 1st layer <---> Corresponding Locations of higher layers = Receptive field
Translation invariance
- Input의 위치가 달라져도 Output이 동일한 값을 갖는 것 (위치 정보 불변)
- CNN basic components(convolution, pooling, activation function) operate on local input regions & depend only on relative spatial coordinates
  - x_ij : Data vector at location (i, j) in a particular layer
  - y_ij : Output vector at location (i, j) in a following layer
    - k : kernel size, s : stride or subsampling factor
    - f_ks : layer type (ex. convolution, average pooling, max pooling, activation function, etc)
  - Transformation rule of maintaining upper functional form
CNN -> FCN
- When receptive fields overlap significantly with conv filter (ex. stride=1), feedforward computation and backprop are more efficient with layer-by-layer over entire img instead of patch-by-patch (individual CNNs)
- Produce coarse output maps
- -> Need to connect these coarse outputs back to pixels for pixelwise (dense) prediction (... BY Skip arch in FCN)

3.1 Adapting classifiers for dense prediction [Encoder] ... 4.1 From classifier to dense FCN

[Figure 2]

Typical CNN : Fixed sized inputs & Non-spatial outputs (no spatial coordinates)
- Fixed sized inputs : FC layer는 고정된 입력크기(뉴런) 받아야 하는 구조
- Non-spatial outputs (no spatial coordinates) : FC layer는 feature map 1D로 flatten
FCN : Any sized inputs & Spatial outputs (heatmap)
- Computation is highly amortized over the overlapping regions of patches
  - FC layer를 이용한 Alexnet으로 Segmentation 하면 고정된 input size patch 이용해야 함
  - FCN으로 Segmentation 하면 input size 자유라서 patch로 안해도 되기에 더 빠름
- Both backward + forward : Straightforward -> Computation efficiency of convolution
- Output dimensions are reduced by Subsampling (ex. stride 조절)
  - Subsampling to keep filters small (3x3) & computational requirements reasonable
  - Coarsen output of FCN

3.2 Shift-and-stitch is filter rarefaction ( x )

coarse outputs -> dense prediction BY stitching output from shifted versions of input
trick for shift-and-stitch
- setting lower layer input stride 1 -> upsampling its output by factor of input stride s
- stride로 나눠 떨어지는 경우는 값을 채워주고, 그렇지 않은 경우는 0
- however, not same result as shift-and-stitch

3.3 Upsampling is backwards strided convolution ( O ) ... 4.2 Combining what and where

(Bilinear) Interpolation
- linear mapping that depends only on relative positions -> fixed value
- 아는 값들 기반으로 모르는 값 채워 넣기
Backward strided convolution (Deconvolution)
- Reverse forward and backward passes of convolution
- Upsampling can be performed end-to-end learning by backprop from pixelwise loss
- Deconv filter need not be fixed, can be learned!
- A stack of Deconv and Activation func -> Nonlinear upsampling

3.4 Patchwise training is loss sampling ( x )

Sampling Patchwise training ( x )
- can correct class imbalance BUT down spatial correlation of dense patches
- patch 간의 overlap 많이 되면 불필요한 computation 증가한다는 단점
- faster, better convergence 효과 X
FCN (Whole image training) ( O )
- also can correct class imbalance by weighting the loss AND address spatial correlation
- more effective and efficient
Experiment result
- Sampling : No significant effect on convergence rate, but significantly more time due to large # of imgs per batch --> Unsampled, whole img training !

4. Segmentation Architecture

[Figure 3]

Skip architecture between layers to fuse coarse, semantic, local, appearance information
Investigation : PASCAL VOC 2011

4.1 From classifier to dense FCN

Used CNN models : AlexNet, VGG16, GoogLeNet
Discarding the final classifier layer and Converting all FC to Conv (1 x 1 x 21 for PASCAL)
Upsampling coarse outputs by deconvolution to dense outputs
Result : FCN-VGG16 (SOTA) >> FCN-GoogLeNet (similar classification acc with VGG16)

4.2 Combining what and where

Fully convolutionalized classifiers : can be fine-tuned to segmentation
- BUT outputs is dissatisfying coarse (limit the scale of detail in the upsampled output)
- Pooling 많이 거칠수록 features 정보 손실 -> 이걸로 upsampling 하면 제대로 X (특히 fine object 잡아내기 어려움)
FCN + Skip Arch : combine final prediction layer + lower layers with finer strides
- combination of coarse layers + fine layers -> global + local prediction
- Pooling 덜 거친 앞 단의 feature maps 을 combine 한 후 그걸로 upsampling

4.3 Experimental framework

Optimization
- Optimizer : SGD
- Momentum = 0.9
- Weigh decay = 5^-4 or 2^-4
- Mini-batch size of 20 imgs
- Fixed lr = 10^-3, 10^-4, 5^-5 for FCN-AlexNet, FCN-VGG16, FCN-GoogLeNet
- Zero initialization class scoring layer
- Dropout : same with original CNN
Fine-tuning
- Fine-tune all layers by backprop through whole net
- Fine-tune the output classifier layer alone : 70% of full fine-tuning performance
- Scratch training : not feasible (실현 불가)
Dense Prediction (Upsampling)
- Upsampling by Deconvolution layers within the net
- Final deconv layer : deconv filters are fixed to bilinear interpolation
- Intermediate deconv layers : are initialized to bilinear upsampling and learned
Augmentation
- Randomly mirroring (Horizontal Flip)
- Jittering by translating up to 32 pixels (the coarsest scale of prediction)
- Result : No noticeable improvement
More Training data
- PASCAL VOC 2011 segmentation training set (labels for 1112 imgs) + 8498 labels by Hariharan
- Result : improve FCN-VGG16 validation score by 3.4 points to 59.4 mean IU

5. Results

Metric : pixel accuracy, mean accuracy, mean IU, frequency weighted IU
FCN-8s on PASCAL VOC 2011 and 2012

[CV_3D] MVSNet: Depth Inference for Unstructured Multi-view Stereo

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Paper Review

Abstract

MVSNet : E2E DL model for depth map inference from multi-view imgs
- (1) Extract deep visual img features
- (2) Build 3D Cost Volume upon reference camera frustum via differentiable homography warping
- (3) Apply 3D conv to regularize and regress initial Depth Map → Refine with reference img
Multiple features를 One cost feature로 mapping 하는 Variance-based metric 이용해서 N-view inputs 처리 가능
Experiments
- DTU dataset 대해 outperform SOTA & faster in runtime → benchmarking
- T&T dataset 대해 rank first without fine-tuning → strong generalization

Introduction

Multi-View Stereo (MVS) : estimating dense representation from overlapping imgs
Traditional methods
- How : using hand-crafted similarity metrics & engineered regularizations
- Limitation : dense matching intractable for global semantic information (ex. low-textured, specular, reflective region) → incomplete reconstruction
Learnable CNN-based methods for 2-view stereo matching
- Global semantic information 문제 해결
- How : 2-view에서는 camera params 없이도 image pairs 미리 보정해서 horizontal pixel-wise disparity estimation 가능
- Limitation : MVS에서는 input img가 arbitrary camera geometry 일 수도 있기에 learning method 사용 어려움
Learnable CNN-based methods for MVS recon
- 위의 Limitation 인해 MVS와 CNN의 fit 안맞아서 거의 시도되지 않았음
- Ex. SurfaceNet using CVC (Color Voxel Cubes), LSM (Learned Stereo Machine)
- Limitation : volumetric representation of regular grids 사용하기에 huge memory consumption of 3D volumes 인해 network scale up 어려움 (long time required OR only for synthetic objects in low volume resolution)
MVSNet
- How : computing one depth map at each time (not whole 3D scene at once)
- Input : one reference img and several source imgs → to infer depth map for reference img
- Key insight : Differentiable homography warping operation
  - to encode camera geometries implicitly to build 3D Cost Volumes from 2D img features
- Next step : Multi-scale 3D conv
  - to regularize and regress initial Depth Map → Refine with reference img
- Major differences
  - 3D Cost Volume is built upon camera frustum instead of regular Euclidean space
  - Decoupled MVS recon to smaller problems of per-view depth map estimation → large-scale recon possible!

Related work

MVS Reconstruction

(분류 기준 : Output representation)

Direct Point Cloud recon : 3D point에서 직접 수행 → sequential propagation 인해 hard to be fully parallelized, long time
Volumetric recon : 3D space를 regular grid로 나눈 후, each voxel이 surface에 붙어있는지 추정 → space discretization error, high memory consumption
Depth map recon : only one reference img와 a few source imgs에만 집중하는 small problems of per-view estimation로 분리 + PC 또는 Volumetric recon에 쉽게 fuse 가능

Learned Stereo

Traditional Stereo 방법 대신 DL model 사용하기 시작!

Pair-wise patch matching
- DL network to match two img patches
- Learned features for stereo matching and semi-global matching(SGM) for post-processing
Cost regularization
- SGMNet, CNN-CRF, GCNet
- GCNet (SOTA) : 3D CNN으로 cost volume을 regularize 하고 disparity를 regress하는 E2E model

Learned MVS

Fewer attempts ...

Multi-patch similarity (new metric for MVS)
- SurfaceNet : sophisticated voxel-wise view 선택해서 cost volume 계산 → 3D CNN으로 정규화하고 surface voxel 추론
- LSM : camera parameters are encoded as projection for cost volume → 3D CNN으로 voxel이 surface에 속하는지 분류
- But, 두 방법 다 volumetric representation 한계로 인해 small-scale recon만 가능

MVSNet

(1) Image Feature Extraction

Goal : To extract deep features $F$ of N개 input imgs $I$
2D Network : 8-layer 2D CNN
- layer = Conv + BN + ReLU except for last layer
- layer 1,2 & 4,5 : extract higher-level representation
- layer 3 & layer 6 : s=2 → divide feature towers into 3 scales (original input size, 1/2, 1/4)
Output : N개 32-channel feature maps downsized by 4 in each dim
- original neighboring information of each remaining pixel은 32-channel pixel descriptor에 의해 이미 encoding 되어 있음 → dense matching 할 때 useful context information 잃어버릴 걱정 X
Ablation study : original img 대해 dense matching 했을 때 보다 extracted feature maps 대해 했을 때 recon quality 훨씬 굿

class UniNetDS2(Network):
    """Simple UniNet, as described in the paper."""

    def setup(self):
        print ('2D with 32 filters')
        base_filter = 8
        (self.feed('data')
        .conv_bn(3, base_filter, 1, center=True, scale=True, name='conv0_0')
        .conv_bn(3, base_filter, 1, center=True, scale=True, name='conv0_1')
        .conv_bn(5, base_filter * 2, 2, center=True, scale=True, name='conv1_0')
        .conv_bn(3, base_filter * 2, 1, center=True, scale=True, name='conv1_1')
        .conv_bn(3, base_filter * 2, 1, center=True, scale=True, name='conv1_2')
        .conv_bn(5, base_filter * 4, 2, center=True, scale=True, name='conv2_0')
        .conv_bn(3, base_filter * 4, 1, center=True, scale=True, name='conv2_1')
        .conv(3, base_filter * 4, 1, biased=False, relu=False, name='conv2_2'))

### model.py -> def inference
    # image feature extraction    
    if is_master_gpu:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
    else:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
    view_towers = []
    for view in range(1, FLAGS.view_num):
        view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
        view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
        view_towers.append(view_tower)

(2) Cost Volume

Goal : To build 3D Cost Volume from extracted feature maps and input cameras
How : regular grid로 space를 나누지 않고, reference camera frustum 위에 cost volume 구축
Notations
- $I_1$ : reference img → $F_1$ : reference feature map
- $I_i$ (i=2~N) : source imgs → $F_i$ : feature map
- ${K_i, R_i, t_i}$ (i=1~N) : camera intrinsics, rotations, translations
- $n_1$ : principle axis of reference camera

Differentiable Homography

Warping all feature maps $F$ → N개의 feature volume $V$ (By different fronto-parallel planes of reference camera)
Coordinate mapping from warped $V_i(d)$ to $F_i$ at $d$ By planar transformation $x'$ ~ $H_i(d)*x$
- ~ : projective equality
- $H_i(d)$ : 3x3 Homography matrix bw i-th feature map $F_i$ and reference feature map $F_1$ at depth $d$
⇔ Classical plane sweeping stereo + Differentiable bilinear interporlation to sample pixels from feature map (imgs X)
Differentiable Warping operation : 2D feature extraction과 3D regularization network 연결 → E2E depth map inference !

Cost Metric : Variance-based Metric $M$

Notations
- $W$(img width), $H$(img height), $D$(depth sample #), $F$(feature map channel #)
- Feature volume size : $V$ = $W$/4 * $H$/4 * $D$ * $F$
- $\overline{V_i}$ : Average volume of all feature volumes
Mapping : N개의 feature volume $V_i$ → 1개의 cost volume $C$
Matching cost
- Traditional MVS methods : pairwise costs bw refer img and all src imgs in heuristic way
- MVSNet : all views contribute equally to matching cost & no preference to refer img
Mean vs Variance
- Prior research using Mean operation : infer multi-patch similarity with additional pre- and post- CNN layers
- MVSNet using Variance operation : measure multi-view feature difference explicitly

### model.py -> def inference
    # build cost volume by differentiable homography
    with tf.name_scope('cost_volume_homography'):
        depth_costs = []
        for d in range(depth_num):
            # compute cost (variation metric)
            ave_feature = ref_tower.get_output()
            ave_feature2 = tf.square(ref_tower.get_output())
            for view in range(0, FLAGS.view_num - 1):
                homography = tf.slice(view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
                homography = tf.squeeze(homography, axis=1)
                warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
                ave_feature = ave_feature + warped_view_feature
                ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
            ave_feature = ave_feature / FLAGS.view_num
            ave_feature2 = ave_feature2 / FLAGS.view_num
            cost = ave_feature2 - tf.square(ave_feature)
            depth_costs.append(cost)
        cost_volume = tf.stack(depth_costs, axis=1)

Cost Volume Regularization

What : raw Cost volume $C$ → regulated Probability volume $P$
Why : $C$는 img features에서 계산되었기에 noise-contaminated 위험 존재 → smoothness constraints와 통합 필요
How : Multi-scale 3D CNN (4-scale network)
- ≒ 3D Unet encoder-decoder structure (aggregating neighboring information from large receptive field)
- +) Computation 줄이기 위해 channel수(32→8) 줄이고, conv layers수(3→2) 줄임
Output : 1-channel volume → softmax operation along depth direction for probability normalization
Usages : per-pixel depth estimation, measuring estimation confidence
=> determining recon quality by probability distribution, outlier filtering

class RegNetUS0(Network):
    """network for regularizing 3D cost volume in a encoder-decoder style. Keeping original size."""

    def setup(self):
        print ('Shallow 3D UNet with 8 channel input')
        base_filter = 8
        (self.feed('data')
        .conv_bn(3, base_filter * 2, 2, center=True, scale=True, name='3dconv1_0')
        .conv_bn(3, base_filter * 4, 2, center=True, scale=True, name='3dconv2_0')
        .conv_bn(3, base_filter * 8, 2, center=True, scale=True, name='3dconv3_0'))

        (self.feed('data')
        .conv_bn(3, base_filter, 1, center=True, scale=True, name='3dconv0_1'))

        (self.feed('3dconv1_0')
        .conv_bn(3, base_filter * 2, 1, center=True, scale=True, name='3dconv1_1'))

        (self.feed('3dconv2_0')
        .conv_bn(3, base_filter * 4, 1, center=True, scale=True, name='3dconv2_1'))

        (self.feed('3dconv3_0')
        .conv_bn(3, base_filter * 8, 1, center=True, scale=True, name='3dconv3_1')
        .deconv_bn(3, base_filter * 4, 2, center=True, scale=True, name='3dconv4_0'))

        (self.feed('3dconv4_0', '3dconv2_1')
        .add(name='3dconv4_1')
        .deconv_bn(3, base_filter * 2, 2, center=True, scale=True, name='3dconv5_0'))

        (self.feed('3dconv5_0', '3dconv1_1')
        .add(name='3dconv5_1')
        .deconv_bn(3, base_filter, 2, center=True, scale=True, name='3dconv6_0'))

        (self.feed('3dconv6_0', '3dconv0_1')
        .add(name='3dconv6_1')
        .conv(3, 1, 1, biased=False, relu=False, name='3dconv6_2'))

(3) Depth Map

Initial Estimation

What : regulated Probability volume $P$ → inferred Depth map $D$
How : Expectation value along depth direction = Probability weighted sum over all depth hypothesis
= Soft argmin → fully differentiable operation & armax effect
- $P(d)$ : probability estimation for all pixels at depth $d$
- $d$ : depth hypothesis uniformly sampled within [ $d_{min}$ , $d_{max}$ ]
Output : depth map (same size to 2D img feature maps = 1/4 size of input img)

Probability Map

Why(Observation) : Multi-scale 3D CNN은 probability를 single model로 정규화하는 기능을 가졌지만, falsely matched pixels의 경우 scattered distribution을 띄기에 one peak에 집중 불가
Definition : The quality of depth estimation $\hat{d}$ = GT depth가 estimation 근처의 작은 범위 내에 있을 확률
How : Probability sum over 4 nearest depth hypothesis to measure estimation quality
→ Effect : better depth map filtering, outlier filtering

def get_propability_map(cv, depth_map, depth_start, depth_interval):
    """ get probability map from cost volume """

    def _repeat_(x, num_repeats):
        """ repeat each element num_repeats times """
        x = tf.reshape(x, [-1])
        ones = tf.ones((1, num_repeats), dtype='int32')
        x = tf.reshape(x, shape=(-1,1))
        x = tf.matmul(x, ones)
        return tf.reshape(x, [-1])

    shape = tf.shape(depth_map)
    batch_size = shape[0]
    height = shape[1]
    width = shape[2]
    depth = tf.shape(cv)[1]

    # byx coordinate, batched & flattened
    b_coordinates = tf.range(batch_size)
    y_coordinates = tf.range(height)
    x_coordinates = tf.range(width)
    b_coordinates, y_coordinates, x_coordinates = tf.meshgrid(b_coordinates, y_coordinates, x_coordinates)
    b_coordinates = _repeat_(b_coordinates, batch_size)
    y_coordinates = _repeat_(y_coordinates, batch_size)
    x_coordinates = _repeat_(x_coordinates, batch_size)

    # d coordinate (floored and ceiled), batched & flattened
    d_coordinates = tf.reshape((depth_map - depth_start) / depth_interval, [-1])
    d_coordinates_left0 = tf.clip_by_value(tf.cast(tf.floor(d_coordinates), 'int32'), 0, depth - 1)
    d_coordinates_left1 = tf.clip_by_value(d_coordinates_left0 - 1, 0, depth - 1)
    d_coordinates1_right0 = tf.clip_by_value(tf.cast(tf.ceil(d_coordinates), 'int32'), 0, depth - 1)
    d_coordinates1_right1 = tf.clip_by_value(d_coordinates1_right0 + 1, 0, depth - 1)

    # voxel coordinates
    voxel_coordinates_left0 = tf.stack(
        [b_coordinates, d_coordinates_left0, y_coordinates, x_coordinates], axis=1)
    voxel_coordinates_left1 = tf.stack(
        [b_coordinates, d_coordinates_left1, y_coordinates, x_coordinates], axis=1)
    voxel_coordinates_right0 = tf.stack(
        [b_coordinates, d_coordinates1_right0, y_coordinates, x_coordinates], axis=1)
    voxel_coordinates_right1 = tf.stack(
        [b_coordinates, d_coordinates1_right1, y_coordinates, x_coordinates], axis=1)

    # get probability image by gathering and interpolation
    prob_map_left0 = tf.gather_nd(cv, voxel_coordinates_left0)
    prob_map_left1 = tf.gather_nd(cv, voxel_coordinates_left1)
    prob_map_right0 = tf.gather_nd(cv, voxel_coordinates_right0)
    prob_map_right1 = tf.gather_nd(cv, voxel_coordinates_right1)
    prob_map = prob_map_left0 + prob_map_left1 + prob_map_right0 + prob_map_right1
    prob_map = tf.reshape(prob_map, [batch_size, height, width, 1])

    return prob_map


### model.py -> def inference
    # depth map by softArgmin
    with tf.name_scope('soft_arg_min'):
        # probability volume by soft max
        probability_volume = tf.nn.softmax(
            tf.scalar_mul(-1, filtered_cost_volume), axis=1, name='prob_volume')
        # depth image by soft argmin
        volume_shape = tf.shape(probability_volume)
        soft_2d = []
        for i in range(FLAGS.batch_size):
            soft_1d = tf.linspace(depth_start[i], depth_end[i], tf.cast(depth_num, tf.int32))
            soft_2d.append(soft_1d)
        soft_2d = tf.reshape(tf.stack(soft_2d, axis=0), [volume_shape[0], volume_shape[1], 1, 1])
        soft_4d = tf.tile(soft_2d, [1, 1, volume_shape[2], volume_shape[3]])
        estimated_depth_map = tf.reduce_sum(soft_4d * probability_volume, axis=1)
        estimated_depth_map = tf.expand_dims(estimated_depth_map, axis=3)

    # probability map
    prob_map = get_propability_map(probability_volume, estimated_depth_map, depth_start, depth_interval)

    return estimated_depth_map, prob_map # filtered_depth_map, probability_volume

Depth Map Refinement

Why : Large receptive field 인해 reconstruction boundary의 oversmoothing 문제
How : reference img에는 boundary 정보가 있으므로 refine 위한 guidance로 사용
- MVSNet + Depth residual learning network
  - Pre-scaling of inital depth magnitude to [0, 1] → Refinement 후 back : (biased at certain depth scale 방지)
  - Input : Initial depth map & resized reference img를 4-channel input으로 concat
  - → 32-channel 2D conv 3개와 1-channel conv 1개를 거쳐 Depth residual 학습
  - Last layer : No BN layer and ReLU as to learn negative residual

class RefineNet(Network):
    """network for depth map refinement using original image."""

    def setup(self):

        (self.feed('color_image', 'depth_image')
        .concat(axis=3, name='concat_image'))

        (self.feed('concat_image')
        .conv_bn(3, 32, 1, name='refine_conv0')
        .conv_bn(3, 32, 1, name='refine_conv1')
        .conv_bn(3, 32, 1, name='refine_conv2')
        .conv(3, 1, 1, relu=False, name='refine_conv3'))

        (self.feed('refine_conv3', 'depth_image')
        .add(name='refined_depth_image'))

## model.py
def depth_refine(init_depth_map, image, depth_num, depth_start, depth_interval, is_master_gpu=True):
    """ refine depth image with the image """

    # normalization parameters
    depth_shape = tf.shape(init_depth_map)
    depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
    depth_start_mat = tf.tile(tf.reshape(
        depth_start, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
    depth_end_mat = tf.tile(tf.reshape(
        depth_end, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
    depth_scale_mat = depth_end_mat - depth_start_mat

    # normalize depth map (to 0~1)
    init_norm_depth_map = tf.div(init_depth_map - depth_start_mat, depth_scale_mat)

    # resize normalized image to the same size of depth image
    resized_image = tf.image.resize_bilinear(image, [depth_shape[1], depth_shape[2]])

    # refinement network
    if is_master_gpu:
        norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
                                        is_training=True, reuse=False)
    else:
        norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
                                        is_training=True, reuse=True)
    norm_depth_map = norm_depth_tower.get_output()

    # denormalize depth map
    refined_depth_map = tf.multiply(norm_depth_map, depth_scale_mat) + depth_start_mat

    return refined_depth_map

Loss Function

Loss for both estimated (Initial & Refined) depth map are considered
Mean absolute difference bw GT and Estimated depth map
Considering only pixels with valid GT depth map labels (Not whole img)
Notations
- $p_{valide}$ : set of valid GT pixels
- $d(p)$ : GT depth value of pixel $p$
- $\hat{d_i}(p)$ : Initial depth estimation
- $\hat{d_r}(p)$ : Refined depth map estimation
- $λ$ = 1.0

def non_zero_mean_absolute_diff(y_true, y_pred, interval):
    """ non zero mean absolute loss for one batch """
    with tf.name_scope('MAE'):
        shape = tf.shape(y_pred)
        interval = tf.reshape(interval, [shape[0]])
        mask_true = tf.cast(tf.not_equal(y_true, 0.0), dtype='float32')
        denom = tf.reduce_sum(mask_true, axis=[1, 2, 3]) + 1e-7
        masked_abs_error = tf.abs(mask_true * (y_true - y_pred))            # 4D
        masked_mae = tf.reduce_sum(masked_abs_error, axis=[1, 2, 3])        # 1D
        masked_mae = tf.reduce_sum((masked_mae / interval) / denom)         # 1
    return masked_mae

def mvsnet_regression_loss(estimated_depth_image, depth_image, depth_interval):
    """ compute loss and accuracy """
    # non zero mean absulote loss
    masked_mae = non_zero_mean_absolute_diff(depth_image, estimated_depth_image, depth_interval)
    # less one accuracy
    less_one_accuracy = less_one_percentage(depth_image, estimated_depth_image, depth_interval)
    # less three accuracy
    less_three_accuracy = less_three_percentage(depth_image, estimated_depth_image, depth_interval)

    return masked_mae, less_one_accuracy, less_three_accuracy

Implementations

Training

Data Preparation

DTU dataset (GT pc with normal information)+ generated GT Depth maps
- DTU dataset : large-scale MVS dataset containing 100↑ scenes with different lighting conditions
- Point cloud with normal information → Mesh by SPSR → Depth maps by rendering mesh to each viewpoint
  - SPSR(screened Poisson surface reconstruction) : depth-of-tree = 11 (to acquire high quality mesh result)
  - Mesh trimming-factor = 9.5 (to alleviate mesh artifacts)
49 imgs with 7 different lighting conditions for each scan => Total # of training samples : 27097

View Selection

Training img : Reference img + 2 Source imgs
Downsize imgs in feature extraction → Downsize img resolution 1600x1200 to 800x600 in 3D regularization → Crop img patch with W=640, H=512 from center => img resolution 바뀌었으니 이에 따라 input camera parameters도 바꿔주었음
Depth hypotheses are uniformly sampled from [425mm ~ 935mm] with 2mm resolution
Environment : TensorFlow, Tesla P100
100,000 iterations

Post-processing

Depth Map Filter

Goal : To filter out outliers at background and occluded areas before converting depth value to dense point clouds
Criteria : Photometric consistency & Geometric consistency
- Photometric consistency : measuring matching quality
  - (Experiment) Pixels with probability lower than 0.8 = Outliers
- Geometric consistency : measuring depth consistency among multiple view
  - reference pixel과 another view의 pixel 끼리 각각의 depth 대해 project, reproject 해서 특정 조건식 만족시키도록 함
  - (Experiment) All depths should be at least 3-view consistent

Depth Map Fusion

Goal : To integrate depth maps from different views to a unified pc representation
Visibility-based fusion → minimize depth occlusions, violations
Filtering step에서 visible views for each pixel을 선택하고, all reprojected depths 대해 평균 → suppress recon noises
3D Point cloud 생성하기위해 fused depth maps을 space에 reproject 시킴

Experiments

Benchmarking on DTU dataset

MVSNet outperforms all methods in both the completeness & overall quality with a significant margin

Generalization on T&T dataset

Using MVSNet trained on DTU without any fine-tuning

Ablations

View Number
Image Features
Cost Metric
Depth Refinement

Conclusion

MVSNet : unstructed imgs를 input으로 받아서 reference img 대해 depth map 추정 E2E DL Network
Core contribution of MVSNet : To encode camera parameters as differentiable homography to build cost volume upon camera frustum → 2D feature extraction과 3D cost regularization 연결
Results : DTU 대해 outperform & efficient in speed / T&T 대해 SOTA without fine-tuning → generalization ability

Code Review

## model.py
def get_propability_map(cv, depth_map, depth_start, depth_interval):
    """ get probability map from cost volume """

    def _repeat_(x, num_repeats):
        """ repeat each element num_repeats times """
        x = tf.reshape(x, [-1])
        ones = tf.ones((1, num_repeats), dtype='int32')
        x = tf.reshape(x, shape=(-1,1))
        x = tf.matmul(x, ones)
        return tf.reshape(x, [-1])

    shape = tf.shape(depth_map)
    batch_size = shape[0]
    height = shape[1]
    width = shape[2]
    depth = tf.shape(cv)[1]

    # byx coordinate, batched & flattened
    b_coordinates = tf.range(batch_size)
    y_coordinates = tf.range(height)
    x_coordinates = tf.range(width)
    b_coordinates, y_coordinates, x_coordinates = tf.meshgrid(b_coordinates, y_coordinates, x_coordinates)
    b_coordinates = _repeat_(b_coordinates, batch_size)
    y_coordinates = _repeat_(y_coordinates, batch_size)
    x_coordinates = _repeat_(x_coordinates, batch_size)

    # d coordinate (floored and ceiled), batched & flattened
    d_coordinates = tf.reshape((depth_map - depth_start) / depth_interval, [-1])
    d_coordinates_left0 = tf.clip_by_value(tf.cast(tf.floor(d_coordinates), 'int32'), 0, depth - 1)
    d_coordinates_left1 = tf.clip_by_value(d_coordinates_left0 - 1, 0, depth - 1)
    d_coordinates1_right0 = tf.clip_by_value(tf.cast(tf.ceil(d_coordinates), 'int32'), 0, depth - 1)
    d_coordinates1_right1 = tf.clip_by_value(d_coordinates1_right0 + 1, 0, depth - 1)

    # voxel coordinates
    voxel_coordinates_left0 = tf.stack(
        [b_coordinates, d_coordinates_left0, y_coordinates, x_coordinates], axis=1)
    voxel_coordinates_left1 = tf.stack(
        [b_coordinates, d_coordinates_left1, y_coordinates, x_coordinates], axis=1)
    voxel_coordinates_right0 = tf.stack(
        [b_coordinates, d_coordinates1_right0, y_coordinates, x_coordinates], axis=1)
    voxel_coordinates_right1 = tf.stack(
        [b_coordinates, d_coordinates1_right1, y_coordinates, x_coordinates], axis=1)

    # get probability image by gathering and interpolation
    prob_map_left0 = tf.gather_nd(cv, voxel_coordinates_left0)
    prob_map_left1 = tf.gather_nd(cv, voxel_coordinates_left1)
    prob_map_right0 = tf.gather_nd(cv, voxel_coordinates_right0)
    prob_map_right1 = tf.gather_nd(cv, voxel_coordinates_right1)
    prob_map = prob_map_left0 + prob_map_left1 + prob_map_right0 + prob_map_right1
    prob_map = tf.reshape(prob_map, [batch_size, height, width, 1])

    return prob_map

def inference(images, cams, depth_num, depth_start, depth_interval, is_master_gpu=True):
    """ infer depth image from multi-view images and cameras """

    # dynamic gpu params
    depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval

    # reference image
    ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
    ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)

    # image feature extraction    
    if is_master_gpu:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
    else:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
    view_towers = []
    for view in range(1, FLAGS.view_num):
        view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
        view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
        view_towers.append(view_tower)

    # get all homographies
    view_homographies = []
    for view in range(1, FLAGS.view_num):
        view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
        homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
                                        depth_start=depth_start, depth_interval=depth_interval)
        view_homographies.append(homographies)

    # build cost volume by differentialble homography
    with tf.name_scope('cost_volume_homography'):
        depth_costs = []
        for d in range(depth_num):
            # compute cost (variation metric)
            ave_feature = ref_tower.get_output()
            ave_feature2 = tf.square(ref_tower.get_output())
            for view in range(0, FLAGS.view_num - 1):
                homography = tf.slice(view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
                homography = tf.squeeze(homography, axis=1)
				# warped_view_feature = homography_warping(view_towers[view].get_output(), homography)
                warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
                ave_feature = ave_feature + warped_view_feature
                ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
            ave_feature = ave_feature / FLAGS.view_num
            ave_feature2 = ave_feature2 / FLAGS.view_num
            cost = ave_feature2 - tf.square(ave_feature)
            depth_costs.append(cost)
        cost_volume = tf.stack(depth_costs, axis=1)

    # filtered cost volume, size of (B, D, H, W, 1)
    if is_master_gpu:
        filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=False)
    else:
        filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=True)
    filtered_cost_volume = tf.squeeze(filtered_cost_volume_tower.get_output(), axis=-1)

    # depth map by softArgmin
    with tf.name_scope('soft_arg_min'):
        # probability volume by soft max
        probability_volume = tf.nn.softmax(
            tf.scalar_mul(-1, filtered_cost_volume), axis=1, name='prob_volume')
        # depth image by soft argmin
        volume_shape = tf.shape(probability_volume)
        soft_2d = []
        for i in range(FLAGS.batch_size):
            soft_1d = tf.linspace(depth_start[i], depth_end[i], tf.cast(depth_num, tf.int32))
            soft_2d.append(soft_1d)
        soft_2d = tf.reshape(tf.stack(soft_2d, axis=0), [volume_shape[0], volume_shape[1], 1, 1])
        soft_4d = tf.tile(soft_2d, [1, 1, volume_shape[2], volume_shape[3]])
        estimated_depth_map = tf.reduce_sum(soft_4d * probability_volume, axis=1)
        estimated_depth_map = tf.expand_dims(estimated_depth_map, axis=3)

    # probability map
    prob_map = get_propability_map(probability_volume, estimated_depth_map, depth_start, depth_interval)

    return estimated_depth_map, prob_map#, filtered_depth_map, probability_volume

def inference_mem(images, cams, depth_num, depth_start, depth_interval, is_master_gpu=True):
    """ infer depth image from multi-view images and cameras """

    # dynamic gpu params
    depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
    feature_c = 32
    feature_h = FLAGS.max_h / 4
    feature_w = FLAGS.max_w / 4

    # reference image
    ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
    ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)

    # image feature extraction    
    if is_master_gpu:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
    else:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
    ref_feature = ref_tower.get_output()
    ref_feature2 = tf.square(ref_feature)

    view_features = []
    for view in range(1, FLAGS.view_num):
        view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
        view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
        view_features.append(view_tower.get_output())
    view_features = tf.stack(view_features, axis=0)

    # get all homographies
    view_homographies = []
    for view in range(1, FLAGS.view_num):
        view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
        homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
                                        depth_start=depth_start, depth_interval=depth_interval)
        view_homographies.append(homographies)
    view_homographies = tf.stack(view_homographies, axis=0)

    # build cost volume by differentialble homography
    with tf.name_scope('cost_volume_homography'):
        depth_costs = []

        for d in range(depth_num):
            # compute cost (standard deviation feature)
            ave_feature = tf.Variable(tf.zeros(
                [FLAGS.batch_size, feature_h, feature_w, feature_c]),
                name='ave', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
            ave_feature2 = tf.Variable(tf.zeros(
                [FLAGS.batch_size, feature_h, feature_w, feature_c]),
                name='ave2', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
            ave_feature = tf.assign(ave_feature, ref_feature)
            ave_feature2 = tf.assign(ave_feature2, ref_feature2)

            def body(view, ave_feature, ave_feature2):
                """Loop body."""
                homography = tf.slice(view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
                homography = tf.squeeze(homography, axis=1)
                # warped_view_feature = homography_warping(view_features[view], homography)
                warped_view_feature = tf_transform_homography(view_features[view], homography)
                ave_feature = tf.assign_add(ave_feature, warped_view_feature)
                ave_feature2 = tf.assign_add(ave_feature2, tf.square(warped_view_feature))
                view = tf.add(view, 1)
                return view, ave_feature, ave_feature2

            view = tf.constant(0)
            cond = lambda view, *_: tf.less(view, FLAGS.view_num - 1)
            _, ave_feature, ave_feature2 = tf.while_loop(
                cond, body, [view, ave_feature, ave_feature2], back_prop=False, parallel_iterations=1)

            ave_feature = tf.assign(ave_feature, tf.square(ave_feature) / (FLAGS.view_num * FLAGS.view_num))
            ave_feature2 = tf.assign(ave_feature2, ave_feature2 / FLAGS.view_num - ave_feature)
            depth_costs.append(ave_feature2)
        cost_volume = tf.stack(depth_costs, axis=1)

    # filtered cost volume, size of (B, D, H, W, 1)
    if is_master_gpu:
        filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=False)
    else:
        filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=True)
    filtered_cost_volume = tf.squeeze(filtered_cost_volume_tower.get_output(), axis=-1)

    # depth map by softArgmin
    with tf.name_scope('soft_arg_min'):
        # probability volume by soft max
        probability_volume = tf.nn.softmax(tf.scalar_mul(-1, filtered_cost_volume),
                                           axis=1, name='prob_volume')

        # depth image by soft argmin
        volume_shape = tf.shape(probability_volume)
        soft_2d = []
        for i in range(FLAGS.batch_size):
            soft_1d = tf.linspace(depth_start[i], depth_end[i], tf.cast(depth_num, tf.int32))
            soft_2d.append(soft_1d)
        soft_2d = tf.reshape(tf.stack(soft_2d, axis=0), [volume_shape[0], volume_shape[1], 1, 1])
        soft_4d = tf.tile(soft_2d, [1, 1, volume_shape[2], volume_shape[3]])
        estimated_depth_map = tf.reduce_sum(soft_4d * probability_volume, axis=1)
        estimated_depth_map = tf.expand_dims(estimated_depth_map, axis=3)

    # probability map
    prob_map = get_propability_map(probability_volume, estimated_depth_map, depth_start, depth_interval)

    # return filtered_depth_map, 
    return estimated_depth_map, prob_map


def inference_prob_recurrent(images, cams, depth_num, depth_start, depth_interval, is_master_gpu=True):
    """ infer disparity image from stereo images and cameras """

    # dynamic gpu params
    depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval

    # reference image
    ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
    ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)

    # image feature extraction    
    if is_master_gpu:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
    else:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
    view_towers = []
    for view in range(1, FLAGS.view_num):
        view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
        view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
        view_towers.append(view_tower)

    # get all homographies
    view_homographies = []
    for view in range(1, FLAGS.view_num):
        view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
        homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
                                        depth_start=depth_start, depth_interval=depth_interval)
        view_homographies.append(homographies)

    gru1_filters = 16
    gru2_filters = 4
    gru3_filters = 2
    feature_shape = [FLAGS.batch_size, FLAGS.max_h/4, FLAGS.max_w/4, 32]
    gru_input_shape = [feature_shape[1], feature_shape[2]]
    state1 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru1_filters])
    state2 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru2_filters])
    state3 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru3_filters])
    conv_gru1 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru1_filters)
    conv_gru2 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru2_filters)
    conv_gru3 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru3_filters)

    exp_div = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], 1])
    soft_depth_map = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], 1])

    with tf.name_scope('cost_volume_homography'):

        # forward cost volume
        depth_costs = []
        for d in range(depth_num):

            # compute cost (variation metric)
            ave_feature = ref_tower.get_output()
            ave_feature2 = tf.square(ref_tower.get_output())

            for view in range(0, FLAGS.view_num - 1):
                homography = tf.slice(
                    view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
                homography = tf.squeeze(homography, axis=1)
                # warped_view_feature = homography_warping(view_towers[view].get_output(), homography)
                warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
                ave_feature = ave_feature + warped_view_feature
                ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
            ave_feature = ave_feature / FLAGS.view_num
            ave_feature2 = ave_feature2 / FLAGS.view_num 
            cost = ave_feature2 - tf.square(ave_feature) 
            
            # gru
            reg_cost1, state1 = conv_gru1(-cost, state1, scope='conv_gru1')
            reg_cost2, state2 = conv_gru2(reg_cost1, state2, scope='conv_gru2')
            reg_cost3, state3 = conv_gru3(reg_cost2, state3, scope='conv_gru3')
            reg_cost = tf.layers.conv2d(
                reg_cost3, 1, 3, padding='same', reuse=tf.AUTO_REUSE, name='prob_conv')
            depth_costs.append(reg_cost)
            
        prob_volume = tf.stack(depth_costs, axis=1)
        prob_volume = tf.nn.softmax(prob_volume, axis=1, name='prob_volume')

    return prob_volume

def inference_winner_take_all(images, cams, depth_num, depth_start, depth_end, 
                              is_master_gpu=True, reg_type='GRU', inverse_depth=False):
    """ infer disparity image from stereo images and cameras """

    if not inverse_depth:
        depth_interval = (depth_end - depth_start) / (tf.cast(depth_num, tf.float32) - 1)

    # reference image
    ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
    ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)

    # image feature extraction    
    if is_master_gpu:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
    else:
        ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
    view_towers = []
    for view in range(1, FLAGS.view_num):
        view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
        view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
        view_towers.append(view_tower)

    # get all homographies
    view_homographies = []
    for view in range(1, FLAGS.view_num):
        view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
        if inverse_depth:
            homographies = get_homographies_inv_depth(ref_cam, view_cam, depth_num=depth_num,
                                depth_start=depth_start, depth_end=depth_end)
        else:
            homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
                                            depth_start=depth_start, depth_interval=depth_interval)
        view_homographies.append(homographies)

    # gru unit
    gru1_filters = 16
    gru2_filters = 4
    gru3_filters = 2
    feature_shape = [FLAGS.batch_size, FLAGS.max_h/4, FLAGS.max_w/4, 32]
    gru_input_shape = [feature_shape[1], feature_shape[2]]
    state1 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru1_filters])
    state2 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru2_filters])
    state3 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru3_filters])
    conv_gru1 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru1_filters)
    conv_gru2 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru2_filters)
    conv_gru3 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru3_filters)

    # initialize variables
    exp_sum = tf.Variable(tf.zeros(
        [FLAGS.batch_size, feature_shape[1], feature_shape[2], 1]),
        name='exp_sum', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
    depth_image = tf.Variable(tf.zeros(
        [FLAGS.batch_size, feature_shape[1], feature_shape[2], 1]),
        name='depth_image', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
    max_prob_image = tf.Variable(tf.zeros(
        [FLAGS.batch_size, feature_shape[1], feature_shape[2], 1]),
        name='max_prob_image', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
    init_map = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], 1])

    # define winner take all loop
    def body(depth_index, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre):
        """Loop body."""

        # calculate cost 
        ave_feature = ref_tower.get_output()
        ave_feature2 = tf.square(ref_tower.get_output())
        for view in range(0, FLAGS.view_num - 1):
            homographies = view_homographies[view]
            homographies = tf.transpose(homographies, perm=[1, 0, 2, 3])
            homography = homographies[depth_index]
            # warped_view_feature = homography_warping(view_towers[view].get_output(), homography)
            warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
            ave_feature = ave_feature + warped_view_feature
            ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
        ave_feature = ave_feature / FLAGS.view_num
        ave_feature2 = ave_feature2 / FLAGS.view_num
        cost = ave_feature2 - tf.square(ave_feature)
        cost.set_shape([FLAGS.batch_size, feature_shape[1], feature_shape[2], 32])

        # gru
        reg_cost1, state1 = conv_gru1(-cost, state1, scope='conv_gru1')
        reg_cost2, state2 = conv_gru2(reg_cost1, state2, scope='conv_gru2')
        reg_cost3, state3 = conv_gru3(reg_cost2, state3, scope='conv_gru3')
        reg_cost = tf.layers.conv2d(
            reg_cost3, 1, 3, padding='same', reuse=tf.AUTO_REUSE, name='prob_conv')
        prob = tf.exp(reg_cost)

        # index
        d_idx = tf.cast(depth_index, tf.float32) 
        if inverse_depth:
            inv_depth_start = tf.div(1.0, depth_start)
            inv_depth_end = tf.div(1.0, depth_end)
            inv_interval = (inv_depth_start - inv_depth_end) / (tf.cast(depth_num, 'float32') - 1)
            inv_depth = inv_depth_start - d_idx * inv_interval
            depth = tf.div(1.0, inv_depth)
        else:
            depth = depth_start + d_idx * depth_interval
        temp_depth_image = tf.reshape(depth, [FLAGS.batch_size, 1, 1, 1])
        temp_depth_image = tf.tile(
            temp_depth_image, [1, feature_shape[1], feature_shape[2], 1])

        # update the best
        update_flag_image = tf.cast(tf.less(max_prob_image, prob), dtype='float32')
        new_max_prob_image = update_flag_image * prob + (1 - update_flag_image) * max_prob_image
        new_depth_image = update_flag_image * temp_depth_image + (1 - update_flag_image) * depth_image
        max_prob_image = tf.assign(max_prob_image, new_max_prob_image)
        depth_image = tf.assign(depth_image, new_depth_image)

        # update counter
        exp_sum = tf.assign_add(exp_sum, prob)
        depth_index = tf.add(depth_index, incre)

        return depth_index, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre
    
    # run forward loop
    exp_sum = tf.assign(exp_sum, init_map)
    depth_image = tf.assign(depth_image, init_map)
    max_prob_image = tf.assign(max_prob_image, init_map)
    depth_index = tf.constant(0)
    incre = tf.constant(1)
    cond = lambda depth_index, *_: tf.less(depth_index, depth_num)
    _, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre = tf.while_loop(
        cond, body
        , [depth_index, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre]
        , back_prop=False, parallel_iterations=1)

    # get output
    forward_exp_sum = exp_sum + 1e-7
    forward_depth_map = depth_image
    return forward_depth_map, max_prob_image / forward_exp_sum

def depth_refine(init_depth_map, image, depth_num, depth_start, depth_interval, is_master_gpu=True):
    """ refine depth image with the image """

    # normalization parameters
    depth_shape = tf.shape(init_depth_map)
    depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
    depth_start_mat = tf.tile(tf.reshape(
        depth_start, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
    depth_end_mat = tf.tile(tf.reshape(
        depth_end, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
    depth_scale_mat = depth_end_mat - depth_start_mat

    # normalize depth map (to 0~1)
    init_norm_depth_map = tf.div(init_depth_map - depth_start_mat, depth_scale_mat)

    # resize normalized image to the same size of depth image
    resized_image = tf.image.resize_bilinear(image, [depth_shape[1], depth_shape[2]])

    # refinement network
    if is_master_gpu:
        norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
                                        is_training=True, reuse=False)
    else:
        norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
                                        is_training=True, reuse=True)
    norm_depth_map = norm_depth_tower.get_output()

    # denormalize depth map
    refined_depth_map = tf.multiply(norm_depth_map, depth_scale_mat) + depth_start_mat

    return refined_depth_map

Reference

[CV_Pose Estimation] Deep High-Resolution Representation Learning for Human Pose Estimation

Deep High-Resolution Representation Learning for Human Pose Estimation

Basic

Trade off : Global information vs High-resolution(Original size)
- Global information (Receptive field ↑) -> Low resolution -> Up -sampling ↑ -> Pixel-wise prediction ↓
- Need : Learning both Global + Local Feature & Recovering High-resolution

1. Introduction

Most existing method
- Recover high-resolution from low-resolution
- By high-to-low resolution network connected in Series
- ex) Hourglass, SimpleBaseline, Dilated conv
High-Resolution Net (HR-net)
- Maintain high-resolution through Whole process
- First stage : a high-resolution subnetwork --> Next stage : Gradually add high-to-low resolution subnetworks
- Repeated Multi-scale fusions By Parallel multi-resolution subnetworks : help of same depth-low resolution
- Result : rich high-resolution representations -> more accurate and spatially precise heatmap
- Dataset : COCO keypoint detection dataset, MPII Human Pose dataset, PoseTrack dataset

2. Related Work

Traditional solutions to single-pose estimation : probabilistic graphical model, pictorial structure model
Present mainstream methods by DNN : Regressing keypoint positions & Estimating keypoint Heatmaps
- Regressing (x, y) : ex) (2013) DeepPose : Human Pose Estimation via Deep Neural Networks
- Estimating Heatmap [loc = (x, y)] : ex) (2015) Efficient Object Localization Using Convolutional Networks
Most CNN for keypoint heatmap
- consist of subnetwork similar to classification network
- input --> a regressor estimating heatmaps
- main body : high-to-low and low-to-high framework, augmented with multi-scale fusion + intermediate supervision

(a) Hourglass : symmetric low-to-high and high-to-low
(b) Cascade pyramid networks
(c) SimpleBaseline : Transposed conv for low-to-high
(d) Combination with Dilated conv

2.1. High-to-low and Low-to-high

Symmetric high-to-low and low-to-high
Heavy high-to-low (classification network = strided conv or pooling) and Light low-to-high (bilinear-upsampling or transposed conv)
Combination with Dialted conv
Bad for Small object or Detail spatial information -> Bad for Pixcel-wise prediction
- Serialization of network : Local, Global feature extraction and learning rely excessively on Up-sampling

2.2. Multi-scale fusion

(a), (b) : skip-connections bw same-resolution layers of h-t-l and l-t-h
(a) Hourglass : Feeding multi-resolution imgs separately into multiple networks and Aggregating output map
(b) Cascaded pyramid network : globalnet + refinent(right part for combinating features)

2.3. Intermediate supervision

For helping deep networks training and improving heatmap estimation quality
ex) Hourglass, conv pose machine approach : intermediate heatmaps as (part of) input of remaining subnetwork

HR-net

High-to-low subnetworks in Parallel + Fusing multi-scale representations
No intermediate supervision
Result : superior in detection accuracy + efficient in computation complexity and params

3. Approach

Human pose estimation Task : detecting locations of K keypoints or parts from img I (W x H x 3)
SOTA methods : estimating K heatmaps of size W' x H', {H_1, H_2, ..., H_K}, H_k : location confidence of kth keypoint
HR-net : using CNN consisting 3 parts
- Two strided conv decreasing resolution
- Main body outputting feature maps with same resolution as its input feature maps
- Regressor estimating heatmaps where keypoint positions are chosen and transformed to full resolution

3.1. Sequential multi-resolution subnetworks

Existing networks : connecting high-to-low resolution subnetworks in Series
Sequence of subnetworks + down-sample layer to halve resolution
N_sr : subnetwork (s : s-th stage, r : resolution index) -> resolution : 1/2^(r-1) of first subnetwork
- ex) High-to-low network : N_11 -> N_22 -> N_33 -> N_44

3.2. Parallel multi-resolution subnetworks

ex) 4 Parallel sub-networks

3.3. Repeated multi-scale fusion

Exchange units (Fusion) across parallel subnetworks
Input : X = {X_1, X_2, ..., X_s}
Output : Y = {Y_1, Y_2, ..., Y_s}, whose sizes are same to inputs
- Each output is an aggregation of input maps : Y_k = ∑ a(X_i, k), i=1, ..., s
- Extra output maps : Y_(s+1) = a(Y_s, s+1)
Function : a(X_i, k) : Up-sampling or Ddown-sampling X_i from resolution i to k
- Down-sampling(halve) : strided 3x3 conv (Stride = 2, Padding = 1)
- Up-sampling(double) : simple nearest neighbor sampling following a 1x1 conv

3.4. Heatmap estimation

Regressing heatmaps from high-resolution output by Last exchange unit
Loss function : MSE
- GT heatmaps : 2D gaussian with sd=1 pixel-centered on GT location of each keypoing

3.5. Network instantiation

ResNet to distribute depth to each stage and # of channels to each resolution
Main body : HR-net : 4 stages with 4 parallel subnetworks
- Resolution is gradually decreased (halve) -> Width(# of channels) is increased (dounle)
- 1st stage : 4 Residual units
  - each unit is formed by a bottleneck with width 64, followed by one 3x3 conv reducing width of feature maps to C
- 2, 3, 4th stages : 1, 4, 3 Exchange blocks -> Totally 8 Exchange blocks (-> 8 multi-scale fusions)
  - one Exchange block contains 4 Residual units (each unit is followed by two 3x3 conv) and an Exchange block
Experiments : HRNet-W32 (small net), HRNet-W48 (big net)
- 32 and 48 : widths(C) of high-resolution subnetworks in last 3 stages
- HRNet-W32 = 64,128, 256, 32, 32, 32
- HRNet-W48 = 96, 192, 384, 48, 48, 48

4. Experiments

4.1. COCO Keypoint Detection

Dataset

COCO dataset : 200K imgs, 250K person instances labeled with 17 Keypoints
- COCO train2017 dataset : 57K imgs + 150K person instances
- COCO val2017 : 5K imgs
- COCO test-dec2017 set : 20K imgs
- [Annotation] 17 Keypoints : (x, y, z)
  - x, y : (x,y), 2D img coordinate
  - z : visibility flag (0 : not labeled / 1 : labeled but not showed / 2 : labeled and showed)

Evaluation metric

Similarity Metric : OKS (Object Keypoint Similarity)
- d_i : Euclidean distance bw detected keypoint and GT
- v_i : visibility flag of GT
- s : object scale (diagonal length of bbox)
- k_i : per-keypoint constant that controls falloff
- OKS = 0(Worst) ~ 1(Best)
Evaluation Metric : AP (Average Precision) : AP^50, AP^75, AP, AP^M, AP^L, AR

Training

Fixed Human detection box img (h : w = 4 : 3) ... ex) 256 x 192 or 384 x 288
Data Augmentation : random rotation, random scale, flipping, half body data augmentation
Adam optimizer
lr scheduler : 1e-3 (base) -> 1e-4 (170th epochs) -> 1e-5 (200th epochs) -> (210 epochs)

Testing

Top-down : Detect person instance using person detector --> Predict detection keypoints
- person detectors : same with SimpleBaseline model
Averaging heatmaps of original and flipped imgs
Predicted keypoint location : Highest heatvalue location with a quarter offset

Results on validation set

[Red] AP : HRNet = 73.4 > Others
[Red] #Params, GFLOPs : HRNet > CPN model
[Red] #Params, GFLOPs : HRNet < SimpleBaseline model
[Blue] Pre-trained model for ImageNet classification is better : 1.0 points ↑
[Green] Width size ↑ (HRNet-W48) -> AP ↑ : 0.7, 0.5 ↑
[Orange] Input size ↑ (384 x 288) -> AP ↑ : 1.4, 1.2 ↑

Results on test-dev set

HR-net (Top-down) is better than Botton-up methods
HRnet-W32 : 74.9 AP > Other Top-down methods
- More efficient in model size (#Params) and computation complexicity (GELOPs)
HRNet-W48 : highest 75.5 AP > SimpleBaseline
+) Additional data from AI Challenger for training : best 77.0 AP

4.2. MPII Human Pose Estimation

Dataset

MPII Human Pose dataset (real-world / full-body pose) : 25K imgs with 40K subjects
- 12K subjects for testing + 13K subjects for training

Training

Same to MS COCO, except that input size is cropped to 256 x 256

Testing

Same to MS COCO, except that using provided person boxes (instead of detected person boxes)
six-scale pyramid testing procedure

Evaluation metric

PCKh (head-normalized probability of correct keypoint) score -> [email protected] (α=0.5)
- Joint is correct if it falls within α * ℓ pixels of GT position
- α : constant
- ℓ : head size that corresponds to 60% of diagonal length of GT head bbox

Results on test set

HRNet-W32 : model size (#Params = 28.5M) ↓, computation complexicity (GELOPs = 9.5) ↓, 92.3 [email protected] ↑
HRNet-W48 : same result 92.3 [email protected]

4.3. Application to Pose Tracking

Dataset

PoseTrack (articulated tracking in video provided by MPII Human Pose dataset) : 550 video seq with 66, 374 frames
- video seq are split into 292(train) + 50(val) + 208(test)
  - train : length ranges bw 41~151 frames / 30 frames from center of video are densely annotated
  - val/test : 65~298 frames / 30 frames around keyframe are densely annotated + afterwards every fourth frame is annotated

Evaluation metric

[1] Frame-wise Multi-person Pose Estimation : mAP (mean Average Precision)
[2] Multi-person Pose Tracking : MOTA (multi-object tracking accuracy)

Training

network : HRNet-W48 (pre-trained on COCO dataset) for single person pose estimation on PoseTrack2017 training set
Input : Person box extracted from annotated keypoints in training frames by extending bbox of all keypoints by 15%
Training setup, data aug : almost same as COCO except lr scheduler : 1e-4 -> 1e-5 (10th) -> 1e-6 (15th) -> (20 epochs)

Testing

1) Person box Detection and Propagation
- Same detector in SimpleBaseline
- Propagating box into nearby frames by propagating predicted keypoints according to optical flows + NMS for removing
2) Human Pose Estimation
- Metric : OKS (Object Keypoint Similarity)
3) Pose Association cross nearby frames
- Greedy matching algorithm to compute correspondence bw keypoints in nearby frames

Results on PoseTrack2017 test set

HRNet-W48 : 74.9 mAP score, 57.9 MOTA score

4.4. Ablation Study

Repeated multi-scale fusion

(a) Without Intermediate Exchange (1 fusions)
(b) With only Across-stage Exchange (3 fusions)
(c) With both Across-stage and Within-stage Exchange (8 fusions) = HR-Net
All networks are trained from scratch
Result on COCO val set : More fusions lead to better performance (AP : c>b>a)

Resolution maintenance

HRNet-W32 : 73.4 AP > Variant : 72.5 AP
Low-level features extracted from early stages over low-resolution subnetworks are less helpful
Simple high-resolution without low-resolution parallel subnetworks shows lower performance

Representation resolution

(1) Resolution ↑ -> AP ↑ = Keypoint heatmap prediction quality ↑
(2) Input size
- Performance(AP) Improvement for smaller input size (128 x 96) is bigger than larger input size (256 x 192)
- Input size ↑ -> AP ↑
- Intuition : Maintaining high resolution is important!

5. Conclusion and Future Works

Maintaining high resolution through whole process without need of recovering
Fusing multi-resolution representations repeatly
Result : reliable high-resolution representations

Code

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import csv
import os
import shutil

from PIL import Image
import torch
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision
import cv2
import numpy as np
import time


import _init_paths
import models
from config import cfg
from config import update_config
from core.function import get_final_preds
from utils.transforms import get_affine_transform

COCO_KEYPOINT_INDEXES = {
    0: 'nose',
    1: 'left_eye',
    2: 'right_eye',
    3: 'left_ear',
    4: 'right_ear',
    5: 'left_shoulder',
    6: 'right_shoulder',
    7: 'left_elbow',
    8: 'right_elbow',
    9: 'left_wrist',
    10: 'right_wrist',
    11: 'left_hip',
    12: 'right_hip',
    13: 'left_knee',
    14: 'right_knee',
    15: 'left_ankle',
    16: 'right_ankle'
}

COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

SKELETON = [
    [1,3],[1,0],[2,4],[2,0],[0,5],[0,6],[5,7],[7,9],[6,8],[8,10],[5,11],[6,12],[11,12],[11,13],[13,15],[12,14],[14,16]
]

CocoColors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0],
              [0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255],
              [170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]]

NUM_KPTS = 17

CTX = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

def draw_pose(keypoints,img):
    """draw the keypoints and the skeletons.
    :params keypoints: the shape should be equal to [17,2]
    :params img:
    """
    assert keypoints.shape == (NUM_KPTS,2)
    for i in range(len(SKELETON)):
        kpt_a, kpt_b = SKELETON[i][0], SKELETON[i][1]
        x_a, y_a = keypoints[kpt_a][0],keypoints[kpt_a][1]
        x_b, y_b = keypoints[kpt_b][0],keypoints[kpt_b][1] 
        cv2.circle(img, (int(x_a), int(y_a)), 6, CocoColors[i], -1)
        cv2.circle(img, (int(x_b), int(y_b)), 6, CocoColors[i], -1)
        cv2.line(img, (int(x_a), int(y_a)), (int(x_b), int(y_b)), CocoColors[i], 2)

def draw_bbox(box,img):
    """draw the detected bounding box on the image.
    :param img:
    """
    cv2.rectangle(img, box[0], box[1], color=(0, 255, 0),thickness=3)


def get_person_detection_boxes(model, img, threshold=0.5):
    pred = model(img)
    pred_classes = [COCO_INSTANCE_CATEGORY_NAMES[i]
                    for i in list(pred[0]['labels'].cpu().numpy())]  # Get the Prediction Score
    pred_boxes = [[(i[0], i[1]), (i[2], i[3])]
                  for i in list(pred[0]['boxes'].detach().cpu().numpy())]  # Bounding boxes
    pred_score = list(pred[0]['scores'].detach().cpu().numpy())
    if not pred_score or max(pred_score)<threshold:
        return []
    # Get list of index with score greater than threshold
    pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1]
    pred_boxes = pred_boxes[:pred_t+1]
    pred_classes = pred_classes[:pred_t+1]

    person_boxes = []
    for idx, box in enumerate(pred_boxes):
        if pred_classes[idx] == 'person':
            person_boxes.append(box)

    return person_boxes


def get_pose_estimation_prediction(pose_model, image, center, scale):
    rotation = 0

    # pose estimation transformation
    trans = get_affine_transform(center, scale, rotation, cfg.MODEL.IMAGE_SIZE)
    model_input = cv2.warpAffine(
        image,
        trans,
        (int(cfg.MODEL.IMAGE_SIZE[0]), int(cfg.MODEL.IMAGE_SIZE[1])),
        flags=cv2.INTER_LINEAR)
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225]),
    ])

    # pose estimation inference
    model_input = transform(model_input).unsqueeze(0)
    # switch to evaluate mode
    pose_model.eval()
    with torch.no_grad():
        # compute output heatmap
        output = pose_model(model_input)
        preds, _ = get_final_preds(
            cfg,
            output.clone().cpu().numpy(),
            np.asarray([center]),
            np.asarray([scale]))

        return preds


def box_to_center_scale(box, model_image_width, model_image_height):
    """convert a box to center,scale information required for pose transformation
    Parameters
    ----------
    box : list of tuple
        list of length 2 with two tuples of floats representing
        bottom left and top right corner of a box
    model_image_width : int
    model_image_height : int

    Returns
    -------
    (numpy array, numpy array)
        Two numpy arrays, coordinates for the center of the box and the scale of the box
    """
    center = np.zeros((2), dtype=np.float32)

    bottom_left_corner = box[0]
    top_right_corner = box[1]
    box_width = top_right_corner[0]-bottom_left_corner[0]
    box_height = top_right_corner[1]-bottom_left_corner[1]
    bottom_left_x = bottom_left_corner[0]
    bottom_left_y = bottom_left_corner[1]
    center[0] = bottom_left_x + box_width * 0.5
    center[1] = bottom_left_y + box_height * 0.5

    aspect_ratio = model_image_width * 1.0 / model_image_height
    pixel_std = 200

    if box_width > aspect_ratio * box_height:
        box_height = box_width * 1.0 / aspect_ratio
    elif box_width < aspect_ratio * box_height:
        box_width = box_height * aspect_ratio
    scale = np.array(
        [box_width * 1.0 / pixel_std, box_height * 1.0 / pixel_std],
        dtype=np.float32)
    if center[0] != -1:
        scale = scale * 1.25

    return center, scale

def parse_args():
    parser = argparse.ArgumentParser(description='Train keypoints network')
    # general
    parser.add_argument('--cfg', type=str, default='demo/inference-config.yaml')
    parser.add_argument('--video', type=str)
    parser.add_argument('--webcam',action='store_true')
    parser.add_argument('--image',type=str)
    parser.add_argument('--write',action='store_true')
    parser.add_argument('--showFps',action='store_true')

    parser.add_argument('opts',
                        help='Modify config options using the command-line',
                        default=None,
                        nargs=argparse.REMAINDER)

    args = parser.parse_args()

    # args expected by supporting codebase  
    args.modelDir = ''
    args.logDir = ''
    args.dataDir = ''
    args.prevModelDir = ''
    return args


def main():
    # cudnn related setting
    cudnn.benchmark = cfg.CUDNN.BENCHMARK
    torch.backends.cudnn.deterministic = cfg.CUDNN.DETERMINISTIC
    torch.backends.cudnn.enabled = cfg.CUDNN.ENABLED

    args = parse_args()
    update_config(cfg, args)

    box_model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    box_model.to(CTX)
    box_model.eval()

    pose_model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
        cfg, is_train=False
    )

    if cfg.TEST.MODEL_FILE:
        print('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
        pose_model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
    else:
        print('expected model defined in config at TEST.MODEL_FILE')

    pose_model = torch.nn.DataParallel(pose_model, device_ids=cfg.GPUS)
    pose_model.to(CTX)
    pose_model.eval()

    # Loading an video or an image or webcam 
    if args.webcam:
        vidcap = cv2.VideoCapture(0)
    elif args.video:
        vidcap = cv2.VideoCapture(args.video)
    elif args.image:
        image_bgr = cv2.imread(args.image)
    else:
        print('please use --video or --webcam or --image to define the input.')
        return 

    if args.webcam or args.video:
        if args.write:
            save_path = 'output.avi'
            fourcc = cv2.VideoWriter_fourcc(*'XVID')
            out = cv2.VideoWriter(save_path,fourcc, 24.0, (int(vidcap.get(3)),int(vidcap.get(4))))
        while True:
            ret, image_bgr = vidcap.read()
            if ret:
                last_time = time.time()
                image = image_bgr[:, :, [2, 1, 0]]

                input = []
                img = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
                img_tensor = torch.from_numpy(img/255.).permute(2,0,1).float().to(CTX)
                input.append(img_tensor)

                # object detection box
                pred_boxes = get_person_detection_boxes(box_model, input, threshold=0.9)

                # pose estimation
                if len(pred_boxes) >= 1:
                    for box in pred_boxes:
                        center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
                        image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
                        pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
                        if len(pose_preds)>=1:
                            for kpt in pose_preds:
                                draw_pose(kpt,image_bgr) # draw the poses

                if args.showFps:
                    fps = 1/(time.time()-last_time)
                    img = cv2.putText(image_bgr, 'fps: '+ "%.2f"%(fps), (25, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)

                if args.write:
                    out.write(image_bgr)

                cv2.imshow('demo',image_bgr)
                if cv2.waitKey(1) & 0XFF==ord('q'):
                    break
            else:
                print('cannot load the video.')
                break

        cv2.destroyAllWindows()
        vidcap.release()
        if args.write:
            print('video has been saved as {}'.format(save_path))
            out.release()

    else:
        # estimate on the image
        last_time = time.time()
        image = image_bgr[:, :, [2, 1, 0]]

        input = []
        img = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
        img_tensor = torch.from_numpy(img/255.).permute(2,0,1).float().to(CTX)
        input.append(img_tensor)

        # object detection box
        pred_boxes = get_person_detection_boxes(box_model, input, threshold=0.9)

        # pose estimation
        if len(pred_boxes) >= 1:
            for box in pred_boxes:
                center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
                image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
                pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
                if len(pose_preds)>=1:
                    for kpt in pose_preds:
                        draw_pose(kpt,image_bgr) # draw the poses
        
        if args.showFps:
            fps = 1/(time.time()-last_time)
            img = cv2.putText(image_bgr, 'fps: '+ "%.2f"%(fps), (25, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)
        
        if args.write:
            save_path = 'output.jpg'
            cv2.imwrite(save_path,image_bgr)
            print('the result image has been saved as {}'.format(save_path))

        cv2.imshow('demo',image_bgr)
        if cv2.waitKey(0) & 0XFF==ord('q'):
            cv2.destroyAllWindows()
        
if __name__ == '__main__':
    main()

[CV_3D] VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Paper Review

Abstract

Previous methods : LiDAR data를 RRN(Region Proposal Network)에 넣기 위해 hand-crafted feature 제작
VoxelNet : feature extraction과 bbox prediction을 single stage로 통합한 end-to-end network 제안
- PC data를 같은 간격의 3D Voxel로 쪼갬 (Voxel Partition)
- → VFE layer 통해 각 voxel 안의 points로 Voxel feature 제작
- → 3D conv layer 통해 Local voxel feature를 통합
- → RPN 통해 bbox 생성

1. Introduction

1.1 Related Work

1.2 Contributions

End-to-end trainable deep network for pc-based 3D detection by VFE
Efficient implementation for sparse point structure and parallel processing on voxel grid (GPU)
SOTA results on KITTI benchmark (LiDAR-based car, pedestrian, cyclist detection)

2. VoxelNet

[ VoxelNet Architecture ]

1️⃣ Feature Learning Network

Voxel Partition

To subdivide(voxelize) 3D space into equally spaced voxels
3D voxel grid : $[D', H', W']$ $(D'=D/v_D, H'=H/v_H, W'=W/v_W)$
- $D, H, W$ : LiDAR point가 분포하는 영역의 z축(위), y축(좌측), x축(전방) 길이
- $v_D, v_H, v_W$ : 단위 voxel의 z, y, x 방향 길이
- In paper, $(v_D, v_H, v_W) = (0.4, 0.2, 0.2)$

Grouping

LiDAR data : sparse 하며 voxel 마다 point 수 다름
하나의 voxel grid 내에 있는 points를 같은 voxel group에 할당 → point group

Random Sampling

Voxel마다 max point 개수 $T$ 정해 그 이상의 points 갖는 voxel에 대해 $T$개 Sampling
LiDAR sensor로 얻은 PC는 1 frame 당 100,000개 points
Purposes : computation ↓, point density imbalance ↓ (sampling bias ↓), variation to training

Stacked Voxel Feature Encoding (VFE)

Fig. VFE Layer-1
Non-empty voxel containing $t$ LiDAR points : $V = p_i = [x_i, y_i, z_i, r_i]^T\in\mathbb{R}^4$, $i=1...t$
- $x_i, y_i, z_i$ : XYZ coordinates for $i$-th point
- $r_i$ : received reflectance
Input feature set (Point-wise Input) : $V_{in} = \hat{p}_i= [x_i, y_i, z_i, r_i, x_i-v_x, y_i-v_y, z_i-v_z]^T\in\mathbb{R}^7$, $i=1...t$
- centroid 대한 각 points의 relative offset (=각 points의 feature)
- $(v_x, v_y, v_z)$ : centroid of all points in $V$ = local mean
Point-wise Feature : Point-wise Input을 FCN에 통과시켜 feature space로 보낸 결과
- FCN = linear layer + BN + ReLU
- aggregating information from point features → encoding shape of surface within voxel
Locally Aggregated Feature
- Point-wise Feature (=voxel 내 모든 points의 feature)에 element-wise max-pooling 수행한 결과
Point-wise concatenated Feature : $f_i^{out}\in\mathbb{R}^2m$
- Point-wise Feature와 Voxel-wise Feature를 concat한 결과
Output feature set : $V_{out} = f_i^{out}$, $i=1...t$
- = Point-wise Feature-1 → VFE Layer-2의 input
Voxel-wise Feature
- 모든 non-empty voxel은 같은 FCN으로 encoding
- VFE-layer를 stacking 하여 voxel 내부 points의 shape information 학습 가능
- $n(=2)$ VFE-layer를 통과한 후 얻어진 point-wise feature를 FCN과 Maxpooling에 통과시켜 얻은 최종 결과
- 3D points 만으로는 CNN 학습 hard → 3D space를 voxel로 쪼개 CNN 학습에 적합한 구조를 만들어 각 voxel의 feature 계산하여 Convolutional Middle Layers의 input으로 사용

[Code] Feature Learning Network

# Fully Connected Network
class FCN(nn.Module):

    def __init__(self,cin,cout):
        super(FCN, self).__init__()
        self.cout = cout
        self.linear = nn.Linear(cin, cout)
        self.bn = nn.BatchNorm1d(cout)

    def forward(self,x):
        # KK is the stacked k across batch
        kk, t, _ = x.shape
        x = self.linear(x.view(kk*t,-1))
        x = F.relu(self.bn(x))
        return x.view(kk,t,-1)

# Voxel Feature Encoding (VFE) Layer
class VFE(nn.Module):

    def __init__(self,cin,cout):
        super(VFE, self).__init__()
        assert cout % 2 == 0
        self.units = cout // 2
        self.fcn = FCN(cin,self.units)

    def forward(self, x, mask):
        # point-wise feature
        pwf = self.fcn(x)
        #locally aggregated feature
        laf = torch.max(pwf,1)[0].unsqueeze(1).repeat(1,cfg.T,1)
        # point-wise concat feature
        pwcf = torch.cat((pwf,laf),dim=2)
        # apply mask
        mask = mask.unsqueeze(2).repeat(1, 1, self.units * 2)
        pwcf = pwcf * mask.float()

        return pwcf

# Stacked Voxel Feature Encoding
class SVFE(nn.Module):

    def __init__(self):
        super(SVFE, self).__init__()
        self.vfe_1 = VFE(7,32)
        self.vfe_2 = VFE(32,128)
        self.fcn = FCN(128,128)
    def forward(self, x):
        mask = torch.ne(torch.max(x,2)[0], 0)
        x = self.vfe_1(x, mask)
        x = self.vfe_2(x, mask)
        x = self.fcn(x)
        # element-wise max pooling
        x = torch.max(x,1)[0]
        return x

Sparse Tensor Representation

pc ~ 100k points → 90%이상이 empty voxel
non-empty voxel features를 sparse tensor로 표현 (list 형태)
backprop에서 memory usage & computation cost ↓

2️⃣ Convolutional Middle Layers

Input : voxel-wise feature
CML = 3D CNN + BN + ReLU
In paper, 3 CML
receptive field를 넓히면서 voxel-wise features를 aggregation

[Code] Convolutional Middle Layer

# conv3d + bn + relu
class Conv3d(nn.Module):
    def __init__(self, in_channels, out_channels, k, s, p, batch_norm=True):
        super(Conv3d, self).__init__()
        self.conv = nn.Conv3d(in_channels, out_channels, kernel_size=k, stride=s, padding=p)
        if batch_norm:
            self.bn = nn.BatchNorm3d(out_channels)
        else:
            self.bn = None

    def forward(self, x):
        x = self.conv(x)
        if self.bn is not None:
            x = self.bn(x)

        return F.relu(x, inplace=True)

# Convolutional Middle Layer
class CML(nn.Module):
    def __init__(self):
        super(CML, self).__init__()
        self.conv3d_1 = Conv3d(128, 64, 3, s=(2, 1, 1), p=(1, 1, 1))
        self.conv3d_2 = Conv3d(64, 64, 3, s=(1, 1, 1), p=(0, 1, 1))
        self.conv3d_3 = Conv3d(64, 64, 3, s=(2, 1, 1), p=(1, 1, 1))

    def forward(self, x):
        x = self.conv3d_1(x)
        x = self.conv3d_2(x)
        x = self.conv3d_3(x)
        return x

3️⃣ Region Proposal Network (RPN)

Input : CML로 얻은 64(channel) x 2(z) x 400(y) x 352(x) 형태의 4D feature map을 128 x 400 x 352 형태의 3D tensor로 reshaping한 BEV feature map
Outputs : 2-dim Probability score map (class score) & 14-dim Regression map (bbox regression)
- Probability score map (class score) : 각 anchor에 대해 해당 class가 맞을 확률(0, 1)
- Regression map (bbox regression) : bbox parameter 7개에 대한 regression 결과
Layers : Conv2D(input channel #, output channel #, kernel size, stride size, padding size)
3 FC blocks
- 각 block의 1st layer = stride 2 → feature map size를 1/2로 downsampling
- 각 block을 거쳐 나온 features를 같은 size로 upsampling 하여 concat
- 최종 high resol feature map을 Conv3D layer 통과 → Class Probability score map & bbox Regression map

[Code] Region Proposal Network (RPN)

# conv2d + bn + relu
class Conv2d(nn.Module):
    def __init__(self,in_channels,out_channels,k,s,p, activation=True, batch_norm=True):
        super(Conv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels,out_channels,kernel_size=k,stride=s,padding=p)
        if batch_norm:
            self.bn = nn.BatchNorm2d(out_channels)
        else:
            self.bn = None
        self.activation = activation
    def forward(self,x):
        x = self.conv(x)
        if self.bn is not None:
            x=self.bn(x)
        if self.activation:
            return F.relu(x,inplace=True)
        else:
            return x

# Region Proposal Network
class RPN(nn.Module):
    def __init__(self):
        super(RPN, self).__init__()
        self.block_1 = [Conv2d(128, 128, 3, 2, 1)]
        self.block_1 += [Conv2d(128, 128, 3, 1, 1) for _ in range(3)]
        self.block_1 = nn.Sequential(*self.block_1)

        self.block_2 = [Conv2d(128, 128, 3, 2, 1)]
        self.block_2 += [Conv2d(128, 128, 3, 1, 1) for _ in range(5)]
        self.block_2 = nn.Sequential(*self.block_2)

        self.block_3 = [Conv2d(128, 256, 3, 2, 1)]
        self.block_3 += [nn.Conv2d(256, 256, 3, 1, 1) for _ in range(5)]
        self.block_3 = nn.Sequential(*self.block_3)

        self.deconv_1 = nn.Sequential(nn.ConvTranspose2d(256, 256, 4, 4, 0),nn.BatchNorm2d(256))
        self.deconv_2 = nn.Sequential(nn.ConvTranspose2d(128, 256, 2, 2, 0),nn.BatchNorm2d(256))
        self.deconv_3 = nn.Sequential(nn.ConvTranspose2d(128, 256, 1, 1, 0),nn.BatchNorm2d(256))

        self.score_head = Conv2d(768, cfg.anchors_per_position, 1, 1, 0, activation=False, batch_norm=False)
        self.reg_head = Conv2d(768, 7 * cfg.anchors_per_position, 1, 1, 0, activation=False, batch_norm=False)

    def forward(self,x):
        x = self.block_1(x)
        x_skip_1 = x
        x = self.block_2(x)
        x_skip_2 = x
        x = self.block_3(x)
        x_0 = self.deconv_1(x)
        x_1 = self.deconv_2(x_skip_2)
        x_2 = self.deconv_3(x_skip_1)
        x = torch.cat((x_0,x_1,x_2),1)
        return self.score_head(x),self.reg_head(x)

[ Loss Function ]

Total Loss = Normalized Classification Loss + Normalized Regression Loss

(1) $L_{cls}$ : Classification Loss by BCE loss

$p_i^{pos}, p_j^{neg}$ : softmax output for positive and negative anchor
$a_i^{pos}$, $i=1...N_{pos}$ : set of positive anchors (pre-defined bbox)
- GT bbox와의 IoU가 특정값보다 큰 anchors → score ~~1
  
  In paper, Car : 0.65, Pedestrian & Cyclist : 0.5
$a_j^{neg}$, $j=1...N_{neg}$ : set of negative anchors
- GT bbox와의 IoU가 특정값보다 작은 anchors → score ~~0
$(x_c^g, y_c^g, z_c^g, l^g, w^g, h^g, \theta^g)$ : 3D GT bbox
- $x_c^g, y_c^g, z_c^g$ : center location = feature map location
- $l^g, w^g, h^g$ : length, width, height of box → class 마다 다름
  
  In paper, Car : (3.9, 1.6, 1.56)
- $\theta^g$ : yaw rotation around Z-axis (0~2𝝅)
  
  In paper, $\theta$ = 0, 𝝅/2 → anchor 2개 → Outputs : 2-dim & 14-dim

(2) $L_{reg}$ : Regression Loss by SmoothL1 loss

$u_i\in\mathbb{R}^7$ : regression output
$u_i^*\in\mathbb{R}^7$ : GT for positive anchor
$u^*\in\mathbb{R}^7$ : residual vector

[Code] Loss function

class VoxelLoss(nn.Module):
    def __init__(self, alpha, beta):
        super(VoxelLoss, self).__init__()
        self.smoothl1loss = nn.SmoothL1Loss(size_average=False)
        self.alpha = alpha
        self.beta = beta

    def forward(self, rm, psm, pos_equal_one, neg_equal_one, targets):

        p_pos = F.sigmoid(psm.permute(0,2,3,1))
        rm = rm.permute(0,2,3,1).contiguous()
        rm = rm.view(rm.size(0),rm.size(1),rm.size(2),-1,7)
        targets = targets.view(targets.size(0),targets.size(1),targets.size(2),-1,7)
        pos_equal_one_for_reg = pos_equal_one.unsqueeze(pos_equal_one.dim()).expand(-1,-1,-1,-1,7)

        rm_pos = rm * pos_equal_one_for_reg
        targets_pos = targets * pos_equal_one_for_reg

        cls_pos_loss = -pos_equal_one * torch.log(p_pos + 1e-6)
        cls_pos_loss = cls_pos_loss.sum() / (pos_equal_one.sum() + 1e-6)

        cls_neg_loss = -neg_equal_one * torch.log(1 - p_pos + 1e-6)
        cls_neg_loss = cls_neg_loss.sum() / (neg_equal_one.sum() + 1e-6)

        reg_loss = self.smoothl1loss(rm_pos, targets_pos)
        reg_loss = reg_loss / (pos_equal_one.sum() + 1e-6)
        conf_loss = self.alpha * cls_pos_loss + self.beta * cls_neg_loss
        return conf_loss, reg_loss

2.3 Efficient Implementation

$K$ : non-empty voxels의 최대 개수
$T$ : 각 voxel이 가질 수 있는 point의 최대 개수

Steps

( $K$ x $T$ x $1$ )-dim Voxel Coordinate Buffer(VCB) 와 ( $K$ x $T$ x $7$ )-dim Voxel Input Feature Buffer(VIFB) 초기화
Sparse한 Input PC를 Stacked VFE-layers에 넣기 전, VIFB에 통과시켜 Dense한 형태로 바꿈 & 빈 공간은 0으로 채움 → GPU parallel 연산 가능
- points를 돌면서 해당 point가 속한 voxel이 초기화된 적이 없다면, voxel의 coordinate를 VCB에 추가
- & 해당 point를 7-dim vector로 만들어 VIFB의 해당 voxel 위치에 추가
Stacked VFE-layers를 통과한 Voxel-wise Feature들을 VCB를 이용해 3D space 상의 Sparse tensor로 mapping
Sparse tensor는 middle conv layer와 RPN으로 들어감

[Code] Efficient VoxelNet

class VoxelNet(nn.Module):

    def __init__(self):
        super(VoxelNet, self).__init__()
        self.svfe = SVFE()
        self.cml = CML()
        self.rpn = RPN()

    def voxel_indexing(self, sparse_features, coords):
        dim = sparse_features.shape[-1]
        dense_feature = Variable(torch.zeros(dim, cfg.N, cfg.D, cfg.H, cfg.W).cuda())
        dense_feature[:, coords[:,0], coords[:,1], coords[:,2], coords[:,3]]= sparse_features
        return dense_feature.transpose(0, 1)

    def forward(self, voxel_features, voxel_coords):
        # feature learning network
        vwfs = self.svfe(voxel_features)
        vwfs = self.voxel_indexing(vwfs,voxel_coords)

        # convolutional middle network
        cml_out = self.cml(vwfs)

        # region proposal network
        # merge the depth and feature dim into one, output probability score map and regression map
        psm,rm = self.rpn(cml_out.view(cfg.N,-1,cfg.H, cfg.W))

        return psm, rm

3. Training Details

Data Augmentation

Less than 4000 training PC → Overfitting issue
1) Perturbation (Rotation and Translation) to each GT bbox
- bbox center를 중심으로 [-π/10, π/10] uniform distribution에서 sampling한 각도만큼 Rotation
- (x, y, z) 방향으로 각각 (0,1) Gaussian distribution에서 sampling한 값만큼 Translation
- Collision test bw two boxes → collision 있으면 원래대로 되돌림
2) Global Scaling
- All GT bbox $b_i$와 whole PC $M$에 대해 [0.95, 1.05] uniform distribution에서 sampling한 값만큼 Scaling
- Result : Robustness ↑ for detecting objects with various sizes and distances
3) Global Rotation
- All GT bbox $b_i$와 whole PC $M$에 대해 [-π/4, π/4] uniform distribution에서 sampling한 각도만큼 (0,0,0)을 중심으로 Z-axis로 Rotation
- Result : rotating entire pc → simulating vehicle making a turn
- 1 : 개별 bbox, 3 : 전체 scene

[Code] Data Augmentation

def draw_polygon(img, box_corner, color = (255, 255, 255),thickness = 1):

    tup0 = (box_corner[0, 1],box_corner[0, 0])
    tup1 = (box_corner[1, 1],box_corner[1, 0])
    tup2 = (box_corner[2, 1],box_corner[2, 0])
    tup3 = (box_corner[3, 1],box_corner[3, 0])
    cv2.line(img, tup0, tup1, color, thickness, cv2.LINE_AA)
    cv2.line(img, tup1, tup2, color, thickness, cv2.LINE_AA)
    cv2.line(img, tup2, tup3, color, thickness, cv2.LINE_AA)
    cv2.line(img, tup3, tup0, color, thickness, cv2.LINE_AA)
    return img

def point_transform(points, tx, ty, tz, rx=0, ry=0, rz=0):
    # Input:
    #   points: (N, 3)
    #   rx/y/z: in radians
    # Output:
    #   points: (N, 3)
    N = points.shape[0]
    points = np.hstack([points, np.ones((N, 1))])
    mat1 = np.eye(4)
    mat1[3, 0:3] = tx, ty, tz
    points = np.matmul(points, mat1)
    if rx != 0:
        mat = np.zeros((4, 4))
        mat[0, 0] = 1
        mat[3, 3] = 1
        mat[1, 1] = np.cos(rx)
        mat[1, 2] = -np.sin(rx)
        mat[2, 1] = np.sin(rx)
        mat[2, 2] = np.cos(rx)
        points = np.matmul(points, mat)
    if ry != 0:
        mat = np.zeros((4, 4))
        mat[1, 1] = 1
        mat[3, 3] = 1
        mat[0, 0] = np.cos(ry)
        mat[0, 2] = np.sin(ry)
        mat[2, 0] = -np.sin(ry)
        mat[2, 2] = np.cos(ry)
        points = np.matmul(points, mat)
    if rz != 0:
        mat = np.zeros((4, 4))
        mat[2, 2] = 1
        mat[3, 3] = 1
        mat[0, 0] = np.cos(rz)
        mat[0, 1] = -np.sin(rz)
        mat[1, 0] = np.sin(rz)
        mat[1, 1] = np.cos(rz)
        points = np.matmul(points, mat)
    return points[:, 0:3]

def box_transform(boxes_corner, tx, ty, tz, r=0):
    # boxes_corner (N, 8, 3)
    for idx in range(len(boxes_corner)):
        boxes_corner[idx] = point_transform(boxes_corner[idx], tx, ty, tz, rz=r)
    return boxes_corner

def cal_iou2d(box1_corner, box2_corner):
    box1_corner = np.reshape(box1_corner, [4, 2])
    box2_corner = np.reshape(box2_corner, [4, 2])
    box1_corner = ((cfg.W, cfg.H)-(box1_corner - (cfg.xrange[0], cfg.yrange[0])) / (cfg.vw, cfg.vh)).astype(np.int32)
    box2_corner = ((cfg.W, cfg.H)-(box2_corner - (cfg.xrange[0], cfg.yrange[0])) / (cfg.vw, cfg.vh)).astype(np.int32)

    buf1 = np.zeros((cfg.H, cfg.W, 3))
    buf2 = np.zeros((cfg.H, cfg.W, 3))
    buf1 = cv2.fillConvexPoly(buf1, box1_corner, color=(1,1,1))[..., 0]
    buf2 = cv2.fillConvexPoly(buf2, box2_corner, color=(1,1,1))[..., 0]

    indiv = np.sum(np.absolute(buf1-buf2))
    share = np.sum((buf1 + buf2) == 2)
    if indiv == 0:
        return 0.0 # when target is out of bound
    return share / (indiv + share)

def aug_data(lidar, gt_box3d_corner):
    np.random.seed()

    choice = np.random.randint(1, 10)

    if choice >= 7:
        for idx in range(len(gt_box3d_corner)):
            # TODO: precisely gather the point
            is_collision = True
            _count = 0
            while is_collision and _count < 100:
                t_rz = np.random.uniform(-np.pi / 10, np.pi / 10)
                t_x = np.random.normal()
                t_y = np.random.normal()
                t_z = np.random.normal()

                # check collision
                tmp = box_transform(
                    gt_box3d_corner[[idx]], t_x, t_y, t_z, t_rz)
                is_collision = False
                for idy in range(idx):
                    iou = cal_iou2d(tmp[0,:4,:2],gt_box3d_corner[idy,:4,:2])
                    if iou > 0:
                        is_collision = True
                        _count += 1
                        break
            if not is_collision:
                box_corner = gt_box3d_corner[idx]
                minx = np.min(box_corner[:, 0])
                miny = np.min(box_corner[:, 1])
                minz = np.min(box_corner[:, 2])
                maxx = np.max(box_corner[:, 0])
                maxy = np.max(box_corner[:, 1])
                maxz = np.max(box_corner[:, 2])
                bound_x = np.logical_and(
                    lidar[:, 0] >= minx, lidar[:, 0] <= maxx)
                bound_y = np.logical_and(
                    lidar[:, 1] >= miny, lidar[:, 1] <= maxy)
                bound_z = np.logical_and(
                    lidar[:, 2] >= minz, lidar[:, 2] <= maxz)
                bound_box = np.logical_and(
                    np.logical_and(bound_x, bound_y), bound_z)
                lidar[bound_box, 0:3] = point_transform(
                    lidar[bound_box, 0:3], t_x, t_y, t_z, rz=t_rz)
                gt_box3d_corner[idx] = box_transform(
                    gt_box3d_corner[[idx]], t_x, t_y, t_z, t_rz)

        gt_box3d = gt_box3d_corner

    elif choice < 7 and choice >= 4:
        # global rotation
        angle = np.random.uniform(-np.pi / 4, np.pi / 4)
        lidar[:, 0:3] = point_transform(lidar[:, 0:3], 0, 0, 0, rz=angle)
        gt_box3d = box_transform(gt_box3d_corner, 0, 0, 0, r=angle)

    else:
        # global scaling
        factor = np.random.uniform(0.95, 1.05)
        lidar[:, 0:3] = lidar[:, 0:3] * factor
        gt_box3d = gt_box3d_corner * factor

    return lidar, gt_box3d

4. Experiments

Evaluation on KITTI benchmark dataset

VoxelNet outperforms all other methods for Car class
VoxelNet is more effective in capturing 3D shape information than HC features

Code Review

reference : https://github.com/skyhehe123/VoxelNet-pytorch

[CV_CNN] Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

1. INTRODUCTION

Fix other parameters and increase 'depth' of the network + Use only 'small (3x3)' convolution filters in all layers
ILSVRC-2014 classification and localisation + other image recognition datasets

2. ConvNet Configurations

2.1. Architecture

Input data : 224 x 224 RGB image
3 x 3 Conv (stride 1, padding 1) and 2 x 2 Maxpool (stride 2)
Activation function : ReLU
3 FC layers (4096 - 4096 - 1000 channels)
Final : soft-max layer
No LRN (Local Response Normalization) except for one

2.2. Configurations

A~E : Differ only in 'depth'
Width of conv layer (the number of channels = feature map) : 64 -> 128 -> 256 -> 512
3 x 3 conv fewer parameters but still a lot (because of FC layer)

2.3. Discussion

Stack of three 3 x 3 has Same effective receptive field as one 7 x 7 conv layer
BUT more non-linear ( ReLU ) & fewer parameters ( 3(3^2C^2) < 7^2C^2 )
1 x 1 conv layer for additional non-linearity by ReLU (config C)
GoogLeNet(1st place of ILSVRC-2014) is more complex than VGGNet
- Similarity : very deep ConvNets (22 layers) and Small conv filters(1x1, 3x3, 5x5)
- Difference : spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation

3. Classification Framework

3.1. Training

Generally follows AlexNet (2012) except for input crops from multi-scale training images
Data Pre-processing
- Image Rescale (Resize)
  - Single-scale training : fixed S = 256, S = 384
  - Multi-scale training : randomly sampling in [256, 512] (Fine-tuning with pre-trained S = 384)
- Data Augmentation
  - Random crop 224 x 224
  - Random horizontal flipping
  - Random RGB color shift
  - Scale jittering
  - Normalization : subtract mean RGB value computed on training dataset from each pixel
Train Details
- Multinomial logistic regression Optimization
- Mini-batch gradient descent based on backpropagation
  - Learning rate : 0.01
  - Momentum : 0.9
  - L2 weight decay : 0.0005
- Batch size : 256
- Dropout : 0.5 ratio for first 2 FC layers
- Learning rate scheduler : decreased by a factor of 10 ( x 3 times) -> stopped at 370K iterations
- Epoch : 74 (370K iterations)
- Pre-initialization : Train shallow config A -> Train deeper config by initialization first 4 conv and last 3 fc layers with layers of A & random initialization intermediate layers by sampling from N(0, 0.001)

3.2. Testing

Data Pre-processing
- Isotropic Rescaling to pre-defined smallest side Q (not necessarily equal to S)
- Multi-crop evaluation + Dense evaluation
- Data Augmentation : Horizontal flipping
Network Change
- FC layers -> convolutional layers => Fully-Convolutional Net
  - First FC layer -> 7 x 7 conv layer
  - Last 2 FC layers -> 1 x 1 conv layers (for free input size) : applied to the whole (uncropped) img
- Add spatially Average Pooling class score map at end : to obtain a fixed-size vector of class scores
Averaging Soft-max class posteriors of original and flipped images -> Final scores

4. Classification Experiments

Dataset : ILSVRC-2012 dataset (1000 classes / 1.3M train + 50K val + 100K test)
Use validation set as test set

4.1. Single-Scale Evaluation

More deeper, less error + Error saturated at 19 layers
Same depth -> High non-linearity is better (D > C)
Deep net with Small filters is better than Shallow net with Large filters
Scale jittering : better than fixed S

4.2. Multi-Scale Evaluation

Better than Single-Scale Evaluation
fixed S : Q = {S-32, S, S+32}
Scale jittering on [256, 384, 512] : better than fixed S

4.3. Multi-Crop Evaluation

Multi-crop & Dense evaluation : complementary -> Combination is best

4.4. Convnet fusion

Combine the outputs of several models by averaging soft-max class posteriors -> improve performance
Multiple ConvNet fusion Results
- ILSVRC submission : Only train the single-scale networks, as well as a multi-scale model D and Ensemble of 7 model => 7.3% test error
- Post-submission : Ensemble of 2 best-performing multi-scale models (D and E) => 7.0% using dense eval, 6.8% using combined eval

4.5. Comparison with the state of the art

ILSVRC-2014 Classification 2nd place with 7.3% test error using an ensemble of 7 models
Decreased the error rate to 6.8% using an ensemble of 2 models
Single-net performance : VGG is the best

5. CONCLUSION

Representation 'depth' is beneficial for the classification accuracy
Generalization well to a wide range of tasks and datasets (more complex recognition pipelines)

Code Review

1. model of VGG16

from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD
import cv2, numpy as np

def VGG_16(weights_path=None):
    model = Sequential()
    model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(ZeroPadding2D((1,1)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))

    if weights_path:
        model.load_weights(weights_path)

    return model

2. Whole models

import torch
import torch.nn as nn

try:
    from torch.hub import load_state_dict_from_url
except ImportError:
    from torch.utils.model_zoo import load_url as load_state_dict_from_url

torch.manual_seed(0)

# Pretrained model weights
pretrained_model_urls = {
    'vgg11': 'https://download.pytorch.org/models/vgg11-bbd30ac9.pth',
    'vgg13': 'https://download.pytorch.org/models/vgg13-c768596a.pth',
    'vgg16': 'https://download.pytorch.org/models/vgg16-397923af.pth',
    'vgg19': 'https://download.pytorch.org/models/vgg19-dcbb9e9d.pth',
    'vgg11_bn': 'https://download.pytorch.org/models/vgg11_bn-6002323d.pth',
    'vgg13_bn': 'https://download.pytorch.org/models/vgg13_bn-abd245e5.pth',
    'vgg16_bn': 'https://download.pytorch.org/models/vgg16_bn-6c64b313.pth',
    'vgg19_bn': 'https://download.pytorch.org/models/vgg19_bn-c79401a0.pth',
}

# Model info
cfgs = {
    11: [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    13: [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    16: [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    19: [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M']
}


class VGG(nn.Module):
    def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096), nn.ReLU(inplace=True), nn.Dropout(),
            nn.Linear(4096, 4096), nn.ReLU(inplace=True), nn.Dropout(),
            nn.Linear(4096, num_classes)
        )
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

def make_layers(cfg, batch_norm=False):
    layers = list()
    in_channels = 3
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    return nn.Sequential(*layers)


def vgg(depth, batch_norm, num_classes, pretrained):
    model = VGG(make_layers(cfgs[depth], batch_norm=batch_norm), num_classes, init_weights=True)
    arch = 'vgg' + str(depth)
    if batch_norm == True: arch += '_bn'

    if pretrained and (num_classes == 1000) and (arch in pretrained_model_urls):
        state_dict = load_state_dict_from_url(pretrained_model_urls[arch], progress=True)
        model.load_state_dict(state_dict)
    elif pretrained:
        raise ValueError('No pretrained model in vggnet {} model with class number {}'.format(depth, num_classes))

    return model

3. Train and Test

from model import *
from utils import *
import os

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(0)

class VGGNet():
    def __init__(self, depth=19, batch_norm=True, num_classes=1000, pretrained=False,
                 gpu_id=0, print_freq=10, epoch_print=10, epoch_save=50):

        self.depth = depth
        self.batch_norm = batch_norm
        self.num_classes = num_classes
        self.pretrained = pretrained
        self.gpu = gpu_id
        self.print_freq = print_freq
        self.epoch_print = epoch_print
        self.epoch_save = epoch_save

        torch.cuda.set_device(self.gpu)

        self.loss_function = nn.CrossEntropyLoss().cuda(self.gpu)

        if self.pretrained:
            print('=> Use pre-trained model with depth : {}, batch_norm : {}'.format(self.depth, self.batch_norm))
        else:
            print('=> Create model with depth : {}, batch_norm : {}'.format(self.depth, self.batch_norm))

        model = vgg(self.depth, self.batch_norm, self.num_classes, self.pretrained)
        self.model = model.cuda(self.gpu)

        self.train_losses = list()
        self.train_acc = list()
        self.test_losses = list()
        self.test_acc = list()


    def train(self, train_data, test_data, resume=False, save=False, start_epoch=0, epochs=74,
              lr=0.01, momentum=0.9, weight_decay=0.0005, milestones=False):
        # Model to Train Mode
        self.model.train()

        # Set Optimizer and Scheduler
        optimizer = optim.SGD(self.model.parameters(), lr, momentum=momentum, weight_decay=weight_decay)
        if milestones:
            scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1)
        else:
            scheduler = optim.lr_scheduler.MultiStepLR(optimizer, [epochs//2, epochs*3//4], gamma=0.1)

        # Optionally Resume from Checkpoint
        if resume:
            if os.path.isfile(resume):
                print('=> Load checkpoint from {}'.format(resume))
                loc = 'cuda:{}'.format(self.gpu)
                checkpoint = torch.load(resume, map_location=loc)

                self.model.load_state_dict(checkpoint['state_dict'])

                start_epoch = checkpoint['epoch']
                optimizer.load_state_dict(checkpoint['optimizer'])
                scheduler.load_state_dict(checkpoint['scheduler'])
                print('=> Loaded checkpoint from {} with epoch {}'.format(resume, checkpoint['epoch']))
            else:
                print('=> No checkpoint found at {}'.format(resume))

        # Train
        for epoch in range(start_epoch, epochs):
            if epoch % self.epoch_print == 0:
                print('Epoch {} Started...'.format(epoch+1))
            for i, (X, y) in enumerate(train_data):
                X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
                output = self.model(X)
                loss = self.loss_function(output, y)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                if (i+1) % self.print_freq == 0:
                    train_acc = 100 * count(output, y) / y.size(0)
                    test_acc, test_loss = self.test(test_data)

                    self.train_losses.append(loss.item())
                    self.train_acc.append(train_acc)
                    self.test_losses.append(test_loss)
                    self.test_acc.append(test_acc)

                    self.model.train()

                    if epoch % self.epoch_print == 0:
                        print('Iteration : {} - Train Loss : {:.2f}, Test Loss : {:.2f}, '
                              'Train Acc : {:.2f}, Test Acc : {:.2f}'.format(i+1, loss.item(), test_loss,
                                                                             train_acc, test_acc))

            scheduler.step()
            if save and (epoch % self.epoch_save == 0):
                save_checkpoint(self.depth, self.batch_norm, self.num_classes, self.pretrained, epoch,
                                state={'epoch': epoch+1, 'state_dict':self.model.state_dict(),
                                       'optimizer':optimizer.state_dict(), 'scheduler':scheduler})


    def test(self, test_data):
        correct, total = 0, 0
        losses = list()

        # Model to Eval Mode
        self.model.eval()

        # Test
        with torch.no_grad():
            for i, (X, y) in enumerate(test_data):
                X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
                output = self.model(X)

                loss = self.loss_function(output, y)
                losses.append(loss.item())

                correct += count(output, y)
                total += y.size(0)

        return (100*correct/total, sum(losses)/len(losses))

[CV_3D] PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Paper Review

1. Introduction

Previous research : Weight sharing, kernel optimization 위해 irregular format 특성을 가지는 point cloud를 3D voxel grid or collections of img로 transform 후 feed → Result : Quantization artifacts
PointNet
- Input : Point clouds
  - Simple and unified 구조 → 학습 easy
  - A set of points → Invariant to permutations & rigid motions
- Output : class labels for entire input or per point segment/part labels for each point of input
- Max pooling : single symmetric function
- FC layers : (shape classification) to aggregate learnt optimal values into global descriptor or (shape segmentation) predict per point labels
- Data-dependent STN : to canonicalize data before PointNet process them
- Any continuous set function을 approximate 할 수 있음
- Input point cloud를 sparse set of key points로 summarize
- Robust to small perturbation of input points (corruption by outliers or missing data)
Key contributions
- Model Design : Deep network for unordered point sets in 3D
- Tasks : 3D shape classification, shape part segmentation, scene semantic parsing
- Analysis : Empirical and theoretical analysis on Stability and efficiency
- Experiment : 3D features illustration computed by selected neurons in net

2. Related Work

Point Cloud Features

Previous method : certain statistical properties를 encode →certain transformation에 invariant
ex. intrinsic or extrinsic / local or global

DL on 3D Data

Volumetric CNN : 3D CNN → data sparsity, 3D conv의 computation cost 제약
FPNN, Vote3D : sparse volumes 인해 large point clouds 어려움
Multiview CNN : 3D point cloud or shapes를 2D imgs로 render 후 2D conv 적용
Spectral CNN : manifold mesh, non-isometric shapes 제약
Feature-based DNN : 3D data를 vector로 바꿔 shape features 뽑은 후 fc로 분류

DL on Unordered Sets

Point cloud = unordered set of vectors .. VS .. Most works in DL : regular representations
read-process-write network with attention : sorting for generic sets and NLP → geometry 부족

3. Problem Statement

Each Point's channel of PC
- (x, y, z) + extra feature channels (ex. color, normal, ..)
- Implementation : (x, y, z) coordinate for simplicity
Object classification task
- Input point cloud : directly sampled from a shape 또는 pre-segmented from a scene point cloud
- Output : k scores (k : candidate class 수)
Semantic segmentation task
- Input : part segmentation로 얻은 single object 또는 object segmentation로 얻은 3D scene의 sub-volume
- Output : n x m scores (n : point 수, m : semantic sub-category 수)

4. Deep Learning on Point Sets

4.1 Properties of Point Sets in R^n

Unordered : N 3D point sets → Network needs to be invariant to N! permutations
Interaction among points : meaningful local structures from nearby points
Invariance under transformations : 변환(ex. rotating, translating)해도 category나 segmentation 값 일정

4.2 PointNet Architecture

Full network = Classification network + Segmentation network

[ 3 Key modules ]

❶ Max pooling layer : Symmetry Function for Unordered Input

Goal : To aggregate information from all points → make model invariant to input permutation (N!)
Input : n vectors → Output : a new vector = [f_1, ..., f_K] (invariant to input order)
Key idea : To approximate general function $f$ defined on point set by symmetric function on transformed elements
Implementation : approximate $h$ by MLP & $g$ by single variable func + max pooling func

❷ Local and Global Information Aggregation [Segmentation]

Max pooling output [f_1, ..., f_K] : only global information for Classification task
Goal : To get Local and Global information for Point Segmentation task
Implementation (Input) : Concatenating global feature (1024) + each of local point feature (64) → Extracting new per point feature (ex. per-point normals)

❸ T-Net : Joint Alignment Network

Goal : Invariant to transformations (ex. rigid transformation)
Implementation : Predicting affine transformation matrix by mini-net (T-net) → Applying this transformation to coordinates of input points
Result Check : semantic labeling 그대로 나오면 invariant
Idea from STN (orthogonal img 위한 transformation matrix 계산 후 기존 input img에 곱하여 변형없는 output img)
T-net : composed by basic modules of point independent feature extraction + max pooling + FC layers
Feature space Alignment : another transformation matrix 추가해 align features from different input point clouds
- Higher dimension (64x64) 이므로 Regularization loss term 추가해 optimize
- Result : orthogonal matrix (not lose information in input) → more stable, better performance

5. Experiment

5.1 Applications (3D recognition)

1) 3D Object Classification

Goal : To learn global point cloud feature
Dataset : ModelNet40 (12311 CAD models from 40 man-made object categories) → 75% Train + 25% Test
Input point cloud : 1024 points uniformly sampling on from mesh faces → normalizing into a unit sphere
Data augmentation : random rotate along up-axis, jitter position of each points by Gaussian noise
Result : fc and max pooling 만으로 fast inference speed, parallel in CPU

2) 3D Object Part Segmentation

Part Segmentation : Given 3D scan or mesh model → Point labels = object part category label to each point of face
Dataset : ShapeNet part dataset (16881 shapes from 16 categories, annotated with 50 parts)
Idea : Part-point Classification
Evaluation metric : mIoU on points (shape's mIoU)
Result : 2.3% mean IoU improvement
Robustness Test (simulated Kinect scans) : lose only 5.3% mIoU

3) Semantic Segmentation in Scenes

Point labels : semantic object classes
Dataset : Standford 3D semantic parsing dataset (3D scans in 6 areas including 271 rooms from 13 categories)
Point representation : 12-dim vector = 9-dim of XYZ, RGB, normalized location + 3-dim of local point density, local curvature, normal)
Classifier : standard MLP
Result : smooth predictions, robustness to missing points and occlusions
3D Object Detection system

5.2 Architecture Design Analysis

Dataset : ModelNet40 shape classification problem for comparisons

Comparison with Alternative Order-invariant Methods

3 Approaches
- MLP (unsorted / sorted input) : points as nx3 arrays
- LSTM : points as a sequence
- Symmetry operation : Attention sum, Average pooling, Max pooling
Result : Max pooling = Best performance (Acc 87.1%)

Effectiveness of Input and Feature Transformations

Input & Feature Transformation STN + Regularization → Acc 2.1% ↑

Robustness Test

Robust to various input corruptions
- Model : Max pooling network / Input points : normalized into a unit sphere
- Result : 50% point missing → Acc 2.4%, 3.8% ↓ wrt furthest, random input sampling
Robust to outliear
- Types : XYZ / XYZ+density
- Result : Acc more than 80% even when 20% are outlier points

5.3 Visualizing PointNet

Critical point sets $C_S$ and Upper-bound shapes $N_S$ for sample shapes $S$
- Critical point sets $C_S$ : max pooled feature (summerized skeleton of shape
- Upper-bound shapes $N_S$ : largest possible point cloud that give global shape feature f(S)
Result : some non-critical points 잃는다고 $f(S)$ 바뀌지X (Robustness)

5.4 Time and Space Complexity Analysis

MVCNN, 3DCNN : conv layer computation ↑ vs PointNet : O(N) efficient

Code Review

Dataloader

from torch.utils.data import Dataset
import numpy as np

class PointCloudDataset(Dataset):
    def __init__(self, npoints=1024):
        self.npoints = npoints
        ...
        
    def __getitem__(self, index):
        points = self.point_list[index]
        
        #randomly sample points
        choice = np.random.choice(points.shape[0], self.npoints, replace=True)
        points = points[choice, :]
        
        #normalize to unit sphere
        points = points - np.expand_dims(np.mean(points, axis=0), 0) #center
        dist = np.max(np.sqrt(np.sum(points**2, axis=1)), 0)
        points = points / dist #scale
        
        points = self.data_augmentation(points)
        
        label = self.label_list[index]
        
        return torch.from_numpy(points).float(), torch.tensor(label)
        
    def data_augmentation(self, points):
        theta = np.random.uniform(0, np.pi*2) #0~360
        rotation_matrix = np.array([[np.cos(theta), -np.sin(theta)],[np.sin(theta), np.cos(theta)]])
        points[:,[0,2]] = points[:,[0,2]].dot(rotation_matrix) # random rotation
        points += np.random.normal(0, 0.02, size=points.shape) # random jitter
        return points

Point Cloud : 각 sample마다 point 수 다름. batch 단위 학습 위해 각 sample의 point 수를 맞춰줘야함 → n_points 설정해서 각 sample마다 random sampling
추출한 point들은 unit sphere로의 normalization 적용
Data augmentation : y축 기준 random rotation, Gaussian noise 기반 jittering

Main network

class PointNetCls(nn.Module):
    def __init__(self, num_classes=2):
        super(PointNetCls, self).__init__()

        self.tnet = TNet(dim=3)
        self.mlp1 = mlpblock(3, 64)

        self.tnet_feature = TNet(dim=64)

        self.mlp2 = nn.Sequential(
            mlpblock(64, 128),
            mlpblock(128, 1024, act_f=False)
        )

        self.mlp3 = nn.Sequential(
            fcblock(1024, 512),
            fcblock(512, 256, dropout_rate=0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        """
        :input size: (N, n_points, 3)
        :output size: (N, num_classes)
        """
        x = x.transpose(2, 1) #N, 3, n_points
        trans = self.tnet(x) #N, 3, 3
        x = torch.bmm(x.transpose(2, 1), trans).transpose(2, 1) #N, 3, n_points
        x = self.mlp1(x) #N, 64, n_points

        trans_feat = self.tnet_feature(x) #N, 64, 64
        x = torch.bmm(x.transpose(2, 1), trans_feat).transpose(2, 1) #N, 64, n_points

        x = self.mlp2(x) #N, 1024, n_points
        x = torch.max(x, 2, keepdim=False)[0] #N, 1024 (global feature)

        x = self.mlp3(x) #N, num_classes

        return x, trans_feat

(1) input feature 대해 T-Net 통해 transformation matrix 계산 → matrix multiplication 통해 transformation 수행
(2) Shared mlp1 통해 feature dim 3 → 64
(3) 64 dim shared mlp1에 T-Net과 matrix multiplication 통한 transformation 수행
(4) Shared mlp2 통해 feature dim 64 →128 →1024
(5) Max pooling으로 1024 dim vector 추출
(6) Last mlp3 통해 classification 수행

mlpblock, fcblock

def mlpblock(in_channels, out_channels, act_f=True):
    layers = [
        nn.Conv1d(in_channels, out_channels, 1),
        nn.BatchNorm1d(out_channels),
    ]
    if act_f:
        layers.append(nn.ReLU())
    return nn.Sequential(*layers)

def fcblock(in_channels, out_channels, dropout_rate=None):
    layers = [
        nn.Linear(in_channels, out_channels),
    ]
    if dropout_rate is not None:
        layers.append(nn.Dropout(p=dropout_rate))
    layers += [
        nn.BatchNorm1d(out_channels),
        nn.ReLU()
    ]
    return nn.Sequential(*layers)

Shared mlp : kernel size=1, 1D conv layer로 구현

T-Net

class TNet(nn.Module):
    def __init__(self, dim=64):
        super(TNet, self).__init__()
        self.dim = dim
        self.mlp = nn.Sequential(
            mlpblock(dim, 64),
            mlpblock(64, 128),
            mlpblock(128, 1024)
        )
        self.fc = nn.Sequential(
            fcblock(1024, 512),
            fcblock(512, 256),
            nn.Linear(256, dim*dim)
        )
        
    def forward(self, x):
        x = self.mlp(x)
        x = torch.max(x, 2, keepdim=True)[0]
        x = x.view(-1, 1024)

        x = self.fc(x)

        idt = torch.eye(self.dim, dtype=torch.float32).flatten().unsqueeze(0).repeat(x.size()[0], 1)
        idt = idt.to(x.device)
        x = x + idt
        x = x.view(-1, self.dim, self.dim)
        return x

Canonical space로의 mapping 위한 transformation matrix 계산

Train

import torch
import torch.nn as nn

def feature_transform_regularizer(trans):
    D = trans.size()[1]
    I = torch.eye(D)[None, :, :]
    I = I.to(trans.device)
    loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2,1)) - I, dim=(1,2)))
    return loss
    
#sample data
points = torch.rand(5, 1024, 3)
target = torch.empty(5, dtype=torch.long).random_(10)

model = PointNetCls(num_classes=10)
loss_f = nn.CrossEntropyLoss()

pred, trans_feat = model(points)
loss = loss_f(pred, target)
loss += feature_transform_regularizer(trans_feat) * 0.001

feature transform의 regularization 함수 정의
Loss : Cross entropy loss

Reference

[CV_3D] PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Background

3D sparse convolution : sparse한 3D voxel data에 efficient하게 적용가능한 convolution 기법
Point Set Abstraction : PointNet++의 point set local feature encoding 방법
- keypoint sampling → MLP → Max pooling → feature vector 생성
- 적은 수의 points로 local feature encoding 가능

3D detection methods

(1) Grid based
- irregular한 pc를 regular한 3D voxel이나 2D BEV map으로 변환하여 detection
- Ex. 3D sparse conv → 효율적 / receptive field가 kernel size에 제한됨
(2) Point based
- 변환 없이 point 자체로 feature enconding하여 detection
- Ex. PointNet Set Abstraction → flexible receptive field, accurate contextual information / distance pair, cost↑
(3) PV-RCNN (Grid and Point based)
- 1 stage : Grid based method로 3D box Region Proposal
- 2 stage : Point based method로 Location의 refinement

PV-RCNN for Point Cloud Object Detection

1) 3D Voxel CNN for Efficient Feature Encoding and Proposal Generation

Proposal Generation

pc를 voxel로 변환
→ (3x3x3) 3D sparse conv 통해 8x dowmsampled size feature volume 얻음
→ 2D BEV feature map 변환
→ 각 feature map pixel 마다 2 anchor (0º, 90º)로 3D box proposal 생성
→ 각 anchor 대해 물체 유무 classification & box regression

Problems of RoI Pooling

Downsampling으로 인해 x8 resolution ↓ → input 물체의 정확한 location 알기 어려움 (information loss)
Upsampling - Interpolation : sparse / Set abstraction : robust refinement 가능 but computation cost ↑

Solution by PV-RCNN

모든 box proposal 안의 grid point 잡음 → grid point 대해 Multi-scale feature volume 얻음, Set abstraction 적용
Sampling한 Keypoints로 feature volume 을 aggregate → keypoints로 RoI grid들이 feature 생성

2) Voxel-to-Keypoint Scene Encoding via Voxel Set Abstraction

VSA (Voxel Set Abstraction) Module

전체 pc에서 정해진 개수의 Keypoints를 FPS로 sampling
정해진 반지름( $r_k$ ) 영역에 포함되는 voxel feature 모아 set 형성
- $r_k$ 를 layer 마다 다르게 설정 → flexible receptive field
Pointnet block의 Voxel Set Abstraction 통해 Multi-scale feature volume encoding
- $M$ : T Random Sampling, $G$ : MLP, $max$ : Max pooling
$f_i^{p}$ = $f_i^{pv}$ + $f_i^{raw}$ + $f_i^{bev}$
- $f_i^{pv}$ : 각 layer 마다 구한 keypoint features 모은 것
- $f_i^{raw}$ : raw data 대한 Set abstraction 결과 (voxelization 인한 quantization loss 보완)
- $f_i^{bev}$ : BEV 에서 구한 Keypoint feature (more wide receptive field)

PKW (Predicted Key point Weighting) Module

기능 : PC segmentation network 추가하여 각 point 대한 foreground confidence weight 계산
구현 : foreground confidence를 keypoint feature에 곱함
효과 : refinement 과정에서 foreground feature vector 영향 ↑

3) Keypoint-to-Grid RoI Feature Abstraction for Proposal Refinement

RoI-grid Pooling Module

각 3D proposal 안에서 6x6x6 grid point sampling
SA 통해 RoI 안의 grid point feature keypoints encoding
효과 : more flexible receptive field, contextual information
+) Boundary 바깥 keypoints 까지 encoding

Grid point에서 다양한 거리의 key point set 생성 → T개 sampling → MLP → Max pooling = Grid point feature

3D Proposal refinement and confidence prediction

Grid point features를 2-layer 통과시켜 256-dim의 RoI feature vector로 만듦
결과 : Confidence & Box refinement 계산
$y_k$ : 어떤 box proposal 이 더 좋은지 판단하고자 IOU 활용
$L_{iou}$ : Confidence 계산 시 활용 (CE loss)

Training losses

Total loss = Region proposal loss + Key point segmentation loss + Proposal refinement loss

[CV_FER] Facial Motion Prior Networks for Facial Expression Recognition

FMRN-FER : Facial Motion Prior Networks for Facial Expression Recognition

FMPN-FER Architecture

Facial-Motion Mask Generator (FMG)
- Generate a facial mask to focus on facial muscle moving regions
- Use avg differences bw neutral faces and expressive faces as training guidance (pseudo gt masks)
Prior Fusion Net (PFN)
- Generated mask is applied to and fused with original input expressive face
Classification Net (CN)
- Extract features and predict facial expression label (6 class)

Implementation Details

CN : Inception V3 (pretrained on ImageNet)
5 landmarks are extracted, followed by face normalization
Image Transforms : Random crop from four corners or center & Random horizontal flip
Training (2 steps)
1. Starting by tuning only FMG for 300 epochs, using Adam optimizer
  - Epoch 150
  - LR linearly decay (FMG : e−4 to 0)
2. Jointly training entire framework with λ1 = 10 and λ2 = 1
  - Epoch 200
  - LR linearly decay (FMG : e−5, CN : e-4) from epoch 100
  - l_total = λ1 * l_G(MSE) + λ2 * l_C(CE) = 10 * l_G + l_C

Experimental Results

MMI Facial Expression Database
- Labelled with 6 basic expressions (Disgust > Sadness > Happy > Fear > Surprise > Anger)
- 3 peak frames around center of each labelled sequence are selected → Total : 624 expressive faces
- 10-fold person-independent cross-validation experiments
- Details for MMI

[CV_GAN] Generative Adversarial Nets

GAN : Generative Adversarial Nets
https://jeonggg119.tistory.com/37

Abstract

Estimating Generative models via an Adversarial process
Simultaneously training two models (minimax two-player game)
- Generative model G : capturing data distribution → recovering training data distribution)
- Discriminative model D : estimating probability that a sample came from training DB rather than G → equal to 1/2
G and D are defined by multilayer perceptrons & trained with backprop

1. Introduction

The promise of DL : to discover models that represent probability distributions over many kinds of data
The most striking success in DL : Discriminative models that map a high dimensional, rich sensory input to a class label
- based on backprop and dropout
- using piecewise linear units behaved gradient
Deep Generative model : less impact due to..
- difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation
- difficulty of leveraging benefits of piecewise linear units
GAN : training both models using only backprop and dropout & sampling from G using only forward prop
- Generative model G : generating samples by passing random noise through a multilayer perceptron
- Discriminative model D : also defined by a multilayer perceptron
- No need for Markov chains or inference networks

2. Related work

RBMs(restricted Boltzmann machines), DBMs(deep Boltzmann machines) : undirected graphical models with latent variables
DBNs(Deep belief networks) : hybrid models containing a single undirected layer and several directed layers
Score matching, NCE(noise-contrastive estimation) : criteria that don't approximate or bound log-likelihood
GSN(generative stochastic network) : extending generalized DAE -> training G to draw samples from desired distribution

3. Adversarial nets

1) Adversarial modeling (G+D) based on MLPs

p_g : G's distribution
p_z(z) : Input noise random variables
G : differentiable function represented by MLP -> G(z) : mapping to data space -> output : fake img
D(x) : probability that x came from the train data rather than p_g from G -> output : single scalar

2) Two-player minimax game with value function V(G,D)

D : maximize probability of assigning correct label to Training examples & Samples from G
- D(x)=1, D(G(z))=0
G : minimize log(1-D(G(z)))
- D(G(z))=1
- Implementation : train G to maximize log(D(G(z))) = stronger gradients early in learning (preventing saturations)

3) Theoretical Analysis

Training criterion allows one to recover data generating distribution as G and D are given enough capacity

[Algorithm 1] k steps of optimizing D and 1 step of optimizing G
- D : being maintained near its optimal solution
- G : changing slowly enough
Loss function for G : min log(1-D(G(z))) => max log(D(z)) for stronger gradients early in training
D is trained to discriminate samples from data, converging to D*(x)=P_d(x)/(P_d(x)+P_g(x))
- D가 Objective function 달성한 optimal state일 때, G가 Objective function 달성하도록 학습
∴ P_g(x) = P_data(x) <=> D(G(z))=1/2

4. Theoretical Results

G implicitly defines P_g as distribution of the samples G(z) obtained when z~P_z
[Algorithm 1] to converge to a good estimator of P_data
Non-parametic : representing a model with infinite capacity by studying convergence in space of probability density func
Global optimum for p_g = p_data

4.1 Global Optimality of p_g = p_data

[Proposition 1]

Optimal D for any given G
For G fixed, optimal D is D*(x)=P_d(x)/(P_d(x)+P_g(x))

[Theorem 1]

Global minimum of C(G) = - log4 is achieved if and only if P_g=P_data

4.2 Convergence of Algorithm 1

[Proposition 2]

If G and D have enough capacity, and at each step of Algorithm 1,
D is allowed to reach optimum given G & P_g is updated to improve criterion → P_g = P_data
pf) V(G, D) = U(P_g, D) : convex function in P_g
Computing a gradient descent update for P_g at optimal D given G
With sufficiently small updates of P_g
Optimizing θ_g rather than P_g itself
Excellent performance of MLP in practice → reasonable model to use despite their lack of theoretical guarantees

5. Experiments

Datasets : MNIST, Toronto Face Database(TFD), CIFAR-10
G : ReLU + sigmoid activations / Dropout and other noise at intermediate layers / Noise as input to bottommost layer
D : Maxout activations / Dropout

[Table 1]

Estimation method : Gaussian Parzen window-based log-likelihood estimation for probability of test data

[Figure 2]

Rightmost column : nearest neighboring training sample → Model has not memorized training set
Samples are fair random draws (Not cherry-picked)
Markov chain mixing Sampling process X → Samples are uncorrelated

[Figure 3]

Linear Interpolation bw coordinates in z space of full model

6. Advantages and disadvantages

1) Disadvantages

No explicit representation of P_g(x)
D must be synchronized well with G during training (G must be trained too much without updating D)
G collapses too many values of z to same value of x to have enough diversity to model P_data

2) Advantages

(1) Computational Advantages

Markov chains are never needed / Only backprop is used / No Inference is needed
Wide variety of functions can be incorporated into model

(2) Statistical Advantages from G

Not being updated directly with data, but only with gradients flowing through D
(= Components of input are not copied directly into G's parameters)
Representing very sharp, even degenerating distributions

7. Conclusions and future work

conditional GAN p(x|c) : adding c as input to both G and D
Learned approximate inference : training auxiliary network to predict z given x
- Similar to inference net trained by wake-sleep algorithm
- Advantage : inference net trained for a fixed G after G has finished training
All conditionals GAN p(x_S|x_S/) : S is a subset of indices of x by training family of conditional models that share params
- To implement a stochastic extension of deterministic MP-DBM
Semi-supervised learning : when limited labeled data is available
Efficiency improvements : training accelerated by coordinating G and D or determining better distributions to sample z

[CV_3D] PointMLP: Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

PointMLP: Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

Paper Review

1. Introduction

Point Cloud : unordered, irregular set of points → sparseness and noise restrict performance
Prior Research : local geometric extractors using convolution, graph, or attention → memory overhead
PointMLP : DNN for PC using only residual feed-forward MLPs (No local geometric extractors)
- +) lightweight local geometric affine module : to adaptively transform point feature in a local region
- Result : SOTA classification performance on ModelNet40, real-world ScanObjectNN

2. Related Work

Two mainstreams of Point Cloud Analysis
- Projecting PC to intermediate voxels or 2D imgs : fast, efficient BUT detail degradation by information loss
- Directly processing PC : ex, PointNet, PointNet++ → PointMLP follows philosophy of PointNet++ but simpler
Local geometry exploration
- Goal : How to generate better regional points representation ?
- Prior Research : local geometric extractors using convolution, graph, or attention
  - Ex. PointConv, PAConv / EdgeConv, 3DGCN / PCT, Point Transformer
- Limitation : minimal improvement, saturated performance
Deep Network Architecture
- Prior Development : Image Processing Network (stacking learning layers) & DNN like ResNet
- Deep MLP architecture : efficiency and generality
- PointMLP : simple and powerful Deep Residual MLP for PC

3. Deep Residual MLP for Point Cloud

3.1 Point-based Methods

Motivation : to directly consume pc from beginning & avoid unnecessary rendering
Goal : to directly learn representation $f$ of point $P$ using NN
Limitations : computational complexity (prohibitive inference latency) & saturated performance gain
Ex) PointNet, PointNet++, Point Transformer, ...

PointNet++

Main idea : learning hierarchical features by stacking multiple learning stages
- In each stage $s$, $N_s$ points are re-sampled by FPS
Formulation : $g_i = A(Φ (f_{i,j}) |j=1, ..., K)$
- $A$ : aggregation function (max-pooling)
- $Φ$ : local feature extraction function (MLP)
- $f_{i,j}$ : $j$-th neighbor point feature of $i$-th sampled point
- $K$ : number of neighbor points

3.2 PointMLP (feed-forward residual MLP)

Main idea : hierarchically aggregating local features extracted by MLPs (No local extractor)
Formulation : $g_i = Φ_{pos} ( A (Φ_{pre} (f_{i,j}), |j=1, ..., K))$
- $Φ_{pre}$, $Φ_{pos}$ : residual point MLP blocks to extract local features
  
  In paper, 2 residual blocks in both $Φ_{pre}$, $Φ_{pos}$ / neighbors by KNN : $K$=24
  - $Φ_{pre}$ : to learn shared weights from a local region
  - $Φ_{pos}$ : to extract deep aggregated features
  - MLP = FC, normalization, activation layers
- $A$ : aggregation function (max-pooling)
- $MLP(x) + x$ : mapping function (a series of homogeneous residual MLP blocks)
- Recursively repeating operation by $s$ stages → receptive field ↑
  
  In paper, $s$ = 4
Merits
- MLP → permutation invariance
- Residual connection → layers ↑ →deep feature representation
- No sophisticated local extractors → efficient with highly optimized feed-forward MLPs

[Code] Mapping function $MLP(x) + x$

class ConvBNReLURes1D(nn.Module):
    def __init__(self, channel, kernel_size=1, groups=1, res_expansion=1.0, bias=True, activation='relu'):
        super(ConvBNReLURes1D, self).__init__()
        self.act = get_activation(activation)
        self.net1 = nn.Sequential(
            nn.Conv1d(in_channels=channel, out_channels=int(channel * res_expansion),
                      kernel_size=kernel_size, groups=groups, bias=bias),
            nn.BatchNorm1d(int(channel * res_expansion)),
            self.act
        )
        if groups > 1:
            self.net2 = nn.Sequential(
                nn.Conv1d(in_channels=int(channel * res_expansion), out_channels=channel,
                          kernel_size=kernel_size, groups=groups, bias=bias),
                nn.BatchNorm1d(channel),
                self.act,
                nn.Conv1d(in_channels=channel, out_channels=channel,
                          kernel_size=kernel_size, bias=bias),
                nn.BatchNorm1d(channel),
            )
        else:
            self.net2 = nn.Sequential(
                nn.Conv1d(in_channels=int(channel * res_expansion), out_channels=channel,
                          kernel_size=kernel_size, bias=bias),
                nn.BatchNorm1d(channel)
            )

    def forward(self, x):
        return self.act(self.net2(self.net1(x)) + x)

[Code] $Φ_{pre}$

To learn shared weights from a local region

class PreExtraction(nn.Module):
    def __init__(self, channels, out_channels,  blocks=1, groups=1, res_expansion=1, bias=True,
                 activation='relu', use_xyz=True):
        """
        input: [b,g,k,d]: output:[b,d,g]
        :param channels:
        :param blocks:
        """
        super(PreExtraction, self).__init__()
        in_channels = 3+2*channels if use_xyz else 2*channels
        self.transfer = ConvBNReLU1D(in_channels, out_channels, bias=bias, activation=activation)
        operation = []
        for _ in range(blocks):
            operation.append(
                ConvBNReLURes1D(out_channels, groups=groups, res_expansion=res_expansion,
                                bias=bias, activation=activation)
            )
        self.operation = nn.Sequential(*operation)

    def forward(self, x):
        b, n, s, d = x.size()  # torch.Size([32, 512, 32, 6])
        x = x.permute(0, 1, 3, 2)
        x = x.reshape(-1, d, s)
        x = self.transfer(x)
        batch_size, _, _ = x.size()
        x = self.operation(x)  # [b, d, k]
        x = F.adaptive_max_pool1d(x, 1).view(batch_size, -1)
        x = x.reshape(b, n, -1).permute(0, 2, 1)
        return x

[Code] $Φ_{pos}$

To learn shared weights from a local region

class PosExtraction(nn.Module):
    def __init__(self, channels, blocks=1, groups=1, res_expansion=1, bias=True, activation='relu'):
        """
        input[b,d,g]; output[b,d,g]
        :param channels:
        :param blocks:
        """
        super(PosExtraction, self).__init__()
        operation = []
        for _ in range(blocks):
            operation.append(
                ConvBNReLURes1D(channels, groups=groups, res_expansion=res_expansion, bias=bias, activation=activation)
            )
        self.operation = nn.Sequential(*operation)

    def forward(self, x):  # [b, d, g]
        return self.operation(x)

3.3 Geometric Affine Module

Motivation
- depth ↑ 위해 stage $s$ ↑ 또는 residual blocks # ↑ 수 있음 but deep MLP의 accuracy와 stability ↓ (less robust)
- pc = sparse, irregular in local region → local regions마다 different extractors 필요 but shared residual MLP 불가
Lightweight local geometric affine module
- To transform local neighbor points to normal distribution while maintaining original geometric properties
- sigma ← center point 대한 분산 구한 뒤, $k$ (neighbor #), $n$ (point#), $d$ (=3) 곱한만큼 나눈 값에 제곱근 씌움
- alpha, beta : learnable parameters

[Code]

# Group points
idx = knn_point(self.kneighbors, xyz, new_xyz)
grouped_xyz = index_points(xyz, idx)  # [B, npoint, k, 3]
grouped_points = index_points(points, idx)  # [B, npoint, k, d]

# Calculate fi and sigma
mean = torch.mean(grouped_points, dim=2, keepdim=True)
std = torch.std((grouped_points - mean).reshape(B, -1), dim=-1, keepdim=True).unsqueeze(dim=-1).unsqueeze(dim=-1)

# Perform Normalization
grouped_points = (grouped_points - mean) / (std + 1e-5)
grouped_points = self.affine_alpha * grouped_points + self.affine_beta

3.4 Computational complexity and Elite version

Motivation : FC layers → huge parameters, computational complexity => How to improve efficiency?
Elite version
- ➀ Bottleneck structure for mapping function $Φ_{pre}$, $Φ_{pos}$ (residual MLP blocks)
  - Intermediate FC layer channel # ↓ (4배) and ↑ as original feature map => parameters ↓
- ➁ MLP blocks, Embedding dimension # ↓
- ➂ Grouped FC operation (X)

[Code] pointMLP vs. pointMLP-elite

def pointMLP(num_classes=40, **kwargs) -> Model:
    return Model(points=1024, class_num=num_classes, embed_dim=64, groups=1, res_expansion=1.0,
                   activation="relu", bias=False, use_xyz=False, normalize="anchor",
                   dim_expansion=[2, 2, 2, 2], pre_blocks=[2, 2, 2, 2], pos_blocks=[2, 2, 2, 2],
                   k_neighbors=[24, 24, 24, 24], reducers=[2, 2, 2, 2], **kwargs)


def pointMLPElite(num_classes=40, **kwargs) -> Model:
    return Model(points=1024, class_num=num_classes, embed_dim=32, groups=1, res_expansion=0.25,
                   activation="relu", bias=False, use_xyz=False, normalize="anchor",
                   dim_expansion=[2, 2, 2, 1], pre_blocks=[1, 1, 2, 1], pos_blocks=[1, 1, 2, 1],
                   k_neighbors=[24,24,24,24], reducers=[2, 2, 2, 2], **kwargs)

4. Experiments

4.1 Shape Classification on ModelNet40

Dataset : ModelNet40 (meshed CAD models, 40 categories)
Metrics : mAcc(class-avg acc), OA(overall acc)
Train : 300 epoch, SGD
Results
- OA ↑ : PointMLP(94.5%) > CurveNet(94.2%, SOTA) .... 94% saturated for a long time
- Inference speed ↑ : PointMLP-elite(176 samples/s) > PointMLP(112 samples/s) > CurveNet(15 samples/s)

4.2 Shape Classification on ScanObjectNN

Dataset : ScanObjectNN (15000 real world objects, 15 classes) - background, noise, occlusions → hard

hardest perturbed variant (PB_T50_RS)
Metrics : mAcc(class-avg acc), OA(overall acc)
Train : 200 epoch, batch size 32, SGD
Results
- Significant improvement on mAcc and OA by fewer training epochs, no voting
- Smallest gap bw mAcc and OA → No bias to a particular category (robustness)

4.3 Ablation Studies

✅ Network Depth

Variants : 24, 40, 56-layers PointMLP
Depth 깊다고 항상 좋은게 X → appropriate depth 존재 (tradeoff bw acc and stability 고려)

40-layers : best tradeoff (85.4% mACC and 0.3 standard deviations)
Depth 관계 없이 outperform recent methods

✅ Geometric Affine Module : important component

Performance improvement : 3% ↑ for all variants
- Reason1. mapping local input features to a normal distribution → easy train
- Reason2. encoding local geometric information by channel-wise distance to local centroid and variance
Stability improvement (=better robustness)

✅ 3D Loss landscape

(b) PointMLP : residual connection 있어야 optimization 쉬움 (flat not sharp)

4.4 Part Segmentation

Dataset : shapeNetPart (16881 shapes, 16 classes, 50 parts labels in total)
Results : predictions of PointMLP are close to GT

Code Review

[Code] PointMLP for Classification (ModelNet40)

class Model(nn.Module):
    def __init__(self, points=1024, class_num=40, embed_dim=64, groups=1, res_expansion=1.0,
                 activation="relu", bias=True, use_xyz=True, normalize="center",
                 dim_expansion=[2, 2, 2, 2], pre_blocks=[2, 2, 2, 2], pos_blocks=[2, 2, 2, 2],
                 k_neighbors=[32, 32, 32, 32], reducers=[2, 2, 2, 2], **kwargs):
        super(Model, self).__init__()
        self.stages = len(pre_blocks)
        self.class_num = class_num
        self.points = points
        self.embedding = ConvBNReLU1D(3, embed_dim, bias=bias, activation=activation)
        assert len(pre_blocks) == len(k_neighbors) == len(reducers) == len(pos_blocks) == len(dim_expansion), \
            "Please check stage number consistent for pre_blocks, pos_blocks k_neighbors, reducers."
        self.local_grouper_list = nn.ModuleList()
        self.pre_blocks_list = nn.ModuleList()
        self.pos_blocks_list = nn.ModuleList()
        last_channel = embed_dim
        anchor_points = self.points
        for i in range(len(pre_blocks)):
            out_channel = last_channel * dim_expansion[i]
            pre_block_num = pre_blocks[i]
            pos_block_num = pos_blocks[i]
            kneighbor = k_neighbors[i]
            reduce = reducers[i]
            anchor_points = anchor_points // reduce
            # append local_grouper_list
            local_grouper = LocalGrouper(last_channel, anchor_points, kneighbor, use_xyz, normalize)  # [b,g,k,d]
            self.local_grouper_list.append(local_grouper)
            # append pre_block_list
            pre_block_module = PreExtraction(last_channel, out_channel, pre_block_num, groups=groups,
                                             res_expansion=res_expansion,
                                             bias=bias, activation=activation, use_xyz=use_xyz)
            self.pre_blocks_list.append(pre_block_module)
            # append pos_block_list
            pos_block_module = PosExtraction(out_channel, pos_block_num, groups=groups,
                                             res_expansion=res_expansion, bias=bias, activation=activation)
            self.pos_blocks_list.append(pos_block_module)

            last_channel = out_channel

        self.act = get_activation(activation)
        self.classifier = nn.Sequential(
            nn.Linear(last_channel, 512),
            nn.BatchNorm1d(512),
            self.act,
            nn.Dropout(0.5),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            self.act,
            nn.Dropout(0.5),
            nn.Linear(256, self.class_num)
        )

    def forward(self, x):
        xyz = x.permute(0, 2, 1)
        batch_size, _, _ = x.size()
        x = self.embedding(x)  # B,D,N
        for i in range(self.stages):
            # Give xyz[b, p, 3] and fea[b, p, d], return new_xyz[b, g, 3] and new_fea[b, g, k, d]
            xyz, x = self.local_grouper_list[i](xyz, x.permute(0, 2, 1))  # [b,g,3]  [b,g,k,d]
            x = self.pre_blocks_list[i](x)  # [b,d,g]
            x = self.pos_blocks_list[i](x)  # [b,d,g]

        x = F.adaptive_max_pool1d(x, 1).squeeze(dim=-1)
        x = self.classifier(x)
        return x

Reference : https://github.com/ma-xu/pointMLP-pytorch

[CV_Segmentation] Multi-scale context aggregation by dilated convolutions

Multi-scale context aggregation by dilated convolutions

1. INTRODUCTION

Semantic segmentation requires combining pixel-level acc with multi-scale contextual reasoning
Structural differences between image classification and dense prediction

dense prediction : 이미지의 각 픽셀에 대한 레이블을 예측

Repurposed networks : necessary? reduced accuracy when operated densely?
Modern classification networks
- Integrating multi-scale contextual information via successive pooling and subsampling → reduce resolution
- BUT dense prediction needs full-resolution output
Demand of multi-scale reasoning and full-resolution
- repeated up-convolutions : need severe intermediate downsampling → necessary?
- combination predictions of multiple rescaled inputs : separated analysis of input → necessary?
Dilated convolutions : conv module designed for dense prediction (semantic segmentation)
- multi-scale contextual information without losing resolution
- plugged into existing architectures at any resolution
- no pooling or subsampling
- exponential expansion of receptive field without losing resolution or coverage
- accuracy of sota semantic segmentation ↑

2. Dilated convolutions

Dilated convolution (*l) can apply same filter at different ranges using different dilation factors (l)
F_(i+1) = F_i (*2^i) k_i for i = 0,1,...,n-2
F : discrete functions, k : discrete 3x3 filters
Size of receptive field of each element in F_(i+1) = [ 2^(i+2) -1 ] X [ 2^(i+2) -1 ] : square of exponentially increasing size
- (a) F_1 : 3x3, (b) F_2 : 7x7, (c) F_3 : 15x15 receptive field
- non-red field = zero value

3. Multi-scale context aggregation

[ Context module ]

Input, Output
- C feature maps → C feature maps : can maintain resolution
- Same form : can be plugged into any dense prediction architecture
Each layer has C channels
- directly obtain dense per-class prediction
- feature maps are not normalized, no loss is defined
Multiple layers that expose contextual information → increase acc

[ Basic Context module ]

7 layers : 3x3xC conv with different dilation factors (1,1,2,4,8,16,1)
A final layer : 1x1xC conv → produce output of the module
Front end module output feature map : 64x64 resolution → stop expansion after layer 6
Identity Initialization : set all filters s.t each layer simply passes input directly to the next
Result : increase dense prediction acc both quantitatively and qualitatively & small # of parameters (total: 64C^2)

4. Front End

[ Front End module ] : Backbone module of Context module

Input : reflection padded color image → Output : 64x64xC feature maps
remove last 2 pooling and striding layers of VGG-16 → replace convolution layers were dilated by a factor of 2 for each layer
remove padding of intermediate feature maps

Training
- Pascal VOC 2012 training set + subset of annotations of validation set
- SGD, batch size = 14, lr = 10^-3, momentum = 0.9, iterations = 60K
Test result : front end is both simpler and more accurate

5. Experiments

Implementation : based on Caffe library
Dataset : Microsoft COCO with VOC-2012 categories
Training : 2 stage
- 1st : VOC-2012 & COCO : SGD, batch size = 14, momentum = 0.9, iterations = 100K (lr = 10^-3) + 40K (lr = 10^-4)
- 2nd : fine-tuned network on VOC-2012 only : iterations = 50K (lr = 10^-5)
Test result
- Front-end module (alone) : 69.8% mean IoU on val set, 71.3% on test set
- Attribution : high acc by removal of vestigial components for image classification

(1) Controlled evaluation of context aggregation

context module and structured prediction are synergistic → increase accuracy in each configuration
large context module increases acc by a larger margin

(2) Evaluation on the test set

large context module : significant boost in acc over front end
Context module + CRF-RNN = highest acc

CRF-RNN (Conditional Random Field RNN) : post-processing step to get more fine-grained segmentation results in end to end manner

6. Conclusion

Dilated convolution : dense prediction + increasing receptive field without losing resolution + increasing acc
Future arch : end-to-end -> removing the need for pre-training -> raw input, dense label at full resolution output

[CV_Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks

DeepPose: Human Pose Estimation via Deep Neural Networks

1. Introduction

Previous challenges (Limitations)

Localization of human joints using local detector

strong articulations, small visible joints, occlusions, need to capture context
modeling only a small subset of all interactions bw body parts

Holistic manner proposed but limited success in real-world problems

DNN (Deep Neural Networks)

visual classification tasks, object localization

Holistic human Pose estimation as DNN

Pose estimation <=> Joint regression (location of each joint is regressed)
Input : full img & 7-layered generic convolutional DNN
Capturing full context of each body joint
Simpler to formulate : no need to design whole feature representations, detectors for parts, interactions bw joints
Cascade of DNN-base pose predictors : increased precision of joint localization
SOTA or better than SOTA on 4 benchmarks

2. Related Work

Pictorial Strictures (PSs) : distance transform trick
Tree-based pose models with simple binary potential
Richer part detectors : enriching representational power + maintaining tractability
Mixture models on full scale
Richer higher-order spatial relationships
Transfer joint locations, Nearest neighbor setup
Semi-global classifier for part config : linear -> less expressive representation (only arms)
Pose regression : 3D pose
CNNs with Neighborhood component analysis to regress : No cascade
NN-based pose embedding : contrastive loss

3. Deep Learning Model for Pose Estimation

Encoding locations of all k body joints in Pose vector
- x : Input Image data
- k : # of body joints
- y : GT pose vector (2k Dim)
- y_i : x, y coordinates 2D vector of i-th joint (absolute img coordinates)
Normalized y_i wrt bounding box b
- b = (b_c, b_w, b_h)
- b_c : center of b (2D)
- b_w : width of b
- b_h : heigh of b
Normalized Pose vector

3.1 Pose Estimation as DNN-based Regression [Initial stage]

Architecture
- x : Input Image data
- φ : regression function based on conv DNN
  - Input : 220 x 220 img -> 55 x 55 (by stride = 4)
  - 7 layers (filter size : 11x11, 5x5, 3x3, 3x3, 3x3)
  - Pooling : applied after 3 layers
  - Total # of params : 40M
  - Generic DNN Arch -> Holistic modeling & all internal features can be shared
- θ : parameters of model
- y* : pose prediction vector (absolute img coordinates vector)
Loss function and Training
- L2 loss : minimize distance bw prediction and true pose vector
- Using Normalized training set D_N
- Optimization over individual joints (if not all joints are labeled, omit that terms)
- Mini-batch size = 128, lr rate = 0.0005
- Data Augmentation : random translated crop, left/right flip
- DropOut regularization rate = 0.6

3.2 Cascade of Pose Regressors

Purpose : to solve limited capacity for detail (fixed input size) and achieve better precision
Same network Arch for all stages of cascade but Different learnable parameters
1st stage : estimate an initial pose
Subsequent stage : predict and refine displacement of joint locations y^s - y^(s-1)
- θ_s : learned network params
- φ_i : pose displacement regressor
- y_i : joint location
- b_i : joint bbox
- diam(y^s) : distance bw opposing joints on human torso
- σ : scale parameter for diam(y^s)
Process
- Using predicted joint locations to focus on relevant parts of img
- Cropping sub-imgs around predicted joint location
- Applying pose displacement regressor on sub-imgs
Result : higer resolution imgs -> finer features -> higher precision
Full augmented Training data
- Data Augmentation : multiple normalizations
- Using predictions from previous stage + simulated predictions (generated by randomly displacing GT)

4. Empirical Evaluation

4.1 Setup

Datasets

Frames Labeled In Cinema (FLIC)
- 4000 train img + 1000 test img from Hollywood movies
- diverse poses and clothing
- 10 upper body joints are labeled for each human
Leeds Sports Dataset (LSP)
- 11000 train img + 1000 test img from sports activities
- 150 pixel height for majority of people
- 14 joints labeled for each person full body

Metrics

Percentage of Correct Parts (PCP) : detected if distance bw predicted and true limb joint is at most half of limb length -> hard to detect for shorter limbs, lower arms
Percentage of Detected Joints (PDJ) : varying degrees, detected if distance bw predicted and true limb joint is within certain fraction of torso diameter -> all joints are based on same distance threshold

Experimental Details

FLIC : Rough estimate of initial bbox by Face-based body detector
LSP : Full img as initial bbox
To measure optimally of params, Use Average over PDJ at 0.2 across all joints
To improve generalization, Augment data by sampling 40 randomly translated crop boxes
Running time : 0.1s per img on a 12 core CPU
Training complexity is higher

4.2 Results and Discussion

5. Conclusion

First application of DNNs to human pose estimation
Capturing context and reasoning about pose in a holistic manner
Generic CNN for classification tasks can be applied localization task

[CV_Action Recognition] Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition (ST-GCN)

Automatic Learning both spatial and temporal patterns from data
Greater expressive power & Stronger generalization capability

Paper Review

1. Introduction

Human Action Recognition (HAR)
- Multiple modalities : Appearance, Depth, Optical-flows, "Body skeletons (Dynamic human skeletons)"
- "Dynamic human skeletons" : represented by a time series of human joints
- Limitation of previous works : hand-crafted parts or rules to analyze spatial patterns (not explicitly exploiting spatial relationships among joints) → Less expressive power & Difficult to be generalized
ST-GCN
- Components
  - Node = Joint of human body
  - Two types of Edge (Spatial Edge & Temporal Edge)
- 3 Contributions
  - (1) The first attempt to apply GNN for modeling dynamic skeletons for HAR task
  - (2) Designing convolutional kernels for skeleton modeling
  - (3) Superior performance on two large scale datasets

2. Related Work

Two streams of GNN

Spectral perspective : locality of graph convolution is considered in the form of spectral analysis
Spatial perspective : conv filters are applied directly on nodes and their neighbors (This work)

Skeleton-based Action Recognition

Skeleton : robust to illumination change and scene variation & easy to obtain by depth sensors of HPE algorithms
Hand-crafted feature based methods : manually designing features to capture dynamics of joint motion
DL based methods : modeling joints within body parts (explicitly assigned using domain knowledge)
ST-GCN : applying GCN to skeleton-based AR
- Can learn part information implicitly by using locality of graph conv with temporal dynamics
- No manual part assignment → easier to design to learn better action representations

3. ST-GCN

Human joints move in small local groups (body parts) → restrict joint trajectories for hierarchical representations
Motivation : For hierarchical representations and locality, CNN (intrinsic property) is better than manual assignment

3.1 Pipeline Overview

Skeleton based data : obtained from motion-capture devices or pose estimation algorithms from video
- = Sequence of frames (each frame has a set of joints)
ST-GCN
- Input : joint coordinate vectors on graph nodes
- Multiple layers of ST-GCN → generating higher-level feature maps on graph
  - Graph with joints as nodes & connectivities in both body structures and time as edges
- Output : classified action category by softmax classifier
- Training : E2E with backprop

3.2 Skeleton Graph Construction

Previous work for skeleton based AR : concatenating coordinate vectors of all joints ⇒ a single feature vector per frame
ST-GCN : undirected graph $G$ = $(V, E)$ on a skeleton sequence with $N$ joints, $T$ frames ⇒ hierarchical representation
- Node set : $V$ = { $v_{ti} | t=1, ..., T, i=1, ..., N$ } --> all the joints
- Input Feature vector on a node : $F(v_{ti})$ --> coordinate vectors + estimation confidence
- Edge set : $E$ = $E_S$ & $E_F$
  - Spatial edge : $E_S$ = { $v_{ti} * v_{tj} | (i, j) ∈ H$ }, $H$ : set of connected joints
    = Intra-skeleton edge to connect joints at each frame (공간적으로 연결)
  - Temporal edge : $E_F$ = { $v_{ti} * v_{(t+1)i}$ }
    = Inter-frame edge to connect the same joints in consecutive frames (시간적으로 연결)
- 2 Steps
  - 1st) Joints within one frame are connected with edges by connectivity of body structure
  - 2nd) Each joint is connected to the same joint in the consecutive frame
- Advantages : No manual part assignment → model can work on datasets with different number of joints

3.3 Spatial Graph Convolutional Neural Network

1st Step (on a single frame at time τ) = $N$ joint nodes $V_t$ along with edges $E_S(τ)$

Input feature map $f_{in}$ with channel $c$
Output value at spatial location $x$ : $f_{out}$
Sampling function $p$ : $Z^2$ x $Z^2$ → $Z^2$
- [Image domain] 한 pixel로부터 주변 pixels을 가져오는 함수
- [Graph domin] 한 node로부터 특정 거리 $D$만큼 떨어진 주변 nodes를 가져오는 함수 : $B(v_{ti})$ → $V$
  - (In this paper) $D$ = 1 : 1-neighbor set of joint nodes (바로 연결된 nodes)
Weight function $w$ : $Z^2$ → $R^c$ ~ irrelevant to input location $x$ → filter weights sharing possible
- [2D conv] rigid grid → pixels within neighbor can have fixed spatial order
  - $w$ can be implemented by indexing a tensor of (c, K, K) dim according to spatial order
- [Graph conv] no implicit arrangement → order is defined by graph labeling
  - simplified by partitioning neighbor set $B$ into a fixed number of $K$ subsets
  - $w$ can be implemented by indexing a tensor of (c, K) dim

Spatial Graph Convolution

Normalizing term $Z$ : to balance contributions of different subsets to output

2nd Step : Spatial "Temporal" Modeling

Purpose : Extending domain (Spatial graph ⇒ Spatial Temporal graph)
- By adding temporally connected joint (connecting the same joints across consecutive frames)
Neighbor set $B(v_{ti})$ of a joint node $v_{ti}$
- Γ : parameter gamma (temporal kernel size) to control temporal range to be included in neighbor graph
Label map $l_{ST}$

3.4 Partition Strategies

Methods to implement label map $l_{ST}$ (= to define neighbor nodes)
(a) Input skeleton frame
- Red dashed circles : receptive fields of a filter with D=1
(b) Uni-labeling partitioning : K=1, $l_{ti}(v_{tj})$ = 0
- All neighbor nodes has the same label
- Suboptimal b/c local differential properties could be lost
(c) Distance partitioning : K=2, $l_{ti}(v_{tj})$ = $d(v_{tj}, v_{ti})$
- Labeling according to nodes' distance to the root node $v_{tj}$
- Root node = 0, Other neighbor nodes = 1 (In this case, D=1)
(d) Spatial configuration partitioning : K=3, $l_{ti}(v_t *j)$ = 0 or 1 or 2
- Labeling according to each distance to gravity center (black cross) compared with root node (green)
  - Root node itself : $l_{ti}$ = 0 if $r_j$ = $r_i$
  - Centripetal group : $l_{ti}$ = 1 if $r_j$ < $r_i$
    - = neighbor nodes closer to gravity center than root node
  - Centrifugal group : $l_{ti}$ = 2 if $r_j$ > $r_i$

3.5 Learnable edge importance weighting

Problem : 하나의 joint가 여러 body parts에서 나타날 수 있음 but different importance 가지도록 해야 함
Solution : Adding learnable mask $M$ on every layer
- 각 spatial graph edge의 learned importance weight에 기반해 neighbor nodes에 node's feature contribution을 scaling
- Effect : improved recognition performance, possible to have data dependent attention map

3.6. Implementation ⇒ Code 비교

Implementation Details
- $A$ : Adjacency matrix representing intra-body connections on a single frame
- $I$ : Identity matrix representing self-connections
Network Architecture and Training

4. Experiments

4.1 Dataset & Evaluation Metrics

4.2 Ablation Study

4.3 Comparison with SOTAs

Code Review

[CV_3D] PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Paper Review

1. Introduction

PointNet : learning a spatial encoding of each point → (max-pooling) aggregating all point features to global PC (local features X)
PointNet++ : processing a set of points sampled in metric space in a hierarchical fashion
partitioning a set of points into overlapping local regions
→ extracting local features capturing fine geometric structures from small neighborhoods
→ grouping local features into larger unit and processing to produce higher level features

[ Two issues of the design of PointNet++ ]

1. How to generate overlapping partitioning of point set

Each partition : a neighborhood ball in Euclidean space
- Centroid Location : FPS(Farthest Point Sampling)로 선택
- Scale : combined Multiple scales for both robustness and detail capture (Random input dropout)

2. How to abstract sets of points or local features through a local feature learner (=PointNet)

PointNet : processing an unordered set of points for semantic feature extraction & robust to input data corruption
PointNet++ : applying PointNet recursively on a nested partitioning of input set

2. Problem Statement

$X = (M, d)$ : discrete metric space, metric = Euclidean space $R^n$
- $M$ : set of points (density of $M$ is not uniform)
- $d$ : distance metric
$f$ : set functions = classification or segmentation function
- Input : $X$ (along with additional features for each point)
- Output : information of semantic interest regarding $X$
- classification function : to assign a label to $X$
- segmentation function : to assign a per point label to each member of $M$

3. Method

3.1 Review of PointNet : A Universal Continuous Set Function Approximator

Point Cloud : a set of sparse points => efficient But operation for permutation-invariant 필수
PointNet : single MAX pooling → PC의 global feature 추출 But local context 소실 (segmentation performance ↓)
- $f$ : permutation-invariant set function → arbitrarily approximate any continuous set function

3.2 Hierarchical Point Set Feature Learning (Set Abstraction)

PointNet++ : hierarchical grouping of points and progressively abstracting larger local regions
Set Abstraction level (3 layers) : 전반적인 semantic 정보를 포함한 압축된 PC로 변환 → PC의 local feature 추출
Input : $N$ x ( $d$ + $C$ ) matrix ..... $N$ points with $d$-dim coordinates + $C$-dim point feature
Output : $N'$ x ( $d$ + $C'$ ) matrix ..... $N'$ subsampled points with $d$-dim coordinates + new $C'$-dim feature vectors

In Paper, $d$ = 3 → (x,y,z)

[ 3 layers ]

❶ Sampling layer

Sampling layer : Selecting a set of points from input points { ${x_1, x_2, ..., x_n}$ }
..... $N$ input points 중 $N'$ centroids 선택 (대표성 + local한 공간의 center)
Farthest Point Sampling (FPS)
- Centroid = the most distant point in metric(euclidean) distance w.r.t the rest points
- Better converge of the entire point set than Random Sampling

❷ Grouping layer

Grouping layer : 각 centroid 대한 neighbor points 찾기 → 묶어서 하나의 local region point set 구성
- Input : a point set = $N$ x ( $d$ + $C$ ) & coordinates of a set of centroids = $N'$ x $d$
- Output : local groups of point sets = $N'$ x $K$ x ( $d$ + $C$ ) ..... $K$ : # of neighbor points of centroid points
  $K$ : flexible # (group마다 다름) → PointNet layer에서 fixed length local region feature vector 1개씩 추출
Metric distances to define neighbor points
- 1. KNN : centroid 대해 가장 가까운 $K$개의 점들 (fixed number of neighbor points)
- 1. Ball query : centroid 기준 반지름 r 이내의 점들 (fixed region scale) → more generalizable
In Paper, using Ball query method

def sample_and_group(npoint, radius, nsample, xyz, points, knn=False, use_xyz=True):
    new_xyz = gather_point(xyz, farthest_point_sample(npoint, xyz)) # (batch_size, npoint, 3)
    if knn:
        _,idx = knn_point(nsample, xyz, new_xyz)
    else:
        idx, pts_cnt = query_ball_point(radius, nsample, xyz, new_xyz)
    grouped_xyz = group_point(xyz, idx) # (batch_size, npoint, nsample, 3)
    grouped_xyz -= tf.tile(tf.expand_dims(new_xyz, 2), [1,1,nsample,1]) # translation normalization
    if points is not None:
        grouped_points = group_point(points, idx) # (batch_size, npoint, nsample, channel)
        if use_xyz:
            new_points = tf.concat([grouped_xyz, grouped_points], axis=-1) # (batch_size, npoint, nample, 3+channel)
        else:
            new_points = grouped_points
    else:
        new_points = grouped_xyz

    return new_xyz, new_points, idx, grouped_xyz

❸ PointNet layer

PointNet layer : Each local region points pattern 파악 (encoding) → local feature vector 1개씩 추출
- Input : $N'$ local regions of points with data size $N'$ x $K$ x ( $d$ + $C$ )
- Output : $N'$ x ( $d$ + $C'$ )
Mini-PointNet = basic building block for local pattern learning

def pointnet_sa_module(xyz, points, npoint, radius, nsample, mlp, mlp2, group_all, is_training, bn_decay, scope, bn=True, pooling='max', knn=False, use_xyz=True, use_nchw=False):
    data_format = 'NCHW' if use_nchw else 'NHWC'
    with tf.variable_scope(scope) as sc:
        # Sample and Grouping
        if group_all:
            nsample = xyz.get_shape()[1].value
            new_xyz, new_points, idx, grouped_xyz = sample_and_group_all(xyz, points, use_xyz)
        else:
            new_xyz, new_points, idx, grouped_xyz = sample_and_group(npoint, radius, nsample, xyz, points, knn, use_xyz)

        # Point Feature Embedding
        if use_nchw: new_points = tf.transpose(new_points, [0,3,1,2])
        for i, num_out_channel in enumerate(mlp):
            new_points = tf_util.conv2d(new_points, num_out_channel, [1,1],
                                        padding='VALID', stride=[1,1],
                                        bn=bn, is_training=is_training,
                                        scope='conv%d'%(i), bn_decay=bn_decay,
                                        data_format=data_format) 
        if use_nchw: new_points = tf.transpose(new_points, [0,2,3,1])

        # Pooling in Local Regions
        if pooling=='max':
            new_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
        elif pooling=='avg':
            new_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
        elif pooling=='weighted_avg':
            with tf.variable_scope('weighted_avg'):
                dists = tf.norm(grouped_xyz,axis=-1,ord=2,keep_dims=True)
                exp_dists = tf.exp(-dists * 5)
                weights = exp_dists/tf.reduce_sum(exp_dists,axis=2,keep_dims=True) # (batch_size, npoint, nsample, 1)
                new_points *= weights # (batch_size, npoint, nsample, mlp[-1])
                new_points = tf.reduce_sum(new_points, axis=2, keep_dims=True)
        elif pooling=='max_and_avg':
            max_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
            avg_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
            new_points = tf.concat([avg_points, max_points], axis=-1)

        new_points = tf.squeeze(new_points, [2]) # (batch_size, npoints, mlp2[-1])
        return new_xyz, new_points, idx

3.3 Robust Feature Learning under Non-Uniform Sampling Density

Goal : non-uniform density (sparse ~ dense) point set feature learning 어려움 해결
(1) PC를 다양한 density로 sampling하여 학습
(2) Density Adaptive layer : 다양한 scale의 PC에서 feature vector 추출하여 결합

[ 2 Types of Density Adaptive layers ]

1. Multi-scale grouping (MSG)

Grouping을 다양한 scale로 여러 번 적용 → 하나의 centroid 대해 여러 scale의 point sets 생성
각 point set에서 추출한 feature vector를 concat하여 multi-scale feature vector 생성
각 point set은 random input dropout (down-sampling) → 다양 scale의 density (various sparsity, varying uniformity)
단점 : every centroid 대해 local PointNet 돌려야함 → computationally expensive, inefficient, time-consuming

2. Multi-resolution grouping (MRG) ★이해

MSG의 단점 보완, PointNet++에서 사용한 방법
$L_i$ level features : 2 different scale feature vectors를 concat하여 multi-scale feature vector
Left vector : lower level $L_{i-1}$의 each sub-region의 features를 summarizing한 feature
Right vector : local region $L_i$의 all raw points에 대해 PointNet을 거쳐서 얻은 feature
장점 : large scale neighborhoods at lowest levels에서의 feature extraction 필요 X → more efficient

3.4 Point Feature Propagation for Set Segmentation

Set Abstraction Sampling layer 의해 PC 크기 감소 → segmentation task 위해 원래 크기 복원
(1) Up-sampling : 이전 점들( $N_{l-1}$ points )에 대한 feature vector로부터 (1/distance)로 weighted Interpolation
(2) Skip connection : down-sampling 이전의 feature vector를 concat → 정보량 보충

4. Experiments

[ Dataset ]

4.1 Point Set Classification in Euclidean Metric Space

MNIST (2D Object)
- Input : 2D img coordinates에서 2D PC of digit pixel locations 로 변환 (default 512 points)
- Result : digit classification task에서 PointNet 보다 error rate ↓, CNN-based models 보다도 성능 ↑
ModelNet40 (3D rigid Object)
- Input : CAD model 3D mesh에서 표면을 sampling하여 3D PC 로 변환 (default 1024 points)
- Additional point features로 face normals 사용 ( $N$ = 5000 ) to boost performance
- All points are normalized to be 0 mean and within a unit (r=1) ball
- Model : 3-level hierarchical network + 3 FC layer
- Result : 3D shape classification task에서 MVCNN(SOTA model) 보다 성능 ↑
- Ablation study of Density adaptive layer : multi-scale 학습 모델들(MSG, MRG) = robust to points # (or density)

def get_model(point_cloud, is_training, bn_decay=None):
    """ Classification PointNet, input is BxNx3, output Bx40 """
    batch_size = point_cloud.get_shape()[0].value
    num_point = point_cloud.get_shape()[1].value
    end_points = {}

    l0_xyz = point_cloud
    l0_points = None

    # Set abstraction layers
    l1_xyz, l1_points = pointnet_sa_module_msg(l0_xyz, l0_points, 512, [0.1,0.2,0.4], [16,32,128], [[32,32,64], [64,64,128], [64,96,128]], is_training, bn_decay, scope='layer1', use_nchw=True)
    l2_xyz, l2_points = pointnet_sa_module_msg(l1_xyz, l1_points, 128, [0.2,0.4,0.8], [32,64,128], [[64,64,128], [128,128,256], [128,128,256]], is_training, bn_decay, scope='layer2')
    l3_xyz, l3_points, _ = pointnet_sa_module(l2_xyz, l2_points, npoint=None, radius=None, nsample=None, mlp=[256,512,1024], mlp2=None, group_all=True, is_training=is_training, bn_decay=bn_decay, scope='layer3')

    # Fully connected layers
    net = tf.reshape(l3_points, [batch_size, -1])
    net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training, scope='fc1', bn_decay=bn_decay)
    net = tf_util.dropout(net, keep_prob=0.4, is_training=is_training, scope='dp1')
    net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training, scope='fc2', bn_decay=bn_decay)
    net = tf_util.dropout(net, keep_prob=0.4, is_training=is_training, scope='dp2')
    net = tf_util.fully_connected(net, 40, activation_fn=None, scope='fc3')

    return net, end_points

4.2 Point Set Segmentation for Semantic Scene Labeling

ScanNet (3D Scene)
- each point에는 해당 point가 어떤 물체에 속해 있는지에 대한 segmentation label 존재
- Result : Segmentation 성능 ↑ => 계층적 구조를 통한 local feature 학습이 다양한 scale의 scene 이해에 중요
- Ablation study : Density adaptive layer 이용하여 non-uniform sampling density로 줄여서 학습 → MRG가 SSG보다 다양한 density 대해 성능 ↑

4.3 Point Set Classification in Non-Euclidean Metric Space

SHREC15 (3D non-rigid Object)
- SHREC15 dataset : 2D surfaces embedded in 3D space
- Goal : To show generalizability of PointNet++ to non-Euclidean space
- Requirement : knowledge of 'intrinsic structure'
- [Fig.7] (a), (c) : different in pose -> same category
- Geodesic distances along the surfaces induce a metric space
  
  Geodesic distance : the shortest path between the vertices in a graph
- PointNet++ : constructing metric space induced by geodesic distance → extracting intrinsic point features in WKS, HKS, multi-scale Gaussian curvature → using these features as input → sampling and grouping points
- Result : capturing multi-scale intrinsic structure not influenced by specific pose => effectiveness, 성능 ↑

4.4 Feature Visualization

Visualization of What has been learned by the 1st level kernels of hierarchical network

Conclusion

Future works

To think how to accelerate inference speed of network for MSG and MRG layers by sharing more computation in each local regions
To find applications in higher dimensional metric spaces where CNN based method would be computationally unfeasible

[CV_CNN] Accelerating the Super-Resolution Convolutional Neural Network

Abstract

SRCNN : high computational cost -> real-time performance(24fps) X -> practical usage X
FSRCNN : accelerated, compact hourglass-shape SRCNN for faster and better SR
- 3 Main re-design aspects
  - deconv layer at the end : learning mapping directly from original low-resol img to high-resol img (No interpolation)
  - reformulation of mapping layer : shrinking input dim -> mapping -> expanding back
  - smaller filter sizes but more(deeper) mapping layers
- Results : speed up of more than x40 with superior restoration quality
- Additional aspects
  - Parameter settings for real-time performance on CPU
  - Transfer strategy for fast training and testing across Different upscaling factors

1. Introduction

1) Previous SR algorithms

Learning-based(or patch-based) methods
SRCNN : faster than upper methods but still slow speed (no real-time)

2) Inherent limitations of previous algorithms

(1) [Pre-processing step] Upsampling by Bicubic interpolation -> high computation complexity

n^2 times computation cost for n upscaling factor
Solution : learning directly from original LR img -> n^2 times faster

(2) [Costly non-linear mapping step] Input patches are projected on high-dim LR & HR feature space

parameter # ↑ -> accuracy ↑ but also running time ↑
Solution : shrinking network scale while keeping accuracy

3) FSRCNN : Solution for upper limitations

(1) Deconvolution layer to replace Bicubic interpolation

deconv layer at the end of network -> computational complexity ~ spatial size of original LR img
- Better than interpolation kernel like in FCN / unpooling+conv
- deconv layer consists of diverse automatically learned upsampling kernels -> generate final HR img

(2) Adding shrinking/expanding layer at the beginning/end of mapping layer separately

To restrict mapping in low-dim feature space

(3) Additional aspects

Decomposition a single mapping layer into several layers with fixed filter size 3x3
Overall shape : hourglass (symmetric : thick end and thin middle)

4) Achievements

Speed up of more than 40x (+ FSRCNN-s can run in real time with generic CPU)
Different upscaling factors
- All conv layers except deconv can be shared
- Training : only fine-tune deconv layer for another upscaling factor (no loss of mapping acc)
- Testing : only do convolution operations once & upsampling img to different scales using corresponding deconv layer

5) Contributions

Formulate a compact hourglass-shape CNN structure for fast img SR by deconv -> E2E mapping with no pre-processing
Speed up 40x than SRCNN while keeping performance
Transfer conv layers for fast training and testing across different upscaling factors witj no loss of quality

2. Related Work

DL for SR

SR task 위해 SRCNN 제안된 후, 많은 deeper strucutres 나옴
- SRCNN : directly learning E2E mapping bw LR and HR img
- Sparse-coding-based method : outperform SRCNN with small size model BUT hard to shrink with no loss of mapping acc
- All these networks : required pre-processing with bicubic-upscaling
FSRCNN : only required a different deconv layer -> faster to upscale an img to different sizes

CNNs acceleration

High-level vision (Object detection, Image classification, ..) : CNN 속도 높이기 위한 많은 연구들 진행됨
- Approximating existing well-trained models
Low-level vision (SR) : SR 위한 DL 모델은 fully-connected layers 없어서 convolution filters가 중요함
- FSRCNN : Reformulating previous model -> better performance

3. Fast Super-Resolution by CNN

3.1 SRCNN

Aim : learning E2E mapping function F bw bicubic interpolated LR img Y and HR img X
Network : All conv layers -> output size = input size
Computation complexity
- ~ S_HR(size of HR img)
- middle layer : contributing most to params
Cost function : MSE

3 main parts (steps)

(1) Patch extraction and Representation
- extracting patches from input and representing each patch as a high-dim feature vector
(2) Non-linear mapping
- mapping feature vectors non-linearly to another set of feature vectors (HR features)
(3) Reconstruction
- aggregating features to form the final output img

3.2 FSRCNN

Notations : Conv(f_i, n_i, c_i), DeConv(f_i, n_i, c_i), where f_i, n_i, c_i represent filter size, filter#, channel#
Activation function : PReLU
- Aim : mainly to avoid the dead features caused by zero gradients in ReLU
- different on coeff of negative part with ReLU
- parameter a_i : fixed to be 0 for ReLU <-> learnable for PReLU (full use of all params for max capacity of net)
Overall structure : FSRCNN(d, s, m)
- Conv(5,d,1) - PReLU - Conv(1,s,d) - PReLU - m x Conv(3,s,s) - PReLU - Conv(1,d,s) - PReLU - DeConv(9,1,d)
- d : LR feature dimension, s : shrinking filters #, m : mapping depth governing performance and speed
- Shape : hourglass (symmetric : thick end and thin middle)
Computational complexity : ~~~
Cost function : MSE

5 main parts (steps)

(1) Feature extraction : Conv(5, d, 1)
- Similar to first part of SRCNN but Different on input img
- Feature extraction on original LR input img (Y_s) without interpolation
- SRCNN : Conv(9, n_1, 1) on upscaled img (Y) <-- most pixels in Y are interpolated from Y_s
- FSRCNN : Conv(5, d, 1) on original img (Y_s) <-- 5x5 cover almost info of 9x9 patch in Y
  - f_1 = 5 : smaller filter with little information loss
  - n_1 = d : filter # <=> LR feature dimension # <<< 1st sensitive variable
(2) Shrinking : Conv(1, s, d)
- SRCNN : Feature extraction → (No shrinking) → Mapping => mapping LR features high-dim directly to HR feature spaces
- LR feature dim d is usually very large -> high computation complexity
- FSRCNN : Feature extraction → Shrinking layer (1x1) → Mapping => reduce LR feature dim
  - f_2 = 1 : 1x1 filter to perform like a linear combination
  - n_2 = s << d : smaller filter number to reduce LR feature dim from d to s
  - Result : greatly reduce params #
(3) Non-linear Mapping : m x Conv(3, s, s)
- the most important part for SR performance
- the most influencing factors : width (filters # in a layer), depth (layers #)
- SRCNN : single 5x5 layer (5x5 better than 1x1 layer)
- FSRCNN : multiple 3x3 layers
  - f_3 = 3 : 3x3 layers (trade-off bw performance and net scale)
  - m : multiple layers to replace a single wide one <<< sensitive variable to determine mapping acc and complexity
  - n_3 = s : all mapping layers contain same number of filters
(4) Expanding : Conv(1, d, s)
- Inverse process of Shrinking layer Conv(1, s, d)
- Shrinking operation reduces # of LR feature dim
- BUT HR img directly from these low-dim, final restoration quality is poor
- Expanding layer after mapping part to expand HR feature dim
  - f_4 = 1 : 1x1 filters to maintain consistency with shrinking layer
  - n_4 = d : filter # <=> LR feature dimension #
(5) Deconvolution : DeConv(9, 1, d)
- Aim : upsampling and aggregating previous features
- Deconvolution (Transposed Convolution) = Inverse operation of Convolution
  - Convolution : stride k → output is 1/k times of input
  - Exchange the position of input and output → output will be k times of input
  - Deconvolution : stride k = n (desired upscaling factor) → output is directly reconstructed HR img
- f_5 = 9 : filter size of deconv <=> consistent with filter size of conv (first layer) of SRCNN
  - Reversed network = Downscaling operator (HR img → LR img)
  - [Fig3] patterns of learned deconv filters are very similar to first layer filters in SRCNN
- Deconv layer learns Upsampling kernel for input feature maps (kernels are diverse and meaningful in [Fig3])

3.3 Differences against SRCNN : From SRCNN to FSRCNN

Transform SRCNN to FSRCNN within three steps
(1) The last Conv layer => DeConv layer
- The whole network will perform on original LR img & low computation complexity (~S_LR instead of S_HR)
- Enlarging network scale but speed-up
- Performance of Learned deconv kernels are better than a single bicubic kernel
(2) Single mapping layer => Shrinking layer + 4 mapping layers + Expanding layer
- 5 more layers but params are decreased & acceleration is the most prominent
- Depth is key factor for performance
(3) Smaller filter size, less filters + 4 'narrow' layers (deeper network) instead of a single 'wide' layer
- final speedup & training network efficiently
Two 🐰!! Acceleration is NOT at the cost of performance degradation (FSRCNN outperforms SRCNN)

3.4 SR for Different Upscaling Factors

Transfer conv layers for fast training and testing across Different Upscaling Factors with no loss of quality
- All conv layers except deconv can be shared (only the last deconv layer contains information of upscaling factor)
- FSRCNN : almost conv filters are the same for different upscaling factors
  - SRCNN and SCN : conv filters differ a lot for different upscaling factors
- Training : only fine-tune deconv layer for another upscaling factor (no loss of mapping acc)
- Testing : only do convolution operations once & upsampling img to different scales using corresponding deconv layer

4. Experiements

4.1 Implementation Details

4.2 Investigation of Different Settings

4.3 Towards Real-Time SR with FSRCNN

4.4 Experiments for DIfferent Upscaling Factors

4.5 Comparision with SOTAs

5. Conclusion

[CV_STN] Spatial Transformer Networks

Spatial Transformer Networks

0. Abstract

CNN : limited by lack of ability to be 'spatially invariant' to input data
STN : Spatially Transform data within Network without extra training supervision
- can be inserted into conv architectures
- invariant to translation, scale, rotation, generic warping, etc
spatial invariance : 이미지가 변환되어도 그 이미지로 인식하는 것

1. Introduction

CNN Limitation : local max-pooling(2 x 2) help but intermediate feature maps still not invariant to global transformation of input
- Pooling layer (fixed and local receptive fields) : limited, pre-defined mechanism for dealing with variations
Spatial Transformer module : dynamic mechanism -> appropriate transformation for each input data on entire feature map (non-locally)
- select most relevant regions (attention) & transform them to canonical pose
- can be trained with standard backprop -> end-to-end training
STN (CNN + ST module) 3 Benefits
- image classification : crop, scale-normalization -> simplify subsequent classification task -> great performance
- co-localisation : localize different instances of the same but unknown class
- spatial attention (select most relevent region) : more flexible and trained within backprop without reinforcement learning

2. Related Work (prior work)

modeling transformation with NN
- Hinton : 2D affine transformation -> generative model training
- Tieleman : generative capsule models -> learn discriminative features for classification
transformation-invariant representation
- Cohen & Welling : G-CNN
- Scattering networks, Filter banks
Filter Bank : an array of bandpass filters that separates the input signal into multiple components
attention and detection mechanism for feature selection
STN : invariant representation by manipulating data ! (feature extractor X)

3. Spatial Transformers

Spatial Transformer = Localisation net + Grid generator + Sampler
(1) Input feature map U가 Localisation net에 들어가 transformation parameter θ를 뽑아냄
(2) θ가 Grid generator에 들어가 Sampling point가 지정된 Sampling grid T_θ(G)를 생성함
(3) Sampler에는 Input feature map U와 Sampling grid T_θ(G)가 입력됨
(4) Sampling grid T_θ(G)에는 Sampling point가 찍혀있기에 이를 U에 적용하면 Output feature map V를 얻을 수 있음

3.1 Localisation Network

Regress θ automatically to improve overall accuracy
input : input feature map U (Width x Height x Channel)
output : transform parameter θ = f_loc(U)
- θ : parameter matrix -> shape은 transformation type 따라 달라짐 (ex. affine : 6d)
f_loc( ) can take any form (ex. fc net or conv net) BUT should include final Regression layer to produce θ

3.2 Parameterized Sampling Grid (Grid generator)

Generate coordinate grid on input image corresponding to each pixel from output image
Regular Grid : G = {Gi} of pixels Gi = (xi^t, yi^t) <- output feature map grid
Grid generator : T_θ( ) can have any differentiable parameterized form
- ex 1) 2D affine transformation (crop, translation, rotation, scale, skew) matrix -> by 6 params
- ex 2) Attention (crop, translation, isotrophic scaling) : more constrained(=low complexity) -> by 3 params
- ex 3) plane projective (8 params), wise affine, thin plate spline
height and width normalized coordinates
- (xi^s, yi^s) : source coordinates in the input feature map U
- (xi^t, yi^t) : target coordinates in the output feature map V

3.3 Differentiable Image Sampling (Sampler)

Input : Apply set of sampling points T_θ(G) to input image U -> Define spatial location in the input
- Unm^c : input value at (n,m) in channel c
Output : transformed output feature map V (Width x Height x Channel)
- Vi^c : output value for pixel i at (xi^t, yi^t) in channel c
k( ) can take any sampling kernel as long as (sub-)gradients can be defined (ex. bilinear interpolation) for backprop
- (1) Nearest integer : not-differentiable
- (2) Bilinear interpolation : sub-differentiable !
  - U와 G에 대해 미분 가능하면 backprop 가능
  - 미분불가능한 구간 있더라도 구간별로 나눠서 backprop 하면 되므로 문제 없음
Spatial Consistency : Sampling is done identically for each channel -> every channel is transformed in identical way
- 같은 input의 다른 channel에 대해선 당연히 같은 sampling

3.4 Spatial Transformer Networks

Spatial Transformer module = Localisation net + Grid generator + Sampler
STN = CNN + ST module (at any point, in any number)
- Eval 결과, CNN 입력 바로 앞에 ST layer 두는게 일반적으로 가장 효과적
Advantages
- Fast & little overhead naively & even speedups in attentive models
- Minimize overall cost during training -> little effect on speed
  - training 과정 중에 모델의 다른 파라미터들과 함께 학습이 되기에 속도에 미치는 영향 거의 X
- How to transform each sample is compressed in weights of localisation net during training
- Possible to Downsample or Oversample feature map
- Possible to have Multiple spatial transformers in CNN
  - At increasing depths of CNN -> more abstract representations
  - For localisation networks -> more informative representations to base predicted params
  - Parallel -> useful to focus on multiple objects or parts of interest individually
Limitation : the number of parallel transformers limits the number of modeled objects

4. Experiments

4.1 Distorted MNIST

[ Train ]

Transformation type : Rotation(R), Rotation-Translataion-Scale(RTS), Projective(P), Elastic warping(E)
Network type : FCN, CNN, ST-FCN, ST-CNN
Sampling (bilinear) : affine(Aff), projective(Proj), thin plate spline(TPS)
Identical condition : same # of params, same base structure, identical optim (backprop, SGD, scheduled lr decay, multinomial CE loss, three weight layers) / CNN includes 2 max-pooling

[ Result ]

Network type : (percent error) ST-CNN < ST-FCN < CNN < FCN
- ST < non ST : ST enables network outperform
- CNN < FCN : Max-pooling (more spatial invariance) & Convolutional layer itself (better local structure model)
- CNN = ST-FCN for RTS : ST is alternative way for spatial invariance
Sampling : TPS is the best (elastically deform digits, reduce complexity, not overfit on simple data)
Transformation of inputs for all ST models : Standard upright posed digit = mean pose found in training data
https://www.youtube.com/watch?v=Ywv0Xi2-14Y&t=94s

4.2 Street View House Numbers (SVHN)

[ Train ]

Dataset : 1 and 5 digits house number in real world images (200K)
Pre-processing : 64 x 64 crop + additional loosely 128 x 128 crop
CNN : 11 hidden layers, 5 digit-independent softmax
ST-CNN : Single (f = 4 layers CNN, following input of baseline CNN) / Multi (f = 2 layer FC, before each first 4 CNN) -> affine transform, bilinear sampler
SGD, dropout, randomly initialized weights except for regression layers

[ Result ]

Best Accurcay : ST-CNN Multi for 64 x 64 images (3.6% error)
- crop and rescale by focusing resolution and network capacity only on corresponding parts of digit
Computation Speed : ST-CNN is only 6% slower than CNN
- ST-CNN requires only a single forward pass

4.3 Fine-Grained Classification

[ Train ]

Dataset : CUB-200-2011 birds dataset (6K train, 5.8K test, 200 species)
Baseline CNN : Inception + BN (pre-trained on ImageNet, fine-tuned on CUB)
ST-CNN : 2 or 4 parallel spatial transformers
1 softmax layer, end-to-end backprop

[ Result ]

Best Accurcay : 4 x ST-CNN (84.1%) -> outperform baseline CNN (82.3%)
Pose detection (Attention) : head (red) + central part (green) without any additional supervision
Same performance even if resolution is downsampled (448px input -> 224px output)

5. Conclusion

can be dropped into a network, perform explicit spatial transformations
can do end-to-end without any change in loss function
gain accuracy across multiple task
regressed transformation parameters are available as output
Expectation : powerful in recurrent models, object reference frame, 3D transformation

Code

#### 1) Load Dataset ####

from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np

plt.ion()  
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# train dataset
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST(root='.', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])), batch_size=64, shuffle=True, num_workers=4)
# test dataset
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST(root='.', train=False, transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])), batch_size=64, shuffle=True, num_workers=4)


#### 2) Compose STN ####

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

        # Localization-network
        self.localization = nn.Sequential(
            nn.Conv2d(1, 8, kernel_size=7),
            nn.MaxPool2d(2, stride=2),
            nn.ReLU(True),
            nn.Conv2d(8, 10, kernel_size=5),
            nn.MaxPool2d(2, stride=2),
            nn.ReLU(True)
        )

        # 3 * 2 size Affine matrix에 대해 예측
        self.fc_loc = nn.Sequential(
            nn.Linear(10 * 3 * 3, 32),
            nn.ReLU(True),
            nn.Linear(32, 3 * 2)
        )

        # 항등 변환(identity transformation) -> 가중치/바이어스 초기화
        self.fc_loc[2].weight.data.zero_()
        self.fc_loc[2].bias.data.copy_(torch.tensor([1, 0, 0, 0, 1, 0], dtype=torch.float))

    # STN forward
    def stn(self, x):
        xs = self.localization(x)
        xs = xs.view(-1, 10 * 3 * 3)
        theta = self.fc_loc(xs)
        theta = theta.view(-1, 2, 3)

        grid = F.affine_grid(theta, x.size())
        x = F.grid_sample(x, grid)

        return x

    def forward(self, x):
        x = self.stn(x)

        # general forward pass
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

model = Net().to(device)


#### 3) Train and Test ####

 optimizer = optim.SGD(model.parameters(), lr=0.01)

def train(epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()  # End-to-end training
        optimizer.step()
        if batch_idx % 500 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
        
def test():
    with torch.no_grad():
        model.eval()
        test_loss = 0
        correct = 0
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)

            test_loss += F.nll_loss(output, target, size_average=False).item()
            pred = output.max(1, keepdim=True)[1]
            correct += pred.eq(target.view_as(pred)).sum().item()

        test_loss /= len(test_loader.dataset)
        print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'
              .format(test_loss, correct, len(test_loader.dataset),
                      100. * correct / len(test_loader.dataset)))


#### 4) Visualization ####

def convert_image_np(inp):
    """Convert a Tensor to numpy image."""
    inp = inp.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)
    return inp

def visualize_stn():
    with torch.no_grad():
        data = next(iter(test_loader))[0].to(device)

        input_tensor = data.cpu()
        transformed_input_tensor = model.stn(data).cpu()

        in_grid = convert_image_np(
            torchvision.utils.make_grid(input_tensor))

        out_grid = convert_image_np(
            torchvision.utils.make_grid(transformed_input_tensor))

        f, axarr = plt.subplots(1, 2)
        axarr[0].imshow(in_grid)
        axarr[0].set_title('Dataset Images')

        axarr[1].imshow(out_grid)
        axarr[1].set_title('Transformed Images')

for epoch in range(1, 20 + 1):
    train(epoch)
    test()

visualize_stn()

plt.ioff()
plt.show()

[CV_Localization] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Abstract

visual explanations for decisions from CNN-> transparent, explainable
use gradients flowing into final conv to localization map -> highlight important regions in image for predicting concept
applicable to many CNN tasks without architectural changes or re-training
Classification
- lend insight into failure modes by reasonable explanations
- outperform previous methods
- robust to adversarial perturbations
- more faithful to basic model
- help generalization by dataset bias (for fair and bias-free outcomes)
Localization
- image captioning, VQA
- even non-attention based models
Human study
- appropriate trust in prediction from deep networks
- discern stronger vs weaker model even when identical prediction

1. Introduction

Transparent model to Explain why they predict what they predict
- AI evolution : (VQA) Identify failure / (classification) Establish appropriate trust / (chess) Teach human how to make better decisions
Trade-off bw accuracy and interpretability (simplicity)
- Classical model : interpretability ↑, accuracy ↓
- Deep model : interpretability ↓, accuracy ↑ BY greater abstraction (layers↑) and integration (end-to-end training)
CAM vs Grad-CAM
- CAM : constrained to model architecture (GAP -> fc)
- Grad-CAM : deep models without altering architecture (no trade-off) => Generalization of CAM
Guided Grad-CAM : class-discriminative & high-resolution = good visual explanation
- CAM, Grad-Cam : class-discriminative (localize)
- Guided backprop, Deconv : high-resolution (detail)

2. Related Work

Visualizing CNNs
Assessing model trust
Aligning gradient-based importance
Weakly-supervised localization : training without bbox information

3. Grad-CAM

last conv layer : high-level semantics (class-specific) & detailed spatial information
gradient flowing into last conv -> assign importance values to each neuron for a particular decision of interest

① Class score (before softmax) : y^c (could be any differentiable activation)
② Gradients of y^k wrt feature map activations A^k via backprop : dy^c/dA^k
③ Global average pooling -> Importance weight of feature map k for target class c : a_k^c

④ Weighted combination of forward activation maps
⑤ Apply ReLU b/c only interested in features of positive influence
-> result : coarse heatmap of same size as conv feature maps

3.1 Grad-CAM generalizes CAM

(수학적 증명)

3.2 Guided Grad-CAM

Grad-CAM : pixel-space detail ↓ -> unclear why network predicts particular instance
Guided Backprop : suppress negative gradients and visualize gradients through ReLU -> capture pixel by neurons
Guided Grad-CAM : combination by element-wise mul -> both high-resolution & class-discriminative + less noisy then deconv

3.3 Counterfactual Explanations

가장 방해되는 것 무엇?
DL model : background가 아닌 foreground로 판단

4. Evaluating Localization Ability of Grad-CAM

4.1 Weakly-supervised Localization

Weakly-supervised Localization : training without bbox information
Given image -> Obtain class predictions -> Generate Grad-CAM maps for each predicted classes -> Binarize pixels with thresh of 15% of max intensity -> Draw bbox around single largest segment
Grad-CAM localization error < others
No change model structure or re-train -> No compromise on classification performance!

4.2 Weakly-supervised Segmentation

Semantic Segmentation : assign each pixel in image an object class -> expensive pixel-level annotation
Weakly-supervised Segmentation : segment object with image-level annotation -> cheap and easy to get data
SEC with CAM : sensitive to choice of weak localization seed -> SEC with Grad-CAM : (IoU : 44.6 -> 49.6)

4.3 Pointing Game

Why : To evaluate discriminativeness of visualization method for localizing objects
How : Extract maximally activated point on generated heatmap -> compare with target label -> Count # of Hit or Miss

Acc = Hit # / (Hit # + Miss #) ... only measure Precision
For Recall, compute localization maps for top-5 class predictions -> evaluate them with additional option

option : reject predictions below a threshold (absent from GT)
Result : Grad-CAM > c-MWP (70.58% > 60.30%)

5. Evaluating Visualizations

interpretability vs. faithfulness tradeoff

5.1 Class Discrimination

Dataset : PASAL VOC 2007 - 2 annotated categories
CNN model : VGG-16, AlexNet
Method(Human Acc) : Deconv(53.33%), Guided backprop(44.44%), Deconv Grad-CAM(60.37%), Guided Grad-CAM(61.23%)

5.2 Trust

CNN model : VGG-16, AlexNet <- both models making same prediction as GT
Method: Guided backprop, Guided Grad-CAM
Evaluation : rating reliability of models relative to each other
Result : Guided backprop (VGG-16 : 1.00), Guided Grad-CAM (VGG-16 : 1.27) => VGG is more reliable than AlexNet
Grad-CAM can place trust in model that generalizes better than individual prediction explanations

5.3 Faithfulness vs Interpretability

Trade-off : More faithful, Less interpretable and vice versa
Grad-CAM are reasonably interpretable, so evaluate how faithful!
- Faithfulness : ability to accurately explain function
- Reference explanation with high local-faithfulness : correlation with Image occlusion maps
- Result : Grad-CAM is more faithful than original model
Grad-CAM is more Faithful and more Interpretable

6. Diagnosing image classification CNNs with Grad-CAM

VGG-16 pretrained on imagenet

6.1 Analyzing failure models for VGG-16

Some failures are due to ambiguities inherent in ImageNet classification
Guided Grad-CAM has reasonable explanations for failure predictions

6.2 Effect of adversarial noise on VGG-16

Dataset : adversarial images for ImageNet-pretrained VGG-16
Result : Despite network being certain about absence of each category, correctly localize! -> fairly robust to adversarial noise

6.3 Identifying bias in dataset

Task : binary classification of doctor' vs 'nurse'
Biased model : misclassifying by gender stereotype (face / hairstyle) => good validation acc, but not good for generalization
Reduced biased model : generalization better (82% → 90%)
Insight: Grad-CAM can help detect and reduce bias in training datasets -> better generalizaion, fair and eithical outcome

7. Textual Explanations with Grad-CAM

obtain neuron names for last conv layer -> sort and obtain top-5 and bottom-5 neurons -> use for text explanations
higher positive values of neuron importance => presence of concept increases in class score
important concepts are indicative of predicted class even for misclassification

8. Grad-CAM for Image Captioning and VQA

vision & language tasks

8.1 Image Captioning

finetuned VGG-16 for images, LSTM-based language model (no explicit attention mechanism)
compute gradient of log probability wrt units in last conv layer -> generate Grad-CAM visualizations
FCLN produces bbox for rol & LSTM-based model generates associated captions
DenseCap generates 5 captions per image with GT bbox
Then, Guided Grad-CAM localizes regions without trained with bbox annotations

8.2 Visual Question Answering

CNN for processing images & RNN language model for questions
image and question are fused to predict answer
Result : Grad-CAM via correlation with occlusion maps : 0.60+-0.038 -> high faithfulness

9. Conclusion

Grad-CAM (Gradient-weighted Class Activation Mapping) : class-discriminative localization technique for making any CNN model more transparent by visual explanations
Guided Grad-CAM : Both high resolution + class-discriminative -> interpretability + faithfulness
AI should be able to reason about its belief and actions for human to trust and use it!

Code Review

def generate_gradcam(img_tensor, model, class_index, activation_layer):
    model_input = model.input

    # y_c : class_index에 해당하는 CNN 마지막 layer op(softmax, linear, ...)의 입력
    y_c = model.output[0, class_index]

    # A_k: activation conv layer의 출력 feature map
    A_k = model.get_layer(activation_layer).output

    # model의 입력에 대해서,
    # activation conv layer의 출력(A_k)과
    # 최종 layer activation 입력(y_c)의 A_k에 대한 gradient,
    # 모델의 최종 출력(prediction) 계산
    get_output = K.function([model_input], [A_k, K.gradients(y_c, A_k)[0]])
    [conv_output, grad_val] = get_output([img_tensor])

    # batch size가 포함되어 shape가 (1, width, height, k)이므로
    # (width, height, k)로 shape 변경
    # 여기서 width, height는 activation conv layer인 A_k feature map의 width와 height를 의미함
    conv_output = conv_output[0]
    grad_val = grad_val[0]

    # global average pooling 연산
    # gradient의 width, height에 대해 평균을 구해서(1/Z) weights(a^c_k) 계산
    weights = np.mean(grad_val, axis=(0, 1))

    # activation conv layer의 출력 feature map(conv_output)과
    # class_index에 해당하는 weights(a^c_k)를 k에 대응해서 weighted combination 계산

    # feature map(conv_output)의 (width, height)로 초기화
    grad_cam = np.zeros(dtype=np.float32, shape=conv_output.shape[0:2])
    for k, w in enumerate(weights):
        grad_cam += w * conv_output[:, :, k]

    # 계산된 weighted combination 에 ReLU 적용
    grad_cam = np.maximum(grad_cam, 0)

    return grad_cam, weights

def make_gradcam_heatmap(img_array, model, last_conv_layer_name, pred_index=None):
    # First, we create a model that maps the input image to the activations
    # of the last conv layer as well as the output predictions
    grad_model = tf.keras.models.Model(
        [model.inputs], [model.get_layer(last_conv_layer_name).output, model.output]
    )

    # Then, we compute the gradient of the top predicted class for our input image
    # with respect to the activations of the last conv layer
    with tf.GradientTape() as tape:
        last_conv_layer_output, preds = grad_model(img_array)
        if pred_index is None:
            pred_index = tf.argmax(preds[0])
        class_channel = preds[:, pred_index]

    # This is the gradient of the output neuron (top predicted or chosen)
    # with regard to the output feature map of the last conv layer
    grads = tape.gradient(class_channel, last_conv_layer_output)

    # This is a vector where each entry is the mean intensity of the gradient
    # over a specific feature map channel
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))

    # We multiply each channel in the feature map array
    # by "how important this channel is" with regard to the top predicted class
    # then sum all the channels to obtain the heatmap class activation
    last_conv_layer_output = last_conv_layer_output[0]
    heatmap = last_conv_layer_output @ pooled_grads[..., tf.newaxis]
    heatmap = tf.squeeze(heatmap)

    # For visualization purpose, we will also normalize the heatmap between 0 & 1
    heatmap = tf.maximum(heatmap, 0) / tf.math.reduce_max(heatmap)
    return heatmap.numpy()

[CV_Localization] Learning Deep Features for Discriminative Localization

Abstract

Global average pooling (GAP) : (previously) Regularizing training -> (CAM) generic localizable deep representaion

1. Introduction

CNN : classification, object detection Good but FC layer (flatten)-> ability to localize objects is lost
FCN (NIN), GoogLeNet : GAP as regularizer -> minimize # of params + maintain high performance
CAM : GAP for remarkable localization ability until final layer (deep features)

1.1 Related Work

localizing objects + identifying which regions of image are being used for discrimination

(1) Weakly-supervised object localization

Previous works : self-taught, multiple-instance learning, transferring mid-level image, multiple overlapping patches
-> No end-to-end training & Multiple forward pass -> difficult to scale real-world datasets
GMP (Global Max Pooling) : limited to lying in boundary of object rather than full extent
CAM : End-to-end training & Single forward pass & GAP (full extent, all discriminative regions)

(2) Visualizing CNNs

Previous works : Deconvnet (patterns activate each unit) -> Incomplete (only analyzing conv layers, ignoring fc layers)
CAM : Removing fc layers -> able to understand whole network (end-to-end)
Previous works : Inverting deep features at different layers (inverting fc layers)-> But No highlight relative importance
CAM : Highlight which regions are important for discrimination

2. Class Activation Mapping

Class Activation Map for each particular category indicates discriminative regions to identify category
Class Activation Mapping : CNN -> GAP on last conv layer (feature maps) -> fc layer -> Softmax final output
GAP : spatial average of feature map at last conv layer -> one weight for each channel (total : N weights for N channels)
CAM : Sum of N weights * N conv layers -> one heat map for each class
Result : Projecting back weights of output on conv feature maps -> can identify importance of image regions

f_k(x,y) : activation map (feature map) of unit k in last conv layer at spatial location (x,y)
F_k(x,y) : result of GAP
S_c : input to softmax for class c
w_k^c : weight for class c -> importance of F_k for class c
M_c(x,y) : CAM for class c -> importance of activation at (x,y) leading to classification of image to class c
CAM = weighted linear sum of visual patterns at different spatial locations -> Upsampling CAM to size of input !
P_c : output of softmax for class c

Global average pooling (GAP) vs global max pooling (GMP)

GAP : consider all discriminative parts of an object -> identify extent of object
GMP : consider only highest parts of an object
Classification performance : similar / Localization performance : GAP > GMP

3. Weakly-supervised Object Localization

3.1 Setup

Dataset : ILSVRC 2014
CNN models : AlexNet, VGGnet, GoogLeNet (remove fc layers -> replace them with GAP)
- Localization ability improved when last conv layer before GAP = high spatial resolution (mapping resolution)
- So, remove some layers -> add new layers (3 x 3, stride 1, pad 1 with 1024 units) followed by GAP
Networks were fine-tuned on 1.3M training images of ILSVRC

3.2 Results

(1) Classification

GAP : Only small performance drop (1-2%) without fc layers -> Acceptable

(2) Localizaion

bbox selection strategy : Simple thresholding technique (max 20% labeling -> bbox)
[Table 2] GAP : not trained on a single annotated bbox but outperforms than others (NIN, Backprop)
[Table 3] Weakly vs Fully-supervised methods
- bbox selection strategy (heuristics) : 2 bbox (one tight and one loose) from 1st and 2nd predicted classes + 1 loose bbox for top 3rd predicted class
- weakly-supervised GoogLeNet-GAP (heuristics) ~= fully-supervised AlexNet
- Same model -> still long way...

4. Deep Features for Generic Localization

Response from higher-level layers of CNN : effective generic features with SOTA on many image datasets
Response from GAP CNN : also perform well as generic features + highlight discriminative regions (without training)
- GoogLeNet-GAP, GoogLeNet > AlexNet
- GoogLeNet-GAP ~= GoogLeNet

4.1 Find-grained Recognition

Dataset : CUB-200-2011 (200 bird species)
Methods : GoogLeNet-GAP on full image < crop < bbox

4.2 Pattern Discovery

To identify common elements or patterns such as text or high-level concepts

(1) Discovering informative objects in the scenes

Dataset : 10 scene categories from SUN dataset
top 6 objects that most frequently overlap with high activation regions for two scene

(2) Concept localization in weakly labeled images

concept detector : localize informative regions for concepts, even phrases are more abstract than object names

(3) Weakly supervised text detector

Dataset : 350 Google StreetView images containing text from SVT dataset
highlight text without using bbox annotations

(4) Interpreting visual question answering (VQA)

overall acc : 55.89%
highlight image regions relevant to predicted answers

5. Visualizing Class-Specific Units

Using GAP and the ranked softmax weight
CAM : Visualize most discriminative units (Class-Specific Units) for a given class
Combination of Class-Specific Units guides CNN -> we can infer CNN actually learn!

6. Conclusion

CAM enables classification-trained CNNs with GAP to perform object localization without bbox annotations
CAM visualizes predicted class scores & highlights discriminative object parts
CAM generalizes to other visual recognition tasks

Code

def generate_cam(img_tensor, model, class_index, last_conv):
  
    model_input = model.input
    model_output = model.layers[-1].output

    # f_k(x, y) : 마지막 conv layer의 출력 feature map
    f_k = model.get_layer(last_conv).output
    get_output = K.function([model_input], [f_k])
    [last_conv_output] = get_output([img_tensor])

    # batch size가 포함되어 shape가 (1, width, height, k)이므로 (width, height, k)로 shape 변경
    last_conv_output = last_conv_output[0]

    # softmax(+ dense) layer와 GAP layer 사이의 weight matrix에서 class_index에 해당하는 class_weight_k(w^c_k)
    # ex) w^2_1, w^2_2, w^2_3, ..., w^2_k
    class_weight_k = model.layers[-1].get_weights()[0][:, class_index]


    # feature map(last_conv_output)의 (width, height)로 초기화
    cam = np.zeros(dtype=np.float32, shape=last_conv_output.shape[0:2])

   # last conv layer의 출력 feature map(last_conv_output)과 class_weight_k(w^c_k)로 weighted sum을 구함
    for k, w in enumerate(class_weight_k):
        cam += w * last_conv_output[:, :, k]

    return cam

[CV_3D] PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows

PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows

Introduction

Major roadblock in generating pc : complexity of space of point clouds

Meaning of words

Distribution = Invertible parameterized transformation of 3D points from prior distribution (ex. Gaussian)
Shape = Variable that parametrizes transformation
Category = distribution of this variable

PointFlow : Point cloud Generative model by learning distribution of distributions

Two-level hierarchy of distributions : distribution of shapes & distribution of points given a shape
Sampling points from prior Gaussian
→ Moving them according to parameterized transformation to new location in target shape
Parameterization : Continuous Normalizing Flows to model transformation
- Invertibility → Sampling and Estimating probability density → Training models using variational inference
- (maximize a variational lower bound on log-likelihood of training point clouds set)
Results : SOTA performance in point cloud generation & pc reconstruction, unsupervised feature learning

Related work

Deep learning for PC

PC discriminative tasks : classification, segmentation, critical point sampling, auto-encoding, single-view 3D reconstruction, stereo reconstruction, point cloud completion, ...
- AE : training with heuristic loss functions that measure distance bw two point sets (ex. CD, EMD)
  - CD : incorrect point clouds
  - EMD : slow to compute (approximation → biased or noisy gradients)
Problems of Previous models : Fixed number of points, Heuristic loss function
- Drawbacks of treating pc and fixed-dimensional matrix
  - Model is restricted to generate a fixed number of points
  - No Permutation invariance of point sets
- Drawbacks of using heuristic loss function
  - Lack of probabilistic guarantee
  - Only learning distribution of points for each shape (Not distribution of shapes)
  - Ex. Sophisticated decoders : overcoming fixed number of points BUT still relying heuristic set distances
PointFlow : training E2E by maximizing variational lower bound on log-likelihood

Generative models

Generative models : GAN, VAE, Auto-regressive models. Flow-based models
Most deep generative models : learning distribution of fixed-dimensional variables
PointFlow : learning distribution of sets and generating new sets by using tighter lower bound on log-likelihood
- with normalizing flow in modeling both reconstruction likelihood and prior

Overview

Goal : To learn distribution of shapes(=distributions of points)
= To sample shapes and an arbitrary # of points from a shape
Continuous Normalizing Flow (CNF) = A vector field in 3D Euclidean space
- To model distribution of points by transforming a generic prior
  (sample points from prior → move them according to vector field)
- Invertible → move data points back to prior → compute exact llikelihood
- parametrizing each continuous NF with a latent variable that represents shape
  ⇔ modeling distribution of shapes = modeling distribution of latent variable
Optimization : using variational lower bound on log-likelihood by inference network
- Invertiblity 의해 likelihood computation 가능 → Training model E2E in stable manner !

Model

Three Modules

$Q_Φ (z|X)$ : (permutation-invariant) Encoder to encode a point cloud into a shape representation $z$
$P_ψ (z)$ : (CNF) Prior over shape representation $z$
$P_θ (X|z)$ : (CNF) Decoder to model distribution of points given shape representation $z$

Flow-based point generation from shape representations

$log P_θ (X|z)$ : Reconstruction log-likelihood of a point set $X$ = Sum of log-likelihood of each point $x$
$x$ : result of transforming some point $y(t_0)$ in prior distribution $P(y) = N(0,1)$ using CNF
- $g_θ$ : continuous-time dynamics of flow $G_θ$ conditioned on $z$ → $G_θ^{-1} (x;z)$ 가능
$log P_θ (x|z)$ : log-likelihood of each point by using conditional extension of CNF

Flow-based prior over shape

Learnable Prior

Motivation : Prior로 simple Gaussian 써도 되지만, VAE 성능 하락하는 문제 완화하고자 제안
How : using another CNF to parametrize a learnable prior
KL divergence term in ELBO function
- $P_ψ(z)$ : prior distribution with learnable parameters $ψ$
- $H$ : entropy
$z$ : result of transforming some point $w(t_0)$ in simple Gaussian $P(w) = N(0,1)$ using CNF
- $f_ψ$ : continuous-time dynamics of flow $F_ψ$ → $F_ψ^{-1} (z)$ 가능
$log P_ψ (z)$ : log probability of prior distribution

[Training] Final training objective

Objective function
Training encoder and decoder jointly to maximize a lower bound on log-likelihood
Training whole network E2E by maximizing ELBO of all point sets in dataset
Objective function = ① + ② + ③
- ① Prior : encourage encoded shape representation to have high probability under prior
  - using reparameterization trick to enable differentiable MC esimate
- ② Reconstruction likelihood : estimating using MC sampling
- ③ Posterior Entropy : entropy of approximated posterior

[Test] Sampling

(1) Sampling a shape representation $\widetilde{z}$ through $F_ψ$
(2) Generating a point given $\widetilde{z}$
- How : Sampling a point $\widetilde{y}$ from $N(0,1)$ → Passing $\widetilde{y}$ through $G_θ$ conditioned on $\widetilde{z}$
- Result : a point $\widetilde{x}$ = $G_θ(\widetilde{w};z)$
Sampling a point cloud with size $\widetilde{M}$ by repeating (2) for $\widetilde{M}$ times

Experiments

Eval metrics

Previous metrics to measure similarity bw point clouds (not used during training PointFlow)
- Ex. Chamfer distance (CD), Earth mover's distance (EMD)
- $X, Y$ : point clouds with the same # of points / $Φ$ : bijection bw $X, Y$
Jensen-Shannon Divergence (JSD)
Coverage (COV)
Mininum matching distance (MMD)
1-nearest neighbor accuracy (1-NNA)

Generation

Previous pc generative models : raw-GAN, latent-GAN, PC-GAN
Dataset : 3 categories in ShapeNet (airplane, chair, car) → Normalized (zero-mean per axis, unit variance)
Training, Test : 2048 points for each shape
Models : # of parameters in total (full) or in generative pathways (gen)
Result : outperforming all baselines across all categories (1-NNA) & best score in most cases (other metrics)

Auto-Encoding

Goal : Reconstruction ability
Models : 1-GAN, AtlasNet(SOTA) vs pointFlow (flow-based AE)
Dataset : ShapeNet
Training : AE trained with only $L_{recon}$
Test : 4096 points per shape = 2048 input set + 2048 reference set
- How? computing distance (CD or EMD) bw reconstructed input set and reference set
Result : best EMD score

Unsupervised representation learning

Goal : Representation learning ability
How : extract latent representations of AE trained in full ShapeNet → train linear SVM classifier on ModelNet10(40)
Dataset : ShapeNet & ModelNet10(40) → Normalized (zero-mean per axis, unit variance), Random-rotation along gravity axis
Problem of task : different encoder, different # of params, different pre-processing --> hard to compare

Code Review

Reference : https://github.com/stevenygd/PointFlow

class Encoder(nn.Module):
    def __init__(self, zdim, input_dim=3, use_deterministic_encoder=False):
        super(Encoder, self).__init__()
        self.use_deterministic_encoder = use_deterministic_encoder
        self.zdim = zdim
        self.conv1 = nn.Conv1d(input_dim, 128, 1)
        self.conv2 = nn.Conv1d(128, 128, 1)
        self.conv3 = nn.Conv1d(128, 256, 1)
        self.conv4 = nn.Conv1d(256, 512, 1)
        self.bn1 = nn.BatchNorm1d(128)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(256)
        self.bn4 = nn.BatchNorm1d(512)

        if self.use_deterministic_encoder:
            self.fc1 = nn.Linear(512, 256)
            self.fc2 = nn.Linear(256, 128)
            self.fc_bn1 = nn.BatchNorm1d(256)
            self.fc_bn2 = nn.BatchNorm1d(128)
            self.fc3 = nn.Linear(128, zdim)
        else:
            # Mapping to [c], cmean
            self.fc1_m = nn.Linear(512, 256)
            self.fc2_m = nn.Linear(256, 128)
            self.fc3_m = nn.Linear(128, zdim)
            self.fc_bn1_m = nn.BatchNorm1d(256)
            self.fc_bn2_m = nn.BatchNorm1d(128)

            # Mapping to [c], cmean
            self.fc1_v = nn.Linear(512, 256)
            self.fc2_v = nn.Linear(256, 128)
            self.fc3_v = nn.Linear(128, zdim)
            self.fc_bn1_v = nn.BatchNorm1d(256)
            self.fc_bn2_v = nn.BatchNorm1d(128)

    def forward(self, x):
        x = x.transpose(1, 2)
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.bn4(self.conv4(x))
        x = torch.max(x, 2, keepdim=True)[0]
        x = x.view(-1, 512)

        if self.use_deterministic_encoder:
            ms = F.relu(self.fc_bn1(self.fc1(x)))
            ms = F.relu(self.fc_bn2(self.fc2(ms)))
            ms = self.fc3(ms)
            m, v = ms, 0
        else:
            m = F.relu(self.fc_bn1_m(self.fc1_m(x)))
            m = F.relu(self.fc_bn2_m(self.fc2_m(m)))
            m = self.fc3_m(m)
            v = F.relu(self.fc_bn1_v(self.fc1_v(x)))
            v = F.relu(self.fc_bn2_v(self.fc2_v(v)))
            v = self.fc3_v(v)

        return m, v


# Model
class PointFlow(nn.Module):
    def __init__(self, args):
        super(PointFlow, self).__init__()
        self.input_dim = args.input_dim
        self.zdim = args.zdim
        self.use_latent_flow = args.use_latent_flow
        self.use_deterministic_encoder = args.use_deterministic_encoder
        self.prior_weight = args.prior_weight
        self.recon_weight = args.recon_weight
        self.entropy_weight = args.entropy_weight
        self.distributed = args.distributed
        self.truncate_std = None
        self.encoder = Encoder(
                zdim=args.zdim, input_dim=args.input_dim,
                use_deterministic_encoder=args.use_deterministic_encoder)
        self.point_cnf = get_point_cnf(args)
        self.latent_cnf = get_latent_cnf(args) if args.use_latent_flow else nn.Sequential()

    @staticmethod
    def sample_gaussian(size, truncate_std=None, gpu=None):
        y = torch.randn(*size).float()
        y = y if gpu is None else y.cuda(gpu)
        if truncate_std is not None:
            truncated_normal(y, mean=0, std=1, trunc_std=truncate_std)
        return y

    @staticmethod
    def reparameterize_gaussian(mean, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn(std.size()).to(mean)
        return mean + std * eps

    @staticmethod
    def gaussian_entropy(logvar):
        const = 0.5 * float(logvar.size(1)) * (1. + np.log(np.pi * 2))
        ent = 0.5 * logvar.sum(dim=1, keepdim=False) + const
        return ent

    def multi_gpu_wrapper(self, f):
        self.encoder = f(self.encoder)
        self.point_cnf = f(self.point_cnf)
        self.latent_cnf = f(self.latent_cnf)

    def make_optimizer(self, args):
        def _get_opt_(params):
            if args.optimizer == 'adam':
                optimizer = optim.Adam(params, lr=args.lr, betas=(args.beta1, args.beta2),
                                       weight_decay=args.weight_decay)
            elif args.optimizer == 'sgd':
                optimizer = torch.optim.SGD(params, lr=args.lr, momentum=args.momentum)
            else:
                assert 0, "args.optimizer should be either 'adam' or 'sgd'"
            return optimizer
        opt = _get_opt_(list(self.encoder.parameters()) + list(self.point_cnf.parameters())
                        + list(list(self.latent_cnf.parameters())))
        return opt

    def forward(self, x, opt, step, writer=None):
        opt.zero_grad()
        batch_size = x.size(0)
        num_points = x.size(1)
        z_mu, z_sigma = self.encoder(x)
        if self.use_deterministic_encoder:
            z = z_mu + 0 * z_sigma
        else:
            z = self.reparameterize_gaussian(z_mu, z_sigma)

        # Compute H[Q(z|X)]
        if self.use_deterministic_encoder:
            entropy = torch.zeros(batch_size).to(z)
        else:
            entropy = self.gaussian_entropy(z_sigma)

        # Compute the prior probability P(z)
        if self.use_latent_flow:
            w, delta_log_pw = self.latent_cnf(z, None, torch.zeros(batch_size, 1).to(z))
            log_pw = standard_normal_logprob(w).view(batch_size, -1).sum(1, keepdim=True)
            delta_log_pw = delta_log_pw.view(batch_size, 1)
            log_pz = log_pw - delta_log_pw
        else:
            log_pz = torch.zeros(batch_size, 1).to(z)

        # Compute the reconstruction likelihood P(X|z)
        z_new = z.view(*z.size())
        z_new = z_new + (log_pz * 0.).mean()
        y, delta_log_py = self.point_cnf(x, z_new, torch.zeros(batch_size, num_points, 1).to(x))
        log_py = standard_normal_logprob(y).view(batch_size, -1).sum(1, keepdim=True)
        delta_log_py = delta_log_py.view(batch_size, num_points, 1).sum(1)
        log_px = log_py - delta_log_py

        # Loss
        entropy_loss = -entropy.mean() * self.entropy_weight
        recon_loss = -log_px.mean() * self.recon_weight
        prior_loss = -log_pz.mean() * self.prior_weight
        loss = entropy_loss + prior_loss + recon_loss
        loss.backward()
        opt.step()

        # LOGGING (after the training)
        if self.distributed:
            entropy_log = reduce_tensor(entropy.mean())
            recon = reduce_tensor(-log_px.mean())
            prior = reduce_tensor(-log_pz.mean())
        else:
            entropy_log = entropy.mean()
            recon = -log_px.mean()
            prior = -log_pz.mean()

        recon_nats = recon / float(x.size(1) * x.size(2))
        prior_nats = prior / float(self.zdim)

        if writer is not None:
            writer.add_scalar('train/entropy', entropy_log, step)
            writer.add_scalar('train/prior', prior, step)
            writer.add_scalar('train/prior(nats)', prior_nats, step)
            writer.add_scalar('train/recon', recon, step)
            writer.add_scalar('train/recon(nats)', recon_nats, step)

        return {
            'entropy': entropy_log.cpu().detach().item()
            if not isinstance(entropy_log, float) else entropy_log,
            'prior_nats': prior_nats,
            'recon_nats': recon_nats,
        }

    def encode(self, x):
        z_mu, z_sigma = self.encoder(x)
        if self.use_deterministic_encoder:
            return z_mu
        else:
            return self.reparameterize_gaussian(z_mu, z_sigma)

    def decode(self, z, num_points, truncate_std=None):
        # transform points from the prior to a point cloud, conditioned on a shape code
        y = self.sample_gaussian((z.size(0), num_points, self.input_dim), truncate_std)
        x = self.point_cnf(y, z, reverse=True).view(*y.size())
        return y, x

    def sample(self, batch_size, num_points, truncate_std=None, truncate_std_latent=None, gpu=None):
        assert self.use_latent_flow, "Sampling requires `self.use_latent_flow` to be True."
        # Generate the shape code from the prior
        w = self.sample_gaussian((batch_size, self.zdim), truncate_std_latent, gpu=gpu)
        z = self.latent_cnf(w, None, reverse=True).view(*w.size())
        # Sample points conditioned on the shape code
        y = self.sample_gaussian((batch_size, num_points, self.input_dim), truncate_std, gpu=gpu)
        x = self.point_cnf(y, z, reverse=True).view(*y.size())
        return z, x

    def reconstruct(self, x, num_points=None, truncate_std=None):
        num_points = x.size(1) if num_points is None else num_points
        z = self.encode(x)
        _, x = self.decode(z, num_points, truncate_std)
        return x

[CV_Pose Estimation] Efficient Object Localization Using Convolutional Networks

Efficient Object Localization Using Convolutional Networks

Abstract

Efficient 'Position Refinement' model
- trained to estimate joint offset location within a small region of img
- trained in cascade within SOTA ConvNet model to acheive improved acc
- on FLIC dataset, MPII dataset

1. Introduction

Human-body part localization task ↑ BY ConvNet arch + larger datasets
(sota) ConvNet : internal strided-pooling layers
- reduce spatial resolution
- output : invariant to spatial location within pooling region
- promote spatial invariance to local input transformation
- pooling : prevent over-training + reducing computational complexity for classification
- Trade-off : generalization performance ↑ <-> spatial localization accuracy ↓
(this paper) LCN : ConvNet for efficient localization of human joints in RGB imgs
- high spatial accuracy + computational efficiency
- begin by coarse body part localization -> output : low resolution, per-pixel heat-map
- show likelihood of a joint occurring in each spatial location
- Max-pooling for dimensionality reduction + improving invariance to noise and local img transformations
- reuse hidden layer conv features from coarse heat-map regression model to improve localization accuracy

2. Related Work

Models using Hand-crafted features (edges, contours, HoG, color histograms) : poor generalization performance
- Deformable Part Models (DPM)
- Mixture of templates modeled using SVMs
- Poselet + DPM mosel : spatial relationship of body parts
- Atmlets : semi-global classifier, good for real-world data, but only arms
- Multi-modal model : holistic + local
ConvNets
- formulate problem as a direct (continuous) regression
- poorly in high-precision region
- unnecessary learning complexity by mapping from input RGB img to XY location (over-training)
- +) low-dimensional representation of input img, multi-resolution ConvNet arch, ...

3. Coarse Heat-Map Regression Model

Using Extension of Multi-resolution ConvNet model
For Sliding window detector with Overlapping contexts to produce Coarse heat-map output

3.1. Model Architecture

Input : RGB Gaussian pyramid of 3 levels (320 x 240 for FLIC, 256 x 256 for MPII)

Figure 2 : only 2 levels for brevity
Output : Heat-map for each joint describing per-pixel likelihood for joint occurring in each output spatial location
1st layer : LCN (Local Contrast Normalization) with same filter kernel in each 3 resolution banks -> out : LCN imgs
Next 7 stage (11 for MPII) multi-resolution ConvNet : Pooling -> heat-map output is at a lower resolution than input img
Last 4 stage (3 for MPII) : effectively simulated FC network for taget input patch size

3.2. Spatial Dropout

Dropout : zeroing activation -> improving generalization by preventing activations from becoming strongly correlated
Additional Dropout layer before 1st 1x1 conv layer
Standard Dropout
- Network is fully conv (1d conv) & natural imgs (so, feature map activations) are strongly correlated
- Result : over-training (Fail)
Spatial Dropout
- Feature-map = n_features x Height x Width
- How : perform only n_features dropout trials + extend value across entire feature map
- Result : adjacent pixels are either all 0 OR all active (good performance on FLIC)

3.3. Training and Data Augmentation

Loss : MSE
- H', H : Predicted and GT heat-map for joint
- Target GT heat-map : 2D gaussian of constant variance (sigma = 1.5 pixels) centered at GT joint (x,y)
Data Augmentation : Random rotation, scaling, flipping -> Generalization
Multiple people contained but Single person annotated case
- How : Sliding-window + tree-structured MRF spatial model (approximate Torso position)
- MRF Input : GT torso position + 14 predicted joints from ConvNet output = 15 joints locations
- Result : selecting correct person for labeling

4. Fine Heat-Map Regression Model

Purpose : Recovering spatial accuracy lost due to pooling
How : Using additional ConvNet to refine localization result of coarse heat-map
Difference : Reusing existing conv features -> reducing # of params + acting as regularizer

4.1. Model Architecture

Full system Architecture
- Heat-map-based model for coarse localization
- Module to sample and crop conv features at joint location (x, y)
- Additional conv model for fine tuning
Joint Inference Steps
1. FPROP (forward-propagate) through Coarse heat-map model
  - Infer all joint locations (x, y) from max value in each joint's heat-map
2. Sample and Crop first 2 conv layers (for all resolution) at each coarse location (x, y)
  - output gradients from cropped img + output gradients of conv stages in coarse heat-map
3. FPROP through Fine heat-map model -> (△x, △y)
  - Fine heat-map model : Siamese network of 7 instances (14 for MPII)
4. Add Position Refinement to coarse location -> Final location (x, y) for each joint
Fine heat-map model
- Siamese network : Weights and biases of each module are shared
- Sample location for each joint is different : Conv features don't share same Spatial context
- So, conv sub-nets must be applied to each joint independently
- But, parameter sharing to reduce # of shared params and prevent over-training
Last 1x1 Conv
- No weight sharing
- Input : each output of 7 sub-nets
- Output : detailed-resolution heat-map
- Purpose : Final detection for each joint

4.2. Joint Training

Before Joint Training : Pre-training Coarse heat-map model first
Holding params Coarse heat-map model Fixed + Training Fine heat-map model
Jointly Training both models by minimizing E3 = E1 + λE2 ..... (λ = 0.1)
- H', H : Predicted and GT Coarse heat-map for joint
- G', G : Predicted and GT Fine heat-map for joint
Regression to set of target heat-maps for minimizing final (x, y) prediction

5. Results

Framwork : Torch7
Dataset : FLIC(easy), MPII-Human-Pose(hard)
Pooling impact for coarse heat-map model : Pooling ↑ -> Detection performance(spatial precision) ↓
Ambiguous GT labels : can be worse than expected variance in User-generated labels
Cascaded model impact : better than Coarse model only
Greedily-trained cascade (Shared features)
- Coarse and Fine models are trained independently by adding additional conv layer
- How : Training Fine model by using cropped input imgs as input
- Result : regularizing effect of joint training : preventing over-training [F14(a)]
SpatialDropout : regularizing effect of dropout + reduction in strong heat-map outliers [F14(b)]

6. Conclusion

Localization tasks demand high degree of spatial precision
Cascaded architecture that combined Fine and Coarse conv networks -> SOTA on FLIC, MPII-human-pose
Spatial Precision + Computational benefits of Pooling

Code

Train

import os
import sys
import time
import argparse

import torch
import numpy as np
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchcontrib

from torchvision import transforms

from dataset.cub200 import CUB200Data
from dataset.mit67 import MIT67Data
from dataset.stanford_dog import SDog120Data
from dataset.caltech256 import Caltech257Data
from dataset.stanford_40 import Stanford40Data
from dataset.flower102 import Flower102Data

from model.fe_resnet import resnet18_dropout, resnet50_dropout, resnet101_dropout
from model.fe_mobilenet import mbnetv2_dropout

class MovingAverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f', momentum=0.9):
        self.name = name
        self.fmt = fmt
        self.momentum = momentum
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0

    def update(self, val, n=1):
        self.val = val
        self.avg = self.momentum*self.avg + (1-self.momentum)*val

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)

class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'
    
class CrossEntropyLabelSmooth(nn.Module):
    def __init__(self, num_classes, epsilon = 0.1):
        super(CrossEntropyLabelSmooth, self).__init__()
        self.num_classes = num_classes
        self.epsilon = epsilon
        self.logsoftmax = nn.LogSoftmax(dim=1)

    def forward(self, inputs, targets):
        log_probs = self.logsoftmax(inputs)
        targets = torch.zeros_like(log_probs).scatter_(1, targets.unsqueeze(1), 1)
        targets = (1 - self.epsilon) * targets + self.epsilon / self.num_classes
        loss = (-targets * log_probs).sum(1)
        return loss.mean()

def linear_l2(model):
    beta_loss = 0
    for m in model.modules():
        if isinstance(m, nn.Linear):
            beta_loss += (m.weight).pow(2).sum()
            beta_loss += (m.bias).pow(2).sum()
    return 0.5*beta_loss*args.beta, beta_loss

def l2sp(model, reg):
    reg_loss = 0
    dist = 0
    for m in model.modules():
        if hasattr(m, 'weight') and hasattr(m, 'old_weight'):
            diff = (m.weight - m.old_weight).pow(2).sum()
            dist += diff
            reg_loss += diff 

        if hasattr(m, 'bias') and hasattr(m, 'old_bias'):
            diff = (m.bias - m.old_bias).pow(2).sum()
            dist += diff
            reg_loss += diff 

    if dist > 0:
        dist = dist.sqrt()
    
    loss = (reg * reg_loss)
    return loss, dist


def test(model, teacher, loader, loss=False):
    with torch.no_grad():
        model.eval()

        if loss:
            teacher.eval()

            ce = CrossEntropyLabelSmooth(loader.dataset.num_classes, args.label_smoothing).to('cuda')
            featloss = torch.nn.MSELoss(reduction='none')

        total_ce = 0
        total_feat_reg = np.zeros(len(reg_layers))
        total_l2sp_reg = 0
        total = 0
        top1 = 0

        total = 0
        top1 = 0
        for i, (batch, label) in enumerate(loader):
            batch, label = batch.to('cuda'), label.to('cuda')

            total += batch.size(0)
            out = model(batch)
            _, pred = out.max(dim=1)
            top1 += int(pred.eq(label).sum().item())

            if loss:
                total_ce += ce(out, label).item()
                if teacher is not None:
                    with torch.no_grad():
                        tout = teacher(batch)

                    for key in reg_layers:
                        src_x = reg_layers[key][0].out
                        tgt_x = reg_layers[key][1].out
                        tgt_channels = tgt_x.shape[1]

                        regloss = featloss(src_x[:,:tgt_channels,:,:], tgt_x.detach()).mean()

                        total_feat_reg[key] += regloss.item()

                _, unweighted = l2sp(model, 0)
                total_l2sp_reg += unweighted.item()

    return float(top1)/total*100, total_ce/(i+1), np.sum(total_feat_reg)/(i+1), total_l2sp_reg/(i+1), total_feat_reg/(i+1)

def train(model, train_loader, val_loader, iterations=9000, lr=1e-2, name='', l2sp_lmda=1e-2, teacher=None, reg_layers={}):
    model = model.to('cuda')

    if l2sp_lmda == 0:
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=args.momentum, weight_decay=args.weight_decay)
    else:
        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=args.momentum, weight_decay=0)

    end_iter = iterations
    if args.swa:
        optimizer = torchcontrib.optim.SWA(optimizer, swa_start=args.swa_start, swa_freq=args.swa_freq)
        end_iter = args.swa_start
    if args.const_lr:
        scheduler = None
    else:
        scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, end_iter)

    teacher.eval()
    ce = CrossEntropyLabelSmooth(train_loader.dataset.num_classes, args.label_smoothing).to('cuda')
    featloss = torch.nn.MSELoss()


    batch_time = MovingAverageMeter('Time', ':6.3f')
    data_time = MovingAverageMeter('Data', ':6.3f')
    ce_loss_meter = MovingAverageMeter('CE Loss', ':6.3f')
    feat_loss_meter  = MovingAverageMeter('Feat. Loss', ':6.3f')
    l2sp_loss_meter  = MovingAverageMeter('L2SP Loss', ':6.3f')
    linear_loss_meter  = MovingAverageMeter('LinearL2 Loss', ':6.3f')
    total_loss_meter  = MovingAverageMeter('Total Loss', ':6.3f')
    top1_meter  = MovingAverageMeter('Acc@1', ':6.2f')

    dataloader_iterator = iter(train_loader)
    for i in range(iterations):
        if args.swa:
            if i >= int(args.swa_start) and (i-int(args.swa_start))%args.swa_freq == 0:
                scheduler = None
        model.train()
        optimizer.zero_grad()

        end = time.time()
        try:
            batch, label = next(dataloader_iterator)
        except:
            dataloader_iterator = iter(train_loader)
            batch, label = next(dataloader_iterator)
        batch, label = batch.to('cuda'), label.to('cuda')
        data_time.update(time.time() - end)

        out = model(batch)
        _, pred = out.max(dim=1)

        top1_meter.update(float(pred.eq(label).sum().item()) / label.shape[0] * 100.)

        loss = 0.
        loss += ce(out, label)

        ce_loss_meter.update(loss.item())

        with torch.no_grad():
            tout = teacher(batch)

        # Compute the feature distillation loss only when needed
        if args.feat_lmda > 0:
            regloss = 0
            for layer in args.feat_layers:
                key = int(layer)-1

                src_x = reg_layers[key][0].out
                tgt_x = reg_layers[key][1].out
                tgt_channels = tgt_x.shape[1]
                regloss += featloss(src_x[:,:tgt_channels,:,:], tgt_x.detach())

            regloss = args.feat_lmda * regloss
            loss += regloss
            feat_loss_meter.update(regloss.item())

        beta_loss, linear_norm = linear_l2(model)
        loss = loss + beta_loss 
        linear_loss_meter.update(beta_loss.item())

        if l2sp_lmda > 0:
            reg, _ = l2sp(model, l2sp_lmda)
            l2sp_loss_meter.update(reg.item())
            loss = loss + reg

        total_loss_meter.update(loss.item())

        loss.backward()
        optimizer.step()
        for param_group in optimizer.param_groups:
            current_lr = param_group['lr']
        if scheduler is not None:
            scheduler.step()

        batch_time.update(time.time() - end)

        if (i % args.print_freq == 0) or (i == iterations-1):
            progress = ProgressMeter(
                iterations,
                [batch_time, data_time, top1_meter, total_loss_meter, ce_loss_meter, feat_loss_meter, l2sp_loss_meter, linear_loss_meter],
                prefix="LR: {:6.5f}".format(current_lr))
            progress.display(i)

        if (i % args.test_interval == 0) or (i == iterations-1):
            test_top1, test_ce_loss, test_feat_loss, test_weight_loss, test_feat_layer_loss = test(model, teacher, val_loader, loss=True)
            train_top1, train_ce_loss, train_feat_loss, train_weight_loss, train_feat_layer_loss = test(model, teacher, train_loader, loss=True)
            print('Eval Train | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, train_top1, train_ce_loss, train_feat_loss, train_weight_loss))
            print('Eval Test | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, test_top1, test_ce_loss, test_feat_loss, test_weight_loss))
            if not args.no_save:
                if not os.path.exists('ckpt'):
                    os.makedirs('ckpt')
                torch.save({'state_dict': model.state_dict()}, 'ckpt/{}.pth'.format(name))

    if args.swa:
        optimizer.swap_swa_sgd()

        for m in model.modules():
            if hasattr(m, 'running_mean'):
                m.reset_running_stats()
                m.momentum = None
        with torch.no_grad():
            model.train()
            for x, y in train_loader:
                x = x.to('cuda')
                out = model(x)

        test_top1, test_ce_loss, test_feat_loss, test_weight_loss, test_feat_layer_loss = test(model, teacher, val_loader, loss=True)
        train_top1, train_ce_loss, train_feat_loss, train_weight_loss, train_feat_layer_loss = test(model, teacher, train_loader, loss=True)
        print('Eval Train | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, train_top1, train_ce_loss, train_feat_loss, train_weight_loss))
        print('Eval Test | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, test_top1, test_ce_loss, test_feat_loss, test_weight_loss))

        if not args.no_save:
            if not os.path.exists('ckpt'):
                os.makedirs('ckpt')
            torch.save({'state_dict': model.state_dict()}, 'ckpt/{}.pth'.format(name))

    return model

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--datapath", type=str, default='/data', help='path to the dataset')
    parser.add_argument("--dataset", type=str, default='CUB200Data', help='Target dataset. Currently support: \{SDog120Data, CUB200Data, Stanford40Data, MIT67Data, Flower102Data\}')
    parser.add_argument("--iterations", type=int, default=30000, help='Iterations to train')
    parser.add_argument("--print_freq", type=int, default=100, help='Frequency of printing training logs')
    parser.add_argument("--test_interval", type=int, default=1000, help='Frequency of testing')
    parser.add_argument("--name", type=str, default='test', help='Name for the checkpoint')
    parser.add_argument("--batch_size", type=int, default=64)
    parser.add_argument("--lr", type=float, default=1e-2)
    parser.add_argument("--const_lr", action='store_true', default=False, help='Use constant learning rate')
    parser.add_argument("--weight_decay", type=float, default=0)
    parser.add_argument("--momentum", type=float, default=0.9)
    parser.add_argument("--beta", type=float, default=1e-2, help='The strength of the L2 regularization on the last linear layer')
    parser.add_argument("--dropout", type=float, default=0, help='Dropout rate for spatial dropout')
    parser.add_argument("--l2sp_lmda", type=float, default=0)
    parser.add_argument("--feat_lmda", type=float, default=0)
    parser.add_argument("--feat_layers", type=str, default='1234', help='Used for DELTA (which layers or stages to match), ResNets should be 1234 and MobileNetV2 should be 12345')
    parser.add_argument("--reinit", action='store_true', default=False, help='Reinitialize before training')
    parser.add_argument("--no_save", action='store_true', default=False, help='Do not save checkpoints')
    parser.add_argument("--swa", action='store_true', default=False, help='Use SWA')
    parser.add_argument("--swa_freq", type=int, default=500, help='Frequency of averaging models in SWA')
    parser.add_argument("--swa_start", type=int, default=0, help='Start SWA since which iterations')
    parser.add_argument("--label_smoothing", type=float, default=0)
    parser.add_argument("--checkpoint", type=str, default='', help='Load a previously trained checkpoint')
    parser.add_argument("--network", type=str, default='resnet18', help='Network architecture. Currently support: \{resnet18, resnet50, resnet101, mbnetv2\}')
    parser.add_argument("--tnetwork", type=str, default='resnet18', help='Network architecture. Currently support: \{resnet18, resnet50, resnet101, mbnetv2\}')
    parser.add_argument("--width_mult", type=float, default=1)
    parser.add_argument("--shot", type=int, default=-1, help='Number of training samples per class for the training dataset. -1 indicates using the full dataset.')
    parser.add_argument("--log", action='store_true', default=False, help='Redirect the output to log/args.name.log')
    args = parser.parse_args()
    return args

# Used to matching features
def record_act(self, input, output):
    self.out = output

def record_act_with_1x1(self, input, output):
    self.out = self[-1].dim_matching(output)

if __name__ == '__main__':
    args = get_args()

    if args.log:
        if not os.path.exists('log'):
            os.makedirs('log')
        sys.stdout = open('log/{}.log'.format(args.name), 'w')


    print(args)

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    # Used to make sure we sample the same image for few-shot scenarios
    seed = 98

    train_set = eval(args.dataset)(args.datapath, True, transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]), args.shot, seed, preload=False)

    test_set = eval(args.dataset)(args.datapath, False, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]), args.shot, seed, preload=False)

    train_loader = torch.utils.data.DataLoader(train_set,
        batch_size=args.batch_size, shuffle=True,
        num_workers=8, pin_memory=True)

    val_loader = train_loader

    test_loader = torch.utils.data.DataLoader(test_set,
        batch_size=args.batch_size, shuffle=False,
        num_workers=8, pin_memory=False)

    model = eval('{}_dropout'.format(args.network))(pretrained=True, dropout=args.dropout, width_mult=args.width_mult, num_classes=train_loader.dataset.num_classes).cuda()
    if args.checkpoint != '':
        checkpoint = torch.load(args.checkpoint)
        model.load_state_dict(checkpoint['state_dict'])

    # Pre-trained model
    teacher = eval('{}_dropout'.format(args.tnetwork))(pretrained=True, dropout=0, num_classes=train_loader.dataset.num_classes).cuda()

    if 'mbnetv2' in args.network:
        reg_layers = {0: [model.layer1], 1: [model.layer2], 2: [model.layer3], 3: [model.layer4], 4: [model.layer5]}
        model.layer1.register_forward_hook(record_act)
        model.layer2.register_forward_hook(record_act)
        model.layer3.register_forward_hook(record_act)
        model.layer4.register_forward_hook(record_act)
        model.layer5.register_forward_hook(record_act)
    else:
        reg_layers = {0: [model.layer1], 1: [model.layer2], 2: [model.layer3], 3: [model.layer4]}
        # if args.width_mult > 1:
        #     model.layer1.register_forward_hook(record_act_with_1x1)
        #     model.layer2.register_forward_hook(record_act_with_1x1)
        #     model.layer3.register_forward_hook(record_act_with_1x1)
        #     model.layer4.register_forward_hook(record_act_with_1x1)

        #     model.layer1[-1].dim_matching = torch.nn.Conv2d(model.layer1[-1].out_dim, int(model.layer1[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        #     model.layer2[-1].dim_matching = torch.nn.Conv2d(model.layer2[-1].out_dim, int(model.layer2[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        #     model.layer3[-1].dim_matching = torch.nn.Conv2d(model.layer3[-1].out_dim, int(model.layer3[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        #     model.layer4[-1].dim_matching = torch.nn.Conv2d(model.layer4[-1].out_dim, int(model.layer4[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        # else:
        #     model.layer1.register_forward_hook(record_act)
        #     model.layer2.register_forward_hook(record_act)
        #     model.layer3.register_forward_hook(record_act)
        #     model.layer4.register_forward_hook(record_act)

        model.layer1.register_forward_hook(record_act_with_1x1)
        model.layer2.register_forward_hook(record_act_with_1x1)
        model.layer3.register_forward_hook(record_act_with_1x1)
        model.layer4.register_forward_hook(record_act_with_1x1)

        model.layer1[-1].dim_matching = torch.nn.Conv2d(model.layer1[-1].out_dim, int(teacher.layer1[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        model.layer2[-1].dim_matching = torch.nn.Conv2d(model.layer2[-1].out_dim, int(teacher.layer2[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        model.layer3[-1].dim_matching = torch.nn.Conv2d(model.layer3[-1].out_dim, int(teacher.layer3[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
        model.layer4[-1].dim_matching = torch.nn.Conv2d(model.layer4[-1].out_dim, int(teacher.layer4[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()


    # Stored pre-trained weights for computing L2SP
    for m in model.modules():
        if hasattr(m, 'weight') and not hasattr(m, 'old_weight'):
            m.old_weight = m.weight.data.clone().detach()
            # all_weights = torch.cat([all_weights.reshape(-1), m.weight.data.abs().reshape(-1)], dim=0)
        if hasattr(m, 'bias') and not hasattr(m, 'old_bias') and m.bias is not None:
            m.old_bias = m.bias.data.clone().detach()

    if args.reinit:
        for m in model.modules():
            if type(m) in [nn.Linear, nn.BatchNorm2d, nn.Conv2d]:
                m.reset_parameters()

    reg_layers[0].append(teacher.layer1)
    teacher.layer1.register_forward_hook(record_act)
    reg_layers[1].append(teacher.layer2)
    teacher.layer2.register_forward_hook(record_act)
    reg_layers[2].append(teacher.layer3)
    teacher.layer3.register_forward_hook(record_act)
    reg_layers[3].append(teacher.layer4)
    teacher.layer4.register_forward_hook(record_act)

    if '5' in args.feat_layers:
        reg_layers[4].append(teacher.layer5)
        teacher.layer5.register_forward_hook(record_act)

    train(model, train_loader, test_loader, l2sp_lmda=args.l2sp_lmda, iterations=args.iterations, lr=args.lr, name='{}'.format(args.name), teacher=teacher, reg_layers=reg_layers)

Eval

import argparse
import torch
import time
import sys
import numpy as np
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchcontrib

from PIL import Image

from torchvision import transforms

from dataset.cub200 import CUB200Data
from dataset.mit67 import MIT67Data
from dataset.stanford_dog import SDog120Data
from dataset.caltech256 import Caltech257Data
from dataset.stanford_40 import Stanford40Data
from dataset.flower102 import Flower102Data

from advertorch.attacks import LinfPGDAttack

from model.fe_resnet import resnet18_dropout, resnet50_dropout, resnet101_dropout
from model.fe_mobilenet import mbnetv2_dropout
from model.fe_resnet import feresnet18, feresnet50, feresnet101
from model.fe_mobilenet import fembnetv2

def test(model, loader, adversary):
    model.eval()

    total_ce = 0
    total = 0
    top1 = 0

    total = 0
    top1_clean = 0
    top1_adv = 0
    adv_success = 0
    adv_trial = 0
    for i, (batch, label) in enumerate(loader):
        batch, label = batch.to('cuda'), label.to('cuda')

        total += batch.size(0)
        out_clean = model(batch)

        if 'mbnetv2' in args.network:
            y = torch.zeros(batch.shape[0], model.classifier[1].in_features).cuda()
        else:
            y = torch.zeros(batch.shape[0], model.fc.in_features).cuda()
        y[:,0] = args.m
        advbatch = adversary.perturb(batch, y)

        out_adv = model(advbatch)

        _, pred_clean = out_clean.max(dim=1)
        _, pred_adv = out_adv.max(dim=1)

        clean_correct = pred_clean.eq(label)
        adv_trial += int(clean_correct.sum().item())
        adv_success += int(pred_adv[clean_correct].eq(label[clean_correct]).sum().item())
        top1_clean += int(pred_clean.eq(label).sum().item())
        top1_adv += int(pred_adv.eq(label).sum().item())

        print('{}/{}...'.format(i+1, len(loader)))


    return float(top1_clean)/total*100, float(top1_adv)/total*100, float(adv_trial-adv_success) / adv_trial *100

def record_act(self, input, output):
    pass

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--datapath", type=str, default='/data', help='path to the dataset')
    parser.add_argument("--dataset", type=str, default='CUB200Data', help='Target dataset. Currently support: \{SDog120Data, CUB200Data, Stanford40Data, MIT67Data, Flower102Data\}')
    parser.add_argument("--name", type=str, default='test')
    parser.add_argument("--B", type=float, default=0.1, help='Attack budget')
    parser.add_argument("--m", type=float, default=1000, help='Hyper-parameter for task-agnostic attack')
    parser.add_argument("--pgd_iter", type=int, default=40)
    parser.add_argument("--batch_size", type=int, default=32)
    parser.add_argument("--dropout", type=float, default=0)
    parser.add_argument("--checkpoint", type=str, default='')
    parser.add_argument("--network", type=str, default='resnet18', help='Network architecture. Currently support: \{resnet18, resnet50, resnet101, mbnetv2\}')
    args = parser.parse_args()
    return args

def myloss(yhat, y):
    return -((yhat[:,0]-y[:,0])**2 + 0.1*((yhat[:,1:]-y[:,1:])**2).mean(1)).mean()

if __name__ == '__main__':
    args = get_args()
    print(args)

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    seed = int(time.time())

    test_set = eval(args.dataset)(args.datapath, False, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ]), -1, seed, preload=False)

    test_loader = torch.utils.data.DataLoader(test_set,
        batch_size=args.batch_size, shuffle=False,
        num_workers=8, pin_memory=False)

    transferred_model = eval('{}_dropout'.format(args.network))(pretrained=False, dropout=args.dropout, num_classes=test_loader.dataset.num_classes).cuda()
    checkpoint = torch.load(args.checkpoint)
    transferred_model.load_state_dict(checkpoint['state_dict'])

    pretrained_model = eval('fe{}'.format(args.network))(pretrained=True).cuda().eval()

    adversary = LinfPGDAttack(
            pretrained_model, loss_fn=myloss, eps=args.B,
            nb_iter=args.pgd_iter, eps_iter=0.01, rand_init=True, clip_min=-2.2, clip_max=2.2,
            targeted=False)

    clean_top1, adv_top1, adv_sr = test(transferred_model, test_loader, adversary)

    print('Clean Top-1: {:.2f} | Adv Top-1: {:.2f} | Attack Success Rate: {:.2f}'.format(clean_top1, adv_top1, adv_sr))

[CV_3D] PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation

PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation

Abstract

Task : Online Semantic Segmentation of Single-scan LiDAR point clouds

Assigning semantic label to each of points given input point cloud
Applications : fine-grained autonomous perception in self-driving systems
Challenges
- near-real-time latency with limited hardware
- uneven or long-tailed distribution of LiDAR points across space (sparse)
- increasing number of fine-grained semantic classes
Previous methods (ex. KNN & Graph) : Low performance / Time-consuming

PolarNet

LiDAR-specific, nearest-neighbor-free segmentation algorithm
polar bird’s eye view : balancing points across grid cells in polar coordinate system
indirectly aligning segmentation network’s attention with long-tailed distribution of points along radial axis

1. Introduction

Background

Lag bw release of masive PC and readiness of semantic segmentation labels
Challenge for human raters to providing point-wise labels
Demand for automatic and fast semantic segmentation solutions for LiDAR scans

Contributions

More suitable LiDAR scan representation : considering imbalanced spatial distribution of points
End-to-end PolarNet network : SOTA performance with low computational cost
Analysis on performance based on different backbone nets using polar grid compared to other representations

3. Approach

Problem Statement

$P_i$ : $i$-th point set containing $n_i$ LiDAR points => 4 features : (x, y, z, reflection)
$L_i$ : object labels for each point $p_j$ in $P_i$
Goal : Training segmentation model $f$ to minimize difference bw prediction $f(P_i)$ and label $L_i$

Overview of model

1️⃣ Polar Quantization : Points → Grids
2️⃣ Grid feature extraction → Polar Grid feature map
- KNN-free PointNet to transform points to fixed-length representation
- representation is assigned to its location in ring matrix
3️⃣ Ring-segmentation-CNN
- Input : ring matrix
- Output : quantized prediction
4️⃣ Decoding : Projecting prediction into 3D space

BEV Partition

Network : 2D detection network to detect objects in 3D point clouds → segmentation
- Input : 2D top-down image (orthogonal projections)
- Output : tensor of same dimensional shape with each spatial location
  - encoding class prediction for each voxel along z-axis of location
Motivation : to represent scene with top-down img to speed up down-stream CNNs for natural imgs
Operation
- Initial BEV representation : to create top-down projections of PC
- Variants of inital BEV : to encode each pixel in BEV with different heights, reflection, learned representations
Cartesian BEV Grid Partition
- Quantize points in Cartesian coordinate system
- Middle grid cells : densely concentrated ↔ Peripheral grid cells : totally empty
- Uneven Partitioning → waste of computational power, limit of feature representiveness for center grid cells
- Points with different labels might be assigned to single cell

Polar BEV

Motivation : to solve imbalance problem of Cartesian BEV
Operation
- 1) Origin = Sensor's location → Calculate each point's azimuth and radius on XY plane
- 2) Assign points to grid cells based on quantized azimuth and radius
Benefits : More evenly point distribution
- Less Points when cell is close to sensor <=> Dense grid representation is finer
- Lower Standard deviations <=> points are more evenly distributed
- Less burden on predictors (Less misclassification)

Polar Grid

Learnable simplified PointNet $h$
- Layers : max-pooling & BN & ReLU
- capturing distribution of points in each grid with fixed-length representation
Feature in $i$, $j$-th grid cell
- $fea_{i,j} = MAX(h(p)|w_i < p_x < w_{i+1}, l_j < p_y < l_{j+1})$
- $w$, $l$ : quantization sizes
- $p_x$, $p_y$ : locations of point $p$
- quantization sizes and locations : Polar or Cartesian

4. Experiments

Settings

Datasets

SemanticKITTI : point-level re-annotation of LiDAR part of KITTI / imbalanced and challenging / 19 class
A2D2 : autonomous driving dataset / using 5 asynchronous LiDAR sensors / 38 class segmentation annotation
Paris-Lille-3D : 3 aggregated pc built from continuout LiDAR scans of streets / 9 segmentation class

Voxelization

Cartesian BEV grid spaces → Polar BEV : to include more than 99% if points for each scan on avg
Respective grid size setting : [480, 360, 32], [320, 320, 32], [320, 320, 32]

Baselines / Metric

Baselines : SqueezSeg, PointNet
Metric : RandLA

Results

SemanticKITTI Segmentation

A2D2 Segmentation

Paris-Lille-3D Segmentation

ETC

Projection Methods

Augmenting LiDAR Segmentation

RC : Ring Convolution
9F : 2 Cartesian coordinates + 3 residual distances from center + 1 reflection + 3 Polar coordinate

mIOU vs. Distance to Sensor

Code Implementation

Reference : https://github.com/edwardzhou130/PolarSeg

[Code] BEV

class BEV_Unet(nn.Module):

    def __init__(self,n_class,n_height,dilation = 1,group_conv=False,input_batch_norm = False,dropout = 0.,circular_padding = False, dropblock = True, use_vis_fea=False):
        super(BEV_Unet, self).__init__()
        self.n_class = n_class
        self.n_height = n_height
        if use_vis_fea:
            self.network = UNet(n_class*n_height,2*n_height,dilation,group_conv,input_batch_norm,dropout,circular_padding,dropblock)
        else:
            self.network = UNet(n_class*n_height,n_height,dilation,group_conv,input_batch_norm,dropout,circular_padding,dropblock)

    def forward(self, x):
        x = self.network(x)
        
        x = x.permute(0,2,3,1)
        new_shape = list(x.size())[:3] + [self.n_height,self.n_class]
        x = x.view(new_shape)
        x = x.permute(0,4,1,2,3)
        return x
    
class UNet(nn.Module):
    def __init__(self, n_class,n_height,dilation,group_conv,input_batch_norm, dropout,circular_padding,dropblock):
        super(UNet, self).__init__()
        self.inc = inconv(n_height, 64, dilation, input_batch_norm, circular_padding)
        self.down1 = down(64, 128, dilation, group_conv, circular_padding)
        self.down2 = down(128, 256, dilation, group_conv, circular_padding)
        self.down3 = down(256, 512, dilation, group_conv, circular_padding)
        self.down4 = down(512, 512, dilation, group_conv, circular_padding)
        self.up1 = up(1024, 256, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
        self.up2 = up(512, 128, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
        self.up3 = up(256, 64, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
        self.up4 = up(128, 64, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
        self.dropout = nn.Dropout(p=0. if dropblock else dropout)
        self.outc = outconv(64, n_class)

    def forward(self, x):
        x1 = self.inc(x)
        x2 = self.down1(x1)
        x3 = self.down2(x2)
        x4 = self.down3(x3)
        x5 = self.down4(x4)
        x = self.up1(x5, x4)
        x = self.up2(x, x3)
        x = self.up3(x, x2)
        x = self.up4(x, x1)
        x = self.outc(self.dropout(x))
        return x

class double_conv(nn.Module):
    '''(conv => BN => ReLU) * 2'''
    def __init__(self, in_ch, out_ch,group_conv,dilation=1):
        super(double_conv, self).__init__()
        if group_conv:
            self.conv = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 3, padding=1,groups = min(out_ch,in_ch)),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True),
                nn.Conv2d(out_ch, out_ch, 3, padding=1,groups = out_ch),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True)
            )
        else:
            self.conv = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 3, padding=1),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True),
                nn.Conv2d(out_ch, out_ch, 3, padding=1),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True)
            )

    def forward(self, x):
        x = self.conv(x)
        return x

class double_conv_circular(nn.Module):
    '''(conv => BN => ReLU) * 2'''
    def __init__(self, in_ch, out_ch,group_conv,dilation=1):
        super(double_conv_circular, self).__init__()
        if group_conv:
            self.conv1 = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 3, padding=(1,0),groups = min(out_ch,in_ch)),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True)
            )
            self.conv2 = nn.Sequential(
                nn.Conv2d(out_ch, out_ch, 3, padding=(1,0),groups = out_ch),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True)
            )
        else:
            self.conv1 = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 3, padding=(1,0)),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True)
            )
            self.conv2 = nn.Sequential(
                nn.Conv2d(out_ch, out_ch, 3, padding=(1,0)),
                nn.BatchNorm2d(out_ch),
                nn.LeakyReLU(inplace=True)
            )

    def forward(self, x):
        #add circular padding
        x = F.pad(x,(1,1,0,0),mode = 'circular')
        x = self.conv1(x)
        x = F.pad(x,(1,1,0,0),mode = 'circular')
        x = self.conv2(x)
        return x

class inconv(nn.Module):
    def __init__(self, in_ch, out_ch, dilation, input_batch_norm, circular_padding):
        super(inconv, self).__init__()
        if input_batch_norm:
            if circular_padding:
                self.conv = nn.Sequential(
                    nn.BatchNorm2d(in_ch),
                    double_conv_circular(in_ch, out_ch,group_conv = False,dilation = dilation)
                )
            else:
                self.conv = nn.Sequential(
                    nn.BatchNorm2d(in_ch),
                    double_conv(in_ch, out_ch,group_conv = False,dilation = dilation)
                )
        else:
            if circular_padding:
                self.conv = double_conv_circular(in_ch, out_ch,group_conv = False,dilation = dilation)
            else:
                self.conv = double_conv(in_ch, out_ch,group_conv = False,dilation = dilation)

    def forward(self, x):
        x = self.conv(x)
        return x

class down(nn.Module):
    def __init__(self, in_ch, out_ch, dilation, group_conv, circular_padding):
        super(down, self).__init__()
        if circular_padding:
            self.mpconv = nn.Sequential(
                nn.MaxPool2d(2),
                double_conv_circular(in_ch, out_ch,group_conv = group_conv,dilation = dilation)
            )
        else:
            self.mpconv = nn.Sequential(
                nn.MaxPool2d(2),
                double_conv(in_ch, out_ch,group_conv = group_conv,dilation = dilation)
            )                

    def forward(self, x):
        x = self.mpconv(x)
        return x

class up(nn.Module):
    def __init__(self, in_ch, out_ch, circular_padding, bilinear=True, group_conv=False, use_dropblock = False, drop_p = 0.5):
        super(up, self).__init__()

        #  would be a nice idea if the upsampling could be learned too,
        #  but my machine do not have enough memory to handle all those weights
        if bilinear:
            self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
        elif group_conv:
            self.up = nn.ConvTranspose2d(in_ch//2, in_ch//2, 2, stride=2,groups = in_ch//2)
        else:
            self.up = nn.ConvTranspose2d(in_ch//2, in_ch//2, 2, stride=2)

        if circular_padding:
            self.conv = double_conv_circular(in_ch, out_ch,group_conv = group_conv)
        else:
            self.conv = double_conv(in_ch, out_ch,group_conv = group_conv)

        self.use_dropblock = use_dropblock
        if self.use_dropblock:
            self.dropblock = DropBlock2D(block_size=7, drop_prob=drop_p)

    def forward(self, x1, x2):
        x1 = self.up(x1)
        
        # input is CHW
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, (diffX // 2, diffX - diffX//2,
                        diffY // 2, diffY - diffY//2))
        
        # for padding issues, see 
        # https://github.com/HaiyongJiang/U-Net-Pytorch-Unstructured-Buggy/commit/0e854509c2cea854e247a9c615f175f76fbb2e3a
        # https://github.com/xiaopeng-liao/Pytorch-UNet/commit/8ebac70e633bac59fc22bb5195e513d5832fb3bd

        x = torch.cat([x2, x1], dim=1)
        x = self.conv(x)
        if self.use_dropblock:
            x = self.dropblock(x)
        return x

class outconv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super(outconv, self).__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 1)

    def forward(self, x):
        x = self.conv(x)
        return x

[Code] PointNet + BEV

class ptBEVnet(nn.Module):
    
    def __init__(self, BEV_net, grid_size, pt_model = 'pointnet', fea_dim = 3, pt_pooling = 'max', kernal_size = 3,
                 out_pt_fea_dim = 64, max_pt_per_encode = 64, cluster_num = 4, pt_selection = 'farthest', fea_compre = None):
        super(ptBEVnet, self).__init__()
        assert pt_pooling in ['max']
        assert pt_selection in ['random','farthest']
        
        if pt_model == 'pointnet':
            self.PPmodel = nn.Sequential(
                nn.BatchNorm1d(fea_dim),
                
                nn.Linear(fea_dim, 64),
                nn.BatchNorm1d(64),
                nn.ReLU(inplace=True),
                
                nn.Linear(64, 128),
                nn.BatchNorm1d(128),
                nn.ReLU(inplace=True),
                
                nn.Linear(128, 256),
                nn.BatchNorm1d(256),
                nn.ReLU(inplace=True),
                
                nn.Linear(256, out_pt_fea_dim)
            )
        
        self.pt_model = pt_model
        self.BEV_model = BEV_net
        self.pt_pooling = pt_pooling
        self.max_pt = max_pt_per_encode
        self.pt_selection = pt_selection
        self.fea_compre = fea_compre
        self.grid_size = grid_size
        
        # NN stuff
        if kernal_size != 1:
            if self.pt_pooling == 'max':
                self.local_pool_op = torch.nn.MaxPool2d(kernal_size, stride=1, padding=(kernal_size-1)//2, dilation=1)
            else: raise NotImplementedError
        else: self.local_pool_op = None
        
        # parametric pooling        
        if self.pt_pooling == 'max':
            self.pool_dim = out_pt_fea_dim
        
        # point feature compression
        if self.fea_compre is not None:
            self.fea_compression = nn.Sequential(
                    nn.Linear(self.pool_dim, self.fea_compre),
                    nn.ReLU())
            self.pt_fea_dim = self.fea_compre
        else:
            self.pt_fea_dim = self.pool_dim
        
    def forward(self, pt_fea, xy_ind, voxel_fea=None):
        cur_dev = pt_fea[0].get_device()
        
        # concate everything
        cat_pt_ind = []
        for i_batch in range(len(xy_ind)):
            cat_pt_ind.append(F.pad(xy_ind[i_batch],(1,0),'constant',value = i_batch))

        cat_pt_fea = torch.cat(pt_fea,dim = 0)
        cat_pt_ind = torch.cat(cat_pt_ind,dim = 0)
        pt_num = cat_pt_ind.shape[0]

        # shuffle the data
        shuffled_ind = torch.randperm(pt_num,device = cur_dev)
        cat_pt_fea = cat_pt_fea[shuffled_ind,:]
        cat_pt_ind = cat_pt_ind[shuffled_ind,:]
        
        # unique xy grid index
        unq, unq_inv, unq_cnt = torch.unique(cat_pt_ind,return_inverse=True, return_counts=True, dim=0)
        unq = unq.type(torch.int64)
        
        # subsample pts
        if self.pt_selection == 'random':
            grp_ind = grp_range_torch(unq_cnt,cur_dev)[torch.argsort(torch.argsort(unq_inv))]
            remain_ind = grp_ind < self.max_pt
        elif self.pt_selection == 'farthest':
            unq_ind = np.split(np.argsort(unq_inv.detach().cpu().numpy()), np.cumsum(unq_cnt.detach().cpu().numpy()[:-1]))
            remain_ind = np.zeros((pt_num,),dtype = np.bool)
            np_cat_fea = cat_pt_fea.detach().cpu().numpy()[:,:3]
            pool_in = []
            for i_inds in unq_ind:
                if len(i_inds) > self.max_pt:
                    pool_in.append((np_cat_fea[i_inds,:],self.max_pt))
            if len(pool_in) > 0:
                pool = multiprocessing.Pool(multiprocessing.cpu_count())
                FPS_results = pool.starmap(parallel_FPS, pool_in)
                pool.close()
                pool.join()
            count = 0
            for i_inds in unq_ind:
                if len(i_inds) <= self.max_pt:
                    remain_ind[i_inds] = True
                else:
                    remain_ind[i_inds[FPS_results[count]]] = True
                    count += 1
            
        cat_pt_fea = cat_pt_fea[remain_ind,:]
        cat_pt_ind = cat_pt_ind[remain_ind,:]
        unq_inv = unq_inv[remain_ind]
        unq_cnt = torch.clamp(unq_cnt,max=self.max_pt)
        
        # process feature
        if self.pt_model == 'pointnet':
            processed_cat_pt_fea = self.PPmodel(cat_pt_fea)
        
        if self.pt_pooling == 'max':
            pooled_data = torch_scatter.scatter_max(processed_cat_pt_fea, unq_inv, dim=0)[0]
        else: raise NotImplementedError
        
        if self.fea_compre:
            processed_pooled_data = self.fea_compression(pooled_data)
        else:
            processed_pooled_data = pooled_data
        
        # stuff pooled data into 4D tensor
        out_data_dim = [len(pt_fea),self.grid_size[0],self.grid_size[1],self.pt_fea_dim]
        out_data = torch.zeros(out_data_dim, dtype=torch.float32).to(cur_dev)
        out_data[unq[:,0],unq[:,1],unq[:,2],:] = processed_pooled_data
        out_data = out_data.permute(0,3,1,2)
        if self.local_pool_op != None:
            out_data = self.local_pool_op(out_data)
        if voxel_fea is not None:
            out_data = torch.cat((out_data, voxel_fea), 1)
        
        # run through network
        net_return_data = self.BEV_model(out_data)
        
        return net_return_data
    
def grp_range_torch(a,dev):
    idx = torch.cumsum(a,0)
    id_arr = torch.ones(idx[-1],dtype = torch.int64,device=dev)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return torch.cumsum(id_arr,0)

def parallel_FPS(np_cat_fea,K):
    return  nb_greedy_FPS(np_cat_fea,K)

@nb.jit('b1[:](f4[:,:],i4)',nopython=True,cache=True)
def nb_greedy_FPS(xyz,K):
    start_element = 0
    sample_num = xyz.shape[0]
    sum_vec = np.zeros((sample_num,1),dtype = np.float32)
    xyz_sq = xyz**2
    for j in range(sample_num):
        sum_vec[j,0] = np.sum(xyz_sq[j,:])
    pairwise_distance = sum_vec + np.transpose(sum_vec) - 2*np.dot(xyz, np.transpose(xyz))
    
    candidates_ind = np.zeros((sample_num,),dtype = np.bool_)
    candidates_ind[start_element] = True
    remain_ind = np.ones((sample_num,),dtype = np.bool_)
    remain_ind[start_element] = False
    all_ind = np.arange(sample_num)
    
    for i in range(1,K):
        if i == 1:
            min_remain_pt_dis = pairwise_distance[:,start_element]
            min_remain_pt_dis = min_remain_pt_dis[remain_ind]
        else:
            cur_dis = pairwise_distance[remain_ind,:]
            cur_dis = cur_dis[:,candidates_ind]
            min_remain_pt_dis = np.zeros((cur_dis.shape[0],),dtype = np.float32)
            for j in range(cur_dis.shape[0]):
                min_remain_pt_dis[j] = np.min(cur_dis[j,:])
        next_ind_in_remain = np.argmax(min_remain_pt_dis)
        next_ind = all_ind[remain_ind][next_ind_in_remain]
        candidates_ind[next_ind] = True
        remain_ind[next_ind] = False
        
    return candidates_ind