dl_paper's People
dl_paper's Issues
[CV_GAN] BeautyGAN: Instance-level Facial Makeup Transfer with Deep Generative Adversarial Network
BeautyGAN: Instance-level Facial Makeup Transfer with Deep Generative Adversarial Network
Abstract
- Facial makeup transfer : translating makeup style from reference makeup img to non-makeup one, preserving face identity
- Instance-level transfer : more challenging than conventional Domain-level transfer tasks, especially without paired data
- Makeup style : Local styles/cosmetics (ex. eye shadow, lipstick, foundation) → different from Global style (ex. painting)
- BeautyGAN
- Incorporating both Global Domain-level loss + Local Instance-level loss in dual in/output GAN
- Extracting and Translating Local style and Delicating makeup information
- Global Domain-level loss : ensured by Discriminators that distinguish generated imgs from domain's real samples
- Local Instance-level loss : calculated by pixel-level histogram loss on separate local facial regions
- Perceptual loss & Cycle Consistency loss : Generating high quality faces and preserving identity
- Overall objective function : to learn translation on Instance-level through unsupervised adversarial learning
- Extensive experiments : beautyGAN could generate visually pleasant makeup faces and accurate transferring results
- New makeup dataset (3834 high-resolution face imgs)
CCS Concepts
- Computing methodologies → Computer Vision taskes ; NN ; Unsupervised learning
Keywords
- Facial makeup transfer ; GAN
1. Introduction
1) Why Need
: To help users try well-suited makeup style from photos without professional suggestions
- Virtual Makeup application (Previous tools) : user's manual interaction required & only a certain number of fixed styles
- Makeup transfer : efficient way to help users select the most suitable style
2) Existing automatic Makeup transfer (2 categories)
(1) Traditional image processing
- Ex. image gradient editing, physics-based manipulation
- Decompose imgs into several layers (ex. face structure, color, skin) →
- → Transfer each layer after warping refer makeup img to non-makeup one
(2) DL based methods
- Ex. typically build upon DNN
- several independent networks to deal with each cosmetic individually
Previous methods
- treat makeup style as a simple combination of different components
- → Overall output img looks unnatural with apparent artifacts at combining places
3) Image-to-image translation : style-transfer
- Existing End-to-end structure act on entire img could generate high quality results
- But, directly applying in Facial makeup transfer task is still infeasible
4) Facial makeup transfer (2 main characteristics)
(1) Various makeup styles from face to face & Instance-level transfer required
- Typical img-to-img translation methods (GAN) : mostly for Domain-level transfer
- Ex. CycleGAN : img-to-img translation bw two collections (ex. horsea and zebras)
- Emphasize inter-domain differences while Omit intra-domain differences
- Generate an average domain-level style invariant given different refer imgs
(2) Makeup style = A Global style + Independent Local styles
- Conventional style transfer : style = global painting manner
- Makeup style = consisted of various local cosmetics → delicate and elaborate
- Difficult to extract makeup style as a whole while preserving particular traits of various cosmetics
5) Making New makeup dataset
- Lack of training data
- Released makeup dataset : too small to train big networks
- Difficult to obtain a pair of well-aligned face imgs with different makeup styles
- Supervised learning with paired data is implausible
- So, making a new makeup dataset with 3834 imgs
6) BeautyGAN : A novel dual in/output GAN
- Input : makeup and non-makeup face imgs
- Output : transferred results
- No additoinal pre-/post-processing
- First, transfer non-makeup face to makeup domain with a couple of Discriminators
- Instance-level transfer by pixel-level histogram loss on the basis of Domain-level transfer
- Perceptual loss & Cycle Consistency loss : preserve face identity and eliminate artifacts
- Cycle Consistency bw in/outputs : achieved with only one Generator
- Makeup and Anti-makeup simultaneously in a single forward pass
- No paired data is needed
- Generated result imgs : natural-looking, visually pleasant without observable artifacts
3. Main Contributions
- (1) Automatic makeup transfer with a dual input/output GAN : effective and high quality
- (2) Instance-level style transfer : pixel-level histogram losses on different local facial regions
- can be easily generalized to other img translation tasks (ex. head-shot portraits, img attribute transfer)
- (3) New makeup dataset : 2824 imgs
2. Related works
2.1 Makeup Studies
(1) Makeup transfer frameworks based on Traditional methods
: Localize makeup transfer framework in DL + Warping and structure preservation to synthesize after-makeup imgs
= Divide facial makeup into several parts and conduct different methods on each facial part
- [31] Facial makeup detector and remover framework based on locality-constrained dictionary learning
- [20] Anti-Makeup : Adversarial net to generate non-makeup imgs for makeup-invariant face verification
- [11] Digital face makeup : Decompose imgs into 3 layers & Transfer makeup info layer by layer
- Result : smooth facial details of source imgs
- [19] Advanced Decomposition method (physics-based manipulation of intrinsic img layers)
(2) BeautyGAN
- Realize makeup transfer and makeup removal simultaneously
- Unified training process : considering relationships among cosmetics in different regions
- End-to-end network : learning adaptation of cosmetics fed in source imgs → eliminating need of post-processing
2.2 Style Transfer
- Aim : To combine content and style from different imgs
- [8] Generating a reconstruction img by minimizing content and style reconstruction loss
- [9] Perceptual factors to control more information (ex. color, scale, spatial location) : High quality but Heavy computation
- [13] Feed-forward network (for real-time style transfer and super-resolution): Less computation and Approximate quality
2.3 Generative Adversarial Networks
- GAN : Generator + Discriminator => Generating visually realistic imgs
- [17] Super Resolution GAN
- [6] ExGANs : a type of cGAN that utilize exemplar information to solve personalized eye in-painting problem
- [27] training models on synthetic imgs for improving realism of them
- [34] Generative visual manipulation on natural img manifold
- Incorporating user interactions to present real-time img editing + GAN was leveraged to estimate img manifold
2.4 GAN for Image-to-Image Translation
- Aim : To learn a mapping from source domain to target domain
- [4, 12, 35] Promising works appling GAN to Image-to-Image Translation
- [12] pix2pix : synthesize imgs from label maps → reconstruct objects from edge imgs (using paired imgs for training)
- [22] CoGAN (Coupled GAN) : generators were bounded with weight-sharing constraints to learn a joint distribution
- [35] CycleGAN, [14] DiscoGAN : Cycle Consistency loss to regularize key attributes bw inputs and translated imgs
- [14] StarGAN : mapping among multiple domains within one single generator
3. Our approach : BeautyGAN
- Goal : Facial makeup transfer bw a reference makeup img and a source non-makeup img on instance-level
- A : non-makeup img domain ⊂ R^(HxWx3)
- B : makeup img domain ⊂ R^(HxWx3)
- G : AxB → BxA : Mapping bw A, B domains (x is Cartesian product) - simultaneously learning
- Inputs : given 2 imgs : a source img I_src ∈ A & a reference img I_ref ∈ B
- Outputs : an after-makeup img I_src^B ∈ B & an anti-makeup img I_ref^A ∈ A
- (I_src^B, I_ref^A) = G(I_src, I_ref)
- I_src^B : synthesizing makeup style of I_ref while preserving face identity of I_src
- I_ref^A : realizing makeup removal from I_ref
- Instance-level correspondence = Makeup style consistency bw I_src^B and I_ref
- No paired data for training
- Pixel-level Histogram loss acted on different cosmetics
- Adversarial losses : to generate visually pleasant imgs and refine correlation among different cosmetics
- Perceptual loss : to maintain face identity and structure -> transfer exact makeup to source img
- Integrate all loss terms into one Full Objective function [3.1]
3.1 Full Objective
- 1 generator G & 2 discriminators D_A, D_B → Minmax game
- G : minimize Adversarial loss
- D_A, D_B L maximize same Adversarial loss
Loss function (Adversarial loss) of D_A, D_B
-
D_A : aim to distinguish generated img I_ref^A from non-makeup real samples in set A
-
D_B : aim to distinguish generated img I_src^B from makeup real samples in set B
Full Objective Loss function of G : 4 Loss terms
# Combined loss
g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt
if self.checkpoint or self.direct:
g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt + g_A_loss_his + g_B_loss_his
(1) L_adv : Adversarial loss for G
# GAN loss D_A(G_A(A))
fake_B = self.G_A(org_A)
pred_fake = self.D_A(fake_B)
g_A_loss_adv = self.criterionGAN(pred_fake, True)
#g_loss_adv = self.get_G_loss(out)
# GAN loss D_B(G_B(B))
fake_A = self.G_B(ref_B)
pred_fake = self.D_B(fake_A)
g_B_loss_adv = self.criterionGAN(pred_fake, True)
(3) L_cyc : Cycle consistency loss
3.2 Domain-Level Makeup Transfer
- Domain-level makeup transfer : foundation of Instance-level makeup transfer
- Dual input/output architecture → simultaneously learning the mapping bw two domains(A,B) within just one Generator !
- Output imgs : required to preserve face identities & background info as Input imgs
- Perceptual loss -> face identities
- Cycle consistency loss -> background info
Perceptual loss
- Aim : to preserve face identities
- How : Calculating differences bw high-level features extracted by Deep Conv (ImageNet pretrained VGG16)
- F_l(x) : feature maps in l-th layer on VGG, F_l ∈ R^(C_l x H_l x W_l)
- Perceptual loss bw input imgs(I_src, I_ref) and output imgs(I_src^B, I_ref^A) :
# identity loss
if self.lambda_idt > 0:
# G should be identity if ref_B or org_A is fed
idt_A1, idt_A2 = self.G(org_A, org_A)
idt_B1, idt_B2 = self.G(ref_B, ref_B)
loss_idt_A1 = self.criterionL1(idt_A1, org_A) * self.lambda_A * self.lambda_idt
loss_idt_A2 = self.criterionL1(idt_A2, org_A) * self.lambda_A * self.lambda_idt
loss_idt_B1 = self.criterionL1(idt_B1, ref_B) * self.lambda_B * self.lambda_idt
loss_idt_B2 = self.criterionL1(idt_B2, ref_B) * self.lambda_B * self.lambda_idt
# loss_idt
loss_idt = (loss_idt_A1 + loss_idt_A2 + loss_idt_B1 + loss_idt_B2) * 0.5
else:
loss_idt = 0
# vgg loss
vgg_org = self.vgg(org_A, self.content_layer)[0]
vgg_org = Variable(vgg_org.data).detach()
vgg_fake_A = self.vgg(fake_A, self.content_layer)[0]
g_loss_A_vgg = self.criterionL2(vgg_fake_A, vgg_org) * self.lambda_A * self.lambda_vgg
vgg_ref = self.vgg(ref_B, self.content_layer)[0]
vgg_ref = Variable(vgg_ref.data).detach()
vgg_fake_B = self.vgg(fake_B, self.content_layer)[0]
g_loss_B_vgg = self.criterionL2(vgg_fake_B, vgg_ref) * self.lambda_B * self.lambda_vgg
loss_rec = (g_loss_rec_A + g_loss_rec_B + g_loss_A_vgg + g_loss_B_vgg) * 0.5
Cycle consistency loss
- Aim : to maintain background infomation
- How : Passing output imgs into G -> imgs are generated as the original input imgs
# Forward cycle loss
rec_A = self.G_B(fake_B)
g_loss_rec_A = self.criterionL1(rec_A, org_A) * self.lambda_A
# Backward cycle loss
rec_B = self.G_A(fake_A)
g_loss_rec_B = self.criterionL1(rec_B, ref_B) * self.lambda_B
3.3 Instance-Level Makeup Transfer
- Instance-Level Makeup Transfer : Adding constraints on makeup style consistency
- Facial makeup : visually recognized as color distributions = color changing
- Histogram Matching (HM) : a straightforward Color Transfer method
- Histogram Loss on pixel-level : restricting I_src^B = I_ref in makeup style
Histogram Loss
- Inappropriate strategy : MSE loss on pixel-level histograms of two imgs directly → Gradient=0 → No Optimization
- Histogram matching strategy : Generating HM(x,y) first → MSE Loss → Backpropagation
- Goal : To calculate Histogram Loss on pixels bw original img x and reference img y
- HM(x,y) : a GT remapping img = same color distribution as y & preserved content info as x
- MSE Loss : bw HM(x,y) and x
- Back-prop for optimization
Face parsing
- Inappropriate strategy : Histogram loss over the entire img
- Face parsing strategy :
- split makeup style into 3 important components (lipsticks, eye shadow, foundation)
- apply localized histogram loss on each part
- Reasons
- pixels in background and hairs : no relationship with makeup → disturb correct color distribution
- facial makeup is beyond a global style, but a collection of various styles in different cosmetics regions
- Pre-trained Face parsing model : generating Face guidance mask M = FP(x) for each input img x
- Face guidance mask M = FP(x) : denoting several facial locations (lips, eyes, skin, hairs, background, ...)
- For each M, tracking different labels to produce 3 corresponding Binary masks
- Binary masks (M_lip, M_eye, M_face) : representing for cosmetics spatiality
- M_shadow : calculate two rectangle areas enclosing eye shadows → exclude eyes regions, some hair, eyebrow regions
- Why separated? No annotation for eye shadows on M (b/c before-makeup imgs have no eye shadows)
for self.i, (img_A, img_B, mask_A, mask_B) in enumerate(self.data_loader_train):
# Convert tensor to variable
# mask attribute: 0:background 1:face 2:left-eyebrown 3:right-eyebrown 4:left-eye 5: right-eye 6: nose
# 7: upper-lip 8: teeth 9: under-lip 10:hair 11: left-ear 12: right-ear 13: neck
if self.checkpoint or self.direct:
if self.lips==True:
mask_A_lip = (mask_A==7).float() + (mask_A==9).float()
mask_B_lip = (mask_B==7).float() + (mask_B==9).float()
mask_A_lip, mask_B_lip, index_A_lip, index_B_lip = self.mask_preprocess(mask_A_lip, mask_B_lip)
if self.skin==True:
mask_A_skin = (mask_A==1).float() + (mask_A==6).float() + (mask_A==13).float()
mask_B_skin = (mask_B==1).float() + (mask_B==6).float() + (mask_B==13).float()
mask_A_skin, mask_B_skin, index_A_skin, index_B_skin = self.mask_preprocess(mask_A_skin, mask_B_skin)
if self.eye==True:
mask_A_eye_left = (mask_A==4).float()
mask_A_eye_right = (mask_A==5).float()
mask_B_eye_left = (mask_B==4).float()
mask_B_eye_right = (mask_B==5).float()
mask_A_face = (mask_A==1).float() + (mask_A==6).float()
mask_B_face = (mask_B==1).float() + (mask_B==6).float()
# avoid the situation that images with eye closed
if not ((mask_A_eye_left>0).any() and (mask_B_eye_left>0).any() and \
(mask_A_eye_right > 0).any() and (mask_B_eye_right > 0).any()):
continue
mask_A_eye_left, mask_A_eye_right = self.rebound_box(mask_A_eye_left, mask_A_eye_right, mask_A_face)
mask_B_eye_left, mask_B_eye_right = self.rebound_box(mask_B_eye_left, mask_B_eye_right, mask_B_face)
mask_A_eye_left, mask_B_eye_left, index_A_eye_left, index_B_eye_left = \
self.mask_preprocess(mask_A_eye_left, mask_B_eye_left)
mask_A_eye_right, mask_B_eye_right, index_A_eye_right, index_B_eye_right = \
self.mask_preprocess(mask_A_eye_right, mask_B_eye_right)
Makeup Loss
# color_histogram loss
g_A_loss_his = 0
g_B_loss_his = 0
if self.checkpoint or self.direct:
if self.lips==True:
g_A_lip_loss_his = self.criterionHis(fake_A, ref_B, mask_A_lip, mask_B_lip, index_A_lip) * self.lambda_his_lip
g_B_lip_loss_his = self.criterionHis(fake_B, org_A, mask_B_lip, mask_A_lip, index_B_lip) * self.lambda_his_lip
g_A_loss_his += g_A_lip_loss_his
g_B_loss_his += g_B_lip_loss_his
if self.skin==True:
g_A_skin_loss_his = self.criterionHis(fake_A, ref_B, mask_A_skin, mask_B_skin, index_A_skin) * self.lambda_his_skin_1
g_B_skin_loss_his = self.criterionHis(fake_B, org_A, mask_B_skin, mask_A_skin, index_B_skin) * self.lambda_his_skin_2
g_A_loss_his += g_A_skin_loss_his
g_B_loss_his += g_B_skin_loss_his
if self.eye==True:
g_A_eye_left_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_left, mask_B_eye_left, index_A_eye_left) * self.lambda_his_eye
g_B_eye_left_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_left, mask_A_eye_left, index_B_eye_left) * self.lambda_his_eye
g_A_eye_right_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_right, mask_B_eye_right, index_A_eye_right) * self.lambda_his_eye
g_B_eye_right_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_right, mask_A_eye_right, index_B_eye_right) * self.lambda_his_eye
g_A_loss_his += g_A_eye_left_loss_his + g_A_eye_right_loss_his
g_B_loss_his += g_B_eye_left_loss_his + g_B_eye_right_loss_his
4. Data Collection
- Makeup Transfer(MT) dataset : Facial makeup dataset consisting of 3834 female imgs (1115 non-makeup + 2719 makeup)
- Some variations in race, pose, expression, background clutter
- Many makeup styles : smoky-eyes, flashy, Retro, Korean, Japanese, ...
- More than 3000 subjects
- Nude makeup imgs for Non-makeup category
- How : Initial data are crawled from websites → Low resolution imgs removed → Face alignment with 68 landmarks
- Spatial size : 256x256
- Test set : randomly selected 100 non-makeup imgs + 250 makeup imgs
- Training set and Validation set : separated on remaining imgs
5. Experiments
- Network Architecture, Training setting, Performances, Component Analysis
5.1 Implementation Details
Network Architecture
(1) Generator G with 2 inputs and 2 outputs
- Front : 2 separate input branches with convolutions
- Middle : concatenate 2 branches and feed them into several residual blocks
- End : Upsampling output feature maps by 2 individual branches of transposed convolutions
- Branches don't share params within layers
- Instance Normalization for G
class Generator(nn.Module):
"""Generator. Encoder-Decoder Architecture."""
def __init__(self, conv_dim=64, repeat_num=6):
super(Generator, self).__init__()
layers = []
layers.append(nn.Conv2d(3, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers.append(nn.ReLU(inplace=True))
# Down-Sampling
curr_dim = conv_dim
for i in range(2):
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim * 2
# Bottleneck
for i in range(repeat_num):
layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))
# Up-Sampling
for i in range(2):
layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim // 2
layers.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers.append(nn.Tanh())
self.main = nn.Sequential(*layers)
def forward(self, x):
out = self.main(x)
return out
class Generator_makeup(nn.Module):
"""Generator. Encoder-Decoder Architecture."""
# input 2 images and output 2 images as well
def __init__(self, conv_dim=64, repeat_num=6, input_nc=6):
super(Generator_makeup, self).__init__()
layers = []
layers.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers.append(nn.ReLU(inplace=True))
# Down-Sampling
curr_dim = conv_dim
for i in range(2):
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim * 2
# Bottleneck
for i in range(repeat_num):
layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))
# Up-Sampling
for i in range(2):
layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim // 2
self.main = nn.Sequential(*layers)
layers_1 = []
layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_1.append(nn.Tanh())
self.branch_1 = nn.Sequential(*layers_1)
layers_2 = []
layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_2.append(nn.Tanh())
self.branch_2 = nn.Sequential(*layers_2)
def forward(self, x, y):
input_x = torch.cat((x, y), dim=1)
out = self.main(input_x)
out_A = self.branch_1(out)
out_B = self.branch_2(out)
return out_A, out_B
class Generator_branch(nn.Module):
"""Generator. Encoder-Decoder Architecture."""
# input 2 images and output 2 images as well
def __init__(self, conv_dim=64, repeat_num=6, input_nc=3):
super(Generator_branch, self).__init__()
# Branch input
layers_branch = []
layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
self.Branch_0 = nn.Sequential(*layers_branch)
# Branch input
layers_branch = []
layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
self.Branch_1 = nn.Sequential(*layers_branch)
# Down-Sampling, branch merge
layers = []
curr_dim = conv_dim*2
layers.append(nn.Conv2d(curr_dim*2, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim * 2
# Bottleneck
for i in range(repeat_num):
layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))
# Up-Sampling
for i in range(2):
layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim // 2
self.main = nn.Sequential(*layers)
layers_1 = []
layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_1.append(nn.ReLU(inplace=True))
layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_1.append(nn.ReLU(inplace=True))
layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_1.append(nn.Tanh())
self.branch_1 = nn.Sequential(*layers_1)
layers_2 = []
layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_2.append(nn.ReLU(inplace=True))
layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_2.append(nn.ReLU(inplace=True))
layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_2.append(nn.Tanh())
self.branch_2 = nn.Sequential(*layers_2)
def forward(self, x, y):
input_x = self.Branch_0(x)
input_y = self.Branch_1(y)
input_fuse = torch.cat((input_x, input_y), dim=1)
out = self.main(input_fuse)
out_A = self.branch_1(out)
out_B = self.branch_2(out)
return out_A, out_B
(2) Discriminator D_A, D_B
- Identical 70x70 PatchGANs : classify local overlapping img patches to be real or fake
class Discriminator(nn.Module):
"""Discriminator. PatchGAN."""
def __init__(self, image_size=128, conv_dim=64, repeat_num=3, norm='SN'):
super(Discriminator, self).__init__()
layers = []
if norm=='SN':
layers.append(SpectralNorm(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1)))
else:
layers.append(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1))
layers.append(nn.LeakyReLU(0.01, inplace=True))
curr_dim = conv_dim
for i in range(1, repeat_num):
if norm=='SN':
layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1)))
else:
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1))
layers.append(nn.LeakyReLU(0.01, inplace=True))
curr_dim = curr_dim * 2
#k_size = int(image_size / np.power(2, repeat_num))
if norm=='SN':
layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1)))
else:
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1))
layers.append(nn.LeakyReLU(0.01, inplace=True))
curr_dim = curr_dim *2
self.main = nn.Sequential(*layers)
if norm=='SN':
self.conv1 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False))
else:
self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False)
# conv1 remain the last square size, 256*256-->30*30
#self.conv2 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=k_size, bias=False))
#conv2 output a single number
def forward(self, x):
h = self.main(x)
#out_real = self.conv1(h)
out_makeup = self.conv1(h)
#return out_real.squeeze(), out_makeup.squeeze()
return out_makeup.squeeze()
Training Details
- 2 Additional strategies to stabilize training and generate high quality imgs
- (1) Replacing all negative log likelihood in Adversarial loss by least square loss
- (2) Spectral Normalization : stably training Discriminators
def l2normalize(v, eps=1e-12):
return v / (v.norm() + eps)
class SpectralNorm(object):
def __init__(self):
self.name = "weight"
#print(self.name)
self.power_iterations = 1
def compute_weight(self, module):
u = getattr(module, self.name + "_u")
v = getattr(module, self.name + "_v")
w = getattr(module, self.name + "_bar")
height = w.data.shape[0]
for _ in range(self.power_iterations):
v.data = l2normalize(torch.mv(torch.t(w.view(height,-1).data), u.data))
u.data = l2normalize(torch.mv(w.view(height,-1).data, v.data))
# sigma = torch.dot(u.data, torch.mv(w.view(height,-1).data, v.data))
sigma = u.dot(w.view(height, -1).mv(v))
return w / sigma.expand_as(w)
@staticmethod
def apply(module):
name = "weight"
fn = SpectralNorm()
try:
u = getattr(module, name + "_u")
v = getattr(module, name + "_v")
w = getattr(module, name + "_bar")
except AttributeError:
w = getattr(module, name)
height = w.data.shape[0]
width = w.view(height, -1).data.shape[1]
u = Parameter(w.data.new(height).normal_(0, 1), requires_grad=False)
v = Parameter(w.data.new(width).normal_(0, 1), requires_grad=False)
w_bar = Parameter(w.data)
#del module._parameters[name]
module.register_parameter(name + "_u", u)
module.register_parameter(name + "_v", v)
module.register_parameter(name + "_bar", w_bar)
# remove w from parameter list
del module._parameters[name]
setattr(module, name, fn.compute_weight(module))
# recompute weight before every forward()
module.register_forward_pre_hook(fn)
return fn
def remove(self, module):
weight = self.compute_weight(module)
delattr(module, self.name)
del module._parameters[self.name + '_u']
del module._parameters[self.name + '_v']
del module._parameters[self.name + '_bar']
module.register_parameter(self.name, Parameter(weight.data))
def __call__(self, module, inputs):
setattr(module, self.name, self.compute_weight(module))
def spectral_norm(module):
SpectralNorm.apply(module)
return module
def remove_spectral_norm(module):
name = 'weight'
for k, hook in module._forward_pre_hooks.items():
if isinstance(hook, SpectralNorm) and hook.name == name:
hook.remove(module)
del module._forward_pre_hooks[k]
return module
raise ValueError("spectral_norm of '{}' not found in {}"
.format(name, module))
- Masks annotated labels on different facial regions through a PSPNet trained for face segmentation
- relu_4_1 feature layer of VGG16(pre-trained on ImageNet) : Identity preserving
- Parameters fixed all through training process : α=1, β=10, γ=0.005, λ_l=1, λ_s=1, λ_f=0.1
parser.add_argument('--lambda_cls', default='1', type=float, help='the lambda_cls weight')
parser.add_argument('--lambda_rec', default='10', type=int, help='lambda_A and lambda_B')
parser.add_argument('--lambda_vgg', default='5e-3', type=float, help='the param of vgg loss')
parser.add_argument('--lambda_his', default='1', type=float, help='histogram loss on lips')
parser.add_argument('--lambda_eye', default='1', type=float, help='histogram loss on eyes equals to lambda_his*lambda_eye')
parser.add_argument('--lambda_skin_1', default='0.1', type=float, help='histogram loss on skin equals to lambda_his* lambda_skin')
parser.add_argument('--lambda_skin_2', default='0.1', type=float, help='histogram loss on skin equals to lambda_his* lambda_skin')
- Training network from scratch using Adam (lr=0.0002, batch_size= 1)
parser.add_argument('--batch_size', default='1', type=int, help='batch_size')
parser.add_argument('--LR', default="2e-4", type=float, help='Learning rate')
5.2 Baselines
- Digital Face Makeup : early makeup transfer work, applying traditional img processing method
- DTN : SOTA makeup transfer work, proposing deep localized makeup transfer network
- Deep Image Analogy : visual attribute transfer across two semantic-related imgs
- to match features extracted from DNN
- CycleGAN : unsupervised img-to-img translation work
- BeautyGAN : modify generator in CycleGAN with 2 branches as input, but maintain all others
- Style Transfer : training a feed-forward network for synthesizing style and content
- non-makeup img as content & reference makeup img as style
5.3 Comparison Against Baselines
-
Qualitative evaluation
- [11] : visible artifacts, mismatch problem around facial and eyes contour, incorrect details are transferred (eye shadows)
- [23] : alignment artifacts around eye areas and lips area, incorrect details are transferred (foundation and eye shadows)
- [13] Style transfer : grain-like artifacts (transfer global style → infeasible for delicate makeup transfer)
- [35] CycleGAN : realistic imgs BUT makeup style are not consistent with references
- [21] : similar makeup styles as references and natural results BUT also other non-facial features in references
- Ex. background color from black to blue, hair color, pupil colors
- lighter makeup styles than references (lipsticks, eye shadows, ...)
- BeautyGAN keep other makeup-irrelevant components intact as original non-makeup imgs (hairs, clothes, bg, ...)
-
Quantitative comparison
- User study with 84 volunteers to demonstrate BeautyGAN performs better than othe baselines
- Randomly choose 10 non-makeup test imgs + 20 makeup test imgs
- 10x20 after-makeup results for each makeup transfer method
- Comparison with [21] and [23]
- 5 imgs (1 non-makeup, 1 makeup as ref, 3 randomly shuffled makeup transfer imgs generated from diff methods)
- Rank of 3 generated imgs (based on quality + realism + makeup style similarity)
5.4 Component Analysis of BeautyGAN
- Ablation study to invetigate importance of each component in overall objective function
- Main Analysis : Effect of Perceptual loss term, Makeup loss term
- Conducted with Adversarial loss, Cycle consistency loss
- [Table 2] : Settings / [Figure 6] : Results
(1) A : Remove L_per
- Result : all fake imgs like two inputs warped and merged on pixels
- ↔ Other experiments : identities of non-makeup faces are maintained
- Perceptual loss : to preserve img identity
(2) B, C, D : L_make (L_face, L_shadow, L_lips) = 3 local histogram loss acted on diff cosmetic regions
- B : Remove L_make → makeup style transfer X
- Makeup loss : to be for instance-level makeup transfer
6. Conclusion and Future work
- A dual input/output BeautyGAN for Instance-level facial makeup transfer
- 1 Generator : realizing makeup and anti-makeup simultaneously in a single forward pass
- Pixel-level histogram loss : to constrain similarity of makeup style
- Perceptual loss and Cycle consistency loss : to preserve identity
- Experimental results : Significant performance gain over other existing approaches
Code
https://github.com/wtjiang98/BeautyGAN_pytorch
(1) train
def train_net():
# enable cudnn
cudnn.benchmark = True
data_loaders = get_loader(dataset_config, config, mode="train") # return train&test
#get the solver
if args.model == 'cycleGAN':
solver = Solver_cycleGAN(data_loaders, config, dataset_config)
elif args.model =='makeupGAN':
solver = Solver_makeupGAN(data_loaders, config, dataset_config)
else:
print("model that not support")
exit()
solver.train()
(2) GANLoss
import torch
import torch.nn as nn
from torch.autograd import Variable
class GANLoss(nn.Module):
def __init__(self, use_lsgan=True, target_real_label=1.0, target_fake_label=0.0,
tensor=torch.FloatTensor):
super(GANLoss, self).__init__()
self.real_label = target_real_label
self.fake_label = target_fake_label
self.real_label_var = None
self.fake_label_var = None
self.Tensor = tensor
if use_lsgan:
self.loss = nn.MSELoss()
else:
self.loss = nn.BCELoss()
def get_target_tensor(self, input, target_is_real):
target_tensor = None
if target_is_real:
create_label = ((self.real_label_var is None) or
(self.real_label_var.numel() != input.numel()))
if create_label:
real_tensor = self.Tensor(input.size()).fill_(self.real_label)
self.real_label_var = Variable(real_tensor, requires_grad=False)
target_tensor = self.real_label_var
else:
create_label = ((self.fake_label_var is None) or
(self.fake_label_var.numel() != input.numel()))
if create_label:
fake_tensor = self.Tensor(input.size()).fill_(self.fake_label)
self.fake_label_var = Variable(fake_tensor, requires_grad=False)
target_tensor = self.fake_label_var
return target_tensor
def __call__(self, input, target_is_real):
target_tensor = self.get_target_tensor(input, target_is_real)
return self.loss(input, target_tensor)
(3) cycleGAN
def build_model(self):
# Define generators and discriminators
self.G_A = net.Generator(self.g_conv_dim, self.g_repeat_num)
self.G_B = net.Generator(self.g_conv_dim, self.g_repeat_num)
self.D_A = net.Discriminator(self.img_size, self.d_conv_dim, self.d_repeat_num)
self.D_B = net.Discriminator(self.img_size, self.d_conv_dim, self.d_repeat_num)
self.criterionL1 = torch.nn.L1Loss()
self.criterionGAN = GANLoss(use_lsgan=True, tensor =torch.cuda.FloatTensor)
# Optimizers
self.g_optimizer = torch.optim.Adam(itertools.chain(self.G_A.parameters(), self.G_B.parameters()),
self.g_lr, [self.beta1, self.beta2])
self.d_A_optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.D_A.parameters()), self.d_lr, [self.beta1, self.beta2])
self.d_B_optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.D_B.parameters()), self.d_lr, [self.beta1, self.beta2])
self.G_A.apply(self.weights_init_xavier)
self.D_A.apply(self.weights_init_xavier)
self.G_B.apply(self.weights_init_xavier)
self.D_B.apply(self.weights_init_xavier)
# Print networks
# self.print_network(self.E, 'E')
self.print_network(self.G_A, 'G_A')
self.print_network(self.D_A, 'D_A')
self.print_network(self.G_B, 'G_B')
self.print_network(self.D_B, 'D_B')
if torch.cuda.is_available():
self.G_A.cuda()
self.G_B.cuda()
self.D_A.cuda()
self.D_B.cuda()
def train(self):
"""Train StarGAN within a single dataset."""
# The number of iterations per epoch
self.iters_per_epoch = len(self.data_loader_train)
# Start with trained model if exists
g_lr = self.g_lr
d_lr = self.d_lr
if self.checkpoint:
start = int(self.checkpoint.split('_')[0])
else:
start = 0
# Start training
self.start_time = time.time()
for self.e in range(start, self.num_epochs):
for self.i, (img_A, img_B, _, _) in enumerate(self.data_loader_train):
# Convert tensor to variable
org_A = self.to_var(img_A, requires_grad=False)
ref_B = self.to_var(img_B, requires_grad=False)
# ================== Train D ================== #
# training D_A
# Real
out = self.D_A(ref_B)
d_loss_real = self.criterionGAN(out, True)
# Fake
fake = self.G_A(org_A)
fake = Variable(fake.data)
fake = fake.detach()
out = self.D_A(fake)
#d_loss_fake = self.get_D_loss(out, "fake")
d_loss_fake = self.criterionGAN(out, False)
# Backward + Optimize
d_loss = (d_loss_real + d_loss_fake) * 0.5
self.d_A_optimizer.zero_grad()
d_loss.backward(retain_graph=True)
self.d_A_optimizer.step()
# Logging
self.loss = {}
self.loss['D-A-loss_real'] = d_loss_real.item()
# training D_B
# Real
out = self.D_B(org_A)
d_loss_real = self.criterionGAN(out, True)
# Fake
fake = self.G_B(ref_B)
fake = Variable(fake.data)
fake = fake.detach()
out = self.D_B(fake)
#d_loss_fake = self.get_D_loss(out, "fake")
d_loss_fake = self.criterionGAN(out, False)
# Backward + Optimize
d_loss = (d_loss_real + d_loss_fake) * 0.5
self.d_B_optimizer.zero_grad()
d_loss.backward(retain_graph=True)
self.d_B_optimizer.step()
# Logging
self.loss['D-B-loss_real'] = d_loss_real.item()
# ================== Train G ================== #
if (self.i + 1) % self.ndis == 0:
# adversarial loss, i.e. L_trans,v in the paper
# identity loss
if self.lambda_idt > 0:
# G_A should be identity if ref_B is fed
idt_A = self.G_A(ref_B)
loss_idt_A = self.criterionL1(idt_A, ref_B) * self.lambda_B * self.lambda_idt
# G_B should be identity if org_A is fed
idt_B = self.G_B(org_A)
loss_idt_B = self.criterionL1(idt_B, org_A) * self.lambda_A * self.lambda_idt
g_loss_idt = loss_idt_A + loss_idt_B
else:
g_loss_idt = 0
# GAN loss D_A(G_A(A))
fake_B = self.G_A(org_A)
pred_fake = self.D_A(fake_B)
g_A_loss_adv = self.criterionGAN(pred_fake, True)
#g_loss_adv = self.get_G_loss(out)
# GAN loss D_B(G_B(B))
fake_A = self.G_B(ref_B)
pred_fake = self.D_B(fake_A)
g_B_loss_adv = self.criterionGAN(pred_fake, True)
# Forward cycle loss
rec_A = self.G_B(fake_B)
g_loss_rec_A = self.criterionL1(rec_A, org_A) * self.lambda_A
# Backward cycle loss
rec_B = self.G_A(fake_A)
g_loss_rec_B = self.criterionL1(rec_B, ref_B) * self.lambda_B
# Combined loss
g_loss = g_A_loss_adv + g_B_loss_adv + g_loss_rec_A + g_loss_rec_B + g_loss_idt
self.g_optimizer.zero_grad()
g_loss.backward(retain_graph=True)
self.g_optimizer.step()
# Logging
self.loss['G-A-loss_adv'] = g_A_loss_adv.item()
self.loss['G-B-loss_adv'] = g_A_loss_adv.item()
self.loss['G-loss_org'] = g_loss_rec_A.item()
self.loss['G-loss_ref'] = g_loss_rec_B.item()
self.loss['G-loss_idt'] = g_loss_idt.item()
# Print out log info
if (self.i + 1) % self.log_step == 0:
self.log_terminal()
#plot the figures
for key_now in self.loss.keys():
plot_fig.plot(key_now, self.loss[key_now])
#save the images
if (self.i + 1) % self.vis_step == 0:
print("Saving middle output...")
self.vis_train([org_A, ref_B, fake_A, fake_B, rec_A, rec_B])
self.vis_test()
# Save model checkpoints
if (self.i + 1) % self.snapshot_step == 0:
self.save_models()
if (self.i % 100 == 99):
plot_fig.flush(self.task_name)
plot_fig.tick()
# Decay learning rate
if (self.e+1) > (self.num_epochs - self.num_epochs_decay):
g_lr -= (self.g_lr / float(self.num_epochs_decay))
d_lr -= (self.d_lr / float(self.num_epochs_decay))
self.update_lr(g_lr, d_lr)
print('Decay learning rate to g_lr: {}, d_lr:{}.'.format(g_lr, d_lr))
(4) makeupGAN
def build_model(self):
# Define generators and discriminators
if self.whichG=='normal':
self.G = net.Generator_makeup(self.g_conv_dim, self.g_repeat_num)
if self.whichG=='branch':
self.G = net.Generator_branch(self.g_conv_dim, self.g_repeat_num)
for i in self.cls:
setattr(self, "D_" + i, net.Discriminator(self.img_size, self.d_conv_dim, self.d_repeat_num, self.norm))
self.criterionL1 = torch.nn.L1Loss()
self.criterionL2 = torch.nn.MSELoss()
self.criterionGAN = GANLoss(use_lsgan=True, tensor =torch.cuda.FloatTensor)
self.vgg = net.VGG()
self.vgg.load_state_dict(torch.load('addings/vgg_conv.pth'))
# Optimizers
self.g_optimizer = torch.optim.Adam(self.G.parameters(), self.g_lr, [self.beta1, self.beta2])
for i in self.cls:
setattr(self, "d_" + i + "_optimizer", \
torch.optim.Adam(filter(lambda p: p.requires_grad, getattr(self, "D_" + i).parameters()), \
self.d_lr, [self.beta1, self.beta2]))
# Weights initialization
self.G.apply(self.weights_init_xavier)
for i in self.cls:
getattr(self, "D_" + i).apply(self.weights_init_xavier)
# Print networks
self.print_network(self.G, 'G')
for i in self.cls:
self.print_network(getattr(self, "D_" + i), "D_" + i)
if torch.cuda.is_available():
self.G.cuda()
self.vgg.cuda()
for i in self.cls:
getattr(self, "D_" + i).cuda()
def train(self):
"""Train StarGAN within a single dataset."""
# The number of iterations per epoch
self.iters_per_epoch = len(self.data_loader_train)
# Start with trained model if exists
cls_A = self.cls[0]
cls_B = self.cls[1]
g_lr = self.g_lr
d_lr = self.d_lr
if self.checkpoint:
start = int(self.checkpoint.split('_')[0])
self.vis_test()
else:
start = 0
# Start training
self.start_time = time.time()
for self.e in range(start, self.num_epochs):
for self.i, (img_A, img_B, mask_A, mask_B) in enumerate(self.data_loader_train):
# Convert tensor to variable
# mask attribute: 0:background 1:face 2:left-eyebrown 3:right-eyebrown 4:left-eye 5: right-eye 6: nose
# 7: upper-lip 8: teeth 9: under-lip 10:hair 11: left-ear 12: right-ear 13: neck
if self.checkpoint or self.direct:
if self.lips==True:
mask_A_lip = (mask_A==7).float() + (mask_A==9).float()
mask_B_lip = (mask_B==7).float() + (mask_B==9).float()
mask_A_lip, mask_B_lip, index_A_lip, index_B_lip = self.mask_preprocess(mask_A_lip, mask_B_lip)
if self.skin==True:
mask_A_skin = (mask_A==1).float() + (mask_A==6).float() + (mask_A==13).float()
mask_B_skin = (mask_B==1).float() + (mask_B==6).float() + (mask_B==13).float()
mask_A_skin, mask_B_skin, index_A_skin, index_B_skin = self.mask_preprocess(mask_A_skin, mask_B_skin)
if self.eye==True:
mask_A_eye_left = (mask_A==4).float()
mask_A_eye_right = (mask_A==5).float()
mask_B_eye_left = (mask_B==4).float()
mask_B_eye_right = (mask_B==5).float()
mask_A_face = (mask_A==1).float() + (mask_A==6).float()
mask_B_face = (mask_B==1).float() + (mask_B==6).float()
# avoid the situation that images with eye closed
if not ((mask_A_eye_left>0).any() and (mask_B_eye_left>0).any() and \
(mask_A_eye_right > 0).any() and (mask_B_eye_right > 0).any()):
continue
mask_A_eye_left, mask_A_eye_right = self.rebound_box(mask_A_eye_left, mask_A_eye_right, mask_A_face)
mask_B_eye_left, mask_B_eye_right = self.rebound_box(mask_B_eye_left, mask_B_eye_right, mask_B_face)
mask_A_eye_left, mask_B_eye_left, index_A_eye_left, index_B_eye_left = \
self.mask_preprocess(mask_A_eye_left, mask_B_eye_left)
mask_A_eye_right, mask_B_eye_right, index_A_eye_right, index_B_eye_right = \
self.mask_preprocess(mask_A_eye_right, mask_B_eye_right)
org_A = self.to_var(img_A, requires_grad=False)
ref_B = self.to_var(img_B, requires_grad=False)
# ================== Train D ================== #
# training D_A, D_A aims to distinguish class B
# Real
out = getattr(self, "D_" + cls_A)(ref_B)
d_loss_real = self.criterionGAN(out, True)
# Fake
fake_A, fake_B = self.G(org_A, ref_B)
fake_A = Variable(fake_A.data).detach()
fake_B = Variable(fake_B.data).detach()
out = getattr(self, "D_" + cls_A)(fake_A)
#d_loss_fake = self.get_D_loss(out, "fake")
d_loss_fake = self.criterionGAN(out, False)
# Backward + Optimize
d_loss = (d_loss_real + d_loss_fake) * 0.5
getattr(self, "d_" + cls_A + "_optimizer").zero_grad()
d_loss.backward(retain_graph=True)
getattr(self, "d_" + cls_A + "_optimizer").step()
# Logging
self.loss = {}
self.loss['D-A-loss_real'] = d_loss_real.item()
# training D_B, D_B aims to distinguish class A
# Real
out = getattr(self, "D_" + cls_B)(org_A)
d_loss_real = self.criterionGAN(out, True)
# Fake
out = getattr(self, "D_" + cls_B)(fake_B)
#d_loss_fake = self.get_D_loss(out, "fake")
d_loss_fake = self.criterionGAN(out, False)
# Backward + Optimize
d_loss = (d_loss_real + d_loss_fake) * 0.5
getattr(self, "d_" + cls_B + "_optimizer").zero_grad()
d_loss.backward(retain_graph=True)
getattr(self, "d_" + cls_B + "_optimizer").step()
# Logging
self.loss['D-B-loss_real'] = d_loss_real.item()
# ================== Train G ================== #
if (self.i + 1) % self.ndis == 0:
# adversarial loss, i.e. L_trans,v in the paper
# identity loss
if self.lambda_idt > 0:
# G should be identity if ref_B or org_A is fed
idt_A1, idt_A2 = self.G(org_A, org_A)
idt_B1, idt_B2 = self.G(ref_B, ref_B)
loss_idt_A1 = self.criterionL1(idt_A1, org_A) * self.lambda_A * self.lambda_idt
loss_idt_A2 = self.criterionL1(idt_A2, org_A) * self.lambda_A * self.lambda_idt
loss_idt_B1 = self.criterionL1(idt_B1, ref_B) * self.lambda_B * self.lambda_idt
loss_idt_B2 = self.criterionL1(idt_B2, ref_B) * self.lambda_B * self.lambda_idt
# loss_idt
loss_idt = (loss_idt_A1 + loss_idt_A2 + loss_idt_B1 + loss_idt_B2) * 0.5
else:
loss_idt = 0
# GAN loss D_A(G_A(A))
# fake_A in class B,
fake_A, fake_B = self.G(org_A, ref_B)
pred_fake = getattr(self, "D_" + cls_A)(fake_A)
g_A_loss_adv = self.criterionGAN(pred_fake, True)
#g_loss_adv = self.get_G_loss(out)
# GAN loss D_B(G_B(B))
pred_fake = getattr(self, "D_" + cls_B)(fake_B)
g_B_loss_adv = self.criterionGAN(pred_fake, True)
rec_B, rec_A = self.G(fake_B, fake_A)
# color_histogram loss
g_A_loss_his = 0
g_B_loss_his = 0
if self.checkpoint or self.direct:
if self.lips==True:
g_A_lip_loss_his = self.criterionHis(fake_A, ref_B, mask_A_lip, mask_B_lip, index_A_lip) * self.lambda_his_lip
g_B_lip_loss_his = self.criterionHis(fake_B, org_A, mask_B_lip, mask_A_lip, index_B_lip) * self.lambda_his_lip
g_A_loss_his += g_A_lip_loss_his
g_B_loss_his += g_B_lip_loss_his
if self.skin==True:
g_A_skin_loss_his = self.criterionHis(fake_A, ref_B, mask_A_skin, mask_B_skin, index_A_skin) * self.lambda_his_skin_1
g_B_skin_loss_his = self.criterionHis(fake_B, org_A, mask_B_skin, mask_A_skin, index_B_skin) * self.lambda_his_skin_2
g_A_loss_his += g_A_skin_loss_his
g_B_loss_his += g_B_skin_loss_his
if self.eye==True:
g_A_eye_left_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_left, mask_B_eye_left, index_A_eye_left) * self.lambda_his_eye
g_B_eye_left_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_left, mask_A_eye_left, index_B_eye_left) * self.lambda_his_eye
g_A_eye_right_loss_his = self.criterionHis(fake_A, ref_B, mask_A_eye_right, mask_B_eye_right, index_A_eye_right) * self.lambda_his_eye
g_B_eye_right_loss_his = self.criterionHis(fake_B, org_A, mask_B_eye_right, mask_A_eye_right, index_B_eye_right) * self.lambda_his_eye
g_A_loss_his += g_A_eye_left_loss_his + g_A_eye_right_loss_his
g_B_loss_his += g_B_eye_left_loss_his + g_B_eye_right_loss_his
# cycle loss
g_loss_rec_A = self.criterionL1(rec_A, org_A) * self.lambda_A
g_loss_rec_B = self.criterionL1(rec_B, ref_B) * self.lambda_B
# vgg loss
vgg_org = self.vgg(org_A, self.content_layer)[0]
vgg_org = Variable(vgg_org.data).detach()
vgg_fake_A = self.vgg(fake_A, self.content_layer)[0]
g_loss_A_vgg = self.criterionL2(vgg_fake_A, vgg_org) * self.lambda_A * self.lambda_vgg
vgg_ref = self.vgg(ref_B, self.content_layer)[0]
vgg_ref = Variable(vgg_ref.data).detach()
vgg_fake_B = self.vgg(fake_B, self.content_layer)[0]
g_loss_B_vgg = self.criterionL2(vgg_fake_B, vgg_ref) * self.lambda_B * self.lambda_vgg
loss_rec = (g_loss_rec_A + g_loss_rec_B + g_loss_A_vgg + g_loss_B_vgg) * 0.5
# Combined loss
g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt
if self.checkpoint or self.direct:
g_loss = g_A_loss_adv + g_B_loss_adv + loss_rec + loss_idt + g_A_loss_his + g_B_loss_his
self.g_optimizer.zero_grad()
g_loss.backward(retain_graph=True)
self.g_optimizer.step()
# Logging
self.loss['G-A-loss-adv'] = g_A_loss_adv.item()
self.loss['G-B-loss-adv'] = g_A_loss_adv.item()
self.loss['G-loss-org'] = g_loss_rec_A.item()
self.loss['G-loss-ref'] = g_loss_rec_B.item()
self.loss['G-loss-idt'] = loss_idt.item()
self.loss['G-loss-img-rec'] = (g_loss_rec_A + g_loss_rec_B).item()
self.loss['G-loss-vgg-rec'] = (g_loss_A_vgg + g_loss_B_vgg).item()
if self.direct:
self.loss['G-A-loss-his'] = g_A_loss_his.item()
self.loss['G-B-loss-his'] = g_B_loss_his.item()
# Print out log info
if (self.i + 1) % self.log_step == 0:
self.log_terminal()
#plot the figures
for key_now in self.loss.keys():
plot_fig.plot(key_now, self.loss[key_now])
#save the images
if (self.i + 1) % self.vis_step == 0:
print("Saving middle output...")
self.vis_train([org_A, ref_B, fake_A, fake_B, rec_A, rec_B])
# Save model checkpoints
if (self.i + 1) % self.snapshot_step == 0:
self.save_models()
if (self.i % 100 == 99):
plot_fig.flush(self.task_name)
plot_fig.tick()
# Decay learning rate
if (self.e+1) > (self.num_epochs - self.num_epochs_decay):
g_lr -= (self.g_lr / float(self.num_epochs_decay))
d_lr -= (self.d_lr / float(self.num_epochs_decay))
self.update_lr(g_lr, d_lr)
print('Decay learning rate to g_lr: {}, d_lr:{}.'.format(g_lr, d_lr))
(5) network
import torch
import torch.nn as nn
import torch.nn.functional as F
from ops.spectral_norm import spectral_norm as SpectralNorm
# Defines the GAN loss which uses either LSGAN or the regular GAN.
# When LSGAN is used, it is basically same as MSELoss,
# but it abstracts away the need to create the target label tensor
# that has the same size as the input
class ResidualBlock(nn.Module):
"""Residual Block."""
def __init__(self, dim_in, dim_out):
super(ResidualBlock, self).__init__()
self.main = nn.Sequential(
nn.Conv2d(dim_in, dim_out, kernel_size=3, stride=1, padding=1, bias=False),
nn.InstanceNorm2d(dim_out, affine=True),
nn.ReLU(inplace=True),
nn.Conv2d(dim_out, dim_out, kernel_size=3, stride=1, padding=1, bias=False),
nn.InstanceNorm2d(dim_out, affine=True))
def forward(self, x):
return x + self.main(x)
class Generator(nn.Module):
"""Generator. Encoder-Decoder Architecture."""
def __init__(self, conv_dim=64, repeat_num=6):
super(Generator, self).__init__()
layers = []
layers.append(nn.Conv2d(3, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers.append(nn.ReLU(inplace=True))
# Down-Sampling
curr_dim = conv_dim
for i in range(2):
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim * 2
# Bottleneck
for i in range(repeat_num):
layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))
# Up-Sampling
for i in range(2):
layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim // 2
layers.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers.append(nn.Tanh())
self.main = nn.Sequential(*layers)
def forward(self, x):
out = self.main(x)
return out
class Generator_makeup(nn.Module):
"""Generator. Encoder-Decoder Architecture."""
# input 2 images and output 2 images as well
def __init__(self, conv_dim=64, repeat_num=6, input_nc=6):
super(Generator_makeup, self).__init__()
layers = []
layers.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers.append(nn.ReLU(inplace=True))
# Down-Sampling
curr_dim = conv_dim
for i in range(2):
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim * 2
# Bottleneck
for i in range(repeat_num):
layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))
# Up-Sampling
for i in range(2):
layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim // 2
self.main = nn.Sequential(*layers)
layers_1 = []
layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_1.append(nn.Tanh())
self.branch_1 = nn.Sequential(*layers_1)
layers_2 = []
layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_2.append(nn.Tanh())
self.branch_2 = nn.Sequential(*layers_2)
def forward(self, x, y):
input_x = torch.cat((x, y), dim=1)
out = self.main(input_x)
out_A = self.branch_1(out)
out_B = self.branch_2(out)
return out_A, out_B
class Generator_branch(nn.Module):
"""Generator. Encoder-Decoder Architecture."""
# input 2 images and output 2 images as well
def __init__(self, conv_dim=64, repeat_num=6, input_nc=3):
super(Generator_branch, self).__init__()
# Branch input
layers_branch = []
layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
self.Branch_0 = nn.Sequential(*layers_branch)
# Branch input
layers_branch = []
layers_branch.append(nn.Conv2d(input_nc, conv_dim, kernel_size=7, stride=1, padding=3, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
layers_branch.append(nn.Conv2d(conv_dim, conv_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers_branch.append(nn.InstanceNorm2d(conv_dim*2, affine=True))
layers_branch.append(nn.ReLU(inplace=True))
self.Branch_1 = nn.Sequential(*layers_branch)
# Down-Sampling, branch merge
layers = []
curr_dim = conv_dim*2
layers.append(nn.Conv2d(curr_dim*2, curr_dim*2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim*2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim * 2
# Bottleneck
for i in range(repeat_num):
layers.append(ResidualBlock(dim_in=curr_dim, dim_out=curr_dim))
# Up-Sampling
for i in range(2):
layers.append(nn.ConvTranspose2d(curr_dim, curr_dim//2, kernel_size=4, stride=2, padding=1, bias=False))
layers.append(nn.InstanceNorm2d(curr_dim//2, affine=True))
layers.append(nn.ReLU(inplace=True))
curr_dim = curr_dim // 2
self.main = nn.Sequential(*layers)
layers_1 = []
layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_1.append(nn.ReLU(inplace=True))
layers_1.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_1.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_1.append(nn.ReLU(inplace=True))
layers_1.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_1.append(nn.Tanh())
self.branch_1 = nn.Sequential(*layers_1)
layers_2 = []
layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_2.append(nn.ReLU(inplace=True))
layers_2.append(nn.Conv2d(curr_dim, curr_dim, kernel_size=3, stride=1, padding=1, bias=False))
layers_2.append(nn.InstanceNorm2d(curr_dim, affine=True))
layers_2.append(nn.ReLU(inplace=True))
layers_2.append(nn.Conv2d(curr_dim, 3, kernel_size=7, stride=1, padding=3, bias=False))
layers_2.append(nn.Tanh())
self.branch_2 = nn.Sequential(*layers_2)
def forward(self, x, y):
input_x = self.Branch_0(x)
input_y = self.Branch_1(y)
input_fuse = torch.cat((input_x, input_y), dim=1)
out = self.main(input_fuse)
out_A = self.branch_1(out)
out_B = self.branch_2(out)
return out_A, out_B
class Discriminator(nn.Module):
"""Discriminator. PatchGAN."""
def __init__(self, image_size=128, conv_dim=64, repeat_num=3, norm='SN'):
super(Discriminator, self).__init__()
layers = []
if norm=='SN':
layers.append(SpectralNorm(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1)))
else:
layers.append(nn.Conv2d(3, conv_dim, kernel_size=4, stride=2, padding=1))
layers.append(nn.LeakyReLU(0.01, inplace=True))
curr_dim = conv_dim
for i in range(1, repeat_num):
if norm=='SN':
layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1)))
else:
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=2, padding=1))
layers.append(nn.LeakyReLU(0.01, inplace=True))
curr_dim = curr_dim * 2
#k_size = int(image_size / np.power(2, repeat_num))
if norm=='SN':
layers.append(SpectralNorm(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1)))
else:
layers.append(nn.Conv2d(curr_dim, curr_dim*2, kernel_size=4, stride=1, padding=1))
layers.append(nn.LeakyReLU(0.01, inplace=True))
curr_dim = curr_dim *2
self.main = nn.Sequential(*layers)
if norm=='SN':
self.conv1 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False))
else:
self.conv1 = nn.Conv2d(curr_dim, 1, kernel_size=4, stride=1, padding=1, bias=False)
# conv1 remain the last square size, 256*256-->30*30
#self.conv2 = SpectralNorm(nn.Conv2d(curr_dim, 1, kernel_size=k_size, bias=False))
#conv2 output a single number
def forward(self, x):
h = self.main(x)
#out_real = self.conv1(h)
out_makeup = self.conv1(h)
#return out_real.squeeze(), out_makeup.squeeze()
return out_makeup.squeeze()
class VGG(nn.Module):
def __init__(self, pool='max'):
super(VGG, self).__init__()
# vgg modules
self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
self.conv3_4 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
self.conv4_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
self.conv5_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
if pool == 'max':
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)
self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)
elif pool == 'avg':
self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2)
self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2)
self.pool3 = nn.AvgPool2d(kernel_size=2, stride=2)
self.pool4 = nn.AvgPool2d(kernel_size=2, stride=2)
self.pool5 = nn.AvgPool2d(kernel_size=2, stride=2)
def forward(self, x, out_keys):
out = {}
out['r11'] = F.relu(self.conv1_1(x))
out['r12'] = F.relu(self.conv1_2(out['r11']))
out['p1'] = self.pool1(out['r12'])
out['r21'] = F.relu(self.conv2_1(out['p1']))
out['r22'] = F.relu(self.conv2_2(out['r21']))
out['p2'] = self.pool2(out['r22'])
out['r31'] = F.relu(self.conv3_1(out['p2']))
out['r32'] = F.relu(self.conv3_2(out['r31']))
out['r33'] = F.relu(self.conv3_3(out['r32']))
out['r34'] = F.relu(self.conv3_4(out['r33']))
out['p3'] = self.pool3(out['r34'])
out['r41'] = F.relu(self.conv4_1(out['p3']))
out['r42'] = F.relu(self.conv4_2(out['r41']))
out['r43'] = F.relu(self.conv4_3(out['r42']))
out['r44'] = F.relu(self.conv4_4(out['r43']))
out['p4'] = self.pool4(out['r44'])
out['r51'] = F.relu(self.conv5_1(out['p4']))
out['r52'] = F.relu(self.conv5_2(out['r51']))
out['r53'] = F.relu(self.conv5_3(out['r52']))
out['r54'] = F.relu(self.conv5_4(out['r53']))
out['p5'] = self.pool5(out['r54'])
return [out[key] for key in out_keys]
Demo (test)
[CV_3D] PointConv: Deep Convolutional Networks on 3D Point Clouds
PointConv: Deep Convolutional Networks on 3D Point Clouds
Prior Research
- PointNet : permutation invariant한 max-pooling 이용 → local region의 semantic feature 놓침
- PointNet++ : hierarchical한 Set Abstraction layer 이용 → local feature 고려 O but 내부에서 PointNet 이용
- local region의 semantic feature를 손실없이 고려하는 구조 필요 (PointConv)
Abstract
PointConv
- Convolution kernel : nonlinear function of local coordinates of 3D points
- Weight function learned with MLP
- Density function through kernel density estimation
- Translation-invariant & Permutation-invariant on any point set in 3D space
- Deconvolution operator (PointDeconv) : propagating features (subsampled → original resol)
1. Introduction
- (Indoor/Outdoor) Sensors : directly obtaining 3D data (depth info, surface normals) = important
- CNNs for 2D : translation invariance → all locations에 same set of filters 사용 가능 → params# ↓, generalization ↑
- 3D data (ex. pc) = a set of unordered 3D points (+additional features)
- Regular lattice grid : 불가 → conventional CNNs 어려움
- Volumetric grid : 가능 but sparse → high-resol에서 CNNs 어려움
PointConv : Convolution operation on 3D pc with Non-uniform sampling
- Input : positions of pc
- Goal : MLP로 weight function 학습(근사)
- Convolution operation = discrete approximation of a continuous convolution
- weights in 3D space = (Lipschitz) continuous function of local point w.r.t. a reference point
- continuous function : MLP로 근사 가능
- 보완 : 학습된 weights에 Non-uniform sampling 위해 Inverse density scale
- Inverse density scale = re-weighting continuous function
- = Monte Carlo approximation of continuous convolution
- 개선 (Memory efficient version) : summation order 변경
- Results : translation-invariance (2D CNN 비슷) & permutation-invariance (pc 특성 고려)
∴ 3 Contributions
- PointConv : Density re-weighted convolution to fully approximate 3D continuous conv on any set of 3D points
- Memory efficient version : summation order 변경 → modern CNN level까지 scale up 가능
- PointDeconv : better segmentation 가능
3. PointConv
- PointConv : MC approximation of 3D continuous convolution
- MLP to approximate weight function
- → Inverse density scale to re-weight
3.1 Convolution on 3D Point Clouds
1) Image vs Point Cloud
- Images : 2D discrete functions (grid-shaped matrices)
- relative positions bw different pixels : 항상 고정
- discretized filter : summation of real-valued weight for each location within local region
- Point Cloud : a set of 3D points (fixed grid X, 임의의 continuous value)
2) Operations
-
Conventional (2D) Convolution
-
Continuous 3D Convolution
-
$F$ : feature of a point in local region$G$ centered around point$p = (x,y,z)$ -
$W$ :$F$ 의 continuous kernel -
$(\delta_x, \delta_y, \delta_z)$ : local region$G$ 에 속한 local point가 target point$p$ 를 중심으로 떨어진 정도
-
-
PointConv : entire convolution operation for PC (not full approx)
- 실제로 local region
$G$ 에서 얻을 수 있는 것 = sample point pc - PC : very non-uniform sample from continuous
$R^3$ space -
$S$ : inverse density scale at any possible point in local region
- 실제로 local region
Continuous input 대해서 PointConv가 잘 작동하는 이유
- Continuous input PC를 discretize하여 discrete convolution으로 local feature 뽑아냄
- raster img에서의 relative positions은 고정됨
- ∴ relative positions을 input으로 받으면 전체 img 대해 same weight and density 출력 가능
3) PointConv
-
Main idea : To approximate continuous weight function
$W$ by MLP & KDE(Kernelized density estimation) -
$W$ (Weights of MLP in PointConv) : permutation-invariant 위해 모든 points에서 공유됨
[Code] Weight Network
class WeightNet(nn.Module):
def __init__(self, in_channel, out_channel, hidden_unit = [8, 8]):
super(WeightNet, self).__init__()
self.mlp_convs = nn.ModuleList()
self.mlp_bns = nn.ModuleList()
if hidden_unit is None or len(hidden_unit) == 0:
self.mlp_convs.append(nn.Conv2d(in_channel, out_channel, 1))
self.mlp_bns.append(nn.BatchNorm2d(out_channel))
else:
self.mlp_convs.append(nn.Conv2d(in_channel, hidden_unit[0], 1))
self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[0]))
for i in range(1, len(hidden_unit)):
self.mlp_convs.append(nn.Conv2d(hidden_unit[i - 1], hidden_unit[i], 1))
self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[i]))
self.mlp_convs.append(nn.Conv2d(hidden_unit[-1], out_channel, 1))
self.mlp_bns.append(nn.BatchNorm2d(out_channel))
def forward(self, localized_xyz):
#xyz : BxCxKxN
weights = localized_xyz
for i, conv in enumerate(self.mlp_convs):
bn = self.mlp_bns[i]
weights = F.relu(bn(conv(weights)))
return weights
-
$S$ (Inverse density Scale) : 계산 위해 KDE로 각 point의 density 구해서 MLP for 1D nonlinear transform에 feed- Why nonlinear transform ? network가 density estimates를 사용할지를 adaptively 결정하도록 하기 위함
[Code] KDE(Kernelized density estimation)
def compute_density(xyz, bandwidth):
'''
xyz: input points position data, [B, N, C]
'''
#import ipdb; ipdb.set_trace()
B, N, C = xyz.shape
sqrdists = square_distance(xyz, xyz)
gaussion_density = torch.exp(- sqrdists / (2.0 * bandwidth * bandwidth)) / (2.5 * bandwidth)
xyz_density = gaussion_density.mean(dim = -1)
return xyz_density
[Code] Density Network
class DensityNet(nn.Module):
def __init__(self, hidden_unit = [16, 8]):
super(DensityNet, self).__init__()
self.mlp_convs = nn.ModuleList()
self.mlp_bns = nn.ModuleList()
self.mlp_convs.append(nn.Conv2d(1, hidden_unit[0], 1))
self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[0]))
for i in range(1, len(hidden_unit)):
self.mlp_convs.append(nn.Conv2d(hidden_unit[i - 1], hidden_unit[i], 1))
self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[i]))
self.mlp_convs.append(nn.Conv2d(hidden_unit[-1], 1, 1))
self.mlp_bns.append(nn.BatchNorm2d(1))
def forward(self, density_scale):
for i, conv in enumerate(self.mlp_convs):
bn = self.mlp_bns[i]
density_scale = bn(conv(density_scale))
if i == len(self.mlp_convs):
density_scale = F.sigmoid(density_scale)
else:
density_scale = F.relu(density_scale)
return density_scale
-
$C_{in}$ ,$C_{out}$ : # of channels for input feature and output feature -
PointConv on K-point local region
-
Input feature
$F_{in}$ = ($K$ x$C_{in}$ ) dim vector -
Input of Computing Weight part :
$P_{local}$ = ($K$ x 3 ) dim vector = (relative) 3D local positions of points - MLP (1x1 conv)
- ➀ Output of Computing Weight part :
$W$ =$K$ x ($C_{in}$ ,$C_{out}$ ) dim vector - ➁ Inverse Density Scale :
$S$ = ($K$ x 1 ) dim vector → tile해서$K$ x ($C_{in}$ ,$C_{out}$ ) dim vector 맞춤 - ➀과 ➁를 element-wise product → summation 거쳐 Output feature
$F_{out}$ = ( 1 x$C_{out}$ ) dim vector
-
Input feature
-
Feature Encoding Modules
- Purpose : To aggregate features in entire point set
- Structure : hierarchical structure to combine detailed small region features → large abstract features
-
Key layers : sampling layer, grouping layer, PointConv layer ... PointNet++ 비슷
-
$S$ 와$W$ 를 이용하여 PointConv layer 구성 → PointNet의 Set Abstraction Block의 PointNet layer 대체 - ∴ 더 좋은 local representation aggregate 가능!
-
[Code] Density Set Abstraction
class PointConvDensitySetAbstraction(nn.Module):
def __init__(self, npoint, nsample, in_channel, mlp, bandwidth, group_all):
super(PointConvDensitySetAbstraction, self).__init__()
self.npoint = npoint
self.nsample = nsample
self.mlp_convs = nn.ModuleList()
self.mlp_bns = nn.ModuleList()
last_channel = in_channel
for out_channel in mlp:
self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))
self.mlp_bns.append(nn.BatchNorm2d(out_channel))
last_channel = out_channel
self.weightnet = WeightNet(3, 16)
self.linear = nn.Linear(16 * mlp[-1], mlp[-1])
self.bn_linear = nn.BatchNorm1d(mlp[-1])
self.densitynet = DensityNet()
self.group_all = group_all
self.bandwidth = bandwidth
def forward(self, xyz, points):
"""
Input:
xyz: input points position data, [B, C, N]
points: input points data, [B, D, N]
Return:
new_xyz: sampled points position data, [B, C, S]
new_points_concat: sample points feature data, [B, D', S]
"""
B = xyz.shape[0]
N = xyz.shape[2]
xyz = xyz.permute(0, 2, 1)
if points is not None:
points = points.permute(0, 2, 1)
xyz_density = compute_density(xyz, self.bandwidth)
inverse_density = 1.0 / xyz_density
if self.group_all:
new_xyz, new_points, grouped_xyz_norm, grouped_density = sample_and_group_all(xyz, points, inverse_density.view(B, N, 1))
else:
new_xyz, new_points, grouped_xyz_norm, _, grouped_density = sample_and_group(self.npoint, self.nsample, xyz, points, inverse_density.view(B, N, 1))
# new_xyz: sampled points position data, [B, npoint, C]
# new_points: sampled points data, [B, npoint, nsample, C+D]
new_points = new_points.permute(0, 3, 2, 1) # [B, C+D, nsample,npoint]
for i, conv in enumerate(self.mlp_convs):
bn = self.mlp_bns[i]
new_points = F.relu(bn(conv(new_points)))
inverse_max_density = grouped_density.max(dim = 2, keepdim=True)[0]
density_scale = grouped_density / inverse_max_density
density_scale = self.densitynet(density_scale.permute(0, 3, 2, 1))
new_points = new_points * density_scale
grouped_xyz = grouped_xyz_norm.permute(0, 3, 2, 1)
weights = self.weightnet(grouped_xyz)
new_points = torch.matmul(input=new_points.permute(0, 3, 1, 2), other = weights.permute(0, 3, 2, 1)).view(B, self.npoint, -1)
new_points = self.linear(new_points)
new_points = self.bn_linear(new_points.permute(0, 2, 1))
new_points = F.relu(new_points)
new_xyz = new_xyz.permute(0, 2, 1)
return new_xyz, new_points
[Code] PointConv for Classification
class PointConvDensityClsSsg(nn.Module):
def __init__(self, num_classes = 40):
super(PointConvDensityClsSsg, self).__init__()
feature_dim = 3
self.sa1 = PointConvDensitySetAbstraction(npoint=512, nsample=32, in_channel=feature_dim + 3, mlp=[64, 64, 128], bandwidth = 0.1, group_all=False)
self.sa2 = PointConvDensitySetAbstraction(npoint=128, nsample=64, in_channel=128 + 3, mlp=[128, 128, 256], bandwidth = 0.2, group_all=False)
self.sa3 = PointConvDensitySetAbstraction(npoint=1, nsample=None, in_channel=256 + 3, mlp=[256, 512, 1024], bandwidth = 0.4, group_all=True)
self.fc1 = nn.Linear(1024, 512)
self.bn1 = nn.BatchNorm1d(512)
self.drop1 = nn.Dropout(0.7)
self.fc2 = nn.Linear(512, 256)
self.bn2 = nn.BatchNorm1d(256)
self.drop2 = nn.Dropout(0.7)
self.fc3 = nn.Linear(256, num_classes)
def forward(self, xyz, feat):
B, _, _ = xyz.shape
l1_xyz, l1_points = self.sa1(xyz, feat)
l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
x = l3_points.view(B, 1024)
x = self.drop1(F.relu(self.bn1(self.fc1(x))))
x = self.drop2(F.relu(self.bn2(self.fc2(x))))
x = self.fc3(x)
x = F.log_softmax(x, -1)
return x
3.2 Feature Propagation Using Deconvolution [Segmentation]
- Segmentation : point-wise prediction 필요 (subsampled pc에서 denser pc로 propagate 하여 모든 input features)
- PointNet++ : distance-based Interpolation 제안 → full advantage of deconv 고려X
- PointDeconv : Interpolation + PointConv 구성
4. Efficient PointConv
-
Motivation : MLP는 point마다 공유되어도, MC 기반 weight function으로 구한 weight
$W$ 는 point마다 다름 → high memory consumption -
Implementation : Matrix multiplication & 2d 1x1 convolution
- PointConv 마지막에 전체 points에 대한 summation 있으므로 K에 대한 summation을 먼저 수행하자
- → W : intermediate output
$M$ 에 대해 마지막 weight인$H$ 로 1x1 conv를 수행한 것 - ∴
$K$ x$C_{out}$ 을$C_{mid}$ 로 대체한 것 = 효율적!
- Advantage : parallel computing of GPU, easy implementation, → low memory consumption (1/64)
-
Generated weights filters : 두 파트로 나눔 (Intermediate output
$M$ & Convolution kernel$H$ )
5. Experiments
5.1 Classification on ModelNet40
- Dataset : ModelNet40 (12,311 CAD models from 40 man-made object categories)
- Using PointNet to sample 1024 points uniformly & compute normal vector from mesh models
- Data augmentation : random rotating along z-axis, jittering by gaussian noise
- Result : PointConv = SOTA among 3D input methods
5.2 ShapeNet Part Segmentation
- Dataset : ShapeNet (16,881 shapes from 16 classes, 50 parts)
- Goal : To assign a part category label to each point (fine-grained 3D recognition task)
- Eval Metric : point IoU
- Result : class avg mIoU 82.8%, instance avg mIoU 85.7% = par with SOTA
5.3 Semantic Scene Labeling(Segmentation)
- Dataset : ScanNet (noisy dataset for realistic pc)
- Goal : To predict semantic object labels on each 3D point given indoor scenes represented by pc
- Train : 3m x 1.5m x 1.5m random cube samples 사용
- Eval : using sliding window over entire scan
- Eval Metric : IoU, mIouU
- Result : PointConv outperforms other methods
5.4 Classification on CIFAR-10
- Dataset : CIFAR-10
- each pixel as a 2D point with (x, y) + RGB features
- pc scaled onto unit ball
- Result : same learning capacity as 2D CNN
6. Ablation Experiments and Visualization
6.1 The Structure of MLP
- Dataset : 20 scene types for ScanNet (realistic 3D pc with RGB)
-
$C_{mid}$ : 크다고 성능이 반드시 좋은건 X, memory efficiency에 영향 O - MLP의 layers 수가 성능에 미치는 영향 적음
6.2 Inverse Density Scale
- Dataset : ScanNet
- Density > No Density → Effect of IDS
- more effective in layers closer to input
- FPS for sub-sampling → deeper layer : uniformly distributed 라서 density scale 영향이 줄어
6.3 Ablation Studies on ScanNet
6.4 Visualization
[CV_3D] JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields
JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields
Abstract
- Task : Semantic and Instance Segmentation of 3D point clouds
- Methods
- Multi-task pointwise network : predicting semantic classes of 3D points & embedding points into high-dim vectors → points of same object instance are represented by similar embeddings
- Multi-value conditional random field : incorporating semantic and instance labels & formulating problem of semantic and instance segmentation as jointly optimising labels
- Results : showing robustness / SOTA performance on semantic segmentation
1. Introduction
- 3D scene understanding : hard challenges (ex. large-scale and noisy data processing)
- Point-based representation
- PC : more compact and intuitive representation of 3D data than multi-view of volumetric representations
- Recent NN on PC : promising results across multiple tasks
- Motivation
- Semantic segmentation : identifying a class label or Object category for every 3D point in a scene
- Instance segmentation : clustering scene into Object instances
- Object categories and Object instances are mutually dependent → coupling semantic and instance segmentation into a single task!
- Contributions
- Multi-Task Pointwise Network (MT-PNet) : predicting object categories of 3D points & embedding 3D points into high-dim feature vectors(→clustering points into object instances)
- Multi-Value Conditional Random Fields (MV-CRF) : joint optimisation of class labels and object instances by variational mean field technique
- Experiments : joint optimisation > each individual task / SOTA performance on semantic segmentation
2. Related Work
Semantic Segmentation
- Multi-view approach : using pretrained models on 2D domain and applying to 3D space => inconsistency
- Volumetric approach : ex. octree (limiting convolution operations only on free-space voxels)
- Point cloud approach : directly storing attributes of geometry of 3D scene via coordinates and normals of vertices
- Conditional Random Fields (CRFs) : unary and binary potentials capturing characteristics of individual 3D points or meshes
Instance Segmentation
- (1) Localizing object bboxes by Object detection → Finding a mask that separates fg and bg within each box
- (2) Semantic segmentation + Proposing object instances
3. Method
- At first, Scan entire pc by overlapping 3D windows
- NN for predicting semantic class labels of vertices within window & embedding vertices into high-dim vectors
Multi-Task Pointwise Network (MT-PNet)
- Purpose : predict object class for every 3D point in scene & embedding 3D point into high-dim vector
- Same object instance : Pull <-> Different object instance : Push each other
Multi-Value Conditional Random Fields (MV-CRF)
- Purpose : jointly performing semantic and instance segmentation by variational inference
- Class labels and embeddings are fused into MV-CRF model
3.1. Multi-Task Pointwise Network (MT-PNet)
- Input PC (N) → Feature map (N x D)
- based on feed forward architecture of PointNet
- Two branches : Predicting semantic labels for 3D points & Creating their pointwise Embeddings
-
Notation
-
$K$ : # of instance -
$N_k$ : # of elements in k-th instance -
$e_j$ : embedding of point$v_j$ -
$m_k$ : mean (centroid) of embeddings in k-th instance
-
-
Loss =
$L_{prediction}$ +$L_{embedding}$ -
$L_{prediction}$ : CE -
$L_{embedding}$ :$L_{pull} + L_{push} + 0.001*L_{reg}$ -
$L_{pull}$ : to attract embeddings towards centroids <->$L_{push}$ : to keep centroids away from each other -
$L_{reg}$ : to draw all centroids towards the origin
-
-
[Code] Multi-Task Pointwise Network (MT-PNet)
class MTPNet(nn.Module):
def __init__(self, input_channels, num_classes, embedding_size):
super(MTPNet, self).__init__()
self.num_classes = num_classes
self.embedding_size = embedding_size
self.input_channels = input_channels
self.net = PointNet(self.input_channels)
self.fc1 = nn.Conv1d(128, self.num_classes, 1)
self.fc2 = nn.Conv1d(128, self.embedding_size, 1)
def forward(self, x):
x = self.net(x)
logits = self.fc1(x)
logits = logits.transpose(2, 1)
logits = torch.log_softmax(logits, dim=-1)
embedded = self.fc2(x)
embedded = embedded.transpose(2, 1)
return logits, embedded
[Code] Loss
class DiscriminativeLoss(nn.Module):
def __init__(self, delta_d, delta_v,
alpha=1.0, beta=1.0, gamma=0.001,
reduction='mean'):
# TODO: Respect the reduction rule
super(DiscriminativeLoss, self).__init__()
self.alpha = alpha
self.beta = beta
self.gamma = gamma
# Set delta_d > 2 * delta_v
self.delta_d = delta_d
self.delta_v = delta_v
def forward(self, embedded, masks, size):
centroids = self._centroids(embedded, masks, size)
L_v = self._variance(embedded, masks, centroids, size)
L_d = self._distance(centroids, size)
L_r = self._regularization(centroids, size)
loss = self.alpha * L_v + self.beta * L_d + self.gamma * L_r
return loss
def _centroids(self, embedded, masks, size):
batch_size = embedded.size(0)
embedding_size = embedded.size(2)
K = masks.size(2)
x = embedded.unsqueeze(2).expand(-1, -1, K, -1)
masks = masks.unsqueeze(3)
x = x * masks
centroids = []
for i in range(batch_size):
n = size[i]
mu = x[i,:,:n].sum(0) / masks[i,:,:n].sum(0)
if K > n:
m = int(K - n)
filled = torch.zeros(m, embedding_size)
filled = filled.to(embedded.device)
mu = torch.cat([mu, filled], dim=0)
centroids.append(mu)
centroids = torch.stack(centroids)
return centroids
def _variance(self, embedded, masks, centroids, size):
batch_size = embedded.size(0)
num_points = embedded.size(1)
embedding_size = embedded.size(2)
K = masks.size(2)
# Convert input into the same size
mu = centroids.unsqueeze(1).expand(-1, num_points, -1, -1)
x = embedded.unsqueeze(2).expand(-1, -1, K, -1)
# Calculate intra pull force
var = torch.norm(x - mu, 2, dim=3)
var = torch.clamp(var - self.delta_v, min=0.0) ** 2
var = var * masks
loss = 0.0
for i in range(batch_size):
n = size[i]
loss += torch.sum(var[i,:,:n]) / torch.sum(masks[i,:,:n])
loss /= batch_size
return loss
def _distance(self, centroids, size):
batch_size = centroids.size(0)
loss = 0.0
for i in range(batch_size):
n = size[i]
if n <= 1: continue
mu = centroids[i, :n, :]
mu_a = mu.unsqueeze(1).expand(-1, n, -1)
mu_b = mu_a.permute(1, 0, 2)
diff = mu_a - mu_b
norm = torch.norm(diff, 2, dim=2)
margin = 2 * self.delta_d * (1.0 - torch.eye(n))
margin = margin.to(centroids.device)
distance = torch.sum(torch.clamp(margin - norm, min=0.0) ** 2) # hinge loss
distance /= float(n * (n - 1))
loss += distance
loss /= batch_size
return loss
def _regularization(self, centroids, size):
batch_size = centroids.size(0)
loss = 0.0
for i in range(batch_size):
n = size[i]
mu = centroids[i, :n, :]
norm = torch.norm(mu, 2, dim=1)
loss += torch.mean(norm)
loss /= batch_size
return loss
3.2. Multi-Value Conditional Random Fields (MV-CRF)
-
Conditional Random Fields (CRF)
- Classical algorithm for Named Entity Recognition (NER) in NLP task
- Softmax regression with potential function
- Select several candidates → Choose the most appropriate label among them
-
Notation
-
$V$ : point cloud of 3D scene -
$v_j$ : 3D vertex(point) in$V$ - represented by its 3D location$l_j = [x_j, y_j, z_j]$ & normal$n_j$ & color$c_j = [c_R, c_G, c_B]$ -
$e_j$ : embedding for each point$v_j$ -
$I_J^S$ : semantic label --->$L^S$ : set of semantic labels of$V$ -
$I_J^I$ : instance label --->$L^I$ : set of instance labels of$V$
-
- Joint semantic-instance segmentation of point cloud
$V$ by minimizing Energy function - MV-CRF : treating instance and semantic labels equally as unknown → optimizing together (minimizing E)
-
Energy function
$E$ = ➀+➁+➂+➃+➄- Physical constraints (eg. surface smoothness, geometric proximity) & Semantic constraints (ex. shape consistency, object class and instances) in both Semantic and Instance labeling
-
➀ : Unary potential defined over semantic labels BY classification score of MT-PNet
$v_j$ -
➁ : Pairwise potential for same object class BY classification scores of both
$v_j$ and$v_k$ - ➂ : Unary potential defined over instance labels → PULL same instance <-> PUSH different instance embeddings
-
➃ : Pairwise potential of instance labels → Geometric properties of surfaces in object instances
- defined as a mixture of Gaussians of locations, normals, color of vertices
$v_j$ and$v_k$
- defined as a mixture of Gaussians of locations, normals, color of vertices
-
➄ : semantic-based potentials with instance-based potentials → Consistency bw semantic and instance labels
- defined based on mutual information BY frequency that semantic label
$s$ occurs in vertices whose instance label is$i$
- defined based on mutual information BY frequency that semantic label
- Physical constraints (eg. surface smoothness, geometric proximity) & Semantic constraints (ex. shape consistency, object class and instances) in both Semantic and Instance labeling
[Code] Dense CRF
/////////////////////////////////
///// Pairwise Potentials /////
/////////////////////////////////
void DenseCRF::addPairwiseEnergy (const MatrixXf & features, LabelCompatibility * function, KernelType kernel_type, NormalizationType normalization_type) {
assert( features.cols() == N_ );
addPairwiseEnergy( new PairwisePotential( features, function, kernel_type, normalization_type ) );
}
void DenseCRF::addPairwiseEnergy ( PairwisePotential* potential ){
pairwise_.push_back( potential );
}
void DenseCRF2D::addPairwiseGaussian ( float sx, float sy, LabelCompatibility * function, KernelType kernel_type, NormalizationType normalization_type ) {
MatrixXf feature( 2, N_ );
for( int j=0; j<H_; j++ )
for( int i=0; i<W_; i++ ){
feature(0,j*W_+i) = i / sx;
feature(1,j*W_+i) = j / sy;
}
addPairwiseEnergy( feature, function, kernel_type, normalization_type );
}
void DenseCRF2D::addPairwiseBilateral ( float sx, float sy, float sr, float sg, float sb, const unsigned char* im, LabelCompatibility * function, KernelType kernel_type, NormalizationType normalization_type ) {
MatrixXf feature( 5, N_ );
for( int j=0; j<H_; j++ )
for( int i=0; i<W_; i++ ){
feature(0,j*W_+i) = i / sx;
feature(1,j*W_+i) = j / sy;
feature(2,j*W_+i) = im[(i+j*W_)*3+0] / sr;
feature(3,j*W_+i) = im[(i+j*W_)*3+1] / sg;
feature(4,j*W_+i) = im[(i+j*W_)*3+2] / sb;
}
addPairwiseEnergy( feature, function, kernel_type, normalization_type );
}
//////////////////////////////
///// Unary Potentials /////
//////////////////////////////
void DenseCRF::setUnaryEnergy ( UnaryEnergy * unary ) {
if( unary_ ) delete unary_;
unary_ = unary;
}
void DenseCRF::setUnaryEnergy( const MatrixXf & unary ) {
setUnaryEnergy( new ConstUnaryEnergy( unary ) );
}
void DenseCRF::setUnaryEnergy( const MatrixXf & L, const MatrixXf & f ) {
setUnaryEnergy( new LogisticUnaryEnergy( L, f ) );
}
/////////////////////////////////////
///// Higher Order Potentials /////
/////////////////////////////////////
void DenseCRF::addHigherOrderEnergy( const VectorXs & cliques, float weight ) {
if( higher_order_ ) delete higher_order_;
higher_order_ = new HigherOrderPotential( cliques, weight );
}
///////////////////////
///// Inference /////
///////////////////////
void expAndNormalize ( MatrixXf & out, const MatrixXf & in ) {
out.resize( in.rows(), in.cols() );
for( int i=0; i<out.cols(); i++ ){
VectorXf b = in.col(i);
b.array() -= b.maxCoeff();
b = b.array().exp();
out.col(i) = b / b.array().sum();
}
}
void sumAndNormalize( MatrixXf & out, const MatrixXf & in, const MatrixXf & Q ) {
out.resize( in.rows(), in.cols() );
for( int i=0; i<in.cols(); i++ ){
VectorXf b = in.col(i);
VectorXf q = Q.col(i);
out.col(i) = b.array().sum()*q - b;
}
}
MatrixXf DenseCRF::inference ( int n_iterations ) const {
MatrixXf Q( M_, N_ ), tmp1, unary( M_, N_ ), tmp2, tmp3;
unary.fill(0);
if( unary_ )
unary = unary_->get();
expAndNormalize( Q, -unary );
VectorXi mask(N_);
// for (int i = 0; i < N_; ++i)
// mask[i] = (Q.col(i).maxCoeff() > 0.8f);
for( int it=0; it<n_iterations; it++ ) {
tmp1 = -unary;
// Higher-order Potts model
if( higher_order_ ) {
higher_order_->apply( tmp3, Q, mask );
tmp1 -= tmp3;
}
for( unsigned int k=0; k<pairwise_.size(); k++ ) {
pairwise_[k]->apply( tmp2, Q );
tmp1 -= tmp2;
}
expAndNormalize( Q, tmp1 );
}
return Q;
}
VectorXs DenseCRF::map ( int n_iterations ) const {
// Run inference
MatrixXf Q = inference( n_iterations );
// Find the map
return currentMap( Q );
}
///////////////////
///// Debug /////
///////////////////
VectorXf DenseCRF::unaryEnergy(const VectorXs & l) {
assert( l.cols() == N_ );
VectorXf r( N_ );
r.fill(0.f);
if( unary_ ) {
MatrixXf unary = unary_->get();
for( int i=0; i<N_; i++ )
if ( 0 <= l[i] && l[i] < M_ )
r[i] = unary( l[i], i );
}
return r;
}
VectorXf DenseCRF::pairwiseEnergy(const VectorXs & l, int term) {
assert( l.cols() == N_ );
VectorXf r( N_ );
r.fill(0.f);
if( term == -1 ) {
for( unsigned int i=0; i<pairwise_.size(); i++ )
r += pairwiseEnergy( l, i );
return r;
}
MatrixXf Q( M_, N_ );
// Build the current belief [binary assignment]
for( int i=0; i<N_; i++ )
for( int j=0; j<M_; j++ )
Q(j,i) = (l[i] == j);
pairwise_[ term ]->apply( Q, Q );
for( int i=0; i<N_; i++ )
if ( 0 <= l[i] && l[i] < M_ )
r[i] =-0.5*Q(l[i],i );
else
r[i] = 0;
return r;
}
MatrixXf DenseCRF::startInference() const{
MatrixXf Q( M_, N_ );
Q.fill(0);
// Initialize using the unary energies
if( unary_ )
expAndNormalize( Q, -unary_->get() );
return Q;
}
void DenseCRF::stepInference( MatrixXf & Q, MatrixXf & tmp1, MatrixXf & tmp2 ) const{
tmp1.resize( Q.rows(), Q.cols() );
tmp1.fill(0);
if( unary_ )
tmp1 -= unary_->get();
// Add up all pairwise potentials
for( unsigned int k=0; k<pairwise_.size(); k++ ) {
pairwise_[k]->apply( tmp2, Q );
tmp1 -= tmp2;
}
// Exponentiate and normalize
expAndNormalize( Q, tmp1 );
}
VectorXs DenseCRF::currentMap( const MatrixXf & Q ) const{
VectorXs r(Q.cols());
// Find the map
for( int i=0; i<N_; i++ ){
int m;
Q.col(i).maxCoeff( &m );
r[i] = m;
}
return r;
}
// Compute the KL-divergence of a set of marginals
double DenseCRF::klDivergence( const MatrixXf & Q ) const {
double kl = 0;
// Add the entropy term
for( int i=0; i<Q.cols(); i++ )
for( int l=0; l<Q.rows(); l++ )
kl += Q(l,i)*log(std::max( Q(l,i), 1e-20f) );
// Add the unary term
if( unary_ ) {
MatrixXf unary = unary_->get();
for( int i=0; i<Q.cols(); i++ )
for( int l=0; l<Q.rows(); l++ )
kl += unary(l,i)*Q(l,i);
}
// Add all pairwise terms
MatrixXf tmp;
for( unsigned int k=0; k<pairwise_.size(); k++ ) {
pairwise_[k]->apply( tmp, Q );
kl += (Q.array()*tmp.array()).sum();
}
return kl;
}
// Gradient computations
double DenseCRF::gradient( int n_iterations, const ObjectiveFunction & objective, VectorXf * unary_grad, VectorXf * lbl_cmp_grad, VectorXf * kernel_grad) const {
// Run inference
std::vector< MatrixXf > Q(n_iterations+1);
MatrixXf tmp1, unary( M_, N_ ), tmp2;
unary.fill(0);
if( unary_ )
unary = unary_->get();
expAndNormalize( Q[0], -unary );
for( int it=0; it<n_iterations; it++ ) {
tmp1 = -unary;
for( unsigned int k=0; k<pairwise_.size(); k++ ) {
pairwise_[k]->apply( tmp2, Q[it] );
tmp1 -= tmp2;
}
expAndNormalize( Q[it+1], tmp1 );
}
// Compute the objective value
MatrixXf b( M_, N_ );
double r = objective.evaluate( b, Q[n_iterations] );
sumAndNormalize( b, b, Q[n_iterations] );
// Compute the gradient
if(unary_grad && unary_)
*unary_grad = unary_->gradient( b );
if( lbl_cmp_grad )
*lbl_cmp_grad = 0*labelCompatibilityParameters();
if( kernel_grad )
*kernel_grad = 0*kernelParameters();
for( int it=n_iterations-1; it>=0; it-- ) {
// Do the inverse message passing
tmp1.fill(0);
int ip = 0, ik = 0;
// Add up all pairwise potentials
for( unsigned int k=0; k<pairwise_.size(); k++ ) {
// Compute the pairwise gradient expression
if( lbl_cmp_grad ) {
VectorXf pg = pairwise_[k]->gradient( b, Q[it] );
lbl_cmp_grad->segment( ip, pg.rows() ) += pg;
ip += pg.rows();
}
// Compute the kernel gradient expression
if( kernel_grad ) {
VectorXf pg = pairwise_[k]->kernelGradient( b, Q[it] );
kernel_grad->segment( ik, pg.rows() ) += pg;
ik += pg.rows();
}
// Compute the new b
pairwise_[k]->applyTranspose( tmp2, b );
tmp1 += tmp2;
}
sumAndNormalize( b, tmp1.array()*Q[it].array(), Q[it] );
// Add the gradient
if(unary_grad && unary_)
*unary_grad += unary_->gradient( b );
}
return r;
}
VectorXf DenseCRF::unaryParameters() const {
if( unary_ )
return unary_->parameters();
return VectorXf();
}
void DenseCRF::setUnaryParameters( const VectorXf & v ) {
if( unary_ )
unary_->setParameters( v );
}
VectorXf DenseCRF::labelCompatibilityParameters() const {
std::vector< VectorXf > terms;
for( unsigned int k=0; k<pairwise_.size(); k++ )
terms.push_back( pairwise_[k]->parameters() );
int np=0;
for( unsigned int k=0; k<pairwise_.size(); k++ )
np += terms[k].rows();
VectorXf r( np );
for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
r.segment( i, terms[k].rows() ) = terms[k];
i += terms[k].rows();
}
return r;
}
void DenseCRF::setLabelCompatibilityParameters( const VectorXf & v ) {
std::vector< int > n;
for( unsigned int k=0; k<pairwise_.size(); k++ )
n.push_back( pairwise_[k]->parameters().rows() );
int np=0;
for( unsigned int k=0; k<pairwise_.size(); k++ )
np += n[k];
for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
pairwise_[k]->setParameters( v.segment( i, n[k] ) );
i += n[k];
}
}
VectorXf DenseCRF::kernelParameters() const {
std::vector< VectorXf > terms;
for( unsigned int k=0; k<pairwise_.size(); k++ )
terms.push_back( pairwise_[k]->kernelParameters() );
int np=0;
for( unsigned int k=0; k<pairwise_.size(); k++ )
np += terms[k].rows();
VectorXf r( np );
for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
r.segment( i, terms[k].rows() ) = terms[k];
i += terms[k].rows();
}
return r;
}
void DenseCRF::setKernelParameters( const VectorXf & v ) {
std::vector< int > n;
for( unsigned int k=0; k<pairwise_.size(); k++ )
n.push_back( pairwise_[k]->kernelParameters().rows() );
int np=0;
for( unsigned int k=0; k<pairwise_.size(); k++ )
np += n[k];
for( unsigned int k=0,i=0; k<pairwise_.size(); k++ ) {
pairwise_[k]->setKernelParameters( v.segment( i, n[k] ) );
i += n[k];
}
}
3.3. Variational Inference
-
Optimization problem : Minimizing
$E$ = Maximizing posterior conditional$p$ (intractable with naive implementation) - Mean field Variational Inference to solve optimization problem
Code Implementation
[CV_CNN] Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Abstract
- Residual learning framework to ease training of networks that are substantially deeper
- Residual networks are easier to optimize and can gain acc from increased depth
- 152 layers ResNet (8x deeper than VGGNet) : deeper but still having lower complexity
- Ensemble model : 3.57% top-5 error on ImageNet -> 1st place on ILSVRC 2015 classification task
- Generalization performance on other recognition tasks (Object detection and Segmentation task)
1. Introduction
- DNN for image classification (visual recognition task)
- Integrating low/mid/high level features <- levels can be enriched by stacking layers
- VGG, GoogLeNet : showed that Network depth is important
- Possible problems of stacking many layers
- Vanishing/exploding gradients problem : can be solved by normalized initialization (ex. He), SGD, ..
- Overfitting (Variance ↑ + Bias ↓) : low train error but high test error
- Degradation problem : Deeper model 일수록 train and test error both ↑
- 점점 error 줄어들므로 Vanishing gradient X, Train error도 높아지므로 Overfitting X
- Solution (유일x) : Deep Residual Learning Network (ResNet)
- Residual Learning
- H(x) : Desired(Original) mapping
- F(x) := H(x)-x : Residual mapping
- Output = F(x)+x = H(x)
- (Extreme assumption) If an identity mapping were optimal (H=x), residual to zero (F=0) is easier
- Shortcut connections : +x
- Their outputs are added to the outputs of stacked layers
- Simply perform Identity mapping
- No extra params and computational complexity
- End-to-end by SGD with backprop, Easy implementation
- Experiments
- ImageNet -> ResNet is easy to optimize & deeper net gets higher accuracy
- CIFAR-10 -> Similar phenomena are shown -> showing that generalization for other datasets
- Generalization performance on other recognition tasks (Object detection and Segmentation task)
2. Related Work
Residual Representations
Shortcut Connections
- Previous models : GoogLeNet, highway networks, ...
- ResNet : Always learns residual functions (Identity shortcuts are always opened)
3. Deep Residual Learning
3.1. Residual Learning
- Residual Learning
- H(x) : Desired(Original) mapping
- F(x) := H(x)-x : Residual mapping
- Output = F(x)+x = H(x)
- (Extreme assumption) If an identity mapping were optimal (H=x), residual to zero (F=0) is easier
- Both H(x) and F(x) can approximate the desired functions, but F(x) is easier to train
- (In real cases) Identity mappings are unlikely optimal, but reformulation helps to precondition problem
- If the optimal function is closer to identity mapping than to zero mapping, it is easier to find perturbations with reference to an identity mapping than to learn a new function
3.2. Identify Mapping by Shortcuts
- Definition of a building block :
- x and y : input and output vectors
- F : Residual mapping to be learned
- Operation F + x : performed by a shortcut connection and element-wise addition
- Shortcut connections : +x
- Their outputs are added to the outputs of stacked layers
- Simply perform identity mapping
- No extra params and computational complexity
- End-to-end by SGD with backprop, Easy implementation
- 2 types of shortcut connections
- F can represent multiple conv layers
3.3 Network Architectures
Plain Network
- Plain baselines are mainly inspired by VGGNets
- 3x3 filter size for all conv layers
- # of filters is same for the same output feature map size
- # of filters is doubled if the feature map size is halved to preserve time complexity per layer
- Downsampling by stride = 2 of conv layers (No Pooling layers) to match in/output dim
- Fewer filters (params) and lower complexity than VGGNets
Residual Network
- Plain Network + Shortcut Connections
- Black lines : Identity shortcuts can be directly used when in/output same dimension
- Dotted lines : 2 options to match in/output dimensions
- (A) Identity mapping with extra zero entries padded for increasing dimensions (No extra params)
- (B) Projection shortcuts by 1x1 conv
- Both (A) and (B), when shortcuts go across feature maps of two sizes, stride = 2
3.4. Implementation
(1) Training
- Data Pre-processing
- Image Rescale : with shorter side randomly sampled in [256, 480] for augmentation
- Random crop 224 x 224
- Random horizontal flip
- Standard color augmentation
- Train Details
- Batch Normalization right after each conv and before activation
- Weight initialization & Train all plain/residual nets from scratch
- SGD with a mini-batch size : 256
- Learning rate : 0.1 -> divided by 10 up to 60 x 10^4 iterations
- Weight decay : 0.0001
- Momentum : 0.9
- No Dropout
(2) Testing
- Standard 10-crop testing for comparison studies
- For best results, fully-convolutional form -> average scores at multiple scales
4. Experiments
4.1. ImageNet Classification
- Dataset : ImageNet 2012 classification dataset (1000 classes / 1.28M train + 50K val + 100K test)
- Eval both top-1 and top-5 error rates
Plain Networks
- 18, 34, 50, 101, 152-layer Networks => kernel_size = 3
- Ex) 18-layer conv2_x : conv1 -> BN1 -> ReLU -> conv2 -> BN2 --+) shortcut --> ReLU
- Degradation problem : Deeper(34-layer) plain net has higher training error than shallower(18-layer) plain net
- No vanishing gradient (neither forward nor backward signals vanish)
- May be exponentially low convergence rates
Residual Networks
- 18-layer and 34-layer ResNet : same baseline arch with plain nets + a shortcut connection (to each pair of 3x3 filters)
- (option A) Identity mapping for all shortcuts and Zero-padding for increasing dimensions
- Deeper(34-layer) ResNet has lower training error than shallower(18-layer) ResNet -> Solving Degradation problem
- 34-layer ResNet reduces top-1 error by 3.5% -> Effectiveness of residual learning on deep systems
- 18-layer ResNet converges faster than 18-layer plain net -> ResNet eases optimization by faster convergence at early stage
Identity vs. Projection Shortcuts
- (option A) All Identity mapping shortcuts and zero-padding are used for increasing dim
- (option B) Projection shortcuts are used for increasing dim & Others are Identity mapping
- (option C) All Projection shortcuts
- All 3 options are better than plain counterpart
- B is slightly better than A : zero-padded dims have no residual learning
- C(All Projection) is better but very small differences among 3 options
- Identity shortcuts are mainly used for not increasing complexity of Bottleneck architecture!
Deeper Bottleneck Architectures
- Structure : A stack of 3 layers (1x1 → 3x3 → 1x1) instead of 2 layers
- 1x1 conv Bottleneck layers for deeper nets (50+)
- For reducing and then increasing(restoring) dimensions
- For leaving 3x3 layer a bottle neck with smaller in/output dimensions
- Identity shortcuts : parameter-free -> more efficient
- 1x1 conv Bottleneck layers for deeper nets (50+)
- Results
- 152-layer ResNet still has lower complexity than VGGNet-16/19
- Deeper(50/101/152-layer) ResNets are more accurate than shallower(34-layer) ResNet -> Solving degradation problem & great acc gains from increased depth
Comparisions with SOTA
- 152-layer ResNet single model outperforms all previous ensemble results
- Ensemble 6 models of different depth : 3.57% top-5 error -> 1st place in ILSVRC 2015
4.2. CIFAR-10 and Analysis
- Dataset : CIFAR-10 dataset (10 classes / 45K train + 5K val + 10K test)
- Network architectures
- Network input : 32x32 imgs
- Total (6n+2) stacked weighted layers
- 1st layer : 3x3 conv layer
- A stack of 6n 3x3 conv layers on feature maps of sizes {32, 16, 8} with 2n layers for each
- GAP -> 10-way fc layer -> softmax
- Shortcut Connections
- connected to the pairs of 3x3 layers (totally 3n shortcuts)
- (option A) All Identity shortcuts
(1) Training
- Data Pre-processing
- Data augmentation : 4 pixels are padded on each side
- Random crop 32x32 sampled from the padded img or horizontal flip
- Train Details
- Mini-batch size : 128 on 2 GPUs
- Weight decay : 0.0001
- Momentum : 0.9
- Learning rate : 0.1 -> divided it by 10 at 32k, 48k, 64k iterations
- Weight initialization
- Batch Normalization
- No dropout
- Mini-batch size : 128 on 2 GPUs
(2) Testing
- Only eval the single view of the original 32x32 img
(3) Results
- Similar to ImageNet cases
- 110-layer ResNet (n=18)
- initial lr = 0.01 to warm up -> go back to 0.1 and continue training
- Converges well & Fewer params than other deep networks (FitNet, Highway, ....)
Analysis of Layer Responses
- ResNets have generally smaller responses than plain counterparts
- Residual functions might be generally closer to zero (F=0) than non-residual functions
- Deeper ResNet has smaller magnitudes of responses
Exploring Over 1000 layers
- 1202-layer ResNet (n=200)
- No optimization difficulty & Training error < 0.1%
- BUT,, Small dataset + Too much Deep network -> Overfitting (Bad Test error)
- Using no strong regularization(maxout/dropout) -> Just simple regularization via deep and thin arch.
4.3 Object Detection on PASCAL and MS COCO
- Good generalization performance on other recognition tasks(detection, localization, segmentation)
Code Review
[CV_Action Recognition] Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition
Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition (2s-AGCN)
- Previous model (ST-GCN) : modeling human body skeleton as spatiotemporal graphs
- Topology of graph is set manually and fixed over all layers and samples
- Proposed model (2s-AGCN) : modeling both 1st and 2nd information simultaneously
- Topology of graph is uniformly or individually learned in E2E
- More informative by using hierarchical GCN and diverse samples
- Result : flexibility & generality ↑ ⇒ better than SOTA
Paper Review
1. Introduction
-
Disadvantages of ST-GCN
- (1) Topology is fixed over all layers ⇒ lacking flexibility to model multilevel semantic information
- (2) Feature vector attached to each joint only contains 1st info (2D or 3D coordinates)
- (3) Skeleton graph is heuristically predefined and represents only physical structure of body
⇒ 'two hands' 처럼 멀리 떨어진 것들에 대한 dependency 얻기 어려움 - (4) One fixed graph structure is not optimal for all samples of different actions
⇒ 'touching head'와 'jumping up'에서 hands와 head 사이의 connection 강도 다름
-
Contributions of 2s-AGCN
- (1) Adaptively learn topology of graph for different layers and samples in E2E
- (2) Feature vector pointing from source joint to target joint contains 2nd info (lengths and directions of bones)
⇒ 2nd info is formulated and combined with 1st info using two-stream framework - (3) SOTA on two large-scale datasets
-
Two types of graphs in 2s-AGCN (Data-driven method)
- Global graph : for common pattern for all the data
- Individual graph : for unique pattern for each data
- Both are optimized individually for different laters
3. Graph Convolution Networks
3.1. Graph construction
- Raw data in one frame : sequence of vectors (each vector = coordinates of joint)
- Complete action by multiple frames with different lengths and samples
- Following structure of ST-GCN (spatiotemporal graph to model structured information) #36
![image](https://private-user-images.githubusercontent.com/83633885/254138092-92f17c53-4878-465e-9705-733936fc37ec.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxMzgwOTItOTJmMTdjNTMtNDg3OC00NjVlLTk3MDUtNzMzOTM2ZmMzN2VjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTExZWEzZWU0YWRkODNkOTBjODAwNjYzOTU1N2MwZDgwYTMzYTIxNTJiZjczZTcxYzkzN2MzNzU0OGViYjM0NjMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.E9GdU80Rb0HZEYZug9ohpZc3XvNJr6rUhQJfNxU8L4Q)
3.2. Graph convolution
-
Configs
- Multiple layers of ST-GCN to extract high-level features
- GAP layer & Softmax classifier to predict action categories
-
Graph convolution operation on vertex
$v_i$ in spatial dimension
![image](https://private-user-images.githubusercontent.com/83633885/254139813-820b1ccd-4682-4d49-9996-e6fb3e1ce35a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxMzk4MTMtODIwYjFjY2QtNDY4Mi00ZDQ5LTk5OTYtZTZmYjNlMWNlMzVhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ0OWQ5MjhiZWI3MDkyYjI5ZjI1MWM3MmM5N2M2NzdkNzExOTlmMDk4MjYwZTJkMDUyZTU2YWZhMDc0NTI2ZjEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.R6y2ntedUPsl51yM46o4qvlytxl0NT5BXsgj4bF5fGU)
-
-
$B_i$ : sampling area enclosed by curve (1-distance neighbor vertexes) → # of vertexes in$B_i$ is varied - Kernel size 3,
$B_i$ divided into 3 subsets-
$S_{i1}$ : vertex itself (red circle) -
$S_{i2}$ : centripetal subset (green circle) ; closer to center of gravity -
$S_{i3}$ : centrifugal subset (blue circle) ; farther from center of gravity
-
-
$Z_{ij}$ : cardinality of$S_{ik}$ for balancing contribution of each subset -
$w$ : weighting function based on input → # of weight vectors is fixed -
$l_i$ : mapping function
-
3.3. Implementation
-
[Spatial dimension] shape of feature map =
$C$ x$T$ x$N$ tensor
![image](https://private-user-images.githubusercontent.com/83633885/254140726-ef892215-892d-464f-bf56-1bf503208d34.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxNDA3MjYtZWY4OTIyMTUtODkyZC00NjRmLWJmNTYtMWJmNTAzMjA4ZDM0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTVjZjQ2MThiNjI4OWQ1ZDAwYTUzNzFmNDQyODc0ZDU2YzJkMjk3Mzc0NjIzZTA5MGE5OTVlNjdmMzU1NGNkNDUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.H2uYh1lP3TK_ONXgj2IvM0vsvKVdAAnzZzHZs3cN82w)
-
-
$C$ : channels #,$T$ : temporal length,$N$ : vertexes # -
$K_v$ : kernel size of spatial dimension (=3) -
$A_k$ : adjacency matrix ($N$ x$N$ ) → Whether there are connections bw two vertexes -
$W_k$ : weight vector ($C_{out}$ x$C_{in}$ x 1 x 1 ) -
$M_k$ : attention map (mask) for importance of each vertex ($N$ x$N$ ) → Strength of connections
-
- [Temporal dimension] neighbors for each vertex = fixed as 2 (two consecutive frames)
4. Two-Stream Adaptive Graph Convolutional Network
Adaptive Graph Convolutional Layer
![image](https://private-user-images.githubusercontent.com/83633885/254158794-99801cfa-dee0-4627-a469-895836503038.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxNTg3OTQtOTk4MDFjZmEtZGVlMC00NjI3LWE0NjktODk1ODM2NTAzMDM4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJlNWE3ZjAxNGM3YjVhN2Y3MDY0YjZmNWQ1YjIxNDQzNjFiMjBiNWE2NDVlOTI2ZjFiM2M2MTcwNzA2NGUzNzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.-tYwbLIx5_jBdBfL1E_G_eiZ7to3AX0jEOCEMdiStqU)
- Unique graph for different layers and samples (Flexibility)
- 1x1 Residual brach for matching channel dimension (Stabliity)
-
Adaptive graph form
-
$A_k$ : original normalized adjacency matrix ($N$ x$N$ ) → human body physical structure -
$B_k$ : trainable data-driven adjacency matrix ($N$ x$N$ ) → existence and strength of connections (attention) -
$C_k$ : normalized embedded Gaussian function ($θ$ ,$φ$ ) → similarity of two vertexes, equipped with softmax
-
![image](https://private-user-images.githubusercontent.com/83633885/254142253-3e641792-50f8-4308-a231-5e16fa52347e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxNDIyNTMtM2U2NDE3OTItNTBmOC00MzA4LWEyMzEtNWUxNmZhNTIzNDdlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI5NzI4YTlhNGE1NGEyMTg1MmI3YWQ4N2Y5M2Q1MGU3NGIwNTI1M2U3YTA0ZWQxNjIxZDMxZDU5ODA0M2RiYWQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.F0EVAcCecXOhRi2Je4Ewsbo2scj2-xveGxnFnI9QGbc)
Adaptive Graph Convolutional Block
![image](https://private-user-images.githubusercontent.com/83633885/254142344-a30a6451-0e1c-42d9-908a-a75ca894a320.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxNDIzNDQtYTMwYTY0NTEtMGUxYy00MmQ5LTkwOGEtYTc1Y2E4OTRhMzIwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdkYWNmYThiN2VkZmVlMmU2MTg1YzNjNDg4NzcwNzc3Njg4YTkwZWU1ZTM4MmU0MjQ5OGEyNzU5ODFjYTY4N2QmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.eTT_I2uK2cfIaaqM1pALHyBwDTxL-GXxBgAwcB3_TsA)
- Convs : spatial GCN
- Convt : temporal GCN
- BN (Batch Normalization), ReLU, Dropout(0.5)
- Residual connection for each block
Adaptive Graph Convolutional Network
![image](https://private-user-images.githubusercontent.com/83633885/254142424-fc219b3c-a94d-4b2b-af3c-c9c6318a8102.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxNDI0MjQtZmMyMTliM2MtYTk0ZC00YjJiLWFmM2MtYzljNjMxOGE4MTAyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ5YTM4MzJiZGZkMWIwMGVhMGVkN2NkODFhYzM1ZmMxOGUyZjhlYWIyZGI3ZWUwZjQ4YzA1ZWJkNThkOGQ0OGMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.g2Ds0Dp_zwGcl2XVpBrD7Ho-ma2WbVGUXSw80CThSjU)
- AGCN = stack of 9 basic blocks (output channels : 64, 64, 64, 128, 128, 128, 256, 256, 256)
- BN at beginning to normalize input data
- GAP at end to pool feature maps of different samples to same size
- Softmax classifier to obtain final output prediction
Two-stream networks
![image](https://private-user-images.githubusercontent.com/83633885/254142520-19565110-e759-4c52-be55-ec715573aff7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTYyMTM3NjYsIm5iZiI6MTcxNjIxMzQ2NiwicGF0aCI6Ii84MzYzMzg4NS8yNTQxNDI1MjAtMTk1NjUxMTAtZTc1OS00YzUyLWJlNTUtZWM3MTU1NzNhZmY3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIwVDEzNTc0NlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTMxZjU5NDYwYzk2NWVlNTMyMmVmM2FiOGE3M2I4YmE4Y2Y5YWUzODkwZTMwZWQxOTgxNDM0NmMyMjg0NWI2MmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.XPaUIxv9jKgAtjC9iVqKwNt0ZC2k3cZO5kfnZHtIyU4)
- J-stream (1st, Joint information)
-
B-stream (2nd, Bone information)
-
$v_2$ : Target joint (far away from center) -
$v_1$ : Source joint (close to center) -
$e_{v_1, v_2}$ =$v_2$ -$v_1$ = ($x_2-x_1$ ,$y_2-y_1$ ,$z_2-z_1$ ) : Bone vector (length & direction information)
-
-
Steps
- 1st. Calculate bone data based on joint data
- 2nd. Fed each data into each stream, respectively
- 3rd. Fuse each softmax score and Predict final action
5. Experiments
5.1. Datasets
5.2. Training details
5.3. Ablation Study
- Block
- Visualization
- Two-stream framework
- Comparison with SOTA
- Conclusion
Code Review
[CV_Segmentation] Fully Convolutional Networks for Semantic Segmentation
Abstract
- Convnet : Powerful visual models
- Fully Convnet (FCN)
- Input of arbitrary size -> Producing corresponding-sized Output => Spatially dense prediction
- Use contemporary classification nets (AlexNet, VGGnet, GoogLeNet) into FCN
- Transfer learned representations by fine-tuning for segmentation task
- Skip arch : Semantic info (from deep, coarse layer) + Appearance info (from shallow, fine layer)
- SOTA segmentation of PASCAL VOC, NYUDv2, SIFT Flow (inference less than 1/5 sec for a img)
1. Introduction
(1) Prior methods
- CNN : whole classification / local tasks (bbox object detection, part and key-point prediction, local correspondence)
- CNN for Semantic segmentation : (Prior approach) labeling each pixel with class of enclosing object (전체 이미지를 기반으로 하는 것이 아닌, 픽셀들마다 dense prediction)
- Patchwise training : lack efficiency of fully convolutional training, high computation
- Pixelwise training : high computation (pooling X), hierarchical feature X
- Use of pre/post-processing (ex. superpixels, proposals, post-hoc refinement by random fields or local classifiers)
- Applying small convnets without supervised pre-training
(2) Semantic segmentation ... 4.2 Combining what and where
- inherent tension(trade-off) bw Semantics (What / Global / Coarse) + Location (Where / Local / Fine)
- => FCN에서는 Skip architecture로 해당 문제 완화
(2) FCN
- Architecture : Encoder CNN + Decoder (=upsampling) --> Segmentation task
- Convolutionalization (FC->Conv) : Any size input → same size (spatial dimension) dense output
- Upsampling layer (Deconv layer) : Pixelwise prediction and learning with subsampled pooling
- Skip architecture : Semantic info from deep, coarse layer + Appearance info from shallow, fine layer
- Training / Test
- End-to-end training
- Supervised pre-training by fine-tuned CNNs (AlexNet, VGG, GoogLeNet) into FCN
- Performed whole img at a time by dense feedforward computation and backpropagation
- No use of pre/post-processing
2. Related work
Fully Convolutional Networks
- FCN for Detection
- Matan : extending convnet to any sized inputs (LeNet to recognize 1d strings)
- Wold and Platt : expanding convnet outputs to 2d (four corners of postal address blocks)
- Ning : convnet for coarse multiclass Segmentation
- He : feature extractor (proposals + spatial pooling) hybrid model -> no end-to-end
- Sliding window detection, Semantic segmentation, image restoration
Dense prediction with convnets
- Semantic segmentation, boundary prediction, hybrid convnet/nearest neighbor model, image restoration, depth estimation
- Common machinery elements of upper approaches : patchwise training / post-processing / input shifting and output interlacing / multi-scale pyramid processing / saturating tanh nonlinearites / ensemble / ...
3. Fully convolutional networks
-
Each layer in a Convnet : h x w x d (h x w : spatial dim, d : feature or color channel dim)
- Locations in the input img of 1st layer <---> Corresponding Locations of higher layers = Receptive field
-
Translation invariance
- Input의 위치가 달라져도 Output이 동일한 값을 갖는 것 (위치 정보 불변)
- CNN basic components(convolution, pooling, activation function) operate on local input regions & depend only on relative spatial coordinates
-
CNN -> FCN
- When receptive fields overlap significantly with conv filter (ex. stride=1), feedforward computation and backprop are more efficient with layer-by-layer over entire img instead of patch-by-patch (individual CNNs)
- Produce coarse output maps
- -> Need to connect these coarse outputs back to pixels for pixelwise (dense) prediction (... BY Skip arch in FCN)
- When receptive fields overlap significantly with conv filter (ex. stride=1), feedforward computation and backprop are more efficient with layer-by-layer over entire img instead of patch-by-patch (individual CNNs)
3.1 Adapting classifiers for dense prediction [Encoder] ... 4.1 From classifier to dense FCN
- Typical CNN : Fixed sized inputs & Non-spatial outputs (no spatial coordinates)
- Fixed sized inputs : FC layer는 고정된 입력크기(뉴런) 받아야 하는 구조
- Non-spatial outputs (no spatial coordinates) : FC layer는 feature map 1D로 flatten
- FCN : Any sized inputs & Spatial outputs (heatmap)
- Computation is highly amortized over the overlapping regions of patches
- FC layer를 이용한 Alexnet으로 Segmentation 하면 고정된 input size patch 이용해야 함
- FCN으로 Segmentation 하면 input size 자유라서 patch로 안해도 되기에 더 빠름
- Both backward + forward : Straightforward -> Computation efficiency of convolution
- Output dimensions are reduced by Subsampling (ex. stride 조절)
- Subsampling to keep filters small (3x3) & computational requirements reasonable
- Coarsen output of FCN
- Computation is highly amortized over the overlapping regions of patches
3.2 Shift-and-stitch is filter rarefaction ( x )
- coarse outputs -> dense prediction BY stitching output from shifted versions of input
- trick for shift-and-stitch
- setting lower layer input stride 1 -> upsampling its output by factor of input stride s
- stride로 나눠 떨어지는 경우는 값을 채워주고, 그렇지 않은 경우는 0
- however, not same result as shift-and-stitch
3.3 Upsampling is backwards strided convolution ( O ) ... 4.2 Combining what and where
- (Bilinear) Interpolation
- linear mapping that depends only on relative positions -> fixed value
- 아는 값들 기반으로 모르는 값 채워 넣기
- Backward strided convolution (Deconvolution)
- Reverse forward and backward passes of convolution
- Upsampling can be performed end-to-end learning by backprop from pixelwise loss
- Deconv filter need not be fixed, can be learned!
- A stack of Deconv and Activation func -> Nonlinear upsampling
3.4 Patchwise training is loss sampling ( x )
- Sampling Patchwise training ( x )
- can correct class imbalance BUT down spatial correlation of dense patches
- patch 간의 overlap 많이 되면 불필요한 computation 증가한다는 단점
- faster, better convergence 효과 X
- FCN (Whole image training) ( O )
- also can correct class imbalance by weighting the loss AND address spatial correlation
- more effective and efficient
- Experiment result
- Sampling : No significant effect on convergence rate, but significantly more time due to large # of imgs per batch --> Unsampled, whole img training !
4. Segmentation Architecture
- Skip architecture between layers to fuse coarse, semantic, local, appearance information
- Investigation : PASCAL VOC 2011
4.1 From classifier to dense FCN
- Used CNN models : AlexNet, VGG16, GoogLeNet
- Discarding the final classifier layer and Converting all FC to Conv (1 x 1 x 21 for PASCAL)
- Upsampling coarse outputs by deconvolution to dense outputs
- Result : FCN-VGG16 (SOTA) >> FCN-GoogLeNet (similar classification acc with VGG16)
4.2 Combining what and where
- Fully convolutionalized classifiers : can be fine-tuned to segmentation
- BUT outputs is dissatisfying coarse (limit the scale of detail in the upsampled output)
- Pooling 많이 거칠수록 features 정보 손실 -> 이걸로 upsampling 하면 제대로 X (특히 fine object 잡아내기 어려움)
- FCN + Skip Arch : combine final prediction layer + lower layers with finer strides
4.3 Experimental framework
- Optimization
- Optimizer : SGD
- Momentum = 0.9
- Weigh decay = 5^-4 or 2^-4
- Mini-batch size of 20 imgs
- Fixed lr = 10^-3, 10^-4, 5^-5 for FCN-AlexNet, FCN-VGG16, FCN-GoogLeNet
- Zero initialization class scoring layer
- Dropout : same with original CNN
- Fine-tuning
- Fine-tune all layers by backprop through whole net
- Fine-tune the output classifier layer alone : 70% of full fine-tuning performance
- Scratch training : not feasible (실현 불가)
- Dense Prediction (Upsampling)
- Upsampling by Deconvolution layers within the net
- Final deconv layer : deconv filters are fixed to bilinear interpolation
- Intermediate deconv layers : are initialized to bilinear upsampling and learned
- Augmentation
- Randomly mirroring (Horizontal Flip)
- Jittering by translating up to 32 pixels (the coarsest scale of prediction)
- Result : No noticeable improvement
- More Training data
- PASCAL VOC 2011 segmentation training set (labels for 1112 imgs) + 8498 labels by Hariharan
- Result : improve FCN-VGG16 validation score by 3.4 points to 59.4 mean IU
5. Results
- Metric : pixel accuracy, mean accuracy, mean IU, frequency weighted IU
- FCN-8s on PASCAL VOC 2011 and 2012
[CV_3D] MVSNet: Depth Inference for Unstructured Multi-view Stereo
MVSNet: Depth Inference for Unstructured Multi-view Stereo
Paper Review
Abstract
- MVSNet : E2E DL model for depth map inference from multi-view imgs
- (1) Extract deep visual img features
- (2) Build 3D Cost Volume upon reference camera frustum via differentiable homography warping
- (3) Apply 3D conv to regularize and regress initial Depth Map → Refine with reference img
- Multiple features를 One cost feature로 mapping 하는 Variance-based metric 이용해서 N-view inputs 처리 가능
- Experiments
- DTU dataset 대해 outperform SOTA & faster in runtime → benchmarking
- T&T dataset 대해 rank first without fine-tuning → strong generalization
Introduction
- Multi-View Stereo (MVS) : estimating dense representation from overlapping imgs
- Traditional methods
- How : using hand-crafted similarity metrics & engineered regularizations
- Limitation : dense matching intractable for global semantic information (ex. low-textured, specular, reflective region) → incomplete reconstruction
- Learnable CNN-based methods for 2-view stereo matching
- Global semantic information 문제 해결
- How : 2-view에서는 camera params 없이도 image pairs 미리 보정해서 horizontal pixel-wise disparity estimation 가능
- Limitation : MVS에서는 input img가 arbitrary camera geometry 일 수도 있기에 learning method 사용 어려움
- Learnable CNN-based methods for MVS recon
- 위의 Limitation 인해 MVS와 CNN의 fit 안맞아서 거의 시도되지 않았음
- Ex. SurfaceNet using CVC (Color Voxel Cubes), LSM (Learned Stereo Machine)
- Limitation : volumetric representation of regular grids 사용하기에 huge memory consumption of 3D volumes 인해 network scale up 어려움 (long time required OR only for synthetic objects in low volume resolution)
- MVSNet
- How : computing one depth map at each time (not whole 3D scene at once)
- Input : one reference img and several source imgs → to infer depth map for reference img
- Key insight : Differentiable homography warping operation
- to encode camera geometries implicitly to build 3D Cost Volumes from 2D img features
- Next step : Multi-scale 3D conv
- to regularize and regress initial Depth Map → Refine with reference img
- Major differences
- 3D Cost Volume is built upon camera frustum instead of regular Euclidean space
- Decoupled MVS recon to smaller problems of per-view depth map estimation → large-scale recon possible!
Related work
MVS Reconstruction
(분류 기준 : Output representation)
- Direct Point Cloud recon : 3D point에서 직접 수행 → sequential propagation 인해 hard to be fully parallelized, long time
- Volumetric recon : 3D space를 regular grid로 나눈 후, each voxel이 surface에 붙어있는지 추정 → space discretization error, high memory consumption
- Depth map recon : only one reference img와 a few source imgs에만 집중하는 small problems of per-view estimation로 분리 + PC 또는 Volumetric recon에 쉽게 fuse 가능
Learned Stereo
Traditional Stereo 방법 대신 DL model 사용하기 시작!
- Pair-wise patch matching
- DL network to match two img patches
- Learned features for stereo matching and semi-global matching(SGM) for post-processing
- Cost regularization
- SGMNet, CNN-CRF, GCNet
- GCNet (SOTA) : 3D CNN으로 cost volume을 regularize 하고 disparity를 regress하는 E2E model
Learned MVS
Fewer attempts ...
- Multi-patch similarity (new metric for MVS)
- SurfaceNet : sophisticated voxel-wise view 선택해서 cost volume 계산 → 3D CNN으로 정규화하고 surface voxel 추론
- LSM : camera parameters are encoded as projection for cost volume → 3D CNN으로 voxel이 surface에 속하는지 분류
- But, 두 방법 다 volumetric representation 한계로 인해 small-scale recon만 가능
MVSNet
(1) Image Feature Extraction
-
Goal : To extract deep features
$F$ of N개 input imgs$I$ -
2D Network : 8-layer 2D CNN
- layer = Conv + BN + ReLU except for last layer
- layer 1,2 & 4,5 : extract higher-level representation
- layer 3 & layer 6 : s=2 → divide feature towers into 3 scales (original input size, 1/2, 1/4)
-
Output : N개 32-channel feature maps downsized by 4 in each dim
- original neighboring information of each remaining pixel은 32-channel pixel descriptor에 의해 이미 encoding 되어 있음 → dense matching 할 때 useful context information 잃어버릴 걱정 X
-
Ablation study : original img 대해 dense matching 했을 때 보다 extracted feature maps 대해 했을 때 recon quality 훨씬 굿
class UniNetDS2(Network):
"""Simple UniNet, as described in the paper."""
def setup(self):
print ('2D with 32 filters')
base_filter = 8
(self.feed('data')
.conv_bn(3, base_filter, 1, center=True, scale=True, name='conv0_0')
.conv_bn(3, base_filter, 1, center=True, scale=True, name='conv0_1')
.conv_bn(5, base_filter * 2, 2, center=True, scale=True, name='conv1_0')
.conv_bn(3, base_filter * 2, 1, center=True, scale=True, name='conv1_1')
.conv_bn(3, base_filter * 2, 1, center=True, scale=True, name='conv1_2')
.conv_bn(5, base_filter * 4, 2, center=True, scale=True, name='conv2_0')
.conv_bn(3, base_filter * 4, 1, center=True, scale=True, name='conv2_1')
.conv(3, base_filter * 4, 1, biased=False, relu=False, name='conv2_2'))
### model.py -> def inference
# image feature extraction
if is_master_gpu:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
else:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
view_towers = []
for view in range(1, FLAGS.view_num):
view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
view_towers.append(view_tower)
(2) Cost Volume
- Goal : To build 3D Cost Volume from extracted feature maps and input cameras
- How : regular grid로 space를 나누지 않고, reference camera frustum 위에 cost volume 구축
-
Notations
-
$I_1$ : reference img →$F_1$ : reference feature map -
$I_i$ (i=2~N) : source imgs →$F_i$ : feature map -
${K_i, R_i, t_i}$ (i=1~N) : camera intrinsics, rotations, translations -
$n_1$ : principle axis of reference camera
-
Differentiable Homography
-
Warping all feature maps
$F$ → N개의 feature volume$V$ (By different fronto-parallel planes of reference camera) -
Coordinate mapping from warped
$V_i(d)$ to$F_i$ at$d$ By planar transformation$x'$ ~$H_i(d)*x$ - ~ : projective equality
-
$H_i(d)$ : 3x3 Homography matrix bw i-th feature map$F_i$ and reference feature map$F_1$ at depth$d$
- ⇔ Classical plane sweeping stereo + Differentiable bilinear interporlation to sample pixels from feature map (imgs X)
- Differentiable Warping operation : 2D feature extraction과 3D regularization network 연결 → E2E depth map inference !
Cost Metric : Variance-based Metric
-
Notations
-
$W$ (img width),$H$ (img height),$D$ (depth sample #),$F$ (feature map channel #) - Feature volume size :
$V$ =$W$ /4 *$H$ /4 *$D$ *$F$ -
$\overline{V_i}$ : Average volume of all feature volumes
-
-
Mapping : N개의 feature volume
$V_i$ → 1개의 cost volume$C$
-
Matching cost
- Traditional MVS methods : pairwise costs bw refer img and all src imgs in heuristic way
- MVSNet : all views contribute equally to matching cost & no preference to refer img
-
Mean vs Variance
- Prior research using Mean operation : infer multi-patch similarity with additional pre- and post- CNN layers
- MVSNet using Variance operation : measure multi-view feature difference explicitly
### model.py -> def inference
# build cost volume by differentiable homography
with tf.name_scope('cost_volume_homography'):
depth_costs = []
for d in range(depth_num):
# compute cost (variation metric)
ave_feature = ref_tower.get_output()
ave_feature2 = tf.square(ref_tower.get_output())
for view in range(0, FLAGS.view_num - 1):
homography = tf.slice(view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
homography = tf.squeeze(homography, axis=1)
warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
ave_feature = ave_feature + warped_view_feature
ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
ave_feature = ave_feature / FLAGS.view_num
ave_feature2 = ave_feature2 / FLAGS.view_num
cost = ave_feature2 - tf.square(ave_feature)
depth_costs.append(cost)
cost_volume = tf.stack(depth_costs, axis=1)
Cost Volume Regularization
-
What : raw Cost volume
$C$ → regulated Probability volume$P$ -
Why :
$C$ 는 img features에서 계산되었기에 noise-contaminated 위험 존재 → smoothness constraints와 통합 필요 -
How : Multi-scale 3D CNN (4-scale network)
- ≒ 3D Unet encoder-decoder structure (aggregating neighboring information from large receptive field)
- +) Computation 줄이기 위해 channel수(32→8) 줄이고, conv layers수(3→2) 줄임
- Output : 1-channel volume → softmax operation along depth direction for probability normalization
-
Usages : per-pixel depth estimation, measuring estimation confidence
=> determining recon quality by probability distribution, outlier filtering
class RegNetUS0(Network):
"""network for regularizing 3D cost volume in a encoder-decoder style. Keeping original size."""
def setup(self):
print ('Shallow 3D UNet with 8 channel input')
base_filter = 8
(self.feed('data')
.conv_bn(3, base_filter * 2, 2, center=True, scale=True, name='3dconv1_0')
.conv_bn(3, base_filter * 4, 2, center=True, scale=True, name='3dconv2_0')
.conv_bn(3, base_filter * 8, 2, center=True, scale=True, name='3dconv3_0'))
(self.feed('data')
.conv_bn(3, base_filter, 1, center=True, scale=True, name='3dconv0_1'))
(self.feed('3dconv1_0')
.conv_bn(3, base_filter * 2, 1, center=True, scale=True, name='3dconv1_1'))
(self.feed('3dconv2_0')
.conv_bn(3, base_filter * 4, 1, center=True, scale=True, name='3dconv2_1'))
(self.feed('3dconv3_0')
.conv_bn(3, base_filter * 8, 1, center=True, scale=True, name='3dconv3_1')
.deconv_bn(3, base_filter * 4, 2, center=True, scale=True, name='3dconv4_0'))
(self.feed('3dconv4_0', '3dconv2_1')
.add(name='3dconv4_1')
.deconv_bn(3, base_filter * 2, 2, center=True, scale=True, name='3dconv5_0'))
(self.feed('3dconv5_0', '3dconv1_1')
.add(name='3dconv5_1')
.deconv_bn(3, base_filter, 2, center=True, scale=True, name='3dconv6_0'))
(self.feed('3dconv6_0', '3dconv0_1')
.add(name='3dconv6_1')
.conv(3, 1, 1, biased=False, relu=False, name='3dconv6_2'))
(3) Depth Map
-
What : regulated Probability volume
$P$ → inferred Depth map$D$ -
How : Expectation value along depth direction = Probability weighted sum over all depth hypothesis
= Soft argmin → fully differentiable operation & armax effect
-
$P(d)$ : probability estimation for all pixels at depth$d$ -
$d$ : depth hypothesis uniformly sampled within [$d_{min}$ ,$d_{max}$ ]
-
- Output : depth map (same size to 2D img feature maps = 1/4 size of input img)
Probability Map
- Why(Observation) : Multi-scale 3D CNN은 probability를 single model로 정규화하는 기능을 가졌지만, falsely matched pixels의 경우 scattered distribution을 띄기에 one peak에 집중 불가
-
Definition : The quality of depth estimation
$\hat{d}$ = GT depth가 estimation 근처의 작은 범위 내에 있을 확률 - How : Probability sum over 4 nearest depth hypothesis to measure estimation quality
- → Effect : better depth map filtering, outlier filtering
def get_propability_map(cv, depth_map, depth_start, depth_interval):
""" get probability map from cost volume """
def _repeat_(x, num_repeats):
""" repeat each element num_repeats times """
x = tf.reshape(x, [-1])
ones = tf.ones((1, num_repeats), dtype='int32')
x = tf.reshape(x, shape=(-1,1))
x = tf.matmul(x, ones)
return tf.reshape(x, [-1])
shape = tf.shape(depth_map)
batch_size = shape[0]
height = shape[1]
width = shape[2]
depth = tf.shape(cv)[1]
# byx coordinate, batched & flattened
b_coordinates = tf.range(batch_size)
y_coordinates = tf.range(height)
x_coordinates = tf.range(width)
b_coordinates, y_coordinates, x_coordinates = tf.meshgrid(b_coordinates, y_coordinates, x_coordinates)
b_coordinates = _repeat_(b_coordinates, batch_size)
y_coordinates = _repeat_(y_coordinates, batch_size)
x_coordinates = _repeat_(x_coordinates, batch_size)
# d coordinate (floored and ceiled), batched & flattened
d_coordinates = tf.reshape((depth_map - depth_start) / depth_interval, [-1])
d_coordinates_left0 = tf.clip_by_value(tf.cast(tf.floor(d_coordinates), 'int32'), 0, depth - 1)
d_coordinates_left1 = tf.clip_by_value(d_coordinates_left0 - 1, 0, depth - 1)
d_coordinates1_right0 = tf.clip_by_value(tf.cast(tf.ceil(d_coordinates), 'int32'), 0, depth - 1)
d_coordinates1_right1 = tf.clip_by_value(d_coordinates1_right0 + 1, 0, depth - 1)
# voxel coordinates
voxel_coordinates_left0 = tf.stack(
[b_coordinates, d_coordinates_left0, y_coordinates, x_coordinates], axis=1)
voxel_coordinates_left1 = tf.stack(
[b_coordinates, d_coordinates_left1, y_coordinates, x_coordinates], axis=1)
voxel_coordinates_right0 = tf.stack(
[b_coordinates, d_coordinates1_right0, y_coordinates, x_coordinates], axis=1)
voxel_coordinates_right1 = tf.stack(
[b_coordinates, d_coordinates1_right1, y_coordinates, x_coordinates], axis=1)
# get probability image by gathering and interpolation
prob_map_left0 = tf.gather_nd(cv, voxel_coordinates_left0)
prob_map_left1 = tf.gather_nd(cv, voxel_coordinates_left1)
prob_map_right0 = tf.gather_nd(cv, voxel_coordinates_right0)
prob_map_right1 = tf.gather_nd(cv, voxel_coordinates_right1)
prob_map = prob_map_left0 + prob_map_left1 + prob_map_right0 + prob_map_right1
prob_map = tf.reshape(prob_map, [batch_size, height, width, 1])
return prob_map
### model.py -> def inference
# depth map by softArgmin
with tf.name_scope('soft_arg_min'):
# probability volume by soft max
probability_volume = tf.nn.softmax(
tf.scalar_mul(-1, filtered_cost_volume), axis=1, name='prob_volume')
# depth image by soft argmin
volume_shape = tf.shape(probability_volume)
soft_2d = []
for i in range(FLAGS.batch_size):
soft_1d = tf.linspace(depth_start[i], depth_end[i], tf.cast(depth_num, tf.int32))
soft_2d.append(soft_1d)
soft_2d = tf.reshape(tf.stack(soft_2d, axis=0), [volume_shape[0], volume_shape[1], 1, 1])
soft_4d = tf.tile(soft_2d, [1, 1, volume_shape[2], volume_shape[3]])
estimated_depth_map = tf.reduce_sum(soft_4d * probability_volume, axis=1)
estimated_depth_map = tf.expand_dims(estimated_depth_map, axis=3)
# probability map
prob_map = get_propability_map(probability_volume, estimated_depth_map, depth_start, depth_interval)
return estimated_depth_map, prob_map # filtered_depth_map, probability_volume
Depth Map Refinement
- Why : Large receptive field 인해 reconstruction boundary의 oversmoothing 문제
- How : reference img에는 boundary 정보가 있으므로 refine 위한 guidance로 사용
- MVSNet + Depth residual learning network
- Pre-scaling of inital depth magnitude to [0, 1] → Refinement 후 back : (biased at certain depth scale 방지)
- Input : Initial depth map & resized reference img를 4-channel input으로 concat
- → 32-channel 2D conv 3개와 1-channel conv 1개를 거쳐 Depth residual 학습
- Last layer : No BN layer and ReLU as to learn negative residual
- MVSNet + Depth residual learning network
class RefineNet(Network):
"""network for depth map refinement using original image."""
def setup(self):
(self.feed('color_image', 'depth_image')
.concat(axis=3, name='concat_image'))
(self.feed('concat_image')
.conv_bn(3, 32, 1, name='refine_conv0')
.conv_bn(3, 32, 1, name='refine_conv1')
.conv_bn(3, 32, 1, name='refine_conv2')
.conv(3, 1, 1, relu=False, name='refine_conv3'))
(self.feed('refine_conv3', 'depth_image')
.add(name='refined_depth_image'))
## model.py
def depth_refine(init_depth_map, image, depth_num, depth_start, depth_interval, is_master_gpu=True):
""" refine depth image with the image """
# normalization parameters
depth_shape = tf.shape(init_depth_map)
depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
depth_start_mat = tf.tile(tf.reshape(
depth_start, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
depth_end_mat = tf.tile(tf.reshape(
depth_end, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
depth_scale_mat = depth_end_mat - depth_start_mat
# normalize depth map (to 0~1)
init_norm_depth_map = tf.div(init_depth_map - depth_start_mat, depth_scale_mat)
# resize normalized image to the same size of depth image
resized_image = tf.image.resize_bilinear(image, [depth_shape[1], depth_shape[2]])
# refinement network
if is_master_gpu:
norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
is_training=True, reuse=False)
else:
norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
is_training=True, reuse=True)
norm_depth_map = norm_depth_tower.get_output()
# denormalize depth map
refined_depth_map = tf.multiply(norm_depth_map, depth_scale_mat) + depth_start_mat
return refined_depth_map
Loss Function
- Loss for both estimated (Initial & Refined) depth map are considered
- Mean absolute difference bw GT and Estimated depth map
- Considering only pixels with valid GT depth map labels (Not whole img)
- Notations
-
$p_{valide}$ : set of valid GT pixels -
$d(p)$ : GT depth value of pixel$p$ -
$\hat{d_i}(p)$ : Initial depth estimation -
$\hat{d_r}(p)$ : Refined depth map estimation -
$λ$ = 1.0
-
def non_zero_mean_absolute_diff(y_true, y_pred, interval):
""" non zero mean absolute loss for one batch """
with tf.name_scope('MAE'):
shape = tf.shape(y_pred)
interval = tf.reshape(interval, [shape[0]])
mask_true = tf.cast(tf.not_equal(y_true, 0.0), dtype='float32')
denom = tf.reduce_sum(mask_true, axis=[1, 2, 3]) + 1e-7
masked_abs_error = tf.abs(mask_true * (y_true - y_pred)) # 4D
masked_mae = tf.reduce_sum(masked_abs_error, axis=[1, 2, 3]) # 1D
masked_mae = tf.reduce_sum((masked_mae / interval) / denom) # 1
return masked_mae
def mvsnet_regression_loss(estimated_depth_image, depth_image, depth_interval):
""" compute loss and accuracy """
# non zero mean absulote loss
masked_mae = non_zero_mean_absolute_diff(depth_image, estimated_depth_image, depth_interval)
# less one accuracy
less_one_accuracy = less_one_percentage(depth_image, estimated_depth_image, depth_interval)
# less three accuracy
less_three_accuracy = less_three_percentage(depth_image, estimated_depth_image, depth_interval)
return masked_mae, less_one_accuracy, less_three_accuracy
Implementations
Training
Data Preparation
- DTU dataset (GT pc with normal information)+ generated GT Depth maps
- DTU dataset : large-scale MVS dataset containing 100↑ scenes with different lighting conditions
- Point cloud with normal information → Mesh by SPSR → Depth maps by rendering mesh to each viewpoint
- SPSR(screened Poisson surface reconstruction) : depth-of-tree = 11 (to acquire high quality mesh result)
- Mesh trimming-factor = 9.5 (to alleviate mesh artifacts)
- 49 imgs with 7 different lighting conditions for each scan => Total # of training samples : 27097
View Selection
- Training img : Reference img + 2 Source imgs
- Downsize imgs in feature extraction → Downsize img resolution 1600x1200 to 800x600 in 3D regularization → Crop img patch with W=640, H=512 from center => img resolution 바뀌었으니 이에 따라 input camera parameters도 바꿔주었음
- Depth hypotheses are uniformly sampled from [425mm ~ 935mm] with 2mm resolution
- Environment : TensorFlow, Tesla P100
- 100,000 iterations
Post-processing
Depth Map Filter
- Goal : To filter out outliers at background and occluded areas before converting depth value to dense point clouds
- Criteria : Photometric consistency & Geometric consistency
- Photometric consistency : measuring matching quality
- (Experiment) Pixels with probability lower than 0.8 = Outliers
- Geometric consistency : measuring depth consistency among multiple view
- reference pixel과 another view의 pixel 끼리 각각의 depth 대해 project, reproject 해서 특정 조건식 만족시키도록 함
- (Experiment) All depths should be at least 3-view consistent
- Photometric consistency : measuring matching quality
Depth Map Fusion
- Goal : To integrate depth maps from different views to a unified pc representation
- Visibility-based fusion → minimize depth occlusions, violations
- Filtering step에서 visible views for each pixel을 선택하고, all reprojected depths 대해 평균 → suppress recon noises
- 3D Point cloud 생성하기위해 fused depth maps을 space에 reproject 시킴
Experiments
Benchmarking on DTU dataset
Generalization on T&T dataset
Ablations
- View Number
- Image Features
- Cost Metric
- Depth Refinement
Conclusion
- MVSNet : unstructed imgs를 input으로 받아서 reference img 대해 depth map 추정 E2E DL Network
- Core contribution of MVSNet : To encode camera parameters as differentiable homography to build cost volume upon camera frustum → 2D feature extraction과 3D cost regularization 연결
- Results : DTU 대해 outperform & efficient in speed / T&T 대해 SOTA without fine-tuning → generalization ability
Code Review
## model.py
def get_propability_map(cv, depth_map, depth_start, depth_interval):
""" get probability map from cost volume """
def _repeat_(x, num_repeats):
""" repeat each element num_repeats times """
x = tf.reshape(x, [-1])
ones = tf.ones((1, num_repeats), dtype='int32')
x = tf.reshape(x, shape=(-1,1))
x = tf.matmul(x, ones)
return tf.reshape(x, [-1])
shape = tf.shape(depth_map)
batch_size = shape[0]
height = shape[1]
width = shape[2]
depth = tf.shape(cv)[1]
# byx coordinate, batched & flattened
b_coordinates = tf.range(batch_size)
y_coordinates = tf.range(height)
x_coordinates = tf.range(width)
b_coordinates, y_coordinates, x_coordinates = tf.meshgrid(b_coordinates, y_coordinates, x_coordinates)
b_coordinates = _repeat_(b_coordinates, batch_size)
y_coordinates = _repeat_(y_coordinates, batch_size)
x_coordinates = _repeat_(x_coordinates, batch_size)
# d coordinate (floored and ceiled), batched & flattened
d_coordinates = tf.reshape((depth_map - depth_start) / depth_interval, [-1])
d_coordinates_left0 = tf.clip_by_value(tf.cast(tf.floor(d_coordinates), 'int32'), 0, depth - 1)
d_coordinates_left1 = tf.clip_by_value(d_coordinates_left0 - 1, 0, depth - 1)
d_coordinates1_right0 = tf.clip_by_value(tf.cast(tf.ceil(d_coordinates), 'int32'), 0, depth - 1)
d_coordinates1_right1 = tf.clip_by_value(d_coordinates1_right0 + 1, 0, depth - 1)
# voxel coordinates
voxel_coordinates_left0 = tf.stack(
[b_coordinates, d_coordinates_left0, y_coordinates, x_coordinates], axis=1)
voxel_coordinates_left1 = tf.stack(
[b_coordinates, d_coordinates_left1, y_coordinates, x_coordinates], axis=1)
voxel_coordinates_right0 = tf.stack(
[b_coordinates, d_coordinates1_right0, y_coordinates, x_coordinates], axis=1)
voxel_coordinates_right1 = tf.stack(
[b_coordinates, d_coordinates1_right1, y_coordinates, x_coordinates], axis=1)
# get probability image by gathering and interpolation
prob_map_left0 = tf.gather_nd(cv, voxel_coordinates_left0)
prob_map_left1 = tf.gather_nd(cv, voxel_coordinates_left1)
prob_map_right0 = tf.gather_nd(cv, voxel_coordinates_right0)
prob_map_right1 = tf.gather_nd(cv, voxel_coordinates_right1)
prob_map = prob_map_left0 + prob_map_left1 + prob_map_right0 + prob_map_right1
prob_map = tf.reshape(prob_map, [batch_size, height, width, 1])
return prob_map
def inference(images, cams, depth_num, depth_start, depth_interval, is_master_gpu=True):
""" infer depth image from multi-view images and cameras """
# dynamic gpu params
depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
# reference image
ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
# image feature extraction
if is_master_gpu:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
else:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
view_towers = []
for view in range(1, FLAGS.view_num):
view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
view_towers.append(view_tower)
# get all homographies
view_homographies = []
for view in range(1, FLAGS.view_num):
view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
depth_start=depth_start, depth_interval=depth_interval)
view_homographies.append(homographies)
# build cost volume by differentialble homography
with tf.name_scope('cost_volume_homography'):
depth_costs = []
for d in range(depth_num):
# compute cost (variation metric)
ave_feature = ref_tower.get_output()
ave_feature2 = tf.square(ref_tower.get_output())
for view in range(0, FLAGS.view_num - 1):
homography = tf.slice(view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
homography = tf.squeeze(homography, axis=1)
# warped_view_feature = homography_warping(view_towers[view].get_output(), homography)
warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
ave_feature = ave_feature + warped_view_feature
ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
ave_feature = ave_feature / FLAGS.view_num
ave_feature2 = ave_feature2 / FLAGS.view_num
cost = ave_feature2 - tf.square(ave_feature)
depth_costs.append(cost)
cost_volume = tf.stack(depth_costs, axis=1)
# filtered cost volume, size of (B, D, H, W, 1)
if is_master_gpu:
filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=False)
else:
filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=True)
filtered_cost_volume = tf.squeeze(filtered_cost_volume_tower.get_output(), axis=-1)
# depth map by softArgmin
with tf.name_scope('soft_arg_min'):
# probability volume by soft max
probability_volume = tf.nn.softmax(
tf.scalar_mul(-1, filtered_cost_volume), axis=1, name='prob_volume')
# depth image by soft argmin
volume_shape = tf.shape(probability_volume)
soft_2d = []
for i in range(FLAGS.batch_size):
soft_1d = tf.linspace(depth_start[i], depth_end[i], tf.cast(depth_num, tf.int32))
soft_2d.append(soft_1d)
soft_2d = tf.reshape(tf.stack(soft_2d, axis=0), [volume_shape[0], volume_shape[1], 1, 1])
soft_4d = tf.tile(soft_2d, [1, 1, volume_shape[2], volume_shape[3]])
estimated_depth_map = tf.reduce_sum(soft_4d * probability_volume, axis=1)
estimated_depth_map = tf.expand_dims(estimated_depth_map, axis=3)
# probability map
prob_map = get_propability_map(probability_volume, estimated_depth_map, depth_start, depth_interval)
return estimated_depth_map, prob_map#, filtered_depth_map, probability_volume
def inference_mem(images, cams, depth_num, depth_start, depth_interval, is_master_gpu=True):
""" infer depth image from multi-view images and cameras """
# dynamic gpu params
depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
feature_c = 32
feature_h = FLAGS.max_h / 4
feature_w = FLAGS.max_w / 4
# reference image
ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
# image feature extraction
if is_master_gpu:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
else:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
ref_feature = ref_tower.get_output()
ref_feature2 = tf.square(ref_feature)
view_features = []
for view in range(1, FLAGS.view_num):
view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
view_features.append(view_tower.get_output())
view_features = tf.stack(view_features, axis=0)
# get all homographies
view_homographies = []
for view in range(1, FLAGS.view_num):
view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
depth_start=depth_start, depth_interval=depth_interval)
view_homographies.append(homographies)
view_homographies = tf.stack(view_homographies, axis=0)
# build cost volume by differentialble homography
with tf.name_scope('cost_volume_homography'):
depth_costs = []
for d in range(depth_num):
# compute cost (standard deviation feature)
ave_feature = tf.Variable(tf.zeros(
[FLAGS.batch_size, feature_h, feature_w, feature_c]),
name='ave', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
ave_feature2 = tf.Variable(tf.zeros(
[FLAGS.batch_size, feature_h, feature_w, feature_c]),
name='ave2', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
ave_feature = tf.assign(ave_feature, ref_feature)
ave_feature2 = tf.assign(ave_feature2, ref_feature2)
def body(view, ave_feature, ave_feature2):
"""Loop body."""
homography = tf.slice(view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
homography = tf.squeeze(homography, axis=1)
# warped_view_feature = homography_warping(view_features[view], homography)
warped_view_feature = tf_transform_homography(view_features[view], homography)
ave_feature = tf.assign_add(ave_feature, warped_view_feature)
ave_feature2 = tf.assign_add(ave_feature2, tf.square(warped_view_feature))
view = tf.add(view, 1)
return view, ave_feature, ave_feature2
view = tf.constant(0)
cond = lambda view, *_: tf.less(view, FLAGS.view_num - 1)
_, ave_feature, ave_feature2 = tf.while_loop(
cond, body, [view, ave_feature, ave_feature2], back_prop=False, parallel_iterations=1)
ave_feature = tf.assign(ave_feature, tf.square(ave_feature) / (FLAGS.view_num * FLAGS.view_num))
ave_feature2 = tf.assign(ave_feature2, ave_feature2 / FLAGS.view_num - ave_feature)
depth_costs.append(ave_feature2)
cost_volume = tf.stack(depth_costs, axis=1)
# filtered cost volume, size of (B, D, H, W, 1)
if is_master_gpu:
filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=False)
else:
filtered_cost_volume_tower = RegNetUS0({'data': cost_volume}, is_training=True, reuse=True)
filtered_cost_volume = tf.squeeze(filtered_cost_volume_tower.get_output(), axis=-1)
# depth map by softArgmin
with tf.name_scope('soft_arg_min'):
# probability volume by soft max
probability_volume = tf.nn.softmax(tf.scalar_mul(-1, filtered_cost_volume),
axis=1, name='prob_volume')
# depth image by soft argmin
volume_shape = tf.shape(probability_volume)
soft_2d = []
for i in range(FLAGS.batch_size):
soft_1d = tf.linspace(depth_start[i], depth_end[i], tf.cast(depth_num, tf.int32))
soft_2d.append(soft_1d)
soft_2d = tf.reshape(tf.stack(soft_2d, axis=0), [volume_shape[0], volume_shape[1], 1, 1])
soft_4d = tf.tile(soft_2d, [1, 1, volume_shape[2], volume_shape[3]])
estimated_depth_map = tf.reduce_sum(soft_4d * probability_volume, axis=1)
estimated_depth_map = tf.expand_dims(estimated_depth_map, axis=3)
# probability map
prob_map = get_propability_map(probability_volume, estimated_depth_map, depth_start, depth_interval)
# return filtered_depth_map,
return estimated_depth_map, prob_map
def inference_prob_recurrent(images, cams, depth_num, depth_start, depth_interval, is_master_gpu=True):
""" infer disparity image from stereo images and cameras """
# dynamic gpu params
depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
# reference image
ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
# image feature extraction
if is_master_gpu:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
else:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
view_towers = []
for view in range(1, FLAGS.view_num):
view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
view_towers.append(view_tower)
# get all homographies
view_homographies = []
for view in range(1, FLAGS.view_num):
view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
depth_start=depth_start, depth_interval=depth_interval)
view_homographies.append(homographies)
gru1_filters = 16
gru2_filters = 4
gru3_filters = 2
feature_shape = [FLAGS.batch_size, FLAGS.max_h/4, FLAGS.max_w/4, 32]
gru_input_shape = [feature_shape[1], feature_shape[2]]
state1 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru1_filters])
state2 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru2_filters])
state3 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru3_filters])
conv_gru1 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru1_filters)
conv_gru2 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru2_filters)
conv_gru3 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru3_filters)
exp_div = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], 1])
soft_depth_map = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], 1])
with tf.name_scope('cost_volume_homography'):
# forward cost volume
depth_costs = []
for d in range(depth_num):
# compute cost (variation metric)
ave_feature = ref_tower.get_output()
ave_feature2 = tf.square(ref_tower.get_output())
for view in range(0, FLAGS.view_num - 1):
homography = tf.slice(
view_homographies[view], begin=[0, d, 0, 0], size=[-1, 1, 3, 3])
homography = tf.squeeze(homography, axis=1)
# warped_view_feature = homography_warping(view_towers[view].get_output(), homography)
warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
ave_feature = ave_feature + warped_view_feature
ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
ave_feature = ave_feature / FLAGS.view_num
ave_feature2 = ave_feature2 / FLAGS.view_num
cost = ave_feature2 - tf.square(ave_feature)
# gru
reg_cost1, state1 = conv_gru1(-cost, state1, scope='conv_gru1')
reg_cost2, state2 = conv_gru2(reg_cost1, state2, scope='conv_gru2')
reg_cost3, state3 = conv_gru3(reg_cost2, state3, scope='conv_gru3')
reg_cost = tf.layers.conv2d(
reg_cost3, 1, 3, padding='same', reuse=tf.AUTO_REUSE, name='prob_conv')
depth_costs.append(reg_cost)
prob_volume = tf.stack(depth_costs, axis=1)
prob_volume = tf.nn.softmax(prob_volume, axis=1, name='prob_volume')
return prob_volume
def inference_winner_take_all(images, cams, depth_num, depth_start, depth_end,
is_master_gpu=True, reg_type='GRU', inverse_depth=False):
""" infer disparity image from stereo images and cameras """
if not inverse_depth:
depth_interval = (depth_end - depth_start) / (tf.cast(depth_num, tf.float32) - 1)
# reference image
ref_image = tf.squeeze(tf.slice(images, [0, 0, 0, 0, 0], [-1, 1, -1, -1, 3]), axis=1)
ref_cam = tf.squeeze(tf.slice(cams, [0, 0, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
# image feature extraction
if is_master_gpu:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=False)
else:
ref_tower = UNetDS2GN({'data': ref_image}, is_training=True, reuse=True)
view_towers = []
for view in range(1, FLAGS.view_num):
view_image = tf.squeeze(tf.slice(images, [0, view, 0, 0, 0], [-1, 1, -1, -1, -1]), axis=1)
view_tower = UNetDS2GN({'data': view_image}, is_training=True, reuse=True)
view_towers.append(view_tower)
# get all homographies
view_homographies = []
for view in range(1, FLAGS.view_num):
view_cam = tf.squeeze(tf.slice(cams, [0, view, 0, 0, 0], [-1, 1, 2, 4, 4]), axis=1)
if inverse_depth:
homographies = get_homographies_inv_depth(ref_cam, view_cam, depth_num=depth_num,
depth_start=depth_start, depth_end=depth_end)
else:
homographies = get_homographies(ref_cam, view_cam, depth_num=depth_num,
depth_start=depth_start, depth_interval=depth_interval)
view_homographies.append(homographies)
# gru unit
gru1_filters = 16
gru2_filters = 4
gru3_filters = 2
feature_shape = [FLAGS.batch_size, FLAGS.max_h/4, FLAGS.max_w/4, 32]
gru_input_shape = [feature_shape[1], feature_shape[2]]
state1 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru1_filters])
state2 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru2_filters])
state3 = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], gru3_filters])
conv_gru1 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru1_filters)
conv_gru2 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru2_filters)
conv_gru3 = ConvGRUCell(shape=gru_input_shape, kernel=[3, 3], filters=gru3_filters)
# initialize variables
exp_sum = tf.Variable(tf.zeros(
[FLAGS.batch_size, feature_shape[1], feature_shape[2], 1]),
name='exp_sum', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
depth_image = tf.Variable(tf.zeros(
[FLAGS.batch_size, feature_shape[1], feature_shape[2], 1]),
name='depth_image', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
max_prob_image = tf.Variable(tf.zeros(
[FLAGS.batch_size, feature_shape[1], feature_shape[2], 1]),
name='max_prob_image', trainable=False, collections=[tf.GraphKeys.LOCAL_VARIABLES])
init_map = tf.zeros([FLAGS.batch_size, feature_shape[1], feature_shape[2], 1])
# define winner take all loop
def body(depth_index, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre):
"""Loop body."""
# calculate cost
ave_feature = ref_tower.get_output()
ave_feature2 = tf.square(ref_tower.get_output())
for view in range(0, FLAGS.view_num - 1):
homographies = view_homographies[view]
homographies = tf.transpose(homographies, perm=[1, 0, 2, 3])
homography = homographies[depth_index]
# warped_view_feature = homography_warping(view_towers[view].get_output(), homography)
warped_view_feature = tf_transform_homography(view_towers[view].get_output(), homography)
ave_feature = ave_feature + warped_view_feature
ave_feature2 = ave_feature2 + tf.square(warped_view_feature)
ave_feature = ave_feature / FLAGS.view_num
ave_feature2 = ave_feature2 / FLAGS.view_num
cost = ave_feature2 - tf.square(ave_feature)
cost.set_shape([FLAGS.batch_size, feature_shape[1], feature_shape[2], 32])
# gru
reg_cost1, state1 = conv_gru1(-cost, state1, scope='conv_gru1')
reg_cost2, state2 = conv_gru2(reg_cost1, state2, scope='conv_gru2')
reg_cost3, state3 = conv_gru3(reg_cost2, state3, scope='conv_gru3')
reg_cost = tf.layers.conv2d(
reg_cost3, 1, 3, padding='same', reuse=tf.AUTO_REUSE, name='prob_conv')
prob = tf.exp(reg_cost)
# index
d_idx = tf.cast(depth_index, tf.float32)
if inverse_depth:
inv_depth_start = tf.div(1.0, depth_start)
inv_depth_end = tf.div(1.0, depth_end)
inv_interval = (inv_depth_start - inv_depth_end) / (tf.cast(depth_num, 'float32') - 1)
inv_depth = inv_depth_start - d_idx * inv_interval
depth = tf.div(1.0, inv_depth)
else:
depth = depth_start + d_idx * depth_interval
temp_depth_image = tf.reshape(depth, [FLAGS.batch_size, 1, 1, 1])
temp_depth_image = tf.tile(
temp_depth_image, [1, feature_shape[1], feature_shape[2], 1])
# update the best
update_flag_image = tf.cast(tf.less(max_prob_image, prob), dtype='float32')
new_max_prob_image = update_flag_image * prob + (1 - update_flag_image) * max_prob_image
new_depth_image = update_flag_image * temp_depth_image + (1 - update_flag_image) * depth_image
max_prob_image = tf.assign(max_prob_image, new_max_prob_image)
depth_image = tf.assign(depth_image, new_depth_image)
# update counter
exp_sum = tf.assign_add(exp_sum, prob)
depth_index = tf.add(depth_index, incre)
return depth_index, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre
# run forward loop
exp_sum = tf.assign(exp_sum, init_map)
depth_image = tf.assign(depth_image, init_map)
max_prob_image = tf.assign(max_prob_image, init_map)
depth_index = tf.constant(0)
incre = tf.constant(1)
cond = lambda depth_index, *_: tf.less(depth_index, depth_num)
_, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre = tf.while_loop(
cond, body
, [depth_index, state1, state2, state3, depth_image, max_prob_image, exp_sum, incre]
, back_prop=False, parallel_iterations=1)
# get output
forward_exp_sum = exp_sum + 1e-7
forward_depth_map = depth_image
return forward_depth_map, max_prob_image / forward_exp_sum
def depth_refine(init_depth_map, image, depth_num, depth_start, depth_interval, is_master_gpu=True):
""" refine depth image with the image """
# normalization parameters
depth_shape = tf.shape(init_depth_map)
depth_end = depth_start + (tf.cast(depth_num, tf.float32) - 1) * depth_interval
depth_start_mat = tf.tile(tf.reshape(
depth_start, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
depth_end_mat = tf.tile(tf.reshape(
depth_end, [depth_shape[0], 1, 1, 1]), [1, depth_shape[1], depth_shape[2], 1])
depth_scale_mat = depth_end_mat - depth_start_mat
# normalize depth map (to 0~1)
init_norm_depth_map = tf.div(init_depth_map - depth_start_mat, depth_scale_mat)
# resize normalized image to the same size of depth image
resized_image = tf.image.resize_bilinear(image, [depth_shape[1], depth_shape[2]])
# refinement network
if is_master_gpu:
norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
is_training=True, reuse=False)
else:
norm_depth_tower = RefineNet({'color_image': resized_image, 'depth_image': init_norm_depth_map},
is_training=True, reuse=True)
norm_depth_map = norm_depth_tower.get_output()
# denormalize depth map
refined_depth_map = tf.multiply(norm_depth_map, depth_scale_mat) + depth_start_mat
return refined_depth_map
Reference
[CV_Pose Estimation] Deep High-Resolution Representation Learning for Human Pose Estimation
Deep High-Resolution Representation Learning for Human Pose Estimation
Basic
- Trade off : Global information vs High-resolution(Original size)
- Global information (Receptive field ↑) -> Low resolution -> Up -sampling ↑ -> Pixel-wise prediction ↓
- Need : Learning both Global + Local Feature & Recovering High-resolution
1. Introduction
-
Most existing method
- Recover high-resolution from low-resolution
- By high-to-low resolution network connected in Series
- ex) Hourglass, SimpleBaseline, Dilated conv
-
High-Resolution Net (HR-net)
- Maintain high-resolution through Whole process
- First stage : a high-resolution subnetwork --> Next stage : Gradually add high-to-low resolution subnetworks
- Repeated Multi-scale fusions By Parallel multi-resolution subnetworks : help of same depth-low resolution
- Result : rich high-resolution representations -> more accurate and spatially precise heatmap
- Dataset : COCO keypoint detection dataset, MPII Human Pose dataset, PoseTrack dataset
2. Related Work
- Traditional solutions to single-pose estimation : probabilistic graphical model, pictorial structure model
- Present mainstream methods by DNN : Regressing keypoint positions & Estimating keypoint Heatmaps
- Regressing (x, y) : ex) (2013) DeepPose : Human Pose Estimation via Deep Neural Networks
- Estimating Heatmap [loc = (x, y)] : ex) (2015) Efficient Object Localization Using Convolutional Networks
- Most CNN for keypoint heatmap
- consist of subnetwork similar to classification network
- input --> a regressor estimating heatmaps
- main body : high-to-low and low-to-high framework, augmented with multi-scale fusion + intermediate supervision
- (a) Hourglass : symmetric low-to-high and high-to-low
- (b) Cascade pyramid networks
- (c) SimpleBaseline : Transposed conv for low-to-high
- (d) Combination with Dilated conv
2.1. High-to-low and Low-to-high
- Symmetric high-to-low and low-to-high
- Heavy high-to-low (classification network = strided conv or pooling) and Light low-to-high (bilinear-upsampling or transposed conv)
- Combination with Dialted conv
- Bad for Small object or Detail spatial information -> Bad for Pixcel-wise prediction
- Serialization of network : Local, Global feature extraction and learning rely excessively on Up-sampling
2.2. Multi-scale fusion
- (a), (b) : skip-connections bw same-resolution layers of h-t-l and l-t-h
- (a) Hourglass : Feeding multi-resolution imgs separately into multiple networks and Aggregating output map
- (b) Cascaded pyramid network : globalnet + refinent(right part for combinating features)
2.3. Intermediate supervision
- For helping deep networks training and improving heatmap estimation quality
- ex) Hourglass, conv pose machine approach : intermediate heatmaps as (part of) input of remaining subnetwork
HR-net
- High-to-low subnetworks in Parallel + Fusing multi-scale representations
- No intermediate supervision
- Result : superior in detection accuracy + efficient in computation complexity and params
3. Approach
- Human pose estimation Task : detecting locations of K keypoints or parts from img I (W x H x 3)
- SOTA methods : estimating K heatmaps of size W' x H', {H_1, H_2, ..., H_K}, H_k : location confidence of kth keypoint
- HR-net : using CNN consisting 3 parts
- Two strided conv decreasing resolution
- Main body outputting feature maps with same resolution as its input feature maps
- Regressor estimating heatmaps where keypoint positions are chosen and transformed to full resolution
3.1. Sequential multi-resolution subnetworks
- Existing networks : connecting high-to-low resolution subnetworks in Series
- Sequence of subnetworks + down-sample layer to halve resolution
- N_sr : subnetwork (s : s-th stage, r : resolution index) -> resolution : 1/2^(r-1) of first subnetwork
- ex) High-to-low network : N_11 -> N_22 -> N_33 -> N_44
3.2. Parallel multi-resolution subnetworks
3.3. Repeated multi-scale fusion
- Exchange units (Fusion) across parallel subnetworks
- Input : X = {X_1, X_2, ..., X_s}
- Output : Y = {Y_1, Y_2, ..., Y_s}, whose sizes are same to inputs
- Each output is an aggregation of input maps : Y_k = ∑ a(X_i, k), i=1, ..., s
- Extra output maps : Y_(s+1) = a(Y_s, s+1)
- Function : a(X_i, k) : Up-sampling or Ddown-sampling X_i from resolution i to k
- Down-sampling(halve) : strided 3x3 conv (Stride = 2, Padding = 1)
- Up-sampling(double) : simple nearest neighbor sampling following a 1x1 conv
3.4. Heatmap estimation
- Regressing heatmaps from high-resolution output by Last exchange unit
- Loss function : MSE
- GT heatmaps : 2D gaussian with sd=1 pixel-centered on GT location of each keypoing
3.5. Network instantiation
- ResNet to distribute depth to each stage and # of channels to each resolution
- Main body : HR-net : 4 stages with 4 parallel subnetworks
- Resolution is gradually decreased (halve) -> Width(# of channels) is increased (dounle)
- 1st stage : 4 Residual units
- each unit is formed by a bottleneck with width 64, followed by one 3x3 conv reducing width of feature maps to C
- 2, 3, 4th stages : 1, 4, 3 Exchange blocks -> Totally 8 Exchange blocks (-> 8 multi-scale fusions)
- one Exchange block contains 4 Residual units (each unit is followed by two 3x3 conv) and an Exchange block
- Experiments : HRNet-W32 (small net), HRNet-W48 (big net)
- 32 and 48 : widths(C) of high-resolution subnetworks in last 3 stages
- HRNet-W32 = 64,128, 256, 32, 32, 32
- HRNet-W48 = 96, 192, 384, 48, 48, 48
4. Experiments
4.1. COCO Keypoint Detection
Dataset
- COCO dataset : 200K imgs, 250K person instances labeled with 17 Keypoints
- COCO train2017 dataset : 57K imgs + 150K person instances
- COCO val2017 : 5K imgs
- COCO test-dec2017 set : 20K imgs
- [Annotation] 17 Keypoints : (x, y, z)
- x, y : (x,y), 2D img coordinate
- z : visibility flag (0 : not labeled / 1 : labeled but not showed / 2 : labeled and showed)
Evaluation metric
- Similarity Metric : OKS (Object Keypoint Similarity)
- d_i : Euclidean distance bw detected keypoint and GT
- v_i : visibility flag of GT
- s : object scale (diagonal length of bbox)
- k_i : per-keypoint constant that controls falloff
- OKS = 0(Worst) ~ 1(Best)
- Evaluation Metric : AP (Average Precision) : AP^50, AP^75, AP, AP^M, AP^L, AR
Training
- Fixed Human detection box img (h : w = 4 : 3) ... ex) 256 x 192 or 384 x 288
- Data Augmentation : random rotation, random scale, flipping, half body data augmentation
- Adam optimizer
- lr scheduler : 1e-3 (base) -> 1e-4 (170th epochs) -> 1e-5 (200th epochs) -> (210 epochs)
Testing
- Top-down : Detect person instance using person detector --> Predict detection keypoints
- person detectors : same with SimpleBaseline model
- Averaging heatmaps of original and flipped imgs
- Predicted keypoint location : Highest heatvalue location with a quarter offset
Results on validation set
- [Red] AP : HRNet = 73.4 > Others
- [Red] #Params, GFLOPs : HRNet > CPN model
- [Red] #Params, GFLOPs : HRNet < SimpleBaseline model
- [Blue] Pre-trained model for ImageNet classification is better : 1.0 points ↑
- [Green] Width size ↑ (HRNet-W48) -> AP ↑ : 0.7, 0.5 ↑
- [Orange] Input size ↑ (384 x 288) -> AP ↑ : 1.4, 1.2 ↑
Results on test-dev set
- HR-net (Top-down) is better than Botton-up methods
- HRnet-W32 : 74.9 AP > Other Top-down methods
- More efficient in model size (#Params) and computation complexicity (GELOPs)
- HRNet-W48 : highest 75.5 AP > SimpleBaseline
- +) Additional data from AI Challenger for training : best 77.0 AP
4.2. MPII Human Pose Estimation
Dataset
- MPII Human Pose dataset (real-world / full-body pose) : 25K imgs with 40K subjects
- 12K subjects for testing + 13K subjects for training
Training
- Same to MS COCO, except that input size is cropped to 256 x 256
Testing
- Same to MS COCO, except that using provided person boxes (instead of detected person boxes)
- six-scale pyramid testing procedure
Evaluation metric
- PCKh (head-normalized probability of correct keypoint) score -> [email protected] (α=0.5)
- Joint is correct if it falls within α * ℓ pixels of GT position
- α : constant
- ℓ : head size that corresponds to 60% of diagonal length of GT head bbox
Results on test set
- HRNet-W32 : model size (#Params = 28.5M) ↓, computation complexicity (GELOPs = 9.5) ↓, 92.3 [email protected] ↑
- HRNet-W48 : same result 92.3 [email protected]
4.3. Application to Pose Tracking
Dataset
- PoseTrack (articulated tracking in video provided by MPII Human Pose dataset) : 550 video seq with 66, 374 frames
- video seq are split into 292(train) + 50(val) + 208(test)
- train : length ranges bw 41~151 frames / 30 frames from center of video are densely annotated
- val/test : 65~298 frames / 30 frames around keyframe are densely annotated + afterwards every fourth frame is annotated
- video seq are split into 292(train) + 50(val) + 208(test)
Evaluation metric
- [1] Frame-wise Multi-person Pose Estimation : mAP (mean Average Precision)
- [2] Multi-person Pose Tracking : MOTA (multi-object tracking accuracy)
Training
- network : HRNet-W48 (pre-trained on COCO dataset) for single person pose estimation on PoseTrack2017 training set
- Input : Person box extracted from annotated keypoints in training frames by extending bbox of all keypoints by 15%
- Training setup, data aug : almost same as COCO except lr scheduler : 1e-4 -> 1e-5 (10th) -> 1e-6 (15th) -> (20 epochs)
Testing
- 1) Person box Detection and Propagation
- Same detector in SimpleBaseline
- Propagating box into nearby frames by propagating predicted keypoints according to optical flows + NMS for removing
- 2) Human Pose Estimation
- Metric : OKS (Object Keypoint Similarity)
- 3) Pose Association cross nearby frames
- Greedy matching algorithm to compute correspondence bw keypoints in nearby frames
Results on PoseTrack2017 test set
- HRNet-W48 : 74.9 mAP score, 57.9 MOTA score
4.4. Ablation Study
Repeated multi-scale fusion
- (a) Without Intermediate Exchange (1 fusions)
- (b) With only Across-stage Exchange (3 fusions)
- (c) With both Across-stage and Within-stage Exchange (8 fusions) = HR-Net
- All networks are trained from scratch
- Result on COCO val set : More fusions lead to better performance (AP : c>b>a)
Resolution maintenance
- HRNet-W32 : 73.4 AP > Variant : 72.5 AP
- Low-level features extracted from early stages over low-resolution subnetworks are less helpful
- Simple high-resolution without low-resolution parallel subnetworks shows lower performance
Representation resolution
5. Conclusion and Future Works
- Maintaining high resolution through whole process without need of recovering
- Fusing multi-resolution representations repeatly
- Result : reliable high-resolution representations
Code
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import csv
import os
import shutil
from PIL import Image
import torch
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision
import cv2
import numpy as np
import time
import _init_paths
import models
from config import cfg
from config import update_config
from core.function import get_final_preds
from utils.transforms import get_affine_transform
COCO_KEYPOINT_INDEXES = {
0: 'nose',
1: 'left_eye',
2: 'right_eye',
3: 'left_ear',
4: 'right_ear',
5: 'left_shoulder',
6: 'right_shoulder',
7: 'left_elbow',
8: 'right_elbow',
9: 'left_wrist',
10: 'right_wrist',
11: 'left_hip',
12: 'right_hip',
13: 'left_knee',
14: 'right_knee',
15: 'left_ankle',
16: 'right_ankle'
}
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
SKELETON = [
[1,3],[1,0],[2,4],[2,0],[0,5],[0,6],[5,7],[7,9],[6,8],[8,10],[5,11],[6,12],[11,12],[11,13],[13,15],[12,14],[14,16]
]
CocoColors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0],
[0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255],
[170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]]
NUM_KPTS = 17
CTX = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
def draw_pose(keypoints,img):
"""draw the keypoints and the skeletons.
:params keypoints: the shape should be equal to [17,2]
:params img:
"""
assert keypoints.shape == (NUM_KPTS,2)
for i in range(len(SKELETON)):
kpt_a, kpt_b = SKELETON[i][0], SKELETON[i][1]
x_a, y_a = keypoints[kpt_a][0],keypoints[kpt_a][1]
x_b, y_b = keypoints[kpt_b][0],keypoints[kpt_b][1]
cv2.circle(img, (int(x_a), int(y_a)), 6, CocoColors[i], -1)
cv2.circle(img, (int(x_b), int(y_b)), 6, CocoColors[i], -1)
cv2.line(img, (int(x_a), int(y_a)), (int(x_b), int(y_b)), CocoColors[i], 2)
def draw_bbox(box,img):
"""draw the detected bounding box on the image.
:param img:
"""
cv2.rectangle(img, box[0], box[1], color=(0, 255, 0),thickness=3)
def get_person_detection_boxes(model, img, threshold=0.5):
pred = model(img)
pred_classes = [COCO_INSTANCE_CATEGORY_NAMES[i]
for i in list(pred[0]['labels'].cpu().numpy())] # Get the Prediction Score
pred_boxes = [[(i[0], i[1]), (i[2], i[3])]
for i in list(pred[0]['boxes'].detach().cpu().numpy())] # Bounding boxes
pred_score = list(pred[0]['scores'].detach().cpu().numpy())
if not pred_score or max(pred_score)<threshold:
return []
# Get list of index with score greater than threshold
pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1]
pred_boxes = pred_boxes[:pred_t+1]
pred_classes = pred_classes[:pred_t+1]
person_boxes = []
for idx, box in enumerate(pred_boxes):
if pred_classes[idx] == 'person':
person_boxes.append(box)
return person_boxes
def get_pose_estimation_prediction(pose_model, image, center, scale):
rotation = 0
# pose estimation transformation
trans = get_affine_transform(center, scale, rotation, cfg.MODEL.IMAGE_SIZE)
model_input = cv2.warpAffine(
image,
trans,
(int(cfg.MODEL.IMAGE_SIZE[0]), int(cfg.MODEL.IMAGE_SIZE[1])),
flags=cv2.INTER_LINEAR)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
# pose estimation inference
model_input = transform(model_input).unsqueeze(0)
# switch to evaluate mode
pose_model.eval()
with torch.no_grad():
# compute output heatmap
output = pose_model(model_input)
preds, _ = get_final_preds(
cfg,
output.clone().cpu().numpy(),
np.asarray([center]),
np.asarray([scale]))
return preds
def box_to_center_scale(box, model_image_width, model_image_height):
"""convert a box to center,scale information required for pose transformation
Parameters
----------
box : list of tuple
list of length 2 with two tuples of floats representing
bottom left and top right corner of a box
model_image_width : int
model_image_height : int
Returns
-------
(numpy array, numpy array)
Two numpy arrays, coordinates for the center of the box and the scale of the box
"""
center = np.zeros((2), dtype=np.float32)
bottom_left_corner = box[0]
top_right_corner = box[1]
box_width = top_right_corner[0]-bottom_left_corner[0]
box_height = top_right_corner[1]-bottom_left_corner[1]
bottom_left_x = bottom_left_corner[0]
bottom_left_y = bottom_left_corner[1]
center[0] = bottom_left_x + box_width * 0.5
center[1] = bottom_left_y + box_height * 0.5
aspect_ratio = model_image_width * 1.0 / model_image_height
pixel_std = 200
if box_width > aspect_ratio * box_height:
box_height = box_width * 1.0 / aspect_ratio
elif box_width < aspect_ratio * box_height:
box_width = box_height * aspect_ratio
scale = np.array(
[box_width * 1.0 / pixel_std, box_height * 1.0 / pixel_std],
dtype=np.float32)
if center[0] != -1:
scale = scale * 1.25
return center, scale
def parse_args():
parser = argparse.ArgumentParser(description='Train keypoints network')
# general
parser.add_argument('--cfg', type=str, default='demo/inference-config.yaml')
parser.add_argument('--video', type=str)
parser.add_argument('--webcam',action='store_true')
parser.add_argument('--image',type=str)
parser.add_argument('--write',action='store_true')
parser.add_argument('--showFps',action='store_true')
parser.add_argument('opts',
help='Modify config options using the command-line',
default=None,
nargs=argparse.REMAINDER)
args = parser.parse_args()
# args expected by supporting codebase
args.modelDir = ''
args.logDir = ''
args.dataDir = ''
args.prevModelDir = ''
return args
def main():
# cudnn related setting
cudnn.benchmark = cfg.CUDNN.BENCHMARK
torch.backends.cudnn.deterministic = cfg.CUDNN.DETERMINISTIC
torch.backends.cudnn.enabled = cfg.CUDNN.ENABLED
args = parse_args()
update_config(cfg, args)
box_model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
box_model.to(CTX)
box_model.eval()
pose_model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
cfg, is_train=False
)
if cfg.TEST.MODEL_FILE:
print('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
pose_model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
else:
print('expected model defined in config at TEST.MODEL_FILE')
pose_model = torch.nn.DataParallel(pose_model, device_ids=cfg.GPUS)
pose_model.to(CTX)
pose_model.eval()
# Loading an video or an image or webcam
if args.webcam:
vidcap = cv2.VideoCapture(0)
elif args.video:
vidcap = cv2.VideoCapture(args.video)
elif args.image:
image_bgr = cv2.imread(args.image)
else:
print('please use --video or --webcam or --image to define the input.')
return
if args.webcam or args.video:
if args.write:
save_path = 'output.avi'
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter(save_path,fourcc, 24.0, (int(vidcap.get(3)),int(vidcap.get(4))))
while True:
ret, image_bgr = vidcap.read()
if ret:
last_time = time.time()
image = image_bgr[:, :, [2, 1, 0]]
input = []
img = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
img_tensor = torch.from_numpy(img/255.).permute(2,0,1).float().to(CTX)
input.append(img_tensor)
# object detection box
pred_boxes = get_person_detection_boxes(box_model, input, threshold=0.9)
# pose estimation
if len(pred_boxes) >= 1:
for box in pred_boxes:
center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
if len(pose_preds)>=1:
for kpt in pose_preds:
draw_pose(kpt,image_bgr) # draw the poses
if args.showFps:
fps = 1/(time.time()-last_time)
img = cv2.putText(image_bgr, 'fps: '+ "%.2f"%(fps), (25, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)
if args.write:
out.write(image_bgr)
cv2.imshow('demo',image_bgr)
if cv2.waitKey(1) & 0XFF==ord('q'):
break
else:
print('cannot load the video.')
break
cv2.destroyAllWindows()
vidcap.release()
if args.write:
print('video has been saved as {}'.format(save_path))
out.release()
else:
# estimate on the image
last_time = time.time()
image = image_bgr[:, :, [2, 1, 0]]
input = []
img = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
img_tensor = torch.from_numpy(img/255.).permute(2,0,1).float().to(CTX)
input.append(img_tensor)
# object detection box
pred_boxes = get_person_detection_boxes(box_model, input, threshold=0.9)
# pose estimation
if len(pred_boxes) >= 1:
for box in pred_boxes:
center, scale = box_to_center_scale(box, cfg.MODEL.IMAGE_SIZE[0], cfg.MODEL.IMAGE_SIZE[1])
image_pose = image.copy() if cfg.DATASET.COLOR_RGB else image_bgr.copy()
pose_preds = get_pose_estimation_prediction(pose_model, image_pose, center, scale)
if len(pose_preds)>=1:
for kpt in pose_preds:
draw_pose(kpt,image_bgr) # draw the poses
if args.showFps:
fps = 1/(time.time()-last_time)
img = cv2.putText(image_bgr, 'fps: '+ "%.2f"%(fps), (25, 40), cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)
if args.write:
save_path = 'output.jpg'
cv2.imwrite(save_path,image_bgr)
print('the result image has been saved as {}'.format(save_path))
cv2.imshow('demo',image_bgr)
if cv2.waitKey(0) & 0XFF==ord('q'):
cv2.destroyAllWindows()
if __name__ == '__main__':
main()
[CV_3D] VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
Paper Review
Abstract
- Previous methods : LiDAR data를 RRN(Region Proposal Network)에 넣기 위해 hand-crafted feature 제작
- VoxelNet : feature extraction과 bbox prediction을 single stage로 통합한 end-to-end network 제안
- PC data를 같은 간격의 3D Voxel로 쪼갬 (Voxel Partition)
- → VFE layer 통해 각 voxel 안의 points로 Voxel feature 제작
- → 3D conv layer 통해 Local voxel feature를 통합
- → RPN 통해 bbox 생성
1. Introduction
1.1 Related Work
1.2 Contributions
- End-to-end trainable deep network for pc-based 3D detection by VFE
- Efficient implementation for sparse point structure and parallel processing on voxel grid (GPU)
- SOTA results on KITTI benchmark (LiDAR-based car, pedestrian, cyclist detection)
2. VoxelNet
[ VoxelNet Architecture ]
1️⃣ Feature Learning Network
Voxel Partition
- To subdivide(voxelize) 3D space into equally spaced voxels
-
3D voxel grid :
$[D', H', W']$ $(D'=D/v_D, H'=H/v_H, W'=W/v_W)$ -
$D, H, W$ : LiDAR point가 분포하는 영역의 z축(위), y축(좌측), x축(전방) 길이 -
$v_D, v_H, v_W$ : 단위 voxel의 z, y, x 방향 길이 -
In paper,
$(v_D, v_H, v_W) = (0.4, 0.2, 0.2)$
-
Grouping
- LiDAR data : sparse 하며 voxel 마다 point 수 다름
- 하나의 voxel grid 내에 있는 points를 같은 voxel group에 할당 → point group
Random Sampling
- Voxel마다 max point 개수
$T$ 정해 그 이상의 points 갖는 voxel에 대해$T$ 개 Sampling - LiDAR sensor로 얻은 PC는 1 frame 당 100,000개 points
- Purposes : computation ↓, point density imbalance ↓ (sampling bias ↓), variation to training
Stacked Voxel Feature Encoding (VFE)
-
Fig. VFE Layer-1
-
Non-empty voxel containing
$t$ LiDAR points :$V = p_i = [x_i, y_i, z_i, r_i]^T\in\mathbb{R}^4$ ,$i=1...t$ -
$x_i, y_i, z_i$ : XYZ coordinates for$i$ -th point -
$r_i$ : received reflectance
-
-
Input feature set (Point-wise Input) :
$V_{in} = \hat{p}_i= [x_i, y_i, z_i, r_i, x_i-v_x, y_i-v_y, z_i-v_z]^T\in\mathbb{R}^7$ ,$i=1...t$ - centroid 대한 각 points의 relative offset (=각 points의 feature)
-
$(v_x, v_y, v_z)$ : centroid of all points in$V$ = local mean
-
Point-wise Feature : Point-wise Input을 FCN에 통과시켜 feature space로 보낸 결과
- FCN = linear layer + BN + ReLU
- aggregating information from point features → encoding shape of surface within voxel
-
Locally Aggregated Feature
- Point-wise Feature (=voxel 내 모든 points의 feature)에 element-wise max-pooling 수행한 결과
-
Point-wise concatenated Feature :
$f_i^{out}\in\mathbb{R}^2m$ - Point-wise Feature와 Voxel-wise Feature를 concat한 결과
-
Output feature set :
$V_{out} = f_i^{out}$ ,$i=1...t$ - = Point-wise Feature-1 → VFE Layer-2의 input
-
Voxel-wise Feature
- 모든 non-empty voxel은 같은 FCN으로 encoding
- VFE-layer를 stacking 하여 voxel 내부 points의 shape information 학습 가능
-
$n(=2)$ VFE-layer를 통과한 후 얻어진 point-wise feature를 FCN과 Maxpooling에 통과시켜 얻은 최종 결과 - 3D points 만으로는 CNN 학습 hard → 3D space를 voxel로 쪼개 CNN 학습에 적합한 구조를 만들어 각 voxel의 feature 계산하여 Convolutional Middle Layers의 input으로 사용
[Code] Feature Learning Network
# Fully Connected Network
class FCN(nn.Module):
def __init__(self,cin,cout):
super(FCN, self).__init__()
self.cout = cout
self.linear = nn.Linear(cin, cout)
self.bn = nn.BatchNorm1d(cout)
def forward(self,x):
# KK is the stacked k across batch
kk, t, _ = x.shape
x = self.linear(x.view(kk*t,-1))
x = F.relu(self.bn(x))
return x.view(kk,t,-1)
# Voxel Feature Encoding (VFE) Layer
class VFE(nn.Module):
def __init__(self,cin,cout):
super(VFE, self).__init__()
assert cout % 2 == 0
self.units = cout // 2
self.fcn = FCN(cin,self.units)
def forward(self, x, mask):
# point-wise feature
pwf = self.fcn(x)
#locally aggregated feature
laf = torch.max(pwf,1)[0].unsqueeze(1).repeat(1,cfg.T,1)
# point-wise concat feature
pwcf = torch.cat((pwf,laf),dim=2)
# apply mask
mask = mask.unsqueeze(2).repeat(1, 1, self.units * 2)
pwcf = pwcf * mask.float()
return pwcf
# Stacked Voxel Feature Encoding
class SVFE(nn.Module):
def __init__(self):
super(SVFE, self).__init__()
self.vfe_1 = VFE(7,32)
self.vfe_2 = VFE(32,128)
self.fcn = FCN(128,128)
def forward(self, x):
mask = torch.ne(torch.max(x,2)[0], 0)
x = self.vfe_1(x, mask)
x = self.vfe_2(x, mask)
x = self.fcn(x)
# element-wise max pooling
x = torch.max(x,1)[0]
return x
Sparse Tensor Representation
- pc ~ 100k points → 90%이상이 empty voxel
- non-empty voxel features를 sparse tensor로 표현 (list 형태)
- backprop에서 memory usage & computation cost ↓
2️⃣ Convolutional Middle Layers
- Input : voxel-wise feature
- CML = 3D CNN + BN + ReLU
- In paper, 3 CML
- receptive field를 넓히면서 voxel-wise features를 aggregation
[Code] Convolutional Middle Layer
# conv3d + bn + relu
class Conv3d(nn.Module):
def __init__(self, in_channels, out_channels, k, s, p, batch_norm=True):
super(Conv3d, self).__init__()
self.conv = nn.Conv3d(in_channels, out_channels, kernel_size=k, stride=s, padding=p)
if batch_norm:
self.bn = nn.BatchNorm3d(out_channels)
else:
self.bn = None
def forward(self, x):
x = self.conv(x)
if self.bn is not None:
x = self.bn(x)
return F.relu(x, inplace=True)
# Convolutional Middle Layer
class CML(nn.Module):
def __init__(self):
super(CML, self).__init__()
self.conv3d_1 = Conv3d(128, 64, 3, s=(2, 1, 1), p=(1, 1, 1))
self.conv3d_2 = Conv3d(64, 64, 3, s=(1, 1, 1), p=(0, 1, 1))
self.conv3d_3 = Conv3d(64, 64, 3, s=(2, 1, 1), p=(1, 1, 1))
def forward(self, x):
x = self.conv3d_1(x)
x = self.conv3d_2(x)
x = self.conv3d_3(x)
return x
3️⃣ Region Proposal Network (RPN)
- Input : CML로 얻은 64(channel) x 2(z) x 400(y) x 352(x) 형태의 4D feature map을 128 x 400 x 352 형태의 3D tensor로 reshaping한 BEV feature map
- Outputs : 2-dim Probability score map (class score) & 14-dim Regression map (bbox regression)
- Probability score map (class score) : 각 anchor에 대해 해당 class가 맞을 확률(0, 1)
- Regression map (bbox regression) : bbox parameter 7개에 대한 regression 결과
- Layers : Conv2D(input channel #, output channel #, kernel size, stride size, padding size)
- 3 FC blocks
- 각 block의 1st layer = stride 2 → feature map size를 1/2로 downsampling
- 각 block을 거쳐 나온 features를 같은 size로 upsampling 하여 concat
- 최종 high resol feature map을 Conv3D layer 통과 → Class Probability score map & bbox Regression map
[Code] Region Proposal Network (RPN)
# conv2d + bn + relu
class Conv2d(nn.Module):
def __init__(self,in_channels,out_channels,k,s,p, activation=True, batch_norm=True):
super(Conv2d, self).__init__()
self.conv = nn.Conv2d(in_channels,out_channels,kernel_size=k,stride=s,padding=p)
if batch_norm:
self.bn = nn.BatchNorm2d(out_channels)
else:
self.bn = None
self.activation = activation
def forward(self,x):
x = self.conv(x)
if self.bn is not None:
x=self.bn(x)
if self.activation:
return F.relu(x,inplace=True)
else:
return x
# Region Proposal Network
class RPN(nn.Module):
def __init__(self):
super(RPN, self).__init__()
self.block_1 = [Conv2d(128, 128, 3, 2, 1)]
self.block_1 += [Conv2d(128, 128, 3, 1, 1) for _ in range(3)]
self.block_1 = nn.Sequential(*self.block_1)
self.block_2 = [Conv2d(128, 128, 3, 2, 1)]
self.block_2 += [Conv2d(128, 128, 3, 1, 1) for _ in range(5)]
self.block_2 = nn.Sequential(*self.block_2)
self.block_3 = [Conv2d(128, 256, 3, 2, 1)]
self.block_3 += [nn.Conv2d(256, 256, 3, 1, 1) for _ in range(5)]
self.block_3 = nn.Sequential(*self.block_3)
self.deconv_1 = nn.Sequential(nn.ConvTranspose2d(256, 256, 4, 4, 0),nn.BatchNorm2d(256))
self.deconv_2 = nn.Sequential(nn.ConvTranspose2d(128, 256, 2, 2, 0),nn.BatchNorm2d(256))
self.deconv_3 = nn.Sequential(nn.ConvTranspose2d(128, 256, 1, 1, 0),nn.BatchNorm2d(256))
self.score_head = Conv2d(768, cfg.anchors_per_position, 1, 1, 0, activation=False, batch_norm=False)
self.reg_head = Conv2d(768, 7 * cfg.anchors_per_position, 1, 1, 0, activation=False, batch_norm=False)
def forward(self,x):
x = self.block_1(x)
x_skip_1 = x
x = self.block_2(x)
x_skip_2 = x
x = self.block_3(x)
x_0 = self.deconv_1(x)
x_1 = self.deconv_2(x_skip_2)
x_2 = self.deconv_3(x_skip_1)
x = torch.cat((x_0,x_1,x_2),1)
return self.score_head(x),self.reg_head(x)
[ Loss Function ]
Total Loss = Normalized Classification Loss + Normalized Regression Loss
(1)
-
$p_i^{pos}, p_j^{neg}$ : softmax output for positive and negative anchor -
$a_i^{pos}$ ,$i=1...N_{pos}$ : set of positive anchors (pre-defined bbox)- GT bbox와의 IoU가 특정값보다 큰 anchors → score ~~1
In paper, Car : 0.65, Pedestrian & Cyclist : 0.5
- GT bbox와의 IoU가 특정값보다 큰 anchors → score ~~1
-
$a_j^{neg}$ ,$j=1...N_{neg}$ : set of negative anchors- GT bbox와의 IoU가 특정값보다 작은 anchors → score ~~0
-
$(x_c^g, y_c^g, z_c^g, l^g, w^g, h^g, \theta^g)$ : 3D GT bbox-
$x_c^g, y_c^g, z_c^g$ : center location = feature map location -
$l^g, w^g, h^g$ : length, width, height of box → class 마다 다름In paper, Car : (3.9, 1.6, 1.56)
-
$\theta^g$ : yaw rotation around Z-axis (0~2𝝅)In paper,
$\theta$ = 0, 𝝅/2 → anchor 2개 → Outputs : 2-dim & 14-dim
-
(2)
-
$u_i\in\mathbb{R}^7$ : regression output -
$u_i^*\in\mathbb{R}^7$ : GT for positive anchor -
$u^*\in\mathbb{R}^7$ : residual vector
[Code] Loss function
class VoxelLoss(nn.Module):
def __init__(self, alpha, beta):
super(VoxelLoss, self).__init__()
self.smoothl1loss = nn.SmoothL1Loss(size_average=False)
self.alpha = alpha
self.beta = beta
def forward(self, rm, psm, pos_equal_one, neg_equal_one, targets):
p_pos = F.sigmoid(psm.permute(0,2,3,1))
rm = rm.permute(0,2,3,1).contiguous()
rm = rm.view(rm.size(0),rm.size(1),rm.size(2),-1,7)
targets = targets.view(targets.size(0),targets.size(1),targets.size(2),-1,7)
pos_equal_one_for_reg = pos_equal_one.unsqueeze(pos_equal_one.dim()).expand(-1,-1,-1,-1,7)
rm_pos = rm * pos_equal_one_for_reg
targets_pos = targets * pos_equal_one_for_reg
cls_pos_loss = -pos_equal_one * torch.log(p_pos + 1e-6)
cls_pos_loss = cls_pos_loss.sum() / (pos_equal_one.sum() + 1e-6)
cls_neg_loss = -neg_equal_one * torch.log(1 - p_pos + 1e-6)
cls_neg_loss = cls_neg_loss.sum() / (neg_equal_one.sum() + 1e-6)
reg_loss = self.smoothl1loss(rm_pos, targets_pos)
reg_loss = reg_loss / (pos_equal_one.sum() + 1e-6)
conf_loss = self.alpha * cls_pos_loss + self.beta * cls_neg_loss
return conf_loss, reg_loss
2.3 Efficient Implementation
-
$K$ : non-empty voxels의 최대 개수 -
$T$ : 각 voxel이 가질 수 있는 point의 최대 개수
Steps
- (
$K$ x$T$ x$1$ )-dim Voxel Coordinate Buffer(VCB) 와 ($K$ x$T$ x$7$ )-dim Voxel Input Feature Buffer(VIFB) 초기화 - Sparse한 Input PC를 Stacked VFE-layers에 넣기 전, VIFB에 통과시켜 Dense한 형태로 바꿈 & 빈 공간은 0으로 채움 → GPU parallel 연산 가능
- points를 돌면서 해당 point가 속한 voxel이 초기화된 적이 없다면, voxel의 coordinate를 VCB에 추가
- & 해당 point를 7-dim vector로 만들어 VIFB의 해당 voxel 위치에 추가
- Stacked VFE-layers를 통과한 Voxel-wise Feature들을 VCB를 이용해 3D space 상의 Sparse tensor로 mapping
- Sparse tensor는 middle conv layer와 RPN으로 들어감
[Code] Efficient VoxelNet
class VoxelNet(nn.Module):
def __init__(self):
super(VoxelNet, self).__init__()
self.svfe = SVFE()
self.cml = CML()
self.rpn = RPN()
def voxel_indexing(self, sparse_features, coords):
dim = sparse_features.shape[-1]
dense_feature = Variable(torch.zeros(dim, cfg.N, cfg.D, cfg.H, cfg.W).cuda())
dense_feature[:, coords[:,0], coords[:,1], coords[:,2], coords[:,3]]= sparse_features
return dense_feature.transpose(0, 1)
def forward(self, voxel_features, voxel_coords):
# feature learning network
vwfs = self.svfe(voxel_features)
vwfs = self.voxel_indexing(vwfs,voxel_coords)
# convolutional middle network
cml_out = self.cml(vwfs)
# region proposal network
# merge the depth and feature dim into one, output probability score map and regression map
psm,rm = self.rpn(cml_out.view(cfg.N,-1,cfg.H, cfg.W))
return psm, rm
3. Training Details
Data Augmentation
- Less than 4000 training PC → Overfitting issue
-
1) Perturbation (Rotation and Translation) to each GT bbox
- bbox center를 중심으로 [-π/10, π/10] uniform distribution에서 sampling한 각도만큼 Rotation
- (x, y, z) 방향으로 각각 (0,1) Gaussian distribution에서 sampling한 값만큼 Translation
- Collision test bw two boxes → collision 있으면 원래대로 되돌림
-
2) Global Scaling
- All GT bbox
$b_i$ 와 whole PC$M$ 에 대해 [0.95, 1.05] uniform distribution에서 sampling한 값만큼 Scaling - Result : Robustness ↑ for detecting objects with various sizes and distances
- All GT bbox
-
3) Global Rotation
- All GT bbox
$b_i$ 와 whole PC$M$ 에 대해 [-π/4, π/4] uniform distribution에서 sampling한 각도만큼 (0,0,0)을 중심으로 Z-axis로 Rotation - Result : rotating entire pc → simulating vehicle making a turn
- 1 : 개별 bbox, 3 : 전체 scene
- All GT bbox
[Code] Data Augmentation
def draw_polygon(img, box_corner, color = (255, 255, 255),thickness = 1):
tup0 = (box_corner[0, 1],box_corner[0, 0])
tup1 = (box_corner[1, 1],box_corner[1, 0])
tup2 = (box_corner[2, 1],box_corner[2, 0])
tup3 = (box_corner[3, 1],box_corner[3, 0])
cv2.line(img, tup0, tup1, color, thickness, cv2.LINE_AA)
cv2.line(img, tup1, tup2, color, thickness, cv2.LINE_AA)
cv2.line(img, tup2, tup3, color, thickness, cv2.LINE_AA)
cv2.line(img, tup3, tup0, color, thickness, cv2.LINE_AA)
return img
def point_transform(points, tx, ty, tz, rx=0, ry=0, rz=0):
# Input:
# points: (N, 3)
# rx/y/z: in radians
# Output:
# points: (N, 3)
N = points.shape[0]
points = np.hstack([points, np.ones((N, 1))])
mat1 = np.eye(4)
mat1[3, 0:3] = tx, ty, tz
points = np.matmul(points, mat1)
if rx != 0:
mat = np.zeros((4, 4))
mat[0, 0] = 1
mat[3, 3] = 1
mat[1, 1] = np.cos(rx)
mat[1, 2] = -np.sin(rx)
mat[2, 1] = np.sin(rx)
mat[2, 2] = np.cos(rx)
points = np.matmul(points, mat)
if ry != 0:
mat = np.zeros((4, 4))
mat[1, 1] = 1
mat[3, 3] = 1
mat[0, 0] = np.cos(ry)
mat[0, 2] = np.sin(ry)
mat[2, 0] = -np.sin(ry)
mat[2, 2] = np.cos(ry)
points = np.matmul(points, mat)
if rz != 0:
mat = np.zeros((4, 4))
mat[2, 2] = 1
mat[3, 3] = 1
mat[0, 0] = np.cos(rz)
mat[0, 1] = -np.sin(rz)
mat[1, 0] = np.sin(rz)
mat[1, 1] = np.cos(rz)
points = np.matmul(points, mat)
return points[:, 0:3]
def box_transform(boxes_corner, tx, ty, tz, r=0):
# boxes_corner (N, 8, 3)
for idx in range(len(boxes_corner)):
boxes_corner[idx] = point_transform(boxes_corner[idx], tx, ty, tz, rz=r)
return boxes_corner
def cal_iou2d(box1_corner, box2_corner):
box1_corner = np.reshape(box1_corner, [4, 2])
box2_corner = np.reshape(box2_corner, [4, 2])
box1_corner = ((cfg.W, cfg.H)-(box1_corner - (cfg.xrange[0], cfg.yrange[0])) / (cfg.vw, cfg.vh)).astype(np.int32)
box2_corner = ((cfg.W, cfg.H)-(box2_corner - (cfg.xrange[0], cfg.yrange[0])) / (cfg.vw, cfg.vh)).astype(np.int32)
buf1 = np.zeros((cfg.H, cfg.W, 3))
buf2 = np.zeros((cfg.H, cfg.W, 3))
buf1 = cv2.fillConvexPoly(buf1, box1_corner, color=(1,1,1))[..., 0]
buf2 = cv2.fillConvexPoly(buf2, box2_corner, color=(1,1,1))[..., 0]
indiv = np.sum(np.absolute(buf1-buf2))
share = np.sum((buf1 + buf2) == 2)
if indiv == 0:
return 0.0 # when target is out of bound
return share / (indiv + share)
def aug_data(lidar, gt_box3d_corner):
np.random.seed()
choice = np.random.randint(1, 10)
if choice >= 7:
for idx in range(len(gt_box3d_corner)):
# TODO: precisely gather the point
is_collision = True
_count = 0
while is_collision and _count < 100:
t_rz = np.random.uniform(-np.pi / 10, np.pi / 10)
t_x = np.random.normal()
t_y = np.random.normal()
t_z = np.random.normal()
# check collision
tmp = box_transform(
gt_box3d_corner[[idx]], t_x, t_y, t_z, t_rz)
is_collision = False
for idy in range(idx):
iou = cal_iou2d(tmp[0,:4,:2],gt_box3d_corner[idy,:4,:2])
if iou > 0:
is_collision = True
_count += 1
break
if not is_collision:
box_corner = gt_box3d_corner[idx]
minx = np.min(box_corner[:, 0])
miny = np.min(box_corner[:, 1])
minz = np.min(box_corner[:, 2])
maxx = np.max(box_corner[:, 0])
maxy = np.max(box_corner[:, 1])
maxz = np.max(box_corner[:, 2])
bound_x = np.logical_and(
lidar[:, 0] >= minx, lidar[:, 0] <= maxx)
bound_y = np.logical_and(
lidar[:, 1] >= miny, lidar[:, 1] <= maxy)
bound_z = np.logical_and(
lidar[:, 2] >= minz, lidar[:, 2] <= maxz)
bound_box = np.logical_and(
np.logical_and(bound_x, bound_y), bound_z)
lidar[bound_box, 0:3] = point_transform(
lidar[bound_box, 0:3], t_x, t_y, t_z, rz=t_rz)
gt_box3d_corner[idx] = box_transform(
gt_box3d_corner[[idx]], t_x, t_y, t_z, t_rz)
gt_box3d = gt_box3d_corner
elif choice < 7 and choice >= 4:
# global rotation
angle = np.random.uniform(-np.pi / 4, np.pi / 4)
lidar[:, 0:3] = point_transform(lidar[:, 0:3], 0, 0, 0, rz=angle)
gt_box3d = box_transform(gt_box3d_corner, 0, 0, 0, r=angle)
else:
# global scaling
factor = np.random.uniform(0.95, 1.05)
lidar[:, 0:3] = lidar[:, 0:3] * factor
gt_box3d = gt_box3d_corner * factor
return lidar, gt_box3d
4. Experiments
Evaluation on KITTI benchmark dataset
- VoxelNet outperforms all other methods for Car class
- VoxelNet is more effective in capturing 3D shape information than HC features
Code Review
- reference : https://github.com/skyhehe123/VoxelNet-pytorch
[CV_CNN] Very Deep Convolutional Networks for Large-Scale Image Recognition
Very Deep Convolutional Networks for Large-Scale Image Recognition
1. INTRODUCTION
- Fix other parameters and increase 'depth' of the network + Use only 'small (3x3)' convolution filters in all layers
- ILSVRC-2014 classification and localisation + other image recognition datasets
2. ConvNet Configurations
2.1. Architecture
- Input data : 224 x 224 RGB image
- 3 x 3 Conv (stride 1, padding 1) and 2 x 2 Maxpool (stride 2)
- Activation function : ReLU
- 3 FC layers (4096 - 4096 - 1000 channels)
- Final : soft-max layer
- No LRN (Local Response Normalization) except for one
2.2. Configurations
- A~E : Differ only in 'depth'
- Width of conv layer (the number of channels = feature map) : 64 -> 128 -> 256 -> 512
- 3 x 3 conv fewer parameters but still a lot (because of FC layer)
2.3. Discussion
- Stack of three 3 x 3 has Same effective receptive field as one 7 x 7 conv layer
- BUT more non-linear ( ReLU ) & fewer parameters ( 3(3^2C^2) < 7^2C^2 )
- 1 x 1 conv layer for additional non-linearity by ReLU (config C)
- GoogLeNet(1st place of ILSVRC-2014) is more complex than VGGNet
- Similarity : very deep ConvNets (22 layers) and Small conv filters(1x1, 3x3, 5x5)
- Difference : spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation
3. Classification Framework
3.1. Training
- Generally follows AlexNet (2012) except for input crops from multi-scale training images
- Data Pre-processing
- Image Rescale (Resize)
- Single-scale training : fixed S = 256, S = 384
- Multi-scale training : randomly sampling in [256, 512] (Fine-tuning with pre-trained S = 384)
- Data Augmentation
- Random crop 224 x 224
- Random horizontal flipping
- Random RGB color shift
- Scale jittering
- Normalization : subtract mean RGB value computed on training dataset from each pixel
- Image Rescale (Resize)
- Train Details
- Multinomial logistic regression Optimization
- Mini-batch gradient descent based on backpropagation
- Learning rate : 0.01
- Momentum : 0.9
- L2 weight decay : 0.0005
- Batch size : 256
- Dropout : 0.5 ratio for first 2 FC layers
- Learning rate scheduler : decreased by a factor of 10 ( x 3 times) -> stopped at 370K iterations
- Epoch : 74 (370K iterations)
- Pre-initialization : Train shallow config A -> Train deeper config by initialization first 4 conv and last 3 fc layers with layers of A & random initialization intermediate layers by sampling from N(0, 0.001)
3.2. Testing
- Data Pre-processing
- Isotropic Rescaling to pre-defined smallest side Q (not necessarily equal to S)
- Multi-crop evaluation + Dense evaluation
- Data Augmentation : Horizontal flipping
- Network Change
- FC layers -> convolutional layers => Fully-Convolutional Net
- First FC layer -> 7 x 7 conv layer
- Last 2 FC layers -> 1 x 1 conv layers (for free input size) : applied to the whole (uncropped) img
- Add spatially Average Pooling class score map at end : to obtain a fixed-size vector of class scores
- FC layers -> convolutional layers => Fully-Convolutional Net
- Averaging Soft-max class posteriors of original and flipped images -> Final scores
4. Classification Experiments
- Dataset : ILSVRC-2012 dataset (1000 classes / 1.3M train + 50K val + 100K test)
- Use validation set as test set
4.1. Single-Scale Evaluation
- More deeper, less error + Error saturated at 19 layers
- Same depth -> High non-linearity is better (D > C)
- Deep net with Small filters is better than Shallow net with Large filters
- Scale jittering : better than fixed S
4.2. Multi-Scale Evaluation
- Better than Single-Scale Evaluation
- fixed S : Q = {S-32, S, S+32}
- Scale jittering on [256, 384, 512] : better than fixed S
4.3. Multi-Crop Evaluation
- Multi-crop & Dense evaluation : complementary -> Combination is best
4.4. Convnet fusion
- Combine the outputs of several models by averaging soft-max class posteriors -> improve performance
- Multiple ConvNet fusion Results
- ILSVRC submission : Only train the single-scale networks, as well as a multi-scale model D and Ensemble of 7 model => 7.3% test error
- Post-submission : Ensemble of 2 best-performing multi-scale models (D and E) => 7.0% using dense eval, 6.8% using combined eval
4.5. Comparison with the state of the art
- ILSVRC-2014 Classification 2nd place with 7.3% test error using an ensemble of 7 models
- Decreased the error rate to 6.8% using an ensemble of 2 models
- Single-net performance : VGG is the best
5. CONCLUSION
- Representation 'depth' is beneficial for the classification accuracy
- Generalization well to a wide range of tasks and datasets (more complex recognition pipelines)
Code Review
1. model of VGG16
from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.optimizers import SGD
import cv2, numpy as np
def VGG_16(weights_path=None):
model = Sequential()
model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
if weights_path:
model.load_weights(weights_path)
return model
2. Whole models
import torch
import torch.nn as nn
try:
from torch.hub import load_state_dict_from_url
except ImportError:
from torch.utils.model_zoo import load_url as load_state_dict_from_url
torch.manual_seed(0)
# Pretrained model weights
pretrained_model_urls = {
'vgg11': 'https://download.pytorch.org/models/vgg11-bbd30ac9.pth',
'vgg13': 'https://download.pytorch.org/models/vgg13-c768596a.pth',
'vgg16': 'https://download.pytorch.org/models/vgg16-397923af.pth',
'vgg19': 'https://download.pytorch.org/models/vgg19-dcbb9e9d.pth',
'vgg11_bn': 'https://download.pytorch.org/models/vgg11_bn-6002323d.pth',
'vgg13_bn': 'https://download.pytorch.org/models/vgg13_bn-abd245e5.pth',
'vgg16_bn': 'https://download.pytorch.org/models/vgg16_bn-6c64b313.pth',
'vgg19_bn': 'https://download.pytorch.org/models/vgg19_bn-c79401a0.pth',
}
# Model info
cfgs = {
11: [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
13: [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
16: [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
19: [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M']
}
class VGG(nn.Module):
def __init__(self, features, num_classes=1000, init_weights=True):
super(VGG, self).__init__()
self.features = features
self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096), nn.ReLU(inplace=True), nn.Dropout(),
nn.Linear(4096, 4096), nn.ReLU(inplace=True), nn.Dropout(),
nn.Linear(4096, num_classes)
)
if init_weights:
self._initialize_weights()
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 0.01)
nn.init.constant_(m.bias, 0)
def make_layers(cfg, batch_norm=False):
layers = list()
in_channels = 3
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
return nn.Sequential(*layers)
def vgg(depth, batch_norm, num_classes, pretrained):
model = VGG(make_layers(cfgs[depth], batch_norm=batch_norm), num_classes, init_weights=True)
arch = 'vgg' + str(depth)
if batch_norm == True: arch += '_bn'
if pretrained and (num_classes == 1000) and (arch in pretrained_model_urls):
state_dict = load_state_dict_from_url(pretrained_model_urls[arch], progress=True)
model.load_state_dict(state_dict)
elif pretrained:
raise ValueError('No pretrained model in vggnet {} model with class number {}'.format(depth, num_classes))
return model
3. Train and Test
from model import *
from utils import *
import os
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(0)
class VGGNet():
def __init__(self, depth=19, batch_norm=True, num_classes=1000, pretrained=False,
gpu_id=0, print_freq=10, epoch_print=10, epoch_save=50):
self.depth = depth
self.batch_norm = batch_norm
self.num_classes = num_classes
self.pretrained = pretrained
self.gpu = gpu_id
self.print_freq = print_freq
self.epoch_print = epoch_print
self.epoch_save = epoch_save
torch.cuda.set_device(self.gpu)
self.loss_function = nn.CrossEntropyLoss().cuda(self.gpu)
if self.pretrained:
print('=> Use pre-trained model with depth : {}, batch_norm : {}'.format(self.depth, self.batch_norm))
else:
print('=> Create model with depth : {}, batch_norm : {}'.format(self.depth, self.batch_norm))
model = vgg(self.depth, self.batch_norm, self.num_classes, self.pretrained)
self.model = model.cuda(self.gpu)
self.train_losses = list()
self.train_acc = list()
self.test_losses = list()
self.test_acc = list()
def train(self, train_data, test_data, resume=False, save=False, start_epoch=0, epochs=74,
lr=0.01, momentum=0.9, weight_decay=0.0005, milestones=False):
# Model to Train Mode
self.model.train()
# Set Optimizer and Scheduler
optimizer = optim.SGD(self.model.parameters(), lr, momentum=momentum, weight_decay=weight_decay)
if milestones:
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1)
else:
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, [epochs//2, epochs*3//4], gamma=0.1)
# Optionally Resume from Checkpoint
if resume:
if os.path.isfile(resume):
print('=> Load checkpoint from {}'.format(resume))
loc = 'cuda:{}'.format(self.gpu)
checkpoint = torch.load(resume, map_location=loc)
self.model.load_state_dict(checkpoint['state_dict'])
start_epoch = checkpoint['epoch']
optimizer.load_state_dict(checkpoint['optimizer'])
scheduler.load_state_dict(checkpoint['scheduler'])
print('=> Loaded checkpoint from {} with epoch {}'.format(resume, checkpoint['epoch']))
else:
print('=> No checkpoint found at {}'.format(resume))
# Train
for epoch in range(start_epoch, epochs):
if epoch % self.epoch_print == 0:
print('Epoch {} Started...'.format(epoch+1))
for i, (X, y) in enumerate(train_data):
X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
output = self.model(X)
loss = self.loss_function(output, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % self.print_freq == 0:
train_acc = 100 * count(output, y) / y.size(0)
test_acc, test_loss = self.test(test_data)
self.train_losses.append(loss.item())
self.train_acc.append(train_acc)
self.test_losses.append(test_loss)
self.test_acc.append(test_acc)
self.model.train()
if epoch % self.epoch_print == 0:
print('Iteration : {} - Train Loss : {:.2f}, Test Loss : {:.2f}, '
'Train Acc : {:.2f}, Test Acc : {:.2f}'.format(i+1, loss.item(), test_loss,
train_acc, test_acc))
scheduler.step()
if save and (epoch % self.epoch_save == 0):
save_checkpoint(self.depth, self.batch_norm, self.num_classes, self.pretrained, epoch,
state={'epoch': epoch+1, 'state_dict':self.model.state_dict(),
'optimizer':optimizer.state_dict(), 'scheduler':scheduler})
def test(self, test_data):
correct, total = 0, 0
losses = list()
# Model to Eval Mode
self.model.eval()
# Test
with torch.no_grad():
for i, (X, y) in enumerate(test_data):
X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
output = self.model(X)
loss = self.loss_function(output, y)
losses.append(loss.item())
correct += count(output, y)
total += y.size(0)
return (100*correct/total, sum(losses)/len(losses))
[CV_3D] PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Paper Review
1. Introduction
- Previous research : Weight sharing, kernel optimization 위해 irregular format 특성을 가지는 point cloud를 3D voxel grid or collections of img로 transform 후 feed → Result : Quantization artifacts
- PointNet
- Input : Point clouds
- Simple and unified 구조 → 학습 easy
- A set of points → Invariant to permutations & rigid motions
- Output : class labels for entire input or per point segment/part labels for each point of input
- Max pooling : single symmetric function
- FC layers : (shape classification) to aggregate learnt optimal values into global descriptor or (shape segmentation) predict per point labels
- Data-dependent STN : to canonicalize data before PointNet process them
- Any continuous set function을 approximate 할 수 있음
- Input point cloud를 sparse set of key points로 summarize
- Robust to small perturbation of input points (corruption by outliers or missing data)
- Input : Point clouds
- Key contributions
- Model Design : Deep network for unordered point sets in 3D
- Tasks : 3D shape classification, shape part segmentation, scene semantic parsing
- Analysis : Empirical and theoretical analysis on Stability and efficiency
- Experiment : 3D features illustration computed by selected neurons in net
2. Related Work
Point Cloud Features
- Previous method : certain statistical properties를 encode →certain transformation에 invariant
- ex. intrinsic or extrinsic / local or global
DL on 3D Data
- Volumetric CNN : 3D CNN → data sparsity, 3D conv의 computation cost 제약
- FPNN, Vote3D : sparse volumes 인해 large point clouds 어려움
- Multiview CNN : 3D point cloud or shapes를 2D imgs로 render 후 2D conv 적용
- Spectral CNN : manifold mesh, non-isometric shapes 제약
- Feature-based DNN : 3D data를 vector로 바꿔 shape features 뽑은 후 fc로 분류
DL on Unordered Sets
- Point cloud = unordered set of vectors .. VS .. Most works in DL : regular representations
- read-process-write network with attention : sorting for generic sets and NLP → geometry 부족
3. Problem Statement
- Each Point's channel of PC
- (x, y, z) + extra feature channels (ex. color, normal, ..)
- Implementation : (x, y, z) coordinate for simplicity
- Object classification task
- Input point cloud : directly sampled from a shape 또는 pre-segmented from a scene point cloud
- Output : k scores (k : candidate class 수)
- Semantic segmentation task
- Input : part segmentation로 얻은 single object 또는 object segmentation로 얻은 3D scene의 sub-volume
- Output : n x m scores (n : point 수, m : semantic sub-category 수)
4. Deep Learning on Point Sets
4.1 Properties of Point Sets in R^n
- Unordered : N 3D point sets → Network needs to be invariant to N! permutations
- Interaction among points : meaningful local structures from nearby points
- Invariance under transformations : 변환(ex. rotating, translating)해도 category나 segmentation 값 일정
4.2 PointNet Architecture
Full network = Classification network + Segmentation network
[ 3 Key modules ]
❶ Max pooling layer : Symmetry Function for Unordered Input
-
Goal : To aggregate information from all points → make model invariant to input permutation (N!)
- Input : n vectors → Output : a new vector = [f_1, ..., f_K] (invariant to input order)
-
Key idea : To approximate general function
$f$ defined on point set by symmetric function on transformed elements -
Implementation : approximate
$h$ by MLP &$g$ by single variable func + max pooling func
❷ Local and Global Information Aggregation [Segmentation]
- Max pooling output [f_1, ..., f_K] : only global information for Classification task
- Goal : To get Local and Global information for Point Segmentation task
- Implementation (Input) : Concatenating global feature (1024) + each of local point feature (64) → Extracting new per point feature (ex. per-point normals)
❸ T-Net : Joint Alignment Network
- Goal : Invariant to transformations (ex. rigid transformation)
- Implementation : Predicting affine transformation matrix by mini-net (T-net) → Applying this transformation to coordinates of input points
- Result Check : semantic labeling 그대로 나오면 invariant
- Idea from STN (orthogonal img 위한 transformation matrix 계산 후 기존 input img에 곱하여 변형없는 output img)
- T-net : composed by basic modules of point independent feature extraction + max pooling + FC layers
- Feature space Alignment : another transformation matrix 추가해 align features from different input point clouds
5. Experiment
5.1 Applications (3D recognition)
1) 3D Object Classification
- Goal : To learn global point cloud feature
- Dataset : ModelNet40 (12311 CAD models from 40 man-made object categories) → 75% Train + 25% Test
- Input point cloud : 1024 points uniformly sampling on from mesh faces → normalizing into a unit sphere
- Data augmentation : random rotate along up-axis, jitter position of each points by Gaussian noise
- Result : fc and max pooling 만으로 fast inference speed, parallel in CPU
2) 3D Object Part Segmentation
- Part Segmentation : Given 3D scan or mesh model → Point labels = object part category label to each point of face
- Dataset : ShapeNet part dataset (16881 shapes from 16 categories, annotated with 50 parts)
- Idea : Part-point Classification
- Evaluation metric : mIoU on points (shape's mIoU)
- Result : 2.3% mean IoU improvement
- Robustness Test (simulated Kinect scans) : lose only 5.3% mIoU
3) Semantic Segmentation in Scenes
- Point labels : semantic object classes
- Dataset : Standford 3D semantic parsing dataset (3D scans in 6 areas including 271 rooms from 13 categories)
- Point representation : 12-dim vector = 9-dim of XYZ, RGB, normalized location + 3-dim of local point density, local curvature, normal)
- Classifier : standard MLP
- Result : smooth predictions, robustness to missing points and occlusions
- 3D Object Detection system
5.2 Architecture Design Analysis
- Dataset : ModelNet40 shape classification problem for comparisons
Comparison with Alternative Order-invariant Methods
- 3 Approaches
- MLP (unsorted / sorted input) : points as nx3 arrays
- LSTM : points as a sequence
- Symmetry operation : Attention sum, Average pooling, Max pooling
- Result : Max pooling = Best performance (Acc 87.1%)
Effectiveness of Input and Feature Transformations
Robustness Test
- Robust to various input corruptions
- Model : Max pooling network / Input points : normalized into a unit sphere
- Result : 50% point missing → Acc 2.4%, 3.8% ↓ wrt furthest, random input sampling
- Robust to outliear
5.3 Visualizing PointNet
- Critical point sets
$C_S$ and Upper-bound shapes$N_S$ for sample shapes$S$ - Critical point sets
$C_S$ : max pooled feature (summerized skeleton of shape - Upper-bound shapes
$N_S$ : largest possible point cloud that give global shape feature f(S)
- Critical point sets
-
Result : some non-critical points 잃는다고
$f(S)$ 바뀌지X (Robustness)
5.4 Time and Space Complexity Analysis
Code Review
Dataloader
from torch.utils.data import Dataset
import numpy as np
class PointCloudDataset(Dataset):
def __init__(self, npoints=1024):
self.npoints = npoints
...
def __getitem__(self, index):
points = self.point_list[index]
#randomly sample points
choice = np.random.choice(points.shape[0], self.npoints, replace=True)
points = points[choice, :]
#normalize to unit sphere
points = points - np.expand_dims(np.mean(points, axis=0), 0) #center
dist = np.max(np.sqrt(np.sum(points**2, axis=1)), 0)
points = points / dist #scale
points = self.data_augmentation(points)
label = self.label_list[index]
return torch.from_numpy(points).float(), torch.tensor(label)
def data_augmentation(self, points):
theta = np.random.uniform(0, np.pi*2) #0~360
rotation_matrix = np.array([[np.cos(theta), -np.sin(theta)],[np.sin(theta), np.cos(theta)]])
points[:,[0,2]] = points[:,[0,2]].dot(rotation_matrix) # random rotation
points += np.random.normal(0, 0.02, size=points.shape) # random jitter
return points
- Point Cloud : 각 sample마다 point 수 다름. batch 단위 학습 위해 각 sample의 point 수를 맞춰줘야함 → n_points 설정해서 각 sample마다 random sampling
- 추출한 point들은 unit sphere로의 normalization 적용
- Data augmentation : y축 기준 random rotation, Gaussian noise 기반 jittering
Main network
class PointNetCls(nn.Module):
def __init__(self, num_classes=2):
super(PointNetCls, self).__init__()
self.tnet = TNet(dim=3)
self.mlp1 = mlpblock(3, 64)
self.tnet_feature = TNet(dim=64)
self.mlp2 = nn.Sequential(
mlpblock(64, 128),
mlpblock(128, 1024, act_f=False)
)
self.mlp3 = nn.Sequential(
fcblock(1024, 512),
fcblock(512, 256, dropout_rate=0.3),
nn.Linear(256, num_classes)
)
def forward(self, x):
"""
:input size: (N, n_points, 3)
:output size: (N, num_classes)
"""
x = x.transpose(2, 1) #N, 3, n_points
trans = self.tnet(x) #N, 3, 3
x = torch.bmm(x.transpose(2, 1), trans).transpose(2, 1) #N, 3, n_points
x = self.mlp1(x) #N, 64, n_points
trans_feat = self.tnet_feature(x) #N, 64, 64
x = torch.bmm(x.transpose(2, 1), trans_feat).transpose(2, 1) #N, 64, n_points
x = self.mlp2(x) #N, 1024, n_points
x = torch.max(x, 2, keepdim=False)[0] #N, 1024 (global feature)
x = self.mlp3(x) #N, num_classes
return x, trans_feat
- (1) input feature 대해 T-Net 통해 transformation matrix 계산 → matrix multiplication 통해 transformation 수행
- (2) Shared mlp1 통해 feature dim 3 → 64
- (3) 64 dim shared mlp1에 T-Net과 matrix multiplication 통한 transformation 수행
- (4) Shared mlp2 통해 feature dim 64 →128 →1024
- (5) Max pooling으로 1024 dim vector 추출
- (6) Last mlp3 통해 classification 수행
mlpblock, fcblock
def mlpblock(in_channels, out_channels, act_f=True):
layers = [
nn.Conv1d(in_channels, out_channels, 1),
nn.BatchNorm1d(out_channels),
]
if act_f:
layers.append(nn.ReLU())
return nn.Sequential(*layers)
def fcblock(in_channels, out_channels, dropout_rate=None):
layers = [
nn.Linear(in_channels, out_channels),
]
if dropout_rate is not None:
layers.append(nn.Dropout(p=dropout_rate))
layers += [
nn.BatchNorm1d(out_channels),
nn.ReLU()
]
return nn.Sequential(*layers)
- Shared mlp : kernel size=1, 1D conv layer로 구현
T-Net
class TNet(nn.Module):
def __init__(self, dim=64):
super(TNet, self).__init__()
self.dim = dim
self.mlp = nn.Sequential(
mlpblock(dim, 64),
mlpblock(64, 128),
mlpblock(128, 1024)
)
self.fc = nn.Sequential(
fcblock(1024, 512),
fcblock(512, 256),
nn.Linear(256, dim*dim)
)
def forward(self, x):
x = self.mlp(x)
x = torch.max(x, 2, keepdim=True)[0]
x = x.view(-1, 1024)
x = self.fc(x)
idt = torch.eye(self.dim, dtype=torch.float32).flatten().unsqueeze(0).repeat(x.size()[0], 1)
idt = idt.to(x.device)
x = x + idt
x = x.view(-1, self.dim, self.dim)
return x
- Canonical space로의 mapping 위한 transformation matrix 계산
Train
import torch
import torch.nn as nn
def feature_transform_regularizer(trans):
D = trans.size()[1]
I = torch.eye(D)[None, :, :]
I = I.to(trans.device)
loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2,1)) - I, dim=(1,2)))
return loss
#sample data
points = torch.rand(5, 1024, 3)
target = torch.empty(5, dtype=torch.long).random_(10)
model = PointNetCls(num_classes=10)
loss_f = nn.CrossEntropyLoss()
pred, trans_feat = model(points)
loss = loss_f(pred, target)
loss += feature_transform_regularizer(trans_feat) * 0.001
- feature transform의 regularization 함수 정의
- Loss : Cross entropy loss
Reference
[CV_3D] PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
Background
- 3D sparse convolution : sparse한 3D voxel data에 efficient하게 적용가능한 convolution 기법
- Point Set Abstraction : PointNet++의 point set local feature encoding 방법
- keypoint sampling → MLP → Max pooling → feature vector 생성
- 적은 수의 points로 local feature encoding 가능
3D detection methods
- (1) Grid based
- irregular한 pc를 regular한 3D voxel이나 2D BEV map으로 변환하여 detection
- Ex. 3D sparse conv → 효율적 / receptive field가 kernel size에 제한됨
- (2) Point based
- 변환 없이 point 자체로 feature enconding하여 detection
- Ex. PointNet Set Abstraction → flexible receptive field, accurate contextual information / distance pair, cost↑
- (3) PV-RCNN (Grid and Point based)
PV-RCNN for Point Cloud Object Detection
1) 3D Voxel CNN for Efficient Feature Encoding and Proposal Generation
Proposal Generation
- pc를 voxel로 변환
- → (3x3x3) 3D sparse conv 통해 8x dowmsampled size feature volume 얻음
- → 2D BEV feature map 변환
- → 각 feature map pixel 마다 2 anchor (0º, 90º)로 3D box proposal 생성
- → 각 anchor 대해 물체 유무 classification & box regression
Problems of RoI Pooling
- Downsampling으로 인해 x8 resolution ↓ → input 물체의 정확한 location 알기 어려움 (information loss)
- Upsampling - Interpolation : sparse / Set abstraction : robust refinement 가능 but computation cost ↑
Solution by PV-RCNN
- 모든 box proposal 안의 grid point 잡음 → grid point 대해 Multi-scale feature volume 얻음, Set abstraction 적용
- Sampling한 Keypoints로 feature volume 을 aggregate → keypoints로 RoI grid들이 feature 생성
2) Voxel-to-Keypoint Scene Encoding via Voxel Set Abstraction
VSA (Voxel Set Abstraction) Module
- 전체 pc에서 정해진 개수의 Keypoints를 FPS로 sampling
- 정해진 반지름(
$r_k$ ) 영역에 포함되는 voxel feature 모아 set 형성-
$r_k$ 를 layer 마다 다르게 설정 → flexible receptive field
-
- Pointnet block의 Voxel Set Abstraction 통해 Multi-scale feature volume encoding
-
$M$ : T Random Sampling,$G$ : MLP,$max$ : Max pooling
-
-
$f_i^{p}$ =$f_i^{pv}$ +$f_i^{raw}$ +$f_i^{bev}$ -
$f_i^{pv}$ : 각 layer 마다 구한 keypoint features 모은 것 -
$f_i^{raw}$ : raw data 대한 Set abstraction 결과 (voxelization 인한 quantization loss 보완) -
$f_i^{bev}$ : BEV 에서 구한 Keypoint feature (more wide receptive field)
-
PKW (Predicted Key point Weighting) Module
- 기능 : PC segmentation network 추가하여 각 point 대한 foreground confidence weight 계산
- 구현 : foreground confidence를 keypoint feature에 곱함
- 효과 : refinement 과정에서 foreground feature vector 영향 ↑
3) Keypoint-to-Grid RoI Feature Abstraction for Proposal Refinement
RoI-grid Pooling Module
- 각 3D proposal 안에서 6x6x6 grid point sampling
- SA 통해 RoI 안의 grid point feature keypoints encoding
- 효과 : more flexible receptive field, contextual information
- +) Boundary 바깥 keypoints 까지 encoding
- Grid point에서 다양한 거리의 key point set 생성 → T개 sampling → MLP → Max pooling = Grid point feature
3D Proposal refinement and confidence prediction
- Grid point features를 2-layer 통과시켜 256-dim의 RoI feature vector로 만듦
- 결과 : Confidence & Box refinement 계산
-
$y_k$ : 어떤 box proposal 이 더 좋은지 판단하고자 IOU 활용 -
$L_{iou}$ : Confidence 계산 시 활용 (CE loss)
Training losses
- Total loss = Region proposal loss + Key point segmentation loss + Proposal refinement loss
[CV_FER] Facial Motion Prior Networks for Facial Expression Recognition
FMRN-FER : Facial Motion Prior Networks for Facial Expression Recognition
FMPN-FER Architecture
- Facial-Motion Mask Generator (FMG)
- Generate a facial mask to focus on facial muscle moving regions
- Use avg differences bw neutral faces and expressive faces as training guidance (pseudo gt masks)
- Prior Fusion Net (PFN)
- Generated mask is applied to and fused with original input expressive face
- Classification Net (CN)
- Extract features and predict facial expression label (6 class)
Implementation Details
- CN : Inception V3 (pretrained on ImageNet)
- 5 landmarks are extracted, followed by face normalization
- Image Transforms : Random crop from four corners or center & Random horizontal flip
- Training (2 steps)
- Starting by tuning only FMG for 300 epochs, using Adam optimizer
- Epoch 150
- LR linearly decay (FMG : e−4 to 0)
- Jointly training entire framework with λ1 = 10 and λ2 = 1
- Epoch 200
- LR linearly decay (FMG : e−5, CN : e-4) from epoch 100
- l_total = λ1 * l_G(MSE) + λ2 * l_C(CE) = 10 * l_G + l_C
- Starting by tuning only FMG for 300 epochs, using Adam optimizer
Experimental Results
- MMI Facial Expression Database
- Labelled with 6 basic expressions (Disgust > Sadness > Happy > Fear > Surprise > Anger)
- 3 peak frames around center of each labelled sequence are selected → Total : 624 expressive faces
- 10-fold person-independent cross-validation experiments
- Details for MMI
[CV_GAN] Generative Adversarial Nets
GAN : Generative Adversarial Nets
https://jeonggg119.tistory.com/37
Abstract
- Estimating Generative models via an Adversarial process
- Simultaneously training two models (minimax two-player game)
- Generative model G : capturing data distribution → recovering training data distribution)
- Discriminative model D : estimating probability that a sample came from training DB rather than G → equal to 1/2
- G and D are defined by multilayer perceptrons & trained with backprop
1. Introduction
- The promise of DL : to discover models that represent probability distributions over many kinds of data
- The most striking success in DL : Discriminative models that map a high dimensional, rich sensory input to a class label
- based on backprop and dropout
- using piecewise linear units behaved gradient
- Deep Generative model : less impact due to..
- difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation
- difficulty of leveraging benefits of piecewise linear units
- GAN : training both models using only backprop and dropout & sampling from G using only forward prop
- Generative model G : generating samples by passing random noise through a multilayer perceptron
- Discriminative model D : also defined by a multilayer perceptron
- No need for Markov chains or inference networks
2. Related work
- RBMs(restricted Boltzmann machines), DBMs(deep Boltzmann machines) : undirected graphical models with latent variables
- DBNs(Deep belief networks) : hybrid models containing a single undirected layer and several directed layers
- Score matching, NCE(noise-contrastive estimation) : criteria that don't approximate or bound log-likelihood
- GSN(generative stochastic network) : extending generalized DAE -> training G to draw samples from desired distribution
3. Adversarial nets
1) Adversarial modeling (G+D) based on MLPs
- p_g : G's distribution
- p_z(z) : Input noise random variables
- G : differentiable function represented by MLP -> G(z) : mapping to data space -> output : fake img
- D(x) : probability that x came from the train data rather than p_g from G -> output : single scalar
2) Two-player minimax game with value function V(G,D)
- D : maximize probability of assigning correct label to Training examples & Samples from G
- D(x)=1, D(G(z))=0
- G : minimize log(1-D(G(z)))
- D(G(z))=1
- Implementation : train G to maximize log(D(G(z))) = stronger gradients early in learning (preventing saturations)
3) Theoretical Analysis
- Training criterion allows one to recover data generating distribution as G and D are given enough capacity
- [Algorithm 1] k steps of optimizing D and 1 step of optimizing G
- D : being maintained near its optimal solution
- G : changing slowly enough
- Loss function for G : min log(1-D(G(z))) => max log(D(z)) for stronger gradients early in training
- D is trained to discriminate samples from data, converging to D*(x)=P_d(x)/(P_d(x)+P_g(x))
- D가 Objective function 달성한 optimal state일 때, G가 Objective function 달성하도록 학습
- ∴ P_g(x) = P_data(x) <=> D(G(z))=1/2
4. Theoretical Results
- G implicitly defines P_g as distribution of the samples G(z) obtained when z~P_z
- [Algorithm 1] to converge to a good estimator of P_data
- Non-parametic : representing a model with infinite capacity by studying convergence in space of probability density func
- Global optimum for p_g = p_data
4.1 Global Optimality of p_g = p_data
- Optimal D for any given G
- For G fixed, optimal D is D*(x)=P_d(x)/(P_d(x)+P_g(x))
- Global minimum of C(G) = - log4 is achieved if and only if P_g=P_data
4.2 Convergence of Algorithm 1
[Proposition 2]
- If G and D have enough capacity, and at each step of Algorithm 1,
- D is allowed to reach optimum given G & P_g is updated to improve criterion → P_g = P_data
- pf) V(G, D) = U(P_g, D) : convex function in P_g
- Computing a gradient descent update for P_g at optimal D given G
- With sufficiently small updates of P_g
- Optimizing θ_g rather than P_g itself
- Excellent performance of MLP in practice → reasonable model to use despite their lack of theoretical guarantees
5. Experiments
- Datasets : MNIST, Toronto Face Database(TFD), CIFAR-10
- G : ReLU + sigmoid activations / Dropout and other noise at intermediate layers / Noise as input to bottommost layer
- D : Maxout activations / Dropout
- Estimation method : Gaussian Parzen window-based log-likelihood estimation for probability of test data
- Rightmost column : nearest neighboring training sample → Model has not memorized training set
- Samples are fair random draws (Not cherry-picked)
- Markov chain mixing Sampling process X → Samples are uncorrelated
- Linear Interpolation bw coordinates in z space of full model
6. Advantages and disadvantages
1) Disadvantages
- No explicit representation of P_g(x)
- D must be synchronized well with G during training (G must be trained too much without updating D)
- G collapses too many values of z to same value of x to have enough diversity to model P_data
2) Advantages
(1) Computational Advantages
- Markov chains are never needed / Only backprop is used / No Inference is needed
- Wide variety of functions can be incorporated into model
(2) Statistical Advantages from G
- Not being updated directly with data, but only with gradients flowing through D
- (= Components of input are not copied directly into G's parameters)
- Representing very sharp, even degenerating distributions
7. Conclusions and future work
- conditional GAN p(x|c) : adding c as input to both G and D
- Learned approximate inference : training auxiliary network to predict z given x
- Similar to inference net trained by wake-sleep algorithm
- Advantage : inference net trained for a fixed G after G has finished training
- All conditionals GAN p(x_S|x_S/) : S is a subset of indices of x by training family of conditional models that share params
- To implement a stochastic extension of deterministic MP-DBM
- Semi-supervised learning : when limited labeled data is available
- Efficiency improvements : training accelerated by coordinating G and D or determining better distributions to sample z
[CV_3D] PointMLP: Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework
PointMLP: Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework
Paper Review
1. Introduction
- Point Cloud : unordered, irregular set of points → sparseness and noise restrict performance
- Prior Research : local geometric extractors using convolution, graph, or attention → memory overhead
- PointMLP : DNN for PC using only residual feed-forward MLPs (No local geometric extractors)
- +) lightweight local geometric affine module : to adaptively transform point feature in a local region
- Result : SOTA classification performance on ModelNet40, real-world ScanObjectNN
2. Related Work
-
Two mainstreams of Point Cloud Analysis
- Projecting PC to intermediate voxels or 2D imgs : fast, efficient BUT detail degradation by information loss
- Directly processing PC : ex, PointNet, PointNet++ → PointMLP follows philosophy of PointNet++ but simpler
-
Local geometry exploration
- Goal : How to generate better regional points representation ?
- Prior Research : local geometric extractors using convolution, graph, or attention
- Ex. PointConv, PAConv / EdgeConv, 3DGCN / PCT, Point Transformer
- Limitation : minimal improvement, saturated performance
-
Deep Network Architecture
- Prior Development : Image Processing Network (stacking learning layers) & DNN like ResNet
- Deep MLP architecture : efficiency and generality
- PointMLP : simple and powerful Deep Residual MLP for PC
3. Deep Residual MLP for Point Cloud
3.1 Point-based Methods
- Motivation : to directly consume pc from beginning & avoid unnecessary rendering
-
Goal : to directly learn representation
$f$ of point$P$ using NN - Limitations : computational complexity (prohibitive inference latency) & saturated performance gain
- Ex) PointNet, PointNet++, Point Transformer, ...
PointNet++
-
Main idea : learning hierarchical features by stacking multiple learning stages
- In each stage
$s$ ,$N_s$ points are re-sampled by FPS
- In each stage
-
Formulation :
$g_i = A(Φ (f_{i,j}) |j=1, ..., K)$ -
$A$ : aggregation function (max-pooling) -
$Φ$ : local feature extraction function (MLP) -
$f_{i,j}$ :$j$ -th neighbor point feature of$i$ -th sampled point -
$K$ : number of neighbor points
-
3.2 PointMLP (feed-forward residual MLP)
- Main idea : hierarchically aggregating local features extracted by MLPs (No local extractor)
-
Formulation :
$g_i = Φ_{pos} ( A (Φ_{pre} (f_{i,j}), |j=1, ..., K))$ -
$Φ_{pre}$ ,$Φ_{pos}$ : residual point MLP blocks to extract local featuresIn paper, 2 residual blocks in both
$Φ_{pre}$ ,$Φ_{pos}$ / neighbors by KNN :$K$ =24-
$Φ_{pre}$ : to learn shared weights from a local region -
$Φ_{pos}$ : to extract deep aggregated features - MLP = FC, normalization, activation layers
-
-
$A$ : aggregation function (max-pooling) -
$MLP(x) + x$ : mapping function (a series of homogeneous residual MLP blocks) - Recursively repeating operation by
$s$ stages → receptive field ↑In paper,
$s$ = 4
-
-
Merits
- MLP → permutation invariance
- Residual connection → layers ↑ →deep feature representation
- No sophisticated local extractors → efficient with highly optimized feed-forward MLPs
[Code] Mapping function $MLP(x) + x$
class ConvBNReLURes1D(nn.Module):
def __init__(self, channel, kernel_size=1, groups=1, res_expansion=1.0, bias=True, activation='relu'):
super(ConvBNReLURes1D, self).__init__()
self.act = get_activation(activation)
self.net1 = nn.Sequential(
nn.Conv1d(in_channels=channel, out_channels=int(channel * res_expansion),
kernel_size=kernel_size, groups=groups, bias=bias),
nn.BatchNorm1d(int(channel * res_expansion)),
self.act
)
if groups > 1:
self.net2 = nn.Sequential(
nn.Conv1d(in_channels=int(channel * res_expansion), out_channels=channel,
kernel_size=kernel_size, groups=groups, bias=bias),
nn.BatchNorm1d(channel),
self.act,
nn.Conv1d(in_channels=channel, out_channels=channel,
kernel_size=kernel_size, bias=bias),
nn.BatchNorm1d(channel),
)
else:
self.net2 = nn.Sequential(
nn.Conv1d(in_channels=int(channel * res_expansion), out_channels=channel,
kernel_size=kernel_size, bias=bias),
nn.BatchNorm1d(channel)
)
def forward(self, x):
return self.act(self.net2(self.net1(x)) + x)
[Code] $Φ_{pre}$
- To learn shared weights from a local region
class PreExtraction(nn.Module):
def __init__(self, channels, out_channels, blocks=1, groups=1, res_expansion=1, bias=True,
activation='relu', use_xyz=True):
"""
input: [b,g,k,d]: output:[b,d,g]
:param channels:
:param blocks:
"""
super(PreExtraction, self).__init__()
in_channels = 3+2*channels if use_xyz else 2*channels
self.transfer = ConvBNReLU1D(in_channels, out_channels, bias=bias, activation=activation)
operation = []
for _ in range(blocks):
operation.append(
ConvBNReLURes1D(out_channels, groups=groups, res_expansion=res_expansion,
bias=bias, activation=activation)
)
self.operation = nn.Sequential(*operation)
def forward(self, x):
b, n, s, d = x.size() # torch.Size([32, 512, 32, 6])
x = x.permute(0, 1, 3, 2)
x = x.reshape(-1, d, s)
x = self.transfer(x)
batch_size, _, _ = x.size()
x = self.operation(x) # [b, d, k]
x = F.adaptive_max_pool1d(x, 1).view(batch_size, -1)
x = x.reshape(b, n, -1).permute(0, 2, 1)
return x
[Code] $Φ_{pos}$
- To learn shared weights from a local region
class PosExtraction(nn.Module):
def __init__(self, channels, blocks=1, groups=1, res_expansion=1, bias=True, activation='relu'):
"""
input[b,d,g]; output[b,d,g]
:param channels:
:param blocks:
"""
super(PosExtraction, self).__init__()
operation = []
for _ in range(blocks):
operation.append(
ConvBNReLURes1D(channels, groups=groups, res_expansion=res_expansion, bias=bias, activation=activation)
)
self.operation = nn.Sequential(*operation)
def forward(self, x): # [b, d, g]
return self.operation(x)
3.3 Geometric Affine Module
-
Motivation
- depth ↑ 위해 stage
$s$ ↑ 또는 residual blocks # ↑ 수 있음 but deep MLP의 accuracy와 stability ↓ (less robust) - pc = sparse, irregular in local region → local regions마다 different extractors 필요 but shared residual MLP 불가
- depth ↑ 위해 stage
-
Lightweight local geometric affine module
- To transform local neighbor points to normal distribution while maintaining original geometric properties
-
sigma ← center point 대한 분산 구한 뒤,
$k$ (neighbor #),$n$ (point#),$d$ (=3) 곱한만큼 나눈 값에 제곱근 씌움 - alpha, beta : learnable parameters
[Code]
# Group points
idx = knn_point(self.kneighbors, xyz, new_xyz)
grouped_xyz = index_points(xyz, idx) # [B, npoint, k, 3]
grouped_points = index_points(points, idx) # [B, npoint, k, d]
# Calculate fi and sigma
mean = torch.mean(grouped_points, dim=2, keepdim=True)
std = torch.std((grouped_points - mean).reshape(B, -1), dim=-1, keepdim=True).unsqueeze(dim=-1).unsqueeze(dim=-1)
# Perform Normalization
grouped_points = (grouped_points - mean) / (std + 1e-5)
grouped_points = self.affine_alpha * grouped_points + self.affine_beta
3.4 Computational complexity and Elite version
- Motivation : FC layers → huge parameters, computational complexity => How to improve efficiency?
-
Elite version
-
➀ Bottleneck structure for mapping function
$Φ_{pre}$ ,$Φ_{pos}$ (residual MLP blocks)- Intermediate FC layer channel # ↓ (4배) and ↑ as original feature map => parameters ↓
- ➁ MLP blocks, Embedding dimension # ↓
- ➂ Grouped FC operation (X)
-
➀ Bottleneck structure for mapping function
[Code] pointMLP vs. pointMLP-elite
def pointMLP(num_classes=40, **kwargs) -> Model:
return Model(points=1024, class_num=num_classes, embed_dim=64, groups=1, res_expansion=1.0,
activation="relu", bias=False, use_xyz=False, normalize="anchor",
dim_expansion=[2, 2, 2, 2], pre_blocks=[2, 2, 2, 2], pos_blocks=[2, 2, 2, 2],
k_neighbors=[24, 24, 24, 24], reducers=[2, 2, 2, 2], **kwargs)
def pointMLPElite(num_classes=40, **kwargs) -> Model:
return Model(points=1024, class_num=num_classes, embed_dim=32, groups=1, res_expansion=0.25,
activation="relu", bias=False, use_xyz=False, normalize="anchor",
dim_expansion=[2, 2, 2, 1], pre_blocks=[1, 1, 2, 1], pos_blocks=[1, 1, 2, 1],
k_neighbors=[24,24,24,24], reducers=[2, 2, 2, 2], **kwargs)
4. Experiments
4.1 Shape Classification on ModelNet40
- Dataset : ModelNet40 (meshed CAD models, 40 categories)
- Metrics : mAcc(class-avg acc), OA(overall acc)
- Train : 300 epoch, SGD
- Results
4.2 Shape Classification on ScanObjectNN
- Dataset : ScanObjectNN (15000 real world objects, 15 classes) - background, noise, occlusions → hard
hardest perturbed variant (PB_T50_RS)
- Metrics : mAcc(class-avg acc), OA(overall acc)
- Train : 200 epoch, batch size 32, SGD
- Results
4.3 Ablation Studies
✅ Network Depth
- Variants : 24, 40, 56-layers PointMLP
- Depth 깊다고 항상 좋은게 X → appropriate depth 존재 (tradeoff bw acc and stability 고려)
40-layers : best tradeoff (85.4% mACC and 0.3 standard deviations)
- Depth 관계 없이 outperform recent methods
✅ Geometric Affine Module : important component
- Performance improvement : 3% ↑ for all variants
- Reason1. mapping local input features to a normal distribution → easy train
- Reason2. encoding local geometric information by channel-wise distance to local centroid and variance
- Stability improvement (=better robustness)
✅ 3D Loss landscape
4.4 Part Segmentation
- Dataset : shapeNetPart (16881 shapes, 16 classes, 50 parts labels in total)
- Results : predictions of PointMLP are close to GT
Code Review
[Code] PointMLP for Classification (ModelNet40)
class Model(nn.Module):
def __init__(self, points=1024, class_num=40, embed_dim=64, groups=1, res_expansion=1.0,
activation="relu", bias=True, use_xyz=True, normalize="center",
dim_expansion=[2, 2, 2, 2], pre_blocks=[2, 2, 2, 2], pos_blocks=[2, 2, 2, 2],
k_neighbors=[32, 32, 32, 32], reducers=[2, 2, 2, 2], **kwargs):
super(Model, self).__init__()
self.stages = len(pre_blocks)
self.class_num = class_num
self.points = points
self.embedding = ConvBNReLU1D(3, embed_dim, bias=bias, activation=activation)
assert len(pre_blocks) == len(k_neighbors) == len(reducers) == len(pos_blocks) == len(dim_expansion), \
"Please check stage number consistent for pre_blocks, pos_blocks k_neighbors, reducers."
self.local_grouper_list = nn.ModuleList()
self.pre_blocks_list = nn.ModuleList()
self.pos_blocks_list = nn.ModuleList()
last_channel = embed_dim
anchor_points = self.points
for i in range(len(pre_blocks)):
out_channel = last_channel * dim_expansion[i]
pre_block_num = pre_blocks[i]
pos_block_num = pos_blocks[i]
kneighbor = k_neighbors[i]
reduce = reducers[i]
anchor_points = anchor_points // reduce
# append local_grouper_list
local_grouper = LocalGrouper(last_channel, anchor_points, kneighbor, use_xyz, normalize) # [b,g,k,d]
self.local_grouper_list.append(local_grouper)
# append pre_block_list
pre_block_module = PreExtraction(last_channel, out_channel, pre_block_num, groups=groups,
res_expansion=res_expansion,
bias=bias, activation=activation, use_xyz=use_xyz)
self.pre_blocks_list.append(pre_block_module)
# append pos_block_list
pos_block_module = PosExtraction(out_channel, pos_block_num, groups=groups,
res_expansion=res_expansion, bias=bias, activation=activation)
self.pos_blocks_list.append(pos_block_module)
last_channel = out_channel
self.act = get_activation(activation)
self.classifier = nn.Sequential(
nn.Linear(last_channel, 512),
nn.BatchNorm1d(512),
self.act,
nn.Dropout(0.5),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
self.act,
nn.Dropout(0.5),
nn.Linear(256, self.class_num)
)
def forward(self, x):
xyz = x.permute(0, 2, 1)
batch_size, _, _ = x.size()
x = self.embedding(x) # B,D,N
for i in range(self.stages):
# Give xyz[b, p, 3] and fea[b, p, d], return new_xyz[b, g, 3] and new_fea[b, g, k, d]
xyz, x = self.local_grouper_list[i](xyz, x.permute(0, 2, 1)) # [b,g,3] [b,g,k,d]
x = self.pre_blocks_list[i](x) # [b,d,g]
x = self.pos_blocks_list[i](x) # [b,d,g]
x = F.adaptive_max_pool1d(x, 1).squeeze(dim=-1)
x = self.classifier(x)
return x
- Reference : https://github.com/ma-xu/pointMLP-pytorch
[CV_Segmentation] Multi-scale context aggregation by dilated convolutions
Multi-scale context aggregation by dilated convolutions
1. INTRODUCTION
- Semantic segmentation requires combining pixel-level acc with multi-scale contextual reasoning
- Structural differences between image classification and dense prediction
dense prediction : 이미지의 각 픽셀에 대한 레이블을 예측
- Repurposed networks : necessary? reduced accuracy when operated densely?
- Modern classification networks
- Integrating multi-scale contextual information via successive pooling and subsampling → reduce resolution
- BUT dense prediction needs full-resolution output
- Demand of multi-scale reasoning and full-resolution
- repeated up-convolutions : need severe intermediate downsampling → necessary?
- combination predictions of multiple rescaled inputs : separated analysis of input → necessary?
- Dilated convolutions : conv module designed for dense prediction (semantic segmentation)
- multi-scale contextual information without losing resolution
- plugged into existing architectures at any resolution
- no pooling or subsampling
- exponential expansion of receptive field without losing resolution or coverage
- accuracy of sota semantic segmentation ↑
2. Dilated convolutions
- Dilated convolution (*l) can apply same filter at different ranges using different dilation factors (l)
- F_(i+1) = F_i (*2^i) k_i for i = 0,1,...,n-2
- F : discrete functions, k : discrete 3x3 filters
- Size of receptive field of each element in F_(i+1) = [ 2^(i+2) -1 ] X [ 2^(i+2) -1 ] : square of exponentially increasing size
- (a) F_1 : 3x3, (b) F_2 : 7x7, (c) F_3 : 15x15 receptive field
- non-red field = zero value
3. Multi-scale context aggregation
[ Context module ]
- Input, Output
- C feature maps → C feature maps : can maintain resolution
- Same form : can be plugged into any dense prediction architecture
- Each layer has C channels
- directly obtain dense per-class prediction
- feature maps are not normalized, no loss is defined
- Multiple layers that expose contextual information → increase acc
[ Basic Context module ]
- 7 layers : 3x3xC conv with different dilation factors (1,1,2,4,8,16,1)
- A final layer : 1x1xC conv → produce output of the module
- Front end module output feature map : 64x64 resolution → stop expansion after layer 6
- Identity Initialization : set all filters s.t each layer simply passes input directly to the next
- Result : increase dense prediction acc both quantitatively and qualitatively & small # of parameters (total: 64C^2)
4. Front End
[ Front End module ] : Backbone module of Context module
- Input : reflection padded color image → Output : 64x64xC feature maps
- remove last 2 pooling and striding layers of VGG-16 → replace convolution layers were dilated by a factor of 2 for each layer
- remove padding of intermediate feature maps
- Training
- Pascal VOC 2012 training set + subset of annotations of validation set
- SGD, batch size = 14, lr = 10^-3, momentum = 0.9, iterations = 60K
- Test result : front end is both simpler and more accurate
5. Experiments
- Implementation : based on Caffe library
- Dataset : Microsoft COCO with VOC-2012 categories
- Training : 2 stage
- 1st : VOC-2012 & COCO : SGD, batch size = 14, momentum = 0.9, iterations = 100K (lr = 10^-3) + 40K (lr = 10^-4)
- 2nd : fine-tuned network on VOC-2012 only : iterations = 50K (lr = 10^-5)
- Test result
- Front-end module (alone) : 69.8% mean IoU on val set, 71.3% on test set
- Attribution : high acc by removal of vestigial components for image classification
(1) Controlled evaluation of context aggregation
- context module and structured prediction are synergistic → increase accuracy in each configuration
- large context module increases acc by a larger margin
(2) Evaluation on the test set
- large context module : significant boost in acc over front end
- Context module + CRF-RNN = highest acc
CRF-RNN (Conditional Random Field RNN) : post-processing step to get more fine-grained segmentation results in end to end manner
6. Conclusion
- Dilated convolution : dense prediction + increasing receptive field without losing resolution + increasing acc
- Future arch : end-to-end -> removing the need for pre-training -> raw input, dense label at full resolution output
[CV_Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks
DeepPose: Human Pose Estimation via Deep Neural Networks
1. Introduction
Previous challenges (Limitations)
- Localization of human joints using local detector
strong articulations, small visible joints, occlusions, need to capture context
modeling only a small subset of all interactions bw body parts
- Holistic manner proposed but limited success in real-world problems
DNN (Deep Neural Networks)
- visual classification tasks, object localization
Holistic human Pose estimation as DNN
- Pose estimation <=> Joint regression (location of each joint is regressed)
- Input : full img & 7-layered generic convolutional DNN
- Capturing full context of each body joint
- Simpler to formulate : no need to design whole feature representations, detectors for parts, interactions bw joints
- Cascade of DNN-base pose predictors : increased precision of joint localization
- SOTA or better than SOTA on 4 benchmarks
2. Related Work
- Pictorial Strictures (PSs) : distance transform trick
- Tree-based pose models with simple binary potential
- Richer part detectors : enriching representational power + maintaining tractability
- Mixture models on full scale
- Richer higher-order spatial relationships
- Transfer joint locations, Nearest neighbor setup
- Semi-global classifier for part config : linear -> less expressive representation (only arms)
- Pose regression : 3D pose
- CNNs with Neighborhood component analysis to regress : No cascade
- NN-based pose embedding : contrastive loss
3. Deep Learning Model for Pose Estimation
-
Encoding locations of all k body joints in Pose vector
- x : Input Image data
- k : # of body joints
- y : GT pose vector (2k Dim)
- y_i : x, y coordinates 2D vector of i-th joint (absolute img coordinates)
-
Normalized y_i wrt bounding box b
- b = (b_c, b_w, b_h)
- b_c : center of b (2D)
- b_w : width of b
- b_h : heigh of b
-
Normalized Pose vector
3.1 Pose Estimation as DNN-based Regression [Initial stage]
-
Architecture
- x : Input Image data
- φ : regression function based on conv DNN
- Input : 220 x 220 img -> 55 x 55 (by stride = 4)
- 7 layers (filter size : 11x11, 5x5, 3x3, 3x3, 3x3)
- Pooling : applied after 3 layers
- Total # of params : 40M
- Generic DNN Arch -> Holistic modeling & all internal features can be shared
- θ : parameters of model
- y* : pose prediction vector (absolute img coordinates vector)
-
Loss function and Training
- L2 loss : minimize distance bw prediction and true pose vector
- Using Normalized training set D_N
- Optimization over individual joints (if not all joints are labeled, omit that terms)
- Mini-batch size = 128, lr rate = 0.0005
- Data Augmentation : random translated crop, left/right flip
- DropOut regularization rate = 0.6
3.2 Cascade of Pose Regressors
-
Purpose : to solve limited capacity for detail (fixed input size) and achieve better precision
-
Same network Arch for all stages of cascade but Different learnable parameters
-
Subsequent stage : predict and refine displacement of joint locations y^s - y^(s-1)
- θ_s : learned network params
- φ_i : pose displacement regressor
- y_i : joint location
- b_i : joint bbox
- diam(y^s) : distance bw opposing joints on human torso
- σ : scale parameter for diam(y^s)
-
Process
- Using predicted joint locations to focus on relevant parts of img
- Cropping sub-imgs around predicted joint location
- Applying pose displacement regressor on sub-imgs
-
Result : higer resolution imgs -> finer features -> higher precision
-
Full augmented Training data
4. Empirical Evaluation
4.1 Setup
Datasets
- Frames Labeled In Cinema (FLIC)
- 4000 train img + 1000 test img from Hollywood movies
- diverse poses and clothing
- 10 upper body joints are labeled for each human
- Leeds Sports Dataset (LSP)
- 11000 train img + 1000 test img from sports activities
- 150 pixel height for majority of people
- 14 joints labeled for each person full body
Metrics
- Percentage of Correct Parts (PCP) : detected if distance bw predicted and true limb joint is at most half of limb length -> hard to detect for shorter limbs, lower arms
- Percentage of Detected Joints (PDJ) : varying degrees, detected if distance bw predicted and true limb joint is within certain fraction of torso diameter -> all joints are based on same distance threshold
Experimental Details
- FLIC : Rough estimate of initial bbox by Face-based body detector
- LSP : Full img as initial bbox
- To measure optimally of params, Use Average over PDJ at 0.2 across all joints
- To improve generalization, Augment data by sampling 40 randomly translated crop boxes
- Running time : 0.1s per img on a 12 core CPU
- Training complexity is higher
4.2 Results and Discussion
5. Conclusion
- First application of DNNs to human pose estimation
- Capturing context and reasoning about pose in a holistic manner
- Generic CNN for classification tasks can be applied localization task
[CV_Action Recognition] Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition (ST-GCN)
- Automatic Learning both spatial and temporal patterns from data
- Greater expressive power & Stronger generalization capability
Paper Review
1. Introduction
-
Human Action Recognition (HAR)
- Multiple modalities : Appearance, Depth, Optical-flows, "Body skeletons (Dynamic human skeletons)"
- "Dynamic human skeletons" : represented by a time series of human joints
- Limitation of previous works : hand-crafted parts or rules to analyze spatial patterns (not explicitly exploiting spatial relationships among joints) → Less expressive power & Difficult to be generalized
-
ST-GCN
- Components
- Node = Joint of human body
- Two types of Edge (Spatial Edge & Temporal Edge)
- 3 Contributions
- (1) The first attempt to apply GNN for modeling dynamic skeletons for HAR task
- (2) Designing convolutional kernels for skeleton modeling
- (3) Superior performance on two large scale datasets
- Components
2. Related Work
Two streams of GNN
- Spectral perspective : locality of graph convolution is considered in the form of spectral analysis
- Spatial perspective : conv filters are applied directly on nodes and their neighbors (This work)
Skeleton-based Action Recognition
- Skeleton : robust to illumination change and scene variation & easy to obtain by depth sensors of HPE algorithms
- Hand-crafted feature based methods : manually designing features to capture dynamics of joint motion
- DL based methods : modeling joints within body parts (explicitly assigned using domain knowledge)
- ST-GCN : applying GCN to skeleton-based AR
- Can learn part information implicitly by using locality of graph conv with temporal dynamics
- No manual part assignment → easier to design to learn better action representations
3. ST-GCN
- Human joints move in small local groups (body parts) → restrict joint trajectories for hierarchical representations
- Motivation : For hierarchical representations and locality, CNN (intrinsic property) is better than manual assignment
3.1 Pipeline Overview
![image](https://private-user-images.githubusercontent.com/83633885/252132123-376236b7-f612-4f6a-a604-dc0089031359.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTUzNjgyMDgsIm5iZiI6MTcxNTM2NzkwOCwicGF0aCI6Ii84MzYzMzg4NS8yNTIxMzIxMjMtMzc2MjM2YjctZjYxMi00ZjZhLWE2MDQtZGMwMDg5MDMxMzU5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTEwVDE5MDUwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRkYzEyMzMwN2RjMmU3ODVmMTVhNDZiNWI1MjZlMTMwM2Y2MjE4OGM0YjE3ZmNkYjEzMjhlZDdmOThkMmQxZWMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.XborEic50Iq6FD1ug8SHon6CBLcvGCRSLkp_5ul-92M)
- Skeleton based data : obtained from motion-capture devices or pose estimation algorithms from video
- = Sequence of frames (each frame has a set of joints)
- ST-GCN
- Input : joint coordinate vectors on graph nodes
- Multiple layers of ST-GCN → generating higher-level feature maps on graph
- Graph with joints as nodes & connectivities in both body structures and time as edges
- Output : classified action category by softmax classifier
- Training : E2E with backprop
3.2 Skeleton Graph Construction
- Previous work for skeleton based AR : concatenating coordinate vectors of all joints ⇒ a single feature vector per frame
-
ST-GCN : undirected graph
$G$ =$(V, E)$ on a skeleton sequence with$N$ joints,$T$ frames ⇒ hierarchical representation-
Node set :
$V$ = {$v_{ti} | t=1, ..., T, i=1, ..., N$ } --> all the joints -
Input Feature vector on a node :
$F(v_{ti})$ --> coordinate vectors + estimation confidence -
Edge set :
$E$ =$E_S$ &$E_F$ -
Spatial edge :
$E_S$ = {$v_{ti} * v_{tj} | (i, j) ∈ H$ },$H$ : set of connected joints
= Intra-skeleton edge to connect joints at each frame (공간적으로 연결) -
Temporal edge :
$E_F$ = {$v_{ti} * v_{(t+1)i}$ }
= Inter-frame edge to connect the same joints in consecutive frames (시간적으로 연결)
-
Spatial edge :
-
2 Steps
- 1st) Joints within one frame are connected with edges by connectivity of body structure
- 2nd) Each joint is connected to the same joint in the consecutive frame
- Advantages : No manual part assignment → model can work on datasets with different number of joints
-
Node set :
3.3 Spatial Graph Convolutional Neural Network
1st Step (on a single frame at time τ) =
- Input feature map
$f_{in}$ with channel$c$ - Output value at spatial location
$x$ :$f_{out}$
-
Sampling function
$p$ :$Z^2$ x$Z^2$ →$Z^2$ - [Image domain] 한 pixel로부터 주변 pixels을 가져오는 함수
-
[Graph domin] 한 node로부터 특정 거리
$D$ 만큼 떨어진 주변 nodes를 가져오는 함수 :$B(v_{ti})$ →$V$ - (In this paper)
$D$ = 1 : 1-neighbor set of joint nodes (바로 연결된 nodes)
- (In this paper)
-
Weight function
$w$ :$Z^2$ →$R^c$ ~ irrelevant to input location$x$ → filter weights sharing possible-
[2D conv] rigid grid → pixels within neighbor can have fixed spatial order
-
$w$ can be implemented by indexing a tensor of (c, K, K) dim according to spatial order
-
-
[Graph conv] no implicit arrangement → order is defined by graph labeling
- simplified by partitioning neighbor set
$B$ into a fixed number of$K$ subsets -
$w$ can be implemented by indexing a tensor of (c, K) dim
- simplified by partitioning neighbor set
-
[2D conv] rigid grid → pixels within neighbor can have fixed spatial order
Spatial Graph Convolution
-
Normalizing term
$Z$ : to balance contributions of different subsets to output
![image](https://private-user-images.githubusercontent.com/83633885/252206421-0f39d01a-c1cc-4ce7-80a7-35122e6eeb76.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTUzNjgyMDgsIm5iZiI6MTcxNTM2NzkwOCwicGF0aCI6Ii84MzYzMzg4NS8yNTIyMDY0MjEtMGYzOWQwMWEtYzFjYy00Y2U3LTgwYTctMzUxMjJlNmVlYjc2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTEwVDE5MDUwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTVmZWIzMmVlYWJhMzI3NWYzMTg1NTRiYWVkMzAxMDQyYmNiYWUzZTFhMDk2YmVhMDE1ZWJiZGU1ZmI0ZTA2ZDEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.UKsIqiFCGiet-tMMmEpIsNSONVwOewIUliBfMudHp7c)
![image](https://private-user-images.githubusercontent.com/83633885/252206587-4e24a2b0-7c03-4fab-8a2e-0ada806afa9d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTUzNjgyMDgsIm5iZiI6MTcxNTM2NzkwOCwicGF0aCI6Ii84MzYzMzg4NS8yNTIyMDY1ODctNGUyNGEyYjAtN2MwMy00ZmFiLThhMmUtMGFkYTgwNmFmYTlkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTEwVDE5MDUwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY3MWJjNTEyYzdiZDhhOTY4OGIyMmU4YTJjZGJkNzI4MGU2YzJlMjE2YWVlMjkxZTU4ZDA3ZGM2Nzg0N2UwYmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.vSqj7gaSsNr2ZnORN_Z0XxVcVTkyzYuf4lPJ84kMXQ0)
2nd Step : Spatial "Temporal" Modeling
-
Purpose : Extending domain (Spatial graph ⇒ Spatial Temporal graph)
- By adding temporally connected joint (connecting the same joints across consecutive frames)
-
Neighbor set
$B(v_{ti})$ of a joint node$v_{ti}$ - Γ : parameter gamma (temporal kernel size) to control temporal range to be included in neighbor graph
-
Label map
$l_{ST}$
3.4 Partition Strategies
![image](https://private-user-images.githubusercontent.com/83633885/252219504-6c6e5ccc-534e-4d33-bb6b-9e6e7813a07d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTUzNjgyMDgsIm5iZiI6MTcxNTM2NzkwOCwicGF0aCI6Ii84MzYzMzg4NS8yNTIyMTk1MDQtNmM2ZTVjY2MtNTM0ZS00ZDMzLWJiNmItOWU2ZTc4MTNhMDdkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTEwVDE5MDUwOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg4ZWFhMjgyYTc0ZTI5ZDI2MzcxNWFiYjFkM2ExNGZiZTMzYWQzNDczNmFiOGUyMzI3MDA3MTVhOGQ3YTdhNzgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.g693lVCurDMkqr7FIyFBgvD5mGj-d5KdD6RF6qnSnAg)
- Methods to implement label map
$l_{ST}$ (= to define neighbor nodes) -
(a) Input skeleton frame
- Red dashed circles : receptive fields of a filter with D=1
-
(b) Uni-labeling partitioning : K=1,
$l_{ti}(v_{tj})$ = 0- All neighbor nodes has the same label
- Suboptimal b/c local differential properties could be lost
-
(c) Distance partitioning : K=2,
$l_{ti}(v_{tj})$ =$d(v_{tj}, v_{ti})$ - Labeling according to nodes' distance to the root node
$v_{tj}$ - Root node = 0, Other neighbor nodes = 1 (In this case, D=1)
- Labeling according to nodes' distance to the root node
-
(d) Spatial configuration partitioning : K=3,
$l_{ti}(v_t *j)$ = 0 or 1 or 2- Labeling according to each distance to gravity center (black cross) compared with root node (green)
- Root node itself :
$l_{ti}$ = 0 if$r_j$ =$r_i$ - Centripetal group :
$l_{ti}$ = 1 if$r_j$ <$r_i$ - = neighbor nodes closer to gravity center than root node
- Centrifugal group :
$l_{ti}$ = 2 if$r_j$ >$r_i$
- Root node itself :
- Labeling according to each distance to gravity center (black cross) compared with root node (green)
3.5 Learnable edge importance weighting
- Problem : 하나의 joint가 여러 body parts에서 나타날 수 있음 but different importance 가지도록 해야 함
-
Solution : Adding learnable mask
$M$ on every layer- 각 spatial graph edge의 learned importance weight에 기반해 neighbor nodes에 node's feature contribution을 scaling
- Effect : improved recognition performance, possible to have data dependent attention map
3.6. Implementation ⇒ Code 비교
-
Implementation Details
-
$A$ : Adjacency matrix representing intra-body connections on a single frame -
$I$ : Identity matrix representing self-connections
-
-
Network Architecture and Training
4. Experiments
4.1 Dataset & Evaluation Metrics
4.2 Ablation Study
4.3 Comparison with SOTAs
Code Review
[CV_3D] PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Paper Review
1. Introduction
- PointNet : learning a spatial encoding of each point → (max-pooling) aggregating all point features to global PC (local features X)
- PointNet++ : processing a set of points sampled in metric space in a hierarchical fashion
partitioning a set of points into overlapping local regions
→ extracting local features capturing fine geometric structures from small neighborhoods
→ grouping local features into larger unit and processing to produce higher level features
[ Two issues of the design of PointNet++ ]
1. How to generate overlapping partitioning of point set
- Each partition : a neighborhood ball in Euclidean space
- Centroid Location : FPS(Farthest Point Sampling)로 선택
- Scale : combined Multiple scales for both robustness and detail capture (Random input dropout)
2. How to abstract sets of points or local features through a local feature learner (=PointNet)
- PointNet : processing an unordered set of points for semantic feature extraction & robust to input data corruption
- PointNet++ : applying PointNet recursively on a nested partitioning of input set
2. Problem Statement
-
$X = (M, d)$ : discrete metric space, metric = Euclidean space$R^n$ -
$M$ : set of points (density of$M$ is not uniform) -
$d$ : distance metric
-
-
$f$ : set functions = classification or segmentation function-
Input :
$X$ (along with additional features for each point) -
Output : information of semantic interest regarding
$X$ - classification function : to assign a label to
$X$ - segmentation function : to assign a per point label to each member of
$M$
-
Input :
3. Method
3.1 Review of PointNet : A Universal Continuous Set Function Approximator
- Point Cloud : a set of sparse points => efficient But operation for permutation-invariant 필수
-
PointNet : single MAX pooling → PC의 global feature 추출 But local context 소실 (segmentation performance ↓)
-
$f$ : permutation-invariant set function → arbitrarily approximate any continuous set function
-
3.2 Hierarchical Point Set Feature Learning (Set Abstraction)
- PointNet++ : hierarchical grouping of points and progressively abstracting larger local regions
- Set Abstraction level (3 layers) : 전반적인 semantic 정보를 포함한 압축된 PC로 변환 → PC의 local feature 추출
-
Input :
$N$ x ($d$ +$C$ ) matrix .....$N$ points with$d$ -dim coordinates +$C$ -dim point feature -
Output :
$N'$ x ($d$ +$C'$ ) matrix .....$N'$ subsampled points with$d$ -dim coordinates + new$C'$ -dim feature vectors
In Paper,
$d$ = 3 → (x,y,z)
[ 3 layers ]
❶ Sampling layer
-
Sampling layer : Selecting a set of points from input points {
${x_1, x_2, ..., x_n}$ }
.....$N$ input points 중$N'$ centroids 선택 (대표성 + local한 공간의 center) -
Farthest Point Sampling (FPS)
- Centroid = the most distant point in metric(euclidean) distance w.r.t the rest points
- Better converge of the entire point set than Random Sampling
❷ Grouping layer
-
Grouping layer : 각 centroid 대한 neighbor points 찾기 → 묶어서 하나의 local region point set 구성
-
Input : a point set =
$N$ x ($d$ +$C$ ) & coordinates of a set of centroids =$N'$ x$d$ -
Output : local groups of point sets =
$N'$ x$K$ x ($d$ +$C$ ) .....$K$ : # of neighbor points of centroid points
$K$ : flexible # (group마다 다름) → PointNet layer에서 fixed length local region feature vector 1개씩 추출
-
Input : a point set =
-
Metric distances to define neighbor points
-
-
KNN : centroid 대해 가장 가까운
$K$ 개의 점들 (fixed number of neighbor points)
-
KNN : centroid 대해 가장 가까운
-
- Ball query : centroid 기준 반지름 r 이내의 점들 (fixed region scale) → more generalizable
In Paper, using Ball query method
-
def sample_and_group(npoint, radius, nsample, xyz, points, knn=False, use_xyz=True):
new_xyz = gather_point(xyz, farthest_point_sample(npoint, xyz)) # (batch_size, npoint, 3)
if knn:
_,idx = knn_point(nsample, xyz, new_xyz)
else:
idx, pts_cnt = query_ball_point(radius, nsample, xyz, new_xyz)
grouped_xyz = group_point(xyz, idx) # (batch_size, npoint, nsample, 3)
grouped_xyz -= tf.tile(tf.expand_dims(new_xyz, 2), [1,1,nsample,1]) # translation normalization
if points is not None:
grouped_points = group_point(points, idx) # (batch_size, npoint, nsample, channel)
if use_xyz:
new_points = tf.concat([grouped_xyz, grouped_points], axis=-1) # (batch_size, npoint, nample, 3+channel)
else:
new_points = grouped_points
else:
new_points = grouped_xyz
return new_xyz, new_points, idx, grouped_xyz
❸ PointNet layer
-
PointNet layer : Each local region points pattern 파악 (encoding) → local feature vector 1개씩 추출
-
Input :
$N'$ local regions of points with data size$N'$ x$K$ x ($d$ +$C$ ) -
Output :
$N'$ x ($d$ +$C'$ )
-
Input :
- Mini-PointNet = basic building block for local pattern learning
def pointnet_sa_module(xyz, points, npoint, radius, nsample, mlp, mlp2, group_all, is_training, bn_decay, scope, bn=True, pooling='max', knn=False, use_xyz=True, use_nchw=False):
data_format = 'NCHW' if use_nchw else 'NHWC'
with tf.variable_scope(scope) as sc:
# Sample and Grouping
if group_all:
nsample = xyz.get_shape()[1].value
new_xyz, new_points, idx, grouped_xyz = sample_and_group_all(xyz, points, use_xyz)
else:
new_xyz, new_points, idx, grouped_xyz = sample_and_group(npoint, radius, nsample, xyz, points, knn, use_xyz)
# Point Feature Embedding
if use_nchw: new_points = tf.transpose(new_points, [0,3,1,2])
for i, num_out_channel in enumerate(mlp):
new_points = tf_util.conv2d(new_points, num_out_channel, [1,1],
padding='VALID', stride=[1,1],
bn=bn, is_training=is_training,
scope='conv%d'%(i), bn_decay=bn_decay,
data_format=data_format)
if use_nchw: new_points = tf.transpose(new_points, [0,2,3,1])
# Pooling in Local Regions
if pooling=='max':
new_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
elif pooling=='avg':
new_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
elif pooling=='weighted_avg':
with tf.variable_scope('weighted_avg'):
dists = tf.norm(grouped_xyz,axis=-1,ord=2,keep_dims=True)
exp_dists = tf.exp(-dists * 5)
weights = exp_dists/tf.reduce_sum(exp_dists,axis=2,keep_dims=True) # (batch_size, npoint, nsample, 1)
new_points *= weights # (batch_size, npoint, nsample, mlp[-1])
new_points = tf.reduce_sum(new_points, axis=2, keep_dims=True)
elif pooling=='max_and_avg':
max_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
avg_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
new_points = tf.concat([avg_points, max_points], axis=-1)
new_points = tf.squeeze(new_points, [2]) # (batch_size, npoints, mlp2[-1])
return new_xyz, new_points, idx
3.3 Robust Feature Learning under Non-Uniform Sampling Density
- Goal : non-uniform density (sparse ~ dense) point set feature learning 어려움 해결
- (1) PC를 다양한 density로 sampling하여 학습
- (2) Density Adaptive layer : 다양한 scale의 PC에서 feature vector 추출하여 결합
[ 2 Types of Density Adaptive layers ]
1. Multi-scale grouping (MSG)
- Grouping을 다양한 scale로 여러 번 적용 → 하나의 centroid 대해 여러 scale의 point sets 생성
- 각 point set에서 추출한 feature vector를 concat하여 multi-scale feature vector 생성
- 각 point set은 random input dropout (down-sampling) → 다양 scale의 density (various sparsity, varying uniformity)
- 단점 : every centroid 대해 local PointNet 돌려야함 → computationally expensive, inefficient, time-consuming
2. Multi-resolution grouping (MRG) ★이해
- MSG의 단점 보완, PointNet++에서 사용한 방법
-
$L_i$ level features : 2 different scale feature vectors를 concat하여 multi-scale feature vector -
Left vector : lower level
$L_{i-1}$ 의 each sub-region의 features를 summarizing한 feature -
Right vector : local region
$L_i$ 의 all raw points에 대해 PointNet을 거쳐서 얻은 feature - 장점 : large scale neighborhoods at lowest levels에서의 feature extraction 필요 X → more efficient
3.4 Point Feature Propagation for Set Segmentation
- Set Abstraction Sampling layer 의해 PC 크기 감소 → segmentation task 위해 원래 크기 복원
-
(1) Up-sampling : 이전 점들(
$N_{l-1}$ points )에 대한 feature vector로부터 (1/distance)로 weighted Interpolation - (2) Skip connection : down-sampling 이전의 feature vector를 concat → 정보량 보충
4. Experiments
4.1 Point Set Classification in Euclidean Metric Space
-
MNIST (2D Object)
- Input : 2D img coordinates에서 2D PC of digit pixel locations 로 변환 (default 512 points)
- Result : digit classification task에서 PointNet 보다 error rate ↓, CNN-based models 보다도 성능 ↑
-
ModelNet40 (3D rigid Object)
- Input : CAD model 3D mesh에서 표면을 sampling하여 3D PC 로 변환 (default 1024 points)
- Additional point features로 face normals 사용 (
$N$ = 5000 ) to boost performance - All points are normalized to be 0 mean and within a unit (r=1) ball
- Model : 3-level hierarchical network + 3 FC layer
- Result : 3D shape classification task에서 MVCNN(SOTA model) 보다 성능 ↑
-
Ablation study of Density adaptive layer : multi-scale 학습 모델들(MSG, MRG) = robust to points # (or density)
def get_model(point_cloud, is_training, bn_decay=None):
""" Classification PointNet, input is BxNx3, output Bx40 """
batch_size = point_cloud.get_shape()[0].value
num_point = point_cloud.get_shape()[1].value
end_points = {}
l0_xyz = point_cloud
l0_points = None
# Set abstraction layers
l1_xyz, l1_points = pointnet_sa_module_msg(l0_xyz, l0_points, 512, [0.1,0.2,0.4], [16,32,128], [[32,32,64], [64,64,128], [64,96,128]], is_training, bn_decay, scope='layer1', use_nchw=True)
l2_xyz, l2_points = pointnet_sa_module_msg(l1_xyz, l1_points, 128, [0.2,0.4,0.8], [32,64,128], [[64,64,128], [128,128,256], [128,128,256]], is_training, bn_decay, scope='layer2')
l3_xyz, l3_points, _ = pointnet_sa_module(l2_xyz, l2_points, npoint=None, radius=None, nsample=None, mlp=[256,512,1024], mlp2=None, group_all=True, is_training=is_training, bn_decay=bn_decay, scope='layer3')
# Fully connected layers
net = tf.reshape(l3_points, [batch_size, -1])
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training, scope='fc1', bn_decay=bn_decay)
net = tf_util.dropout(net, keep_prob=0.4, is_training=is_training, scope='dp1')
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training, scope='fc2', bn_decay=bn_decay)
net = tf_util.dropout(net, keep_prob=0.4, is_training=is_training, scope='dp2')
net = tf_util.fully_connected(net, 40, activation_fn=None, scope='fc3')
return net, end_points
4.2 Point Set Segmentation for Semantic Scene Labeling
4.3 Point Set Classification in Non-Euclidean Metric Space
- SHREC15 (3D non-rigid Object)
- SHREC15 dataset : 2D surfaces embedded in 3D space
- Goal : To show generalizability of PointNet++ to non-Euclidean space
- Requirement : knowledge of 'intrinsic structure'
- [Fig.7] (a), (c) : different in pose -> same category
- Geodesic distances along the surfaces induce a metric space
Geodesic distance : the shortest path between the vertices in a graph
- PointNet++ : constructing metric space induced by geodesic distance → extracting intrinsic point features in WKS, HKS, multi-scale Gaussian curvature → using these features as input → sampling and grouping points
- Result : capturing multi-scale intrinsic structure not influenced by specific pose => effectiveness, 성능 ↑
4.4 Feature Visualization
Conclusion
Future works
- To think how to accelerate inference speed of network for MSG and MRG layers by sharing more computation in each local regions
- To find applications in higher dimensional metric spaces where CNN based method would be computationally unfeasible
[CV_CNN] Accelerating the Super-Resolution Convolutional Neural Network
Abstract
- SRCNN : high computational cost -> real-time performance(24fps) X -> practical usage X
- FSRCNN : accelerated, compact hourglass-shape SRCNN for faster and better SR
- 3 Main re-design aspects
- deconv layer at the end : learning mapping directly from original low-resol img to high-resol img (No interpolation)
- reformulation of mapping layer : shrinking input dim -> mapping -> expanding back
- smaller filter sizes but more(deeper) mapping layers
- Results : speed up of more than x40 with superior restoration quality
- Additional aspects
- Parameter settings for real-time performance on CPU
- Transfer strategy for fast training and testing across Different upscaling factors
- 3 Main re-design aspects
1. Introduction
1) Previous SR algorithms
- Learning-based(or patch-based) methods
- SRCNN : faster than upper methods but still slow speed (no real-time)
2) Inherent limitations of previous algorithms
(1) [Pre-processing step] Upsampling by Bicubic interpolation -> high computation complexity
- n^2 times computation cost for n upscaling factor
- Solution : learning directly from original LR img -> n^2 times faster
(2) [Costly non-linear mapping step] Input patches are projected on high-dim LR & HR feature space
- parameter # ↑ -> accuracy ↑ but also running time ↑
- Solution : shrinking network scale while keeping accuracy
3) FSRCNN : Solution for upper limitations
(1) Deconvolution layer to replace Bicubic interpolation
- deconv layer at the end of network -> computational complexity ~ spatial size of original LR img
- Better than interpolation kernel like in FCN / unpooling+conv
- deconv layer consists of diverse automatically learned upsampling kernels -> generate final HR img
(2) Adding shrinking/expanding layer at the beginning/end of mapping layer separately
- To restrict mapping in low-dim feature space
(3) Additional aspects
- Decomposition a single mapping layer into several layers with fixed filter size 3x3
- Overall shape : hourglass (symmetric : thick end and thin middle)
4) Achievements
- Speed up of more than 40x (+ FSRCNN-s can run in real time with generic CPU)
- Different upscaling factors
- All conv layers except deconv can be shared
- Training : only fine-tune deconv layer for another upscaling factor (no loss of mapping acc)
- Testing : only do convolution operations once & upsampling img to different scales using corresponding deconv layer
5) Contributions
- Formulate a compact hourglass-shape CNN structure for fast img SR by deconv -> E2E mapping with no pre-processing
- Speed up 40x than SRCNN while keeping performance
- Transfer conv layers for fast training and testing across different upscaling factors witj no loss of quality
2. Related Work
DL for SR
- SR task 위해 SRCNN 제안된 후, 많은 deeper strucutres 나옴
- SRCNN : directly learning E2E mapping bw LR and HR img
- Sparse-coding-based method : outperform SRCNN with small size model BUT hard to shrink with no loss of mapping acc
- All these networks : required pre-processing with bicubic-upscaling
- FSRCNN : only required a different deconv layer -> faster to upscale an img to different sizes
CNNs acceleration
- High-level vision (Object detection, Image classification, ..) : CNN 속도 높이기 위한 많은 연구들 진행됨
- Approximating existing well-trained models
- Low-level vision (SR) : SR 위한 DL 모델은 fully-connected layers 없어서 convolution filters가 중요함
- FSRCNN : Reformulating previous model -> better performance
3. Fast Super-Resolution by CNN
3.1 SRCNN
- Aim : learning E2E mapping function F bw bicubic interpolated LR img Y and HR img X
- Network : All conv layers -> output size = input size
- Computation complexity
- ~ S_HR(size of HR img)
- middle layer : contributing most to params
- Cost function : MSE
3 main parts (steps)
- (1) Patch extraction and Representation
- extracting patches from input and representing each patch as a high-dim feature vector
- (2) Non-linear mapping
- mapping feature vectors non-linearly to another set of feature vectors (HR features)
- (3) Reconstruction
- aggregating features to form the final output img
3.2 FSRCNN
- Notations : Conv(f_i, n_i, c_i), DeConv(f_i, n_i, c_i), where f_i, n_i, c_i represent filter size, filter#, channel#
- Activation function : PReLU
- Aim : mainly to avoid the dead features caused by zero gradients in ReLU
- different on coeff of negative part with ReLU
- parameter a_i : fixed to be 0 for ReLU <-> learnable for PReLU (full use of all params for max capacity of net)
- Overall structure : FSRCNN(d, s, m)
- Conv(5,d,1) - PReLU - Conv(1,s,d) - PReLU - m x Conv(3,s,s) - PReLU - Conv(1,d,s) - PReLU - DeConv(9,1,d)
- d : LR feature dimension, s : shrinking filters #, m : mapping depth governing performance and speed
- Shape : hourglass (symmetric : thick end and thin middle)
- Computational complexity : ~~~
- Cost function : MSE
5 main parts (steps)
- (1) Feature extraction : Conv(5, d, 1)
- Similar to first part of SRCNN but Different on input img
- Feature extraction on original LR input img (Y_s) without interpolation
- SRCNN : Conv(9, n_1, 1) on upscaled img (Y) <-- most pixels in Y are interpolated from Y_s
- FSRCNN : Conv(5, d, 1) on original img (Y_s) <-- 5x5 cover almost info of 9x9 patch in Y
- f_1 = 5 : smaller filter with little information loss
- n_1 = d : filter # <=> LR feature dimension # <<< 1st sensitive variable
- (2) Shrinking : Conv(1, s, d)
- SRCNN : Feature extraction → (No shrinking) → Mapping => mapping LR features high-dim directly to HR feature spaces
- LR feature dim d is usually very large -> high computation complexity
- FSRCNN : Feature extraction → Shrinking layer (1x1) → Mapping => reduce LR feature dim
- f_2 = 1 : 1x1 filter to perform like a linear combination
- n_2 = s << d : smaller filter number to reduce LR feature dim from d to s
- Result : greatly reduce params #
- (3) Non-linear Mapping : m x Conv(3, s, s)
- the most important part for SR performance
- the most influencing factors : width (filters # in a layer), depth (layers #)
- SRCNN : single 5x5 layer (5x5 better than 1x1 layer)
- FSRCNN : multiple 3x3 layers
- f_3 = 3 : 3x3 layers (trade-off bw performance and net scale)
- m : multiple layers to replace a single wide one <<< sensitive variable to determine mapping acc and complexity
- n_3 = s : all mapping layers contain same number of filters
- (4) Expanding : Conv(1, d, s)
- Inverse process of Shrinking layer Conv(1, s, d)
- Shrinking operation reduces # of LR feature dim
- BUT HR img directly from these low-dim, final restoration quality is poor
- Expanding layer after mapping part to expand HR feature dim
- f_4 = 1 : 1x1 filters to maintain consistency with shrinking layer
- n_4 = d : filter # <=> LR feature dimension #
- (5) Deconvolution : DeConv(9, 1, d)
- Aim : upsampling and aggregating previous features
- Deconvolution (Transposed Convolution) = Inverse operation of Convolution
- Convolution : stride k → output is 1/k times of input
- Exchange the position of input and output → output will be k times of input
- Deconvolution : stride k = n (desired upscaling factor) → output is directly reconstructed HR img
- f_5 = 9 : filter size of deconv <=> consistent with filter size of conv (first layer) of SRCNN
- Reversed network = Downscaling operator (HR img → LR img)
- [Fig3] patterns of learned deconv filters are very similar to first layer filters in SRCNN
- Deconv layer learns Upsampling kernel for input feature maps (kernels are diverse and meaningful in [Fig3])
3.3 Differences against SRCNN : From SRCNN to FSRCNN
- Transform SRCNN to FSRCNN within three steps
- (1) The last Conv layer => DeConv layer
- The whole network will perform on original LR img & low computation complexity (~S_LR instead of S_HR)
- Enlarging network scale but speed-up
- Performance of Learned deconv kernels are better than a single bicubic kernel
- (2) Single mapping layer => Shrinking layer + 4 mapping layers + Expanding layer
- 5 more layers but params are decreased & acceleration is the most prominent
- Depth is key factor for performance
- (3) Smaller filter size, less filters + 4 'narrow' layers (deeper network) instead of a single 'wide' layer
- final speedup & training network efficiently
- Two 🐰!! Acceleration is NOT at the cost of performance degradation (FSRCNN outperforms SRCNN)
3.4 SR for Different Upscaling Factors
- Transfer conv layers for fast training and testing across Different Upscaling Factors with no loss of quality
- All conv layers except deconv can be shared (only the last deconv layer contains information of upscaling factor)
- FSRCNN : almost conv filters are the same for different upscaling factors
- SRCNN and SCN : conv filters differ a lot for different upscaling factors
- Training : only fine-tune deconv layer for another upscaling factor (no loss of mapping acc)
- Testing : only do convolution operations once & upsampling img to different scales using corresponding deconv layer
4. Experiements
4.1 Implementation Details
4.2 Investigation of Different Settings
4.3 Towards Real-Time SR with FSRCNN
4.4 Experiments for DIfferent Upscaling Factors
4.5 Comparision with SOTAs
5. Conclusion
[CV_STN] Spatial Transformer Networks
Spatial Transformer Networks
0. Abstract
- CNN : limited by lack of ability to be 'spatially invariant' to input data
- STN : Spatially Transform data within Network without extra training supervision
- can be inserted into conv architectures
- invariant to translation, scale, rotation, generic warping, etc
spatial invariance : 이미지가 변환되어도 그 이미지로 인식하는 것
1. Introduction
- CNN Limitation : local max-pooling(2 x 2) help but intermediate feature maps still not invariant to global transformation of input
- Pooling layer (fixed and local receptive fields) : limited, pre-defined mechanism for dealing with variations
- Spatial Transformer module : dynamic mechanism -> appropriate transformation for each input data on entire feature map (non-locally)
- select most relevant regions (attention) & transform them to canonical pose
- can be trained with standard backprop -> end-to-end training
- STN (CNN + ST module) 3 Benefits
- image classification : crop, scale-normalization -> simplify subsequent classification task -> great performance
- co-localisation : localize different instances of the same but unknown class
- spatial attention (select most relevent region) : more flexible and trained within backprop without reinforcement learning
2. Related Work (prior work)
- modeling transformation with NN
- Hinton : 2D affine transformation -> generative model training
- Tieleman : generative capsule models -> learn discriminative features for classification
- transformation-invariant representation
- Cohen & Welling : G-CNN
- Scattering networks, Filter banks
Filter Bank : an array of bandpass filters that separates the input signal into multiple components
- attention and detection mechanism for feature selection
- STN : invariant representation by manipulating data ! (feature extractor X)
3. Spatial Transformers
- Spatial Transformer = Localisation net + Grid generator + Sampler
- (1) Input feature map U가 Localisation net에 들어가 transformation parameter θ를 뽑아냄
- (2) θ가 Grid generator에 들어가 Sampling point가 지정된 Sampling grid T_θ(G)를 생성함
- (3) Sampler에는 Input feature map U와 Sampling grid T_θ(G)가 입력됨
- (4) Sampling grid T_θ(G)에는 Sampling point가 찍혀있기에 이를 U에 적용하면 Output feature map V를 얻을 수 있음
3.1 Localisation Network
- Regress θ automatically to improve overall accuracy
- input : input feature map U (Width x Height x Channel)
- output : transform parameter θ = f_loc(U)
- θ : parameter matrix -> shape은 transformation type 따라 달라짐 (ex. affine : 6d)
- f_loc( ) can take any form (ex. fc net or conv net) BUT should include final Regression layer to produce θ
3.2 Parameterized Sampling Grid (Grid generator)
-
Generate coordinate grid on input image corresponding to each pixel from output image
-
Regular Grid : G = {Gi} of pixels Gi = (xi^t, yi^t) <- output feature map grid
-
Grid generator : T_θ( ) can have any differentiable parameterized form
- ex 1) 2D affine transformation (crop, translation, rotation, scale, skew) matrix -> by 6 params
- ex 2) Attention (crop, translation, isotrophic scaling) : more constrained(=low complexity) -> by 3 params
- ex 3) plane projective (8 params), wise affine, thin plate spline
-
height and width normalized coordinates
- (xi^s, yi^s) : source coordinates in the input feature map U
- (xi^t, yi^t) : target coordinates in the output feature map V
3.3 Differentiable Image Sampling (Sampler)
-
Input : Apply set of sampling points T_θ(G) to input image U -> Define spatial location in the input
- Unm^c : input value at (n,m) in channel c
-
Output : transformed output feature map V (Width x Height x Channel)
- Vi^c : output value for pixel i at (xi^t, yi^t) in channel c
-
k( ) can take any sampling kernel as long as (sub-)gradients can be defined (ex. bilinear interpolation) for backprop
-
Spatial Consistency : Sampling is done identically for each channel -> every channel is transformed in identical way
- 같은 input의 다른 channel에 대해선 당연히 같은 sampling
3.4 Spatial Transformer Networks
- Spatial Transformer module = Localisation net + Grid generator + Sampler
- STN = CNN + ST module (at any point, in any number)
- Eval 결과, CNN 입력 바로 앞에 ST layer 두는게 일반적으로 가장 효과적
- Advantages
- Fast & little overhead naively & even speedups in attentive models
- Minimize overall cost during training -> little effect on speed
- training 과정 중에 모델의 다른 파라미터들과 함께 학습이 되기에 속도에 미치는 영향 거의 X
- How to transform each sample is compressed in weights of localisation net during training
- Possible to Downsample or Oversample feature map
- Possible to have Multiple spatial transformers in CNN
- At increasing depths of CNN -> more abstract representations
- For localisation networks -> more informative representations to base predicted params
- Parallel -> useful to focus on multiple objects or parts of interest individually
- Limitation : the number of parallel transformers limits the number of modeled objects
4. Experiments
4.1 Distorted MNIST
[ Train ]
- Transformation type : Rotation(R), Rotation-Translataion-Scale(RTS), Projective(P), Elastic warping(E)
- Network type : FCN, CNN, ST-FCN, ST-CNN
- Sampling (bilinear) : affine(Aff), projective(Proj), thin plate spline(TPS)
- Identical condition : same # of params, same base structure, identical optim (backprop, SGD, scheduled lr decay, multinomial CE loss, three weight layers) / CNN includes 2 max-pooling
[ Result ]
- Network type : (percent error) ST-CNN < ST-FCN < CNN < FCN
- ST < non ST : ST enables network outperform
- CNN < FCN : Max-pooling (more spatial invariance) & Convolutional layer itself (better local structure model)
- CNN = ST-FCN for RTS : ST is alternative way for spatial invariance
- Sampling : TPS is the best (elastically deform digits, reduce complexity, not overfit on simple data)
- Transformation of inputs for all ST models : Standard upright posed digit = mean pose found in training data
- https://www.youtube.com/watch?v=Ywv0Xi2-14Y&t=94s
4.2 Street View House Numbers (SVHN)
[ Train ]
- Dataset : 1 and 5 digits house number in real world images (200K)
- Pre-processing : 64 x 64 crop + additional loosely 128 x 128 crop
- CNN : 11 hidden layers, 5 digit-independent softmax
- ST-CNN : Single (f = 4 layers CNN, following input of baseline CNN) / Multi (f = 2 layer FC, before each first 4 CNN) -> affine transform, bilinear sampler
- SGD, dropout, randomly initialized weights except for regression layers
[ Result ]
- Best Accurcay : ST-CNN Multi for 64 x 64 images (3.6% error)
- crop and rescale by focusing resolution and network capacity only on corresponding parts of digit
- Computation Speed : ST-CNN is only 6% slower than CNN
- ST-CNN requires only a single forward pass
4.3 Fine-Grained Classification
[ Train ]
- Dataset : CUB-200-2011 birds dataset (6K train, 5.8K test, 200 species)
- Baseline CNN : Inception + BN (pre-trained on ImageNet, fine-tuned on CUB)
- ST-CNN : 2 or 4 parallel spatial transformers
- 1 softmax layer, end-to-end backprop
[ Result ]
- Best Accurcay : 4 x ST-CNN (84.1%) -> outperform baseline CNN (82.3%)
- Pose detection (Attention) : head (red) + central part (green) without any additional supervision
- Same performance even if resolution is downsampled (448px input -> 224px output)
5. Conclusion
- can be dropped into a network, perform explicit spatial transformations
- can do end-to-end without any change in loss function
- gain accuracy across multiple task
- regressed transformation parameters are available as output
- Expectation : powerful in recurrent models, object reference frame, 3D transformation
Code
#### 1) Load Dataset ####
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np
plt.ion()
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# train dataset
train_loader = torch.utils.data.DataLoader(
datasets.MNIST(root='.', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])), batch_size=64, shuffle=True, num_workers=4)
# test dataset
test_loader = torch.utils.data.DataLoader(
datasets.MNIST(root='.', train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])), batch_size=64, shuffle=True, num_workers=4)
#### 2) Compose STN ####
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
# Localization-network
self.localization = nn.Sequential(
nn.Conv2d(1, 8, kernel_size=7),
nn.MaxPool2d(2, stride=2),
nn.ReLU(True),
nn.Conv2d(8, 10, kernel_size=5),
nn.MaxPool2d(2, stride=2),
nn.ReLU(True)
)
# 3 * 2 size Affine matrix에 대해 예측
self.fc_loc = nn.Sequential(
nn.Linear(10 * 3 * 3, 32),
nn.ReLU(True),
nn.Linear(32, 3 * 2)
)
# 항등 변환(identity transformation) -> 가중치/바이어스 초기화
self.fc_loc[2].weight.data.zero_()
self.fc_loc[2].bias.data.copy_(torch.tensor([1, 0, 0, 0, 1, 0], dtype=torch.float))
# STN forward
def stn(self, x):
xs = self.localization(x)
xs = xs.view(-1, 10 * 3 * 3)
theta = self.fc_loc(xs)
theta = theta.view(-1, 2, 3)
grid = F.affine_grid(theta, x.size())
x = F.grid_sample(x, grid)
return x
def forward(self, x):
x = self.stn(x)
# general forward pass
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
model = Net().to(device)
#### 3) Train and Test ####
optimizer = optim.SGD(model.parameters(), lr=0.01)
def train(epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward() # End-to-end training
optimizer.step()
if batch_idx % 500 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def test():
with torch.no_grad():
model.eval()
test_loss = 0
correct = 0
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, size_average=False).item()
pred = output.max(1, keepdim=True)[1]
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'
.format(test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
#### 4) Visualization ####
def convert_image_np(inp):
"""Convert a Tensor to numpy image."""
inp = inp.numpy().transpose((1, 2, 0))
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
inp = std * inp + mean
inp = np.clip(inp, 0, 1)
return inp
def visualize_stn():
with torch.no_grad():
data = next(iter(test_loader))[0].to(device)
input_tensor = data.cpu()
transformed_input_tensor = model.stn(data).cpu()
in_grid = convert_image_np(
torchvision.utils.make_grid(input_tensor))
out_grid = convert_image_np(
torchvision.utils.make_grid(transformed_input_tensor))
f, axarr = plt.subplots(1, 2)
axarr[0].imshow(in_grid)
axarr[0].set_title('Dataset Images')
axarr[1].imshow(out_grid)
axarr[1].set_title('Transformed Images')
for epoch in range(1, 20 + 1):
train(epoch)
test()
visualize_stn()
plt.ioff()
plt.show()
[CV_Localization] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Abstract
- visual explanations for decisions from CNN-> transparent, explainable
- use gradients flowing into final conv to localization map -> highlight important regions in image for predicting concept
- applicable to many CNN tasks without architectural changes or re-training
- Classification
- lend insight into failure modes by reasonable explanations
- outperform previous methods
- robust to adversarial perturbations
- more faithful to basic model
- help generalization by dataset bias (for fair and bias-free outcomes)
- Localization
- image captioning, VQA
- even non-attention based models
- Human study
- appropriate trust in prediction from deep networks
- discern stronger vs weaker model even when identical prediction
1. Introduction
- Transparent model to Explain why they predict what they predict
- AI evolution : (VQA) Identify failure / (classification) Establish appropriate trust / (chess) Teach human how to make better decisions
- Trade-off bw accuracy and interpretability (simplicity)
- Classical model : interpretability ↑, accuracy ↓
- Deep model : interpretability ↓, accuracy ↑ BY greater abstraction (layers↑) and integration (end-to-end training)
- CAM vs Grad-CAM
- CAM : constrained to model architecture (GAP -> fc)
- Grad-CAM : deep models without altering architecture (no trade-off) => Generalization of CAM
- Guided Grad-CAM : class-discriminative & high-resolution = good visual explanation
- CAM, Grad-Cam : class-discriminative (localize)
- Guided backprop, Deconv : high-resolution (detail)
2. Related Work
- Visualizing CNNs
- Assessing model trust
- Aligning gradient-based importance
- Weakly-supervised localization : training without bbox information
3. Grad-CAM
- last conv layer : high-level semantics (class-specific) & detailed spatial information
- gradient flowing into last conv -> assign importance values to each neuron for a particular decision of interest
① Class score (before softmax) : y^c (could be any differentiable activation)
② Gradients of y^k wrt feature map activations A^k via backprop : dy^c/dA^k
③ Global average pooling -> Importance weight of feature map k for target class c : a_k^c
④ Weighted combination of forward activation maps
⑤ Apply ReLU b/c only interested in features of positive influence
-> result : coarse heatmap of same size as conv feature maps
3.1 Grad-CAM generalizes CAM
(수학적 증명)
3.2 Guided Grad-CAM
- Grad-CAM : pixel-space detail ↓ -> unclear why network predicts particular instance
- Guided Backprop : suppress negative gradients and visualize gradients through ReLU -> capture pixel by neurons
- Guided Grad-CAM : combination by element-wise mul -> both high-resolution & class-discriminative + less noisy then deconv
3.3 Counterfactual Explanations
- 가장 방해되는 것 무엇?
- DL model : background가 아닌 foreground로 판단
4. Evaluating Localization Ability of Grad-CAM
4.1 Weakly-supervised Localization
- Weakly-supervised Localization : training without bbox information
- Given image -> Obtain class predictions -> Generate Grad-CAM maps for each predicted classes -> Binarize pixels with thresh of 15% of max intensity -> Draw bbox around single largest segment
- Grad-CAM localization error < others
- No change model structure or re-train -> No compromise on classification performance!
4.2 Weakly-supervised Segmentation
- Semantic Segmentation : assign each pixel in image an object class -> expensive pixel-level annotation
- Weakly-supervised Segmentation : segment object with image-level annotation -> cheap and easy to get data
- SEC with CAM : sensitive to choice of weak localization seed -> SEC with Grad-CAM : (IoU : 44.6 -> 49.6)
4.3 Pointing Game
- Why : To evaluate discriminativeness of visualization method for localizing objects
- How : Extract maximally activated point on generated heatmap -> compare with target label -> Count # of Hit or Miss
Acc = Hit # / (Hit # + Miss #) ... only measure Precision
- For Recall, compute localization maps for top-5 class predictions -> evaluate them with additional option
option : reject predictions below a threshold (absent from GT)
- Result : Grad-CAM > c-MWP (70.58% > 60.30%)
5. Evaluating Visualizations
- interpretability vs. faithfulness tradeoff
5.1 Class Discrimination
- Dataset : PASAL VOC 2007 - 2 annotated categories
- CNN model : VGG-16, AlexNet
- Method(Human Acc) : Deconv(53.33%), Guided backprop(44.44%), Deconv Grad-CAM(60.37%), Guided Grad-CAM(61.23%)
5.2 Trust
- CNN model : VGG-16, AlexNet <- both models making same prediction as GT
- Method: Guided backprop, Guided Grad-CAM
- Evaluation : rating reliability of models relative to each other
- Result : Guided backprop (VGG-16 : 1.00), Guided Grad-CAM (VGG-16 : 1.27) => VGG is more reliable than AlexNet
- Grad-CAM can place trust in model that generalizes better than individual prediction explanations
5.3 Faithfulness vs Interpretability
- Trade-off : More faithful, Less interpretable and vice versa
- Grad-CAM are reasonably interpretable, so evaluate how faithful!
- Faithfulness : ability to accurately explain function
- Reference explanation with high local-faithfulness : correlation with Image occlusion maps
- Result : Grad-CAM is more faithful than original model
- Grad-CAM is more Faithful and more Interpretable
6. Diagnosing image classification CNNs with Grad-CAM
- VGG-16 pretrained on imagenet
6.1 Analyzing failure models for VGG-16
- Some failures are due to ambiguities inherent in ImageNet classification
- Guided Grad-CAM has reasonable explanations for failure predictions
6.2 Effect of adversarial noise on VGG-16
- Dataset : adversarial images for ImageNet-pretrained VGG-16
- Result : Despite network being certain about absence of each category, correctly localize! -> fairly robust to adversarial noise
6.3 Identifying bias in dataset
- Task : binary classification of doctor' vs 'nurse'
- Biased model : misclassifying by gender stereotype (face / hairstyle) => good validation acc, but not good for generalization
- Reduced biased model : generalization better (82% → 90%)
- Insight: Grad-CAM can help detect and reduce bias in training datasets -> better generalizaion, fair and eithical outcome
7. Textual Explanations with Grad-CAM
- obtain neuron names for last conv layer -> sort and obtain top-5 and bottom-5 neurons -> use for text explanations
- higher positive values of neuron importance => presence of concept increases in class score
- important concepts are indicative of predicted class even for misclassification
8. Grad-CAM for Image Captioning and VQA
- vision & language tasks
8.1 Image Captioning
- finetuned VGG-16 for images, LSTM-based language model (no explicit attention mechanism)
- compute gradient of log probability wrt units in last conv layer -> generate Grad-CAM visualizations
- FCLN produces bbox for rol & LSTM-based model generates associated captions
- DenseCap generates 5 captions per image with GT bbox
- Then, Guided Grad-CAM localizes regions without trained with bbox annotations
8.2 Visual Question Answering
- CNN for processing images & RNN language model for questions
- image and question are fused to predict answer
- Result : Grad-CAM via correlation with occlusion maps : 0.60+-0.038 -> high faithfulness
9. Conclusion
- Grad-CAM (Gradient-weighted Class Activation Mapping) : class-discriminative localization technique for making any CNN model more transparent by visual explanations
- Guided Grad-CAM : Both high resolution + class-discriminative -> interpretability + faithfulness
- AI should be able to reason about its belief and actions for human to trust and use it!
Code Review
def generate_gradcam(img_tensor, model, class_index, activation_layer):
model_input = model.input
# y_c : class_index에 해당하는 CNN 마지막 layer op(softmax, linear, ...)의 입력
y_c = model.output[0, class_index]
# A_k: activation conv layer의 출력 feature map
A_k = model.get_layer(activation_layer).output
# model의 입력에 대해서,
# activation conv layer의 출력(A_k)과
# 최종 layer activation 입력(y_c)의 A_k에 대한 gradient,
# 모델의 최종 출력(prediction) 계산
get_output = K.function([model_input], [A_k, K.gradients(y_c, A_k)[0]])
[conv_output, grad_val] = get_output([img_tensor])
# batch size가 포함되어 shape가 (1, width, height, k)이므로
# (width, height, k)로 shape 변경
# 여기서 width, height는 activation conv layer인 A_k feature map의 width와 height를 의미함
conv_output = conv_output[0]
grad_val = grad_val[0]
# global average pooling 연산
# gradient의 width, height에 대해 평균을 구해서(1/Z) weights(a^c_k) 계산
weights = np.mean(grad_val, axis=(0, 1))
# activation conv layer의 출력 feature map(conv_output)과
# class_index에 해당하는 weights(a^c_k)를 k에 대응해서 weighted combination 계산
# feature map(conv_output)의 (width, height)로 초기화
grad_cam = np.zeros(dtype=np.float32, shape=conv_output.shape[0:2])
for k, w in enumerate(weights):
grad_cam += w * conv_output[:, :, k]
# 계산된 weighted combination 에 ReLU 적용
grad_cam = np.maximum(grad_cam, 0)
return grad_cam, weights
def make_gradcam_heatmap(img_array, model, last_conv_layer_name, pred_index=None):
# First, we create a model that maps the input image to the activations
# of the last conv layer as well as the output predictions
grad_model = tf.keras.models.Model(
[model.inputs], [model.get_layer(last_conv_layer_name).output, model.output]
)
# Then, we compute the gradient of the top predicted class for our input image
# with respect to the activations of the last conv layer
with tf.GradientTape() as tape:
last_conv_layer_output, preds = grad_model(img_array)
if pred_index is None:
pred_index = tf.argmax(preds[0])
class_channel = preds[:, pred_index]
# This is the gradient of the output neuron (top predicted or chosen)
# with regard to the output feature map of the last conv layer
grads = tape.gradient(class_channel, last_conv_layer_output)
# This is a vector where each entry is the mean intensity of the gradient
# over a specific feature map channel
pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
# We multiply each channel in the feature map array
# by "how important this channel is" with regard to the top predicted class
# then sum all the channels to obtain the heatmap class activation
last_conv_layer_output = last_conv_layer_output[0]
heatmap = last_conv_layer_output @ pooled_grads[..., tf.newaxis]
heatmap = tf.squeeze(heatmap)
# For visualization purpose, we will also normalize the heatmap between 0 & 1
heatmap = tf.maximum(heatmap, 0) / tf.math.reduce_max(heatmap)
return heatmap.numpy()
[CV_Localization] Learning Deep Features for Discriminative Localization
Abstract
- Global average pooling (GAP) : (previously) Regularizing training -> (CAM) generic localizable deep representaion
1. Introduction
- CNN : classification, object detection Good but FC layer (flatten)-> ability to localize objects is lost
- FCN (NIN), GoogLeNet : GAP as regularizer -> minimize # of params + maintain high performance
- CAM : GAP for remarkable localization ability until final layer (deep features)
1.1 Related Work
localizing objects + identifying which regions of image are being used for discrimination
(1) Weakly-supervised object localization
- Previous works : self-taught, multiple-instance learning, transferring mid-level image, multiple overlapping patches
-> No end-to-end training & Multiple forward pass -> difficult to scale real-world datasets - GMP (Global Max Pooling) : limited to lying in boundary of object rather than full extent
- CAM : End-to-end training & Single forward pass & GAP (full extent, all discriminative regions)
(2) Visualizing CNNs
- Previous works : Deconvnet (patterns activate each unit) -> Incomplete (only analyzing conv layers, ignoring fc layers)
- CAM : Removing fc layers -> able to understand whole network (end-to-end)
- Previous works : Inverting deep features at different layers (inverting fc layers)-> But No highlight relative importance
- CAM : Highlight which regions are important for discrimination
2. Class Activation Mapping
- Class Activation Map for each particular category indicates discriminative regions to identify category
- Class Activation Mapping : CNN -> GAP on last conv layer (feature maps) -> fc layer -> Softmax final output
GAP : spatial average of feature map at last conv layer -> one weight for each channel (total : N weights for N channels)
CAM : Sum of N weights * N conv layers -> one heat map for each class - Result : Projecting back weights of output on conv feature maps -> can identify importance of image regions
- f_k(x,y) : activation map (feature map) of unit k in last conv layer at spatial location (x,y)
- F_k(x,y) : result of GAP
- S_c : input to softmax for class c
- w_k^c : weight for class c -> importance of F_k for class c
- M_c(x,y) : CAM for class c -> importance of activation at (x,y) leading to classification of image to class c
CAM = weighted linear sum of visual patterns at different spatial locations -> Upsampling CAM to size of input ! - P_c : output of softmax for class c
Global average pooling (GAP) vs global max pooling (GMP)
- GAP : consider all discriminative parts of an object -> identify extent of object
- GMP : consider only highest parts of an object
- Classification performance : similar / Localization performance : GAP > GMP
3. Weakly-supervised Object Localization
3.1 Setup
- Dataset : ILSVRC 2014
- CNN models : AlexNet, VGGnet, GoogLeNet (remove fc layers -> replace them with GAP)
- Localization ability improved when last conv layer before GAP = high spatial resolution (mapping resolution)
- So, remove some layers -> add new layers (3 x 3, stride 1, pad 1 with 1024 units) followed by GAP
- Networks were fine-tuned on 1.3M training images of ILSVRC
3.2 Results
(1) Classification
- GAP : Only small performance drop (1-2%) without fc layers -> Acceptable
(2) Localizaion
- bbox selection strategy : Simple thresholding technique (max 20% labeling -> bbox)
- [Table 2] GAP : not trained on a single annotated bbox but outperforms than others (NIN, Backprop)
- [Table 3] Weakly vs Fully-supervised methods
- bbox selection strategy (heuristics) : 2 bbox (one tight and one loose) from 1st and 2nd predicted classes + 1 loose bbox for top 3rd predicted class
- weakly-supervised GoogLeNet-GAP (heuristics) ~= fully-supervised AlexNet
- Same model -> still long way...
4. Deep Features for Generic Localization
- Response from higher-level layers of CNN : effective generic features with SOTA on many image datasets
- Response from GAP CNN : also perform well as generic features + highlight discriminative regions (without training)
- GoogLeNet-GAP, GoogLeNet > AlexNet
- GoogLeNet-GAP ~= GoogLeNet
4.1 Find-grained Recognition
- Dataset : CUB-200-2011 (200 bird species)
- Methods : GoogLeNet-GAP on full image < crop < bbox
4.2 Pattern Discovery
- To identify common elements or patterns such as text or high-level concepts
(1) Discovering informative objects in the scenes
- Dataset : 10 scene categories from SUN dataset
- top 6 objects that most frequently overlap with high activation regions for two scene
(2) Concept localization in weakly labeled images
- concept detector : localize informative regions for concepts, even phrases are more abstract than object names
(3) Weakly supervised text detector
- Dataset : 350 Google StreetView images containing text from SVT dataset
- highlight text without using bbox annotations
(4) Interpreting visual question answering (VQA)
- overall acc : 55.89%
- highlight image regions relevant to predicted answers
5. Visualizing Class-Specific Units
- Using GAP and the ranked softmax weight
- CAM : Visualize most discriminative units (Class-Specific Units) for a given class
- Combination of Class-Specific Units guides CNN -> we can infer CNN actually learn!
6. Conclusion
- CAM enables classification-trained CNNs with GAP to perform object localization without bbox annotations
- CAM visualizes predicted class scores & highlights discriminative object parts
- CAM generalizes to other visual recognition tasks
Code
def generate_cam(img_tensor, model, class_index, last_conv):
model_input = model.input
model_output = model.layers[-1].output
# f_k(x, y) : 마지막 conv layer의 출력 feature map
f_k = model.get_layer(last_conv).output
get_output = K.function([model_input], [f_k])
[last_conv_output] = get_output([img_tensor])
# batch size가 포함되어 shape가 (1, width, height, k)이므로 (width, height, k)로 shape 변경
last_conv_output = last_conv_output[0]
# softmax(+ dense) layer와 GAP layer 사이의 weight matrix에서 class_index에 해당하는 class_weight_k(w^c_k)
# ex) w^2_1, w^2_2, w^2_3, ..., w^2_k
class_weight_k = model.layers[-1].get_weights()[0][:, class_index]
# feature map(last_conv_output)의 (width, height)로 초기화
cam = np.zeros(dtype=np.float32, shape=last_conv_output.shape[0:2])
# last conv layer의 출력 feature map(last_conv_output)과 class_weight_k(w^c_k)로 weighted sum을 구함
for k, w in enumerate(class_weight_k):
cam += w * last_conv_output[:, :, k]
return cam
[CV_3D] PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows
PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows
Introduction
Major roadblock in generating pc : complexity of space of point clouds
Meaning of words
- Distribution = Invertible parameterized transformation of 3D points from prior distribution (ex. Gaussian)
- Shape = Variable that parametrizes transformation
- Category = distribution of this variable
PointFlow : Point cloud Generative model by learning distribution of distributions
- Two-level hierarchy of distributions : distribution of shapes & distribution of points given a shape
- Sampling points from prior Gaussian
- → Moving them according to parameterized transformation to new location in target shape
- Parameterization : Continuous Normalizing Flows to model transformation
- Invertibility → Sampling and Estimating probability density → Training models using variational inference
- (maximize a variational lower bound on log-likelihood of training point clouds set)
- Results : SOTA performance in point cloud generation & pc reconstruction, unsupervised feature learning
Related work
Deep learning for PC
- PC discriminative tasks : classification, segmentation, critical point sampling, auto-encoding, single-view 3D reconstruction, stereo reconstruction, point cloud completion, ...
- AE : training with heuristic loss functions that measure distance bw two point sets (ex. CD, EMD)
- CD : incorrect point clouds
- EMD : slow to compute (approximation → biased or noisy gradients)
- AE : training with heuristic loss functions that measure distance bw two point sets (ex. CD, EMD)
- Problems of Previous models : Fixed number of points, Heuristic loss function
- Drawbacks of treating pc and fixed-dimensional matrix
- Model is restricted to generate a fixed number of points
- No Permutation invariance of point sets
- Drawbacks of using heuristic loss function
- Lack of probabilistic guarantee
- Only learning distribution of points for each shape (Not distribution of shapes)
- Ex. Sophisticated decoders : overcoming fixed number of points BUT still relying heuristic set distances
- Drawbacks of treating pc and fixed-dimensional matrix
- PointFlow : training E2E by maximizing variational lower bound on log-likelihood
Generative models
- Generative models : GAN, VAE, Auto-regressive models. Flow-based models
- Most deep generative models : learning distribution of fixed-dimensional variables
- PointFlow : learning distribution of sets and generating new sets by using tighter lower bound on log-likelihood
- with normalizing flow in modeling both reconstruction likelihood and prior
Overview
- Goal : To learn distribution of shapes(=distributions of points)
= To sample shapes and an arbitrary # of points from a shape - Continuous Normalizing Flow (CNF) = A vector field in 3D Euclidean space
- To model distribution of points by transforming a generic prior
(sample points from prior → move them according to vector field) - Invertible → move data points back to prior → compute exact llikelihood
- parametrizing each continuous NF with a latent variable that represents shape
⇔ modeling distribution of shapes = modeling distribution of latent variable
- To model distribution of points by transforming a generic prior
- Optimization : using variational lower bound on log-likelihood by inference network
- Invertiblity 의해 likelihood computation 가능 → Training model E2E in stable manner !
Model
Three Modules
-
$Q_Φ (z|X)$ : (permutation-invariant) Encoder to encode a point cloud into a shape representation$z$ -
$P_ψ (z)$ : (CNF) Prior over shape representation$z$ -
$P_θ (X|z)$ : (CNF) Decoder to model distribution of points given shape representation$z$
Flow-based point generation from shape representations
-
$log P_θ (X|z)$ : Reconstruction log-likelihood of a point set$X$ = Sum of log-likelihood of each point$x$
-
$x$ : result of transforming some point$y(t_0)$ in prior distribution$P(y) = N(0,1)$ using CNF-
$g_θ$ : continuous-time dynamics of flow$G_θ$ conditioned on$z$ →$G_θ^{-1} (x;z)$ 가능
-
-
$log P_θ (x|z)$ : log-likelihood of each point by using conditional extension of CNF
Flow-based prior over shape
Learnable Prior
- Motivation : Prior로 simple Gaussian 써도 되지만, VAE 성능 하락하는 문제 완화하고자 제안
- How : using another CNF to parametrize a learnable prior
-
KL divergence term in ELBO function
-
$P_ψ(z)$ : prior distribution with learnable parameters$ψ$ -
$H$ : entropy
-
-
$z$ : result of transforming some point$w(t_0)$ in simple Gaussian$P(w) = N(0,1)$ using CNF-
$f_ψ$ : continuous-time dynamics of flow$F_ψ$ →$F_ψ^{-1} (z)$ 가능
-
-
$log P_ψ (z)$ : log probability of prior distribution
[Training] Final training objective
- Objective function
- Training encoder and decoder jointly to maximize a lower bound on log-likelihood
- Training whole network E2E by maximizing ELBO of all point sets in dataset
- Objective function = ① + ② + ③
[Test] Sampling
-
(1) Sampling a shape representation
$\widetilde{z}$ through$F_ψ$ -
(2) Generating a point given
$\widetilde{z}$ -
How : Sampling a point
$\widetilde{y}$ from$N(0,1)$ → Passing$\widetilde{y}$ through$G_θ$ conditioned on$\widetilde{z}$ -
Result : a point
$\widetilde{x}$ =$G_θ(\widetilde{w};z)$
-
How : Sampling a point
-
Sampling a point cloud with size
$\widetilde{M}$ by repeating (2) for$\widetilde{M}$ times
Experiments
Eval metrics
-
Previous metrics to measure similarity bw point clouds (not used during training PointFlow)
- Ex. Chamfer distance (CD), Earth mover's distance (EMD)
-
$X, Y$ : point clouds with the same # of points /$Φ$ : bijection bw$X, Y$
- Jensen-Shannon Divergence (JSD)
- Coverage (COV)
- Mininum matching distance (MMD)
- 1-nearest neighbor accuracy (1-NNA)
Generation
- Previous pc generative models : raw-GAN, latent-GAN, PC-GAN
- Dataset : 3 categories in ShapeNet (airplane, chair, car) → Normalized (zero-mean per axis, unit variance)
- Training, Test : 2048 points for each shape
- Models : # of parameters in total (full) or in generative pathways (gen)
- Result : outperforming all baselines across all categories (1-NNA) & best score in most cases (other metrics)
Auto-Encoding
- Goal : Reconstruction ability
- Models : 1-GAN, AtlasNet(SOTA) vs pointFlow (flow-based AE)
- Dataset : ShapeNet
-
Training : AE trained with only
$L_{recon}$ -
Test : 4096 points per shape = 2048 input set + 2048 reference set
- How? computing distance (CD or EMD) bw reconstructed input set and reference set
-
Result : best EMD score
Unsupervised representation learning
- Goal : Representation learning ability
- How : extract latent representations of AE trained in full ShapeNet → train linear SVM classifier on ModelNet10(40)
- Dataset : ShapeNet & ModelNet10(40) → Normalized (zero-mean per axis, unit variance), Random-rotation along gravity axis
- Problem of task : different encoder, different # of params, different pre-processing --> hard to compare
Code Review
- Reference : https://github.com/stevenygd/PointFlow
class Encoder(nn.Module):
def __init__(self, zdim, input_dim=3, use_deterministic_encoder=False):
super(Encoder, self).__init__()
self.use_deterministic_encoder = use_deterministic_encoder
self.zdim = zdim
self.conv1 = nn.Conv1d(input_dim, 128, 1)
self.conv2 = nn.Conv1d(128, 128, 1)
self.conv3 = nn.Conv1d(128, 256, 1)
self.conv4 = nn.Conv1d(256, 512, 1)
self.bn1 = nn.BatchNorm1d(128)
self.bn2 = nn.BatchNorm1d(128)
self.bn3 = nn.BatchNorm1d(256)
self.bn4 = nn.BatchNorm1d(512)
if self.use_deterministic_encoder:
self.fc1 = nn.Linear(512, 256)
self.fc2 = nn.Linear(256, 128)
self.fc_bn1 = nn.BatchNorm1d(256)
self.fc_bn2 = nn.BatchNorm1d(128)
self.fc3 = nn.Linear(128, zdim)
else:
# Mapping to [c], cmean
self.fc1_m = nn.Linear(512, 256)
self.fc2_m = nn.Linear(256, 128)
self.fc3_m = nn.Linear(128, zdim)
self.fc_bn1_m = nn.BatchNorm1d(256)
self.fc_bn2_m = nn.BatchNorm1d(128)
# Mapping to [c], cmean
self.fc1_v = nn.Linear(512, 256)
self.fc2_v = nn.Linear(256, 128)
self.fc3_v = nn.Linear(128, zdim)
self.fc_bn1_v = nn.BatchNorm1d(256)
self.fc_bn2_v = nn.BatchNorm1d(128)
def forward(self, x):
x = x.transpose(1, 2)
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
x = self.bn4(self.conv4(x))
x = torch.max(x, 2, keepdim=True)[0]
x = x.view(-1, 512)
if self.use_deterministic_encoder:
ms = F.relu(self.fc_bn1(self.fc1(x)))
ms = F.relu(self.fc_bn2(self.fc2(ms)))
ms = self.fc3(ms)
m, v = ms, 0
else:
m = F.relu(self.fc_bn1_m(self.fc1_m(x)))
m = F.relu(self.fc_bn2_m(self.fc2_m(m)))
m = self.fc3_m(m)
v = F.relu(self.fc_bn1_v(self.fc1_v(x)))
v = F.relu(self.fc_bn2_v(self.fc2_v(v)))
v = self.fc3_v(v)
return m, v
# Model
class PointFlow(nn.Module):
def __init__(self, args):
super(PointFlow, self).__init__()
self.input_dim = args.input_dim
self.zdim = args.zdim
self.use_latent_flow = args.use_latent_flow
self.use_deterministic_encoder = args.use_deterministic_encoder
self.prior_weight = args.prior_weight
self.recon_weight = args.recon_weight
self.entropy_weight = args.entropy_weight
self.distributed = args.distributed
self.truncate_std = None
self.encoder = Encoder(
zdim=args.zdim, input_dim=args.input_dim,
use_deterministic_encoder=args.use_deterministic_encoder)
self.point_cnf = get_point_cnf(args)
self.latent_cnf = get_latent_cnf(args) if args.use_latent_flow else nn.Sequential()
@staticmethod
def sample_gaussian(size, truncate_std=None, gpu=None):
y = torch.randn(*size).float()
y = y if gpu is None else y.cuda(gpu)
if truncate_std is not None:
truncated_normal(y, mean=0, std=1, trunc_std=truncate_std)
return y
@staticmethod
def reparameterize_gaussian(mean, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn(std.size()).to(mean)
return mean + std * eps
@staticmethod
def gaussian_entropy(logvar):
const = 0.5 * float(logvar.size(1)) * (1. + np.log(np.pi * 2))
ent = 0.5 * logvar.sum(dim=1, keepdim=False) + const
return ent
def multi_gpu_wrapper(self, f):
self.encoder = f(self.encoder)
self.point_cnf = f(self.point_cnf)
self.latent_cnf = f(self.latent_cnf)
def make_optimizer(self, args):
def _get_opt_(params):
if args.optimizer == 'adam':
optimizer = optim.Adam(params, lr=args.lr, betas=(args.beta1, args.beta2),
weight_decay=args.weight_decay)
elif args.optimizer == 'sgd':
optimizer = torch.optim.SGD(params, lr=args.lr, momentum=args.momentum)
else:
assert 0, "args.optimizer should be either 'adam' or 'sgd'"
return optimizer
opt = _get_opt_(list(self.encoder.parameters()) + list(self.point_cnf.parameters())
+ list(list(self.latent_cnf.parameters())))
return opt
def forward(self, x, opt, step, writer=None):
opt.zero_grad()
batch_size = x.size(0)
num_points = x.size(1)
z_mu, z_sigma = self.encoder(x)
if self.use_deterministic_encoder:
z = z_mu + 0 * z_sigma
else:
z = self.reparameterize_gaussian(z_mu, z_sigma)
# Compute H[Q(z|X)]
if self.use_deterministic_encoder:
entropy = torch.zeros(batch_size).to(z)
else:
entropy = self.gaussian_entropy(z_sigma)
# Compute the prior probability P(z)
if self.use_latent_flow:
w, delta_log_pw = self.latent_cnf(z, None, torch.zeros(batch_size, 1).to(z))
log_pw = standard_normal_logprob(w).view(batch_size, -1).sum(1, keepdim=True)
delta_log_pw = delta_log_pw.view(batch_size, 1)
log_pz = log_pw - delta_log_pw
else:
log_pz = torch.zeros(batch_size, 1).to(z)
# Compute the reconstruction likelihood P(X|z)
z_new = z.view(*z.size())
z_new = z_new + (log_pz * 0.).mean()
y, delta_log_py = self.point_cnf(x, z_new, torch.zeros(batch_size, num_points, 1).to(x))
log_py = standard_normal_logprob(y).view(batch_size, -1).sum(1, keepdim=True)
delta_log_py = delta_log_py.view(batch_size, num_points, 1).sum(1)
log_px = log_py - delta_log_py
# Loss
entropy_loss = -entropy.mean() * self.entropy_weight
recon_loss = -log_px.mean() * self.recon_weight
prior_loss = -log_pz.mean() * self.prior_weight
loss = entropy_loss + prior_loss + recon_loss
loss.backward()
opt.step()
# LOGGING (after the training)
if self.distributed:
entropy_log = reduce_tensor(entropy.mean())
recon = reduce_tensor(-log_px.mean())
prior = reduce_tensor(-log_pz.mean())
else:
entropy_log = entropy.mean()
recon = -log_px.mean()
prior = -log_pz.mean()
recon_nats = recon / float(x.size(1) * x.size(2))
prior_nats = prior / float(self.zdim)
if writer is not None:
writer.add_scalar('train/entropy', entropy_log, step)
writer.add_scalar('train/prior', prior, step)
writer.add_scalar('train/prior(nats)', prior_nats, step)
writer.add_scalar('train/recon', recon, step)
writer.add_scalar('train/recon(nats)', recon_nats, step)
return {
'entropy': entropy_log.cpu().detach().item()
if not isinstance(entropy_log, float) else entropy_log,
'prior_nats': prior_nats,
'recon_nats': recon_nats,
}
def encode(self, x):
z_mu, z_sigma = self.encoder(x)
if self.use_deterministic_encoder:
return z_mu
else:
return self.reparameterize_gaussian(z_mu, z_sigma)
def decode(self, z, num_points, truncate_std=None):
# transform points from the prior to a point cloud, conditioned on a shape code
y = self.sample_gaussian((z.size(0), num_points, self.input_dim), truncate_std)
x = self.point_cnf(y, z, reverse=True).view(*y.size())
return y, x
def sample(self, batch_size, num_points, truncate_std=None, truncate_std_latent=None, gpu=None):
assert self.use_latent_flow, "Sampling requires `self.use_latent_flow` to be True."
# Generate the shape code from the prior
w = self.sample_gaussian((batch_size, self.zdim), truncate_std_latent, gpu=gpu)
z = self.latent_cnf(w, None, reverse=True).view(*w.size())
# Sample points conditioned on the shape code
y = self.sample_gaussian((batch_size, num_points, self.input_dim), truncate_std, gpu=gpu)
x = self.point_cnf(y, z, reverse=True).view(*y.size())
return z, x
def reconstruct(self, x, num_points=None, truncate_std=None):
num_points = x.size(1) if num_points is None else num_points
z = self.encode(x)
_, x = self.decode(z, num_points, truncate_std)
return x
[CV_Pose Estimation] Efficient Object Localization Using Convolutional Networks
Efficient Object Localization Using Convolutional Networks
Abstract
- Efficient 'Position Refinement' model
- trained to estimate joint offset location within a small region of img
- trained in cascade within SOTA ConvNet model to acheive improved acc
- on FLIC dataset, MPII dataset
1. Introduction
-
Human-body part localization task ↑ BY ConvNet arch + larger datasets
-
(sota) ConvNet : internal strided-pooling layers
- reduce spatial resolution
- output : invariant to spatial location within pooling region
- promote spatial invariance to local input transformation
- pooling : prevent over-training + reducing computational complexity for classification
- Trade-off : generalization performance ↑ <-> spatial localization accuracy ↓
-
(this paper) LCN : ConvNet for efficient localization of human joints in RGB imgs
- high spatial accuracy + computational efficiency
- begin by coarse body part localization -> output : low resolution, per-pixel heat-map
- show likelihood of a joint occurring in each spatial location
- Max-pooling for dimensionality reduction + improving invariance to noise and local img transformations
- reuse hidden layer conv features from coarse heat-map regression model to improve localization accuracy
2. Related Work
-
Models using Hand-crafted features (edges, contours, HoG, color histograms) : poor generalization performance
- Deformable Part Models (DPM)
- Mixture of templates modeled using SVMs
- Poselet + DPM mosel : spatial relationship of body parts
- Atmlets : semi-global classifier, good for real-world data, but only arms
- Multi-modal model : holistic + local
-
ConvNets
- formulate problem as a direct (continuous) regression
- poorly in high-precision region
- unnecessary learning complexity by mapping from input RGB img to XY location (over-training)
- +) low-dimensional representation of input img, multi-resolution ConvNet arch, ...
3. Coarse Heat-Map Regression Model
- Using Extension of Multi-resolution ConvNet model
- For Sliding window detector with Overlapping contexts to produce Coarse heat-map output
3.1. Model Architecture
- Input : RGB Gaussian pyramid of 3 levels (320 x 240 for FLIC, 256 x 256 for MPII)
Figure 2 : only 2 levels for brevity
- Output : Heat-map for each joint describing per-pixel likelihood for joint occurring in each output spatial location
- 1st layer : LCN (Local Contrast Normalization) with same filter kernel in each 3 resolution banks -> out : LCN imgs
- Next 7 stage (11 for MPII) multi-resolution ConvNet : Pooling -> heat-map output is at a lower resolution than input img
- Last 4 stage (3 for MPII) : effectively simulated FC network for taget input patch size
3.2. Spatial Dropout
- Dropout : zeroing activation -> improving generalization by preventing activations from becoming strongly correlated
- Additional Dropout layer before 1st 1x1 conv layer
- Standard Dropout
- Network is fully conv (1d conv) & natural imgs (so, feature map activations) are strongly correlated
- Result : over-training (Fail)
- Spatial Dropout
- Feature-map = n_features x Height x Width
- How : perform only n_features dropout trials + extend value across entire feature map
- Result : adjacent pixels are either all 0 OR all active (good performance on FLIC)
3.3. Training and Data Augmentation
- Loss : MSE
- H', H : Predicted and GT heat-map for joint
- Target GT heat-map : 2D gaussian of constant variance (sigma = 1.5 pixels) centered at GT joint (x,y)
- Data Augmentation : Random rotation, scaling, flipping -> Generalization
- Multiple people contained but Single person annotated case
- How : Sliding-window + tree-structured MRF spatial model (approximate Torso position)
- MRF Input : GT torso position + 14 predicted joints from ConvNet output = 15 joints locations
- Result : selecting correct person for labeling
4. Fine Heat-Map Regression Model
- Purpose : Recovering spatial accuracy lost due to pooling
- How : Using additional ConvNet to refine localization result of coarse heat-map
- Difference : Reusing existing conv features -> reducing # of params + acting as regularizer
4.1. Model Architecture
-
- Heat-map-based model for coarse localization
- Module to sample and crop conv features at joint location (x, y)
- Additional conv model for fine tuning
-
Joint Inference Steps
- FPROP (forward-propagate) through Coarse heat-map model
- Infer all joint locations (x, y) from max value in each joint's heat-map
- Sample and Crop first 2 conv layers (for all resolution) at each coarse location (x, y)
- FPROP through Fine heat-map model -> (△x, △y)
- Fine heat-map model : Siamese network of 7 instances (14 for MPII)
- Add Position Refinement to coarse location -> Final location (x, y) for each joint
- FPROP (forward-propagate) through Coarse heat-map model
-
- Siamese network : Weights and biases of each module are shared
- Sample location for each joint is different : Conv features don't share same Spatial context
- So, conv sub-nets must be applied to each joint independently
- But, parameter sharing to reduce # of shared params and prevent over-training
-
Last 1x1 Conv
- No weight sharing
- Input : each output of 7 sub-nets
- Output : detailed-resolution heat-map
- Purpose : Final detection for each joint
4.2. Joint Training
- Before Joint Training : Pre-training Coarse heat-map model first
- Holding params Coarse heat-map model Fixed + Training Fine heat-map model
- Jointly Training both models by minimizing E3 = E1 + λE2 ..... (λ = 0.1)
- Regression to set of target heat-maps for minimizing final (x, y) prediction
5. Results
-
Framwork : Torch7
-
Dataset : FLIC(easy), MPII-Human-Pose(hard)
-
Pooling impact for coarse heat-map model : Pooling ↑ -> Detection performance(spatial precision) ↓
-
Ambiguous GT labels : can be worse than expected variance in User-generated labels
-
Greedily-trained cascade (Shared features)
- Coarse and Fine models are trained independently by adding additional conv layer
- How : Training Fine model by using cropped input imgs as input
- Result : regularizing effect of joint training : preventing over-training [F14(a)]
-
SpatialDropout : regularizing effect of dropout + reduction in strong heat-map outliers [F14(b)]
6. Conclusion
- Localization tasks demand high degree of spatial precision
- Cascaded architecture that combined Fine and Coarse conv networks -> SOTA on FLIC, MPII-human-pose
- Spatial Precision + Computational benefits of Pooling
Code
Train
import os
import sys
import time
import argparse
import torch
import numpy as np
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchcontrib
from torchvision import transforms
from dataset.cub200 import CUB200Data
from dataset.mit67 import MIT67Data
from dataset.stanford_dog import SDog120Data
from dataset.caltech256 import Caltech257Data
from dataset.stanford_40 import Stanford40Data
from dataset.flower102 import Flower102Data
from model.fe_resnet import resnet18_dropout, resnet50_dropout, resnet101_dropout
from model.fe_mobilenet import mbnetv2_dropout
class MovingAverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self, name, fmt=':f', momentum=0.9):
self.name = name
self.fmt = fmt
self.momentum = momentum
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
def update(self, val, n=1):
self.val = val
self.avg = self.momentum*self.avg + (1-self.momentum)*val
def __str__(self):
fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
return fmtstr.format(**self.__dict__)
class ProgressMeter(object):
def __init__(self, num_batches, meters, prefix=""):
self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
self.meters = meters
self.prefix = prefix
def display(self, batch):
entries = [self.prefix + self.batch_fmtstr.format(batch)]
entries += [str(meter) for meter in self.meters]
print('\t'.join(entries))
def _get_batch_fmtstr(self, num_batches):
num_digits = len(str(num_batches // 1))
fmt = '{:' + str(num_digits) + 'd}'
return '[' + fmt + '/' + fmt.format(num_batches) + ']'
class CrossEntropyLabelSmooth(nn.Module):
def __init__(self, num_classes, epsilon = 0.1):
super(CrossEntropyLabelSmooth, self).__init__()
self.num_classes = num_classes
self.epsilon = epsilon
self.logsoftmax = nn.LogSoftmax(dim=1)
def forward(self, inputs, targets):
log_probs = self.logsoftmax(inputs)
targets = torch.zeros_like(log_probs).scatter_(1, targets.unsqueeze(1), 1)
targets = (1 - self.epsilon) * targets + self.epsilon / self.num_classes
loss = (-targets * log_probs).sum(1)
return loss.mean()
def linear_l2(model):
beta_loss = 0
for m in model.modules():
if isinstance(m, nn.Linear):
beta_loss += (m.weight).pow(2).sum()
beta_loss += (m.bias).pow(2).sum()
return 0.5*beta_loss*args.beta, beta_loss
def l2sp(model, reg):
reg_loss = 0
dist = 0
for m in model.modules():
if hasattr(m, 'weight') and hasattr(m, 'old_weight'):
diff = (m.weight - m.old_weight).pow(2).sum()
dist += diff
reg_loss += diff
if hasattr(m, 'bias') and hasattr(m, 'old_bias'):
diff = (m.bias - m.old_bias).pow(2).sum()
dist += diff
reg_loss += diff
if dist > 0:
dist = dist.sqrt()
loss = (reg * reg_loss)
return loss, dist
def test(model, teacher, loader, loss=False):
with torch.no_grad():
model.eval()
if loss:
teacher.eval()
ce = CrossEntropyLabelSmooth(loader.dataset.num_classes, args.label_smoothing).to('cuda')
featloss = torch.nn.MSELoss(reduction='none')
total_ce = 0
total_feat_reg = np.zeros(len(reg_layers))
total_l2sp_reg = 0
total = 0
top1 = 0
total = 0
top1 = 0
for i, (batch, label) in enumerate(loader):
batch, label = batch.to('cuda'), label.to('cuda')
total += batch.size(0)
out = model(batch)
_, pred = out.max(dim=1)
top1 += int(pred.eq(label).sum().item())
if loss:
total_ce += ce(out, label).item()
if teacher is not None:
with torch.no_grad():
tout = teacher(batch)
for key in reg_layers:
src_x = reg_layers[key][0].out
tgt_x = reg_layers[key][1].out
tgt_channels = tgt_x.shape[1]
regloss = featloss(src_x[:,:tgt_channels,:,:], tgt_x.detach()).mean()
total_feat_reg[key] += regloss.item()
_, unweighted = l2sp(model, 0)
total_l2sp_reg += unweighted.item()
return float(top1)/total*100, total_ce/(i+1), np.sum(total_feat_reg)/(i+1), total_l2sp_reg/(i+1), total_feat_reg/(i+1)
def train(model, train_loader, val_loader, iterations=9000, lr=1e-2, name='', l2sp_lmda=1e-2, teacher=None, reg_layers={}):
model = model.to('cuda')
if l2sp_lmda == 0:
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=args.momentum, weight_decay=args.weight_decay)
else:
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=args.momentum, weight_decay=0)
end_iter = iterations
if args.swa:
optimizer = torchcontrib.optim.SWA(optimizer, swa_start=args.swa_start, swa_freq=args.swa_freq)
end_iter = args.swa_start
if args.const_lr:
scheduler = None
else:
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, end_iter)
teacher.eval()
ce = CrossEntropyLabelSmooth(train_loader.dataset.num_classes, args.label_smoothing).to('cuda')
featloss = torch.nn.MSELoss()
batch_time = MovingAverageMeter('Time', ':6.3f')
data_time = MovingAverageMeter('Data', ':6.3f')
ce_loss_meter = MovingAverageMeter('CE Loss', ':6.3f')
feat_loss_meter = MovingAverageMeter('Feat. Loss', ':6.3f')
l2sp_loss_meter = MovingAverageMeter('L2SP Loss', ':6.3f')
linear_loss_meter = MovingAverageMeter('LinearL2 Loss', ':6.3f')
total_loss_meter = MovingAverageMeter('Total Loss', ':6.3f')
top1_meter = MovingAverageMeter('Acc@1', ':6.2f')
dataloader_iterator = iter(train_loader)
for i in range(iterations):
if args.swa:
if i >= int(args.swa_start) and (i-int(args.swa_start))%args.swa_freq == 0:
scheduler = None
model.train()
optimizer.zero_grad()
end = time.time()
try:
batch, label = next(dataloader_iterator)
except:
dataloader_iterator = iter(train_loader)
batch, label = next(dataloader_iterator)
batch, label = batch.to('cuda'), label.to('cuda')
data_time.update(time.time() - end)
out = model(batch)
_, pred = out.max(dim=1)
top1_meter.update(float(pred.eq(label).sum().item()) / label.shape[0] * 100.)
loss = 0.
loss += ce(out, label)
ce_loss_meter.update(loss.item())
with torch.no_grad():
tout = teacher(batch)
# Compute the feature distillation loss only when needed
if args.feat_lmda > 0:
regloss = 0
for layer in args.feat_layers:
key = int(layer)-1
src_x = reg_layers[key][0].out
tgt_x = reg_layers[key][1].out
tgt_channels = tgt_x.shape[1]
regloss += featloss(src_x[:,:tgt_channels,:,:], tgt_x.detach())
regloss = args.feat_lmda * regloss
loss += regloss
feat_loss_meter.update(regloss.item())
beta_loss, linear_norm = linear_l2(model)
loss = loss + beta_loss
linear_loss_meter.update(beta_loss.item())
if l2sp_lmda > 0:
reg, _ = l2sp(model, l2sp_lmda)
l2sp_loss_meter.update(reg.item())
loss = loss + reg
total_loss_meter.update(loss.item())
loss.backward()
optimizer.step()
for param_group in optimizer.param_groups:
current_lr = param_group['lr']
if scheduler is not None:
scheduler.step()
batch_time.update(time.time() - end)
if (i % args.print_freq == 0) or (i == iterations-1):
progress = ProgressMeter(
iterations,
[batch_time, data_time, top1_meter, total_loss_meter, ce_loss_meter, feat_loss_meter, l2sp_loss_meter, linear_loss_meter],
prefix="LR: {:6.5f}".format(current_lr))
progress.display(i)
if (i % args.test_interval == 0) or (i == iterations-1):
test_top1, test_ce_loss, test_feat_loss, test_weight_loss, test_feat_layer_loss = test(model, teacher, val_loader, loss=True)
train_top1, train_ce_loss, train_feat_loss, train_weight_loss, train_feat_layer_loss = test(model, teacher, train_loader, loss=True)
print('Eval Train | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, train_top1, train_ce_loss, train_feat_loss, train_weight_loss))
print('Eval Test | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, test_top1, test_ce_loss, test_feat_loss, test_weight_loss))
if not args.no_save:
if not os.path.exists('ckpt'):
os.makedirs('ckpt')
torch.save({'state_dict': model.state_dict()}, 'ckpt/{}.pth'.format(name))
if args.swa:
optimizer.swap_swa_sgd()
for m in model.modules():
if hasattr(m, 'running_mean'):
m.reset_running_stats()
m.momentum = None
with torch.no_grad():
model.train()
for x, y in train_loader:
x = x.to('cuda')
out = model(x)
test_top1, test_ce_loss, test_feat_loss, test_weight_loss, test_feat_layer_loss = test(model, teacher, val_loader, loss=True)
train_top1, train_ce_loss, train_feat_loss, train_weight_loss, train_feat_layer_loss = test(model, teacher, train_loader, loss=True)
print('Eval Train | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, train_top1, train_ce_loss, train_feat_loss, train_weight_loss))
print('Eval Test | Iteration {}/{} | Top-1: {:.2f} | CE Loss: {:.3f} | Feat Reg Loss: {:.6f} | L2SP Reg Loss: {:.3f}'.format(i+1, iterations, test_top1, test_ce_loss, test_feat_loss, test_weight_loss))
if not args.no_save:
if not os.path.exists('ckpt'):
os.makedirs('ckpt')
torch.save({'state_dict': model.state_dict()}, 'ckpt/{}.pth'.format(name))
return model
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--datapath", type=str, default='/data', help='path to the dataset')
parser.add_argument("--dataset", type=str, default='CUB200Data', help='Target dataset. Currently support: \{SDog120Data, CUB200Data, Stanford40Data, MIT67Data, Flower102Data\}')
parser.add_argument("--iterations", type=int, default=30000, help='Iterations to train')
parser.add_argument("--print_freq", type=int, default=100, help='Frequency of printing training logs')
parser.add_argument("--test_interval", type=int, default=1000, help='Frequency of testing')
parser.add_argument("--name", type=str, default='test', help='Name for the checkpoint')
parser.add_argument("--batch_size", type=int, default=64)
parser.add_argument("--lr", type=float, default=1e-2)
parser.add_argument("--const_lr", action='store_true', default=False, help='Use constant learning rate')
parser.add_argument("--weight_decay", type=float, default=0)
parser.add_argument("--momentum", type=float, default=0.9)
parser.add_argument("--beta", type=float, default=1e-2, help='The strength of the L2 regularization on the last linear layer')
parser.add_argument("--dropout", type=float, default=0, help='Dropout rate for spatial dropout')
parser.add_argument("--l2sp_lmda", type=float, default=0)
parser.add_argument("--feat_lmda", type=float, default=0)
parser.add_argument("--feat_layers", type=str, default='1234', help='Used for DELTA (which layers or stages to match), ResNets should be 1234 and MobileNetV2 should be 12345')
parser.add_argument("--reinit", action='store_true', default=False, help='Reinitialize before training')
parser.add_argument("--no_save", action='store_true', default=False, help='Do not save checkpoints')
parser.add_argument("--swa", action='store_true', default=False, help='Use SWA')
parser.add_argument("--swa_freq", type=int, default=500, help='Frequency of averaging models in SWA')
parser.add_argument("--swa_start", type=int, default=0, help='Start SWA since which iterations')
parser.add_argument("--label_smoothing", type=float, default=0)
parser.add_argument("--checkpoint", type=str, default='', help='Load a previously trained checkpoint')
parser.add_argument("--network", type=str, default='resnet18', help='Network architecture. Currently support: \{resnet18, resnet50, resnet101, mbnetv2\}')
parser.add_argument("--tnetwork", type=str, default='resnet18', help='Network architecture. Currently support: \{resnet18, resnet50, resnet101, mbnetv2\}')
parser.add_argument("--width_mult", type=float, default=1)
parser.add_argument("--shot", type=int, default=-1, help='Number of training samples per class for the training dataset. -1 indicates using the full dataset.')
parser.add_argument("--log", action='store_true', default=False, help='Redirect the output to log/args.name.log')
args = parser.parse_args()
return args
# Used to matching features
def record_act(self, input, output):
self.out = output
def record_act_with_1x1(self, input, output):
self.out = self[-1].dim_matching(output)
if __name__ == '__main__':
args = get_args()
if args.log:
if not os.path.exists('log'):
os.makedirs('log')
sys.stdout = open('log/{}.log'.format(args.name), 'w')
print(args)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
# Used to make sure we sample the same image for few-shot scenarios
seed = 98
train_set = eval(args.dataset)(args.datapath, True, transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]), args.shot, seed, preload=False)
test_set = eval(args.dataset)(args.datapath, False, transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
]), args.shot, seed, preload=False)
train_loader = torch.utils.data.DataLoader(train_set,
batch_size=args.batch_size, shuffle=True,
num_workers=8, pin_memory=True)
val_loader = train_loader
test_loader = torch.utils.data.DataLoader(test_set,
batch_size=args.batch_size, shuffle=False,
num_workers=8, pin_memory=False)
model = eval('{}_dropout'.format(args.network))(pretrained=True, dropout=args.dropout, width_mult=args.width_mult, num_classes=train_loader.dataset.num_classes).cuda()
if args.checkpoint != '':
checkpoint = torch.load(args.checkpoint)
model.load_state_dict(checkpoint['state_dict'])
# Pre-trained model
teacher = eval('{}_dropout'.format(args.tnetwork))(pretrained=True, dropout=0, num_classes=train_loader.dataset.num_classes).cuda()
if 'mbnetv2' in args.network:
reg_layers = {0: [model.layer1], 1: [model.layer2], 2: [model.layer3], 3: [model.layer4], 4: [model.layer5]}
model.layer1.register_forward_hook(record_act)
model.layer2.register_forward_hook(record_act)
model.layer3.register_forward_hook(record_act)
model.layer4.register_forward_hook(record_act)
model.layer5.register_forward_hook(record_act)
else:
reg_layers = {0: [model.layer1], 1: [model.layer2], 2: [model.layer3], 3: [model.layer4]}
# if args.width_mult > 1:
# model.layer1.register_forward_hook(record_act_with_1x1)
# model.layer2.register_forward_hook(record_act_with_1x1)
# model.layer3.register_forward_hook(record_act_with_1x1)
# model.layer4.register_forward_hook(record_act_with_1x1)
# model.layer1[-1].dim_matching = torch.nn.Conv2d(model.layer1[-1].out_dim, int(model.layer1[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
# model.layer2[-1].dim_matching = torch.nn.Conv2d(model.layer2[-1].out_dim, int(model.layer2[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
# model.layer3[-1].dim_matching = torch.nn.Conv2d(model.layer3[-1].out_dim, int(model.layer3[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
# model.layer4[-1].dim_matching = torch.nn.Conv2d(model.layer4[-1].out_dim, int(model.layer4[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
# else:
# model.layer1.register_forward_hook(record_act)
# model.layer2.register_forward_hook(record_act)
# model.layer3.register_forward_hook(record_act)
# model.layer4.register_forward_hook(record_act)
model.layer1.register_forward_hook(record_act_with_1x1)
model.layer2.register_forward_hook(record_act_with_1x1)
model.layer3.register_forward_hook(record_act_with_1x1)
model.layer4.register_forward_hook(record_act_with_1x1)
model.layer1[-1].dim_matching = torch.nn.Conv2d(model.layer1[-1].out_dim, int(teacher.layer1[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
model.layer2[-1].dim_matching = torch.nn.Conv2d(model.layer2[-1].out_dim, int(teacher.layer2[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
model.layer3[-1].dim_matching = torch.nn.Conv2d(model.layer3[-1].out_dim, int(teacher.layer3[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
model.layer4[-1].dim_matching = torch.nn.Conv2d(model.layer4[-1].out_dim, int(teacher.layer4[-1].out_dim/args.width_mult), kernel_size=1, bias=False).cuda()
# Stored pre-trained weights for computing L2SP
for m in model.modules():
if hasattr(m, 'weight') and not hasattr(m, 'old_weight'):
m.old_weight = m.weight.data.clone().detach()
# all_weights = torch.cat([all_weights.reshape(-1), m.weight.data.abs().reshape(-1)], dim=0)
if hasattr(m, 'bias') and not hasattr(m, 'old_bias') and m.bias is not None:
m.old_bias = m.bias.data.clone().detach()
if args.reinit:
for m in model.modules():
if type(m) in [nn.Linear, nn.BatchNorm2d, nn.Conv2d]:
m.reset_parameters()
reg_layers[0].append(teacher.layer1)
teacher.layer1.register_forward_hook(record_act)
reg_layers[1].append(teacher.layer2)
teacher.layer2.register_forward_hook(record_act)
reg_layers[2].append(teacher.layer3)
teacher.layer3.register_forward_hook(record_act)
reg_layers[3].append(teacher.layer4)
teacher.layer4.register_forward_hook(record_act)
if '5' in args.feat_layers:
reg_layers[4].append(teacher.layer5)
teacher.layer5.register_forward_hook(record_act)
train(model, train_loader, test_loader, l2sp_lmda=args.l2sp_lmda, iterations=args.iterations, lr=args.lr, name='{}'.format(args.name), teacher=teacher, reg_layers=reg_layers)
Eval
import argparse
import torch
import time
import sys
import numpy as np
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchcontrib
from PIL import Image
from torchvision import transforms
from dataset.cub200 import CUB200Data
from dataset.mit67 import MIT67Data
from dataset.stanford_dog import SDog120Data
from dataset.caltech256 import Caltech257Data
from dataset.stanford_40 import Stanford40Data
from dataset.flower102 import Flower102Data
from advertorch.attacks import LinfPGDAttack
from model.fe_resnet import resnet18_dropout, resnet50_dropout, resnet101_dropout
from model.fe_mobilenet import mbnetv2_dropout
from model.fe_resnet import feresnet18, feresnet50, feresnet101
from model.fe_mobilenet import fembnetv2
def test(model, loader, adversary):
model.eval()
total_ce = 0
total = 0
top1 = 0
total = 0
top1_clean = 0
top1_adv = 0
adv_success = 0
adv_trial = 0
for i, (batch, label) in enumerate(loader):
batch, label = batch.to('cuda'), label.to('cuda')
total += batch.size(0)
out_clean = model(batch)
if 'mbnetv2' in args.network:
y = torch.zeros(batch.shape[0], model.classifier[1].in_features).cuda()
else:
y = torch.zeros(batch.shape[0], model.fc.in_features).cuda()
y[:,0] = args.m
advbatch = adversary.perturb(batch, y)
out_adv = model(advbatch)
_, pred_clean = out_clean.max(dim=1)
_, pred_adv = out_adv.max(dim=1)
clean_correct = pred_clean.eq(label)
adv_trial += int(clean_correct.sum().item())
adv_success += int(pred_adv[clean_correct].eq(label[clean_correct]).sum().item())
top1_clean += int(pred_clean.eq(label).sum().item())
top1_adv += int(pred_adv.eq(label).sum().item())
print('{}/{}...'.format(i+1, len(loader)))
return float(top1_clean)/total*100, float(top1_adv)/total*100, float(adv_trial-adv_success) / adv_trial *100
def record_act(self, input, output):
pass
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--datapath", type=str, default='/data', help='path to the dataset')
parser.add_argument("--dataset", type=str, default='CUB200Data', help='Target dataset. Currently support: \{SDog120Data, CUB200Data, Stanford40Data, MIT67Data, Flower102Data\}')
parser.add_argument("--name", type=str, default='test')
parser.add_argument("--B", type=float, default=0.1, help='Attack budget')
parser.add_argument("--m", type=float, default=1000, help='Hyper-parameter for task-agnostic attack')
parser.add_argument("--pgd_iter", type=int, default=40)
parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--dropout", type=float, default=0)
parser.add_argument("--checkpoint", type=str, default='')
parser.add_argument("--network", type=str, default='resnet18', help='Network architecture. Currently support: \{resnet18, resnet50, resnet101, mbnetv2\}')
args = parser.parse_args()
return args
def myloss(yhat, y):
return -((yhat[:,0]-y[:,0])**2 + 0.1*((yhat[:,1:]-y[:,1:])**2).mean(1)).mean()
if __name__ == '__main__':
args = get_args()
print(args)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
seed = int(time.time())
test_set = eval(args.dataset)(args.datapath, False, transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
]), -1, seed, preload=False)
test_loader = torch.utils.data.DataLoader(test_set,
batch_size=args.batch_size, shuffle=False,
num_workers=8, pin_memory=False)
transferred_model = eval('{}_dropout'.format(args.network))(pretrained=False, dropout=args.dropout, num_classes=test_loader.dataset.num_classes).cuda()
checkpoint = torch.load(args.checkpoint)
transferred_model.load_state_dict(checkpoint['state_dict'])
pretrained_model = eval('fe{}'.format(args.network))(pretrained=True).cuda().eval()
adversary = LinfPGDAttack(
pretrained_model, loss_fn=myloss, eps=args.B,
nb_iter=args.pgd_iter, eps_iter=0.01, rand_init=True, clip_min=-2.2, clip_max=2.2,
targeted=False)
clean_top1, adv_top1, adv_sr = test(transferred_model, test_loader, adversary)
print('Clean Top-1: {:.2f} | Adv Top-1: {:.2f} | Attack Success Rate: {:.2f}'.format(clean_top1, adv_top1, adv_sr))
[CV_3D] PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation
PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation
Abstract
Task : Online Semantic Segmentation of Single-scan LiDAR point clouds
- Assigning semantic label to each of points given input point cloud
- Applications : fine-grained autonomous perception in self-driving systems
- Challenges
- near-real-time latency with limited hardware
- uneven or long-tailed distribution of LiDAR points across space (sparse)
- increasing number of fine-grained semantic classes
- Previous methods (ex. KNN & Graph) : Low performance / Time-consuming
PolarNet
- LiDAR-specific, nearest-neighbor-free segmentation algorithm
- polar bird’s eye view : balancing points across grid cells in polar coordinate system
- indirectly aligning segmentation network’s attention with long-tailed distribution of points along radial axis
1. Introduction
Background
- Lag bw release of masive PC and readiness of semantic segmentation labels
- Challenge for human raters to providing point-wise labels
- Demand for automatic and fast semantic segmentation solutions for LiDAR scans
Contributions
- More suitable LiDAR scan representation : considering imbalanced spatial distribution of points
- End-to-end PolarNet network : SOTA performance with low computational cost
- Analysis on performance based on different backbone nets using polar grid compared to other representations
3. Approach
Problem Statement
-
$P_i$ :$i$ -th point set containing$n_i$ LiDAR points => 4 features : (x, y, z, reflection) -
$L_i$ : object labels for each point$p_j$ in$P_i$ -
Goal : Training segmentation model
$f$ to minimize difference bw prediction$f(P_i)$ and label$L_i$
Overview of model
- 1️⃣ Polar Quantization : Points → Grids
- 2️⃣ Grid feature extraction → Polar Grid feature map
- KNN-free PointNet to transform points to fixed-length representation
- representation is assigned to its location in ring matrix
- 3️⃣ Ring-segmentation-CNN
- Input : ring matrix
- Output : quantized prediction
- 4️⃣ Decoding : Projecting prediction into 3D space
BEV Partition
- Network : 2D detection network to detect objects in 3D point clouds → segmentation
- Input : 2D top-down image (orthogonal projections)
- Output : tensor of same dimensional shape with each spatial location
- encoding class prediction for each voxel along z-axis of location
- Motivation : to represent scene with top-down img to speed up down-stream CNNs for natural imgs
- Operation
- Initial BEV representation : to create top-down projections of PC
- Variants of inital BEV : to encode each pixel in BEV with different heights, reflection, learned representations
- Cartesian BEV Grid Partition
- Quantize points in Cartesian coordinate system
- Middle grid cells : densely concentrated ↔ Peripheral grid cells : totally empty
- Uneven Partitioning → waste of computational power, limit of feature representiveness for center grid cells
- Points with different labels might be assigned to single cell
Polar BEV
- Motivation : to solve imbalance problem of Cartesian BEV
- Operation
- 1) Origin = Sensor's location → Calculate each point's azimuth and radius on XY plane
- 2) Assign points to grid cells based on quantized azimuth and radius
- Benefits : More evenly point distribution
- Less Points when cell is close to sensor <=> Dense grid representation is finer
- Lower Standard deviations <=> points are more evenly distributed
- Less burden on predictors (Less misclassification)
Polar Grid
- Learnable simplified PointNet
$h$ - Layers : max-pooling & BN & ReLU
- capturing distribution of points in each grid with fixed-length representation
- Feature in
$i$ ,$j$ -th grid cell$fea_{i,j} = MAX(h(p)|w_i < p_x < w_{i+1}, l_j < p_y < l_{j+1})$ -
$w$ ,$l$ : quantization sizes -
$p_x$ ,$p_y$ : locations of point$p$ - quantization sizes and locations : Polar or Cartesian
4. Experiments
Settings
Datasets
- SemanticKITTI : point-level re-annotation of LiDAR part of KITTI / imbalanced and challenging / 19 class
- A2D2 : autonomous driving dataset / using 5 asynchronous LiDAR sensors / 38 class segmentation annotation
- Paris-Lille-3D : 3 aggregated pc built from continuout LiDAR scans of streets / 9 segmentation class
Voxelization
- Cartesian BEV grid spaces → Polar BEV : to include more than 99% if points for each scan on avg
- Respective grid size setting : [480, 360, 32], [320, 320, 32], [320, 320, 32]
Baselines / Metric
- Baselines : SqueezSeg, PointNet
- Metric : RandLA
Results
SemanticKITTI Segmentation
A2D2 Segmentation
Paris-Lille-3D Segmentation
ETC
Projection Methods
Augmenting LiDAR Segmentation
- RC : Ring Convolution
- 9F : 2 Cartesian coordinates + 3 residual distances from center + 1 reflection + 3 Polar coordinate
mIOU vs. Distance to Sensor
Code Implementation
- Reference : https://github.com/edwardzhou130/PolarSeg
[Code] BEV
class BEV_Unet(nn.Module):
def __init__(self,n_class,n_height,dilation = 1,group_conv=False,input_batch_norm = False,dropout = 0.,circular_padding = False, dropblock = True, use_vis_fea=False):
super(BEV_Unet, self).__init__()
self.n_class = n_class
self.n_height = n_height
if use_vis_fea:
self.network = UNet(n_class*n_height,2*n_height,dilation,group_conv,input_batch_norm,dropout,circular_padding,dropblock)
else:
self.network = UNet(n_class*n_height,n_height,dilation,group_conv,input_batch_norm,dropout,circular_padding,dropblock)
def forward(self, x):
x = self.network(x)
x = x.permute(0,2,3,1)
new_shape = list(x.size())[:3] + [self.n_height,self.n_class]
x = x.view(new_shape)
x = x.permute(0,4,1,2,3)
return x
class UNet(nn.Module):
def __init__(self, n_class,n_height,dilation,group_conv,input_batch_norm, dropout,circular_padding,dropblock):
super(UNet, self).__init__()
self.inc = inconv(n_height, 64, dilation, input_batch_norm, circular_padding)
self.down1 = down(64, 128, dilation, group_conv, circular_padding)
self.down2 = down(128, 256, dilation, group_conv, circular_padding)
self.down3 = down(256, 512, dilation, group_conv, circular_padding)
self.down4 = down(512, 512, dilation, group_conv, circular_padding)
self.up1 = up(1024, 256, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
self.up2 = up(512, 128, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
self.up3 = up(256, 64, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
self.up4 = up(128, 64, circular_padding, group_conv = group_conv, use_dropblock=dropblock, drop_p=dropout)
self.dropout = nn.Dropout(p=0. if dropblock else dropout)
self.outc = outconv(64, n_class)
def forward(self, x):
x1 = self.inc(x)
x2 = self.down1(x1)
x3 = self.down2(x2)
x4 = self.down3(x3)
x5 = self.down4(x4)
x = self.up1(x5, x4)
x = self.up2(x, x3)
x = self.up3(x, x2)
x = self.up4(x, x1)
x = self.outc(self.dropout(x))
return x
class double_conv(nn.Module):
'''(conv => BN => ReLU) * 2'''
def __init__(self, in_ch, out_ch,group_conv,dilation=1):
super(double_conv, self).__init__()
if group_conv:
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1,groups = min(out_ch,in_ch)),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1,groups = out_ch),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True)
)
else:
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True)
)
def forward(self, x):
x = self.conv(x)
return x
class double_conv_circular(nn.Module):
'''(conv => BN => ReLU) * 2'''
def __init__(self, in_ch, out_ch,group_conv,dilation=1):
super(double_conv_circular, self).__init__()
if group_conv:
self.conv1 = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=(1,0),groups = min(out_ch,in_ch)),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True)
)
self.conv2 = nn.Sequential(
nn.Conv2d(out_ch, out_ch, 3, padding=(1,0),groups = out_ch),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True)
)
else:
self.conv1 = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=(1,0)),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True)
)
self.conv2 = nn.Sequential(
nn.Conv2d(out_ch, out_ch, 3, padding=(1,0)),
nn.BatchNorm2d(out_ch),
nn.LeakyReLU(inplace=True)
)
def forward(self, x):
#add circular padding
x = F.pad(x,(1,1,0,0),mode = 'circular')
x = self.conv1(x)
x = F.pad(x,(1,1,0,0),mode = 'circular')
x = self.conv2(x)
return x
class inconv(nn.Module):
def __init__(self, in_ch, out_ch, dilation, input_batch_norm, circular_padding):
super(inconv, self).__init__()
if input_batch_norm:
if circular_padding:
self.conv = nn.Sequential(
nn.BatchNorm2d(in_ch),
double_conv_circular(in_ch, out_ch,group_conv = False,dilation = dilation)
)
else:
self.conv = nn.Sequential(
nn.BatchNorm2d(in_ch),
double_conv(in_ch, out_ch,group_conv = False,dilation = dilation)
)
else:
if circular_padding:
self.conv = double_conv_circular(in_ch, out_ch,group_conv = False,dilation = dilation)
else:
self.conv = double_conv(in_ch, out_ch,group_conv = False,dilation = dilation)
def forward(self, x):
x = self.conv(x)
return x
class down(nn.Module):
def __init__(self, in_ch, out_ch, dilation, group_conv, circular_padding):
super(down, self).__init__()
if circular_padding:
self.mpconv = nn.Sequential(
nn.MaxPool2d(2),
double_conv_circular(in_ch, out_ch,group_conv = group_conv,dilation = dilation)
)
else:
self.mpconv = nn.Sequential(
nn.MaxPool2d(2),
double_conv(in_ch, out_ch,group_conv = group_conv,dilation = dilation)
)
def forward(self, x):
x = self.mpconv(x)
return x
class up(nn.Module):
def __init__(self, in_ch, out_ch, circular_padding, bilinear=True, group_conv=False, use_dropblock = False, drop_p = 0.5):
super(up, self).__init__()
# would be a nice idea if the upsampling could be learned too,
# but my machine do not have enough memory to handle all those weights
if bilinear:
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
elif group_conv:
self.up = nn.ConvTranspose2d(in_ch//2, in_ch//2, 2, stride=2,groups = in_ch//2)
else:
self.up = nn.ConvTranspose2d(in_ch//2, in_ch//2, 2, stride=2)
if circular_padding:
self.conv = double_conv_circular(in_ch, out_ch,group_conv = group_conv)
else:
self.conv = double_conv(in_ch, out_ch,group_conv = group_conv)
self.use_dropblock = use_dropblock
if self.use_dropblock:
self.dropblock = DropBlock2D(block_size=7, drop_prob=drop_p)
def forward(self, x1, x2):
x1 = self.up(x1)
# input is CHW
diffY = x2.size()[2] - x1.size()[2]
diffX = x2.size()[3] - x1.size()[3]
x1 = F.pad(x1, (diffX // 2, diffX - diffX//2,
diffY // 2, diffY - diffY//2))
# for padding issues, see
# https://github.com/HaiyongJiang/U-Net-Pytorch-Unstructured-Buggy/commit/0e854509c2cea854e247a9c615f175f76fbb2e3a
# https://github.com/xiaopeng-liao/Pytorch-UNet/commit/8ebac70e633bac59fc22bb5195e513d5832fb3bd
x = torch.cat([x2, x1], dim=1)
x = self.conv(x)
if self.use_dropblock:
x = self.dropblock(x)
return x
class outconv(nn.Module):
def __init__(self, in_ch, out_ch):
super(outconv, self).__init__()
self.conv = nn.Conv2d(in_ch, out_ch, 1)
def forward(self, x):
x = self.conv(x)
return x
[Code] PointNet + BEV
class ptBEVnet(nn.Module):
def __init__(self, BEV_net, grid_size, pt_model = 'pointnet', fea_dim = 3, pt_pooling = 'max', kernal_size = 3,
out_pt_fea_dim = 64, max_pt_per_encode = 64, cluster_num = 4, pt_selection = 'farthest', fea_compre = None):
super(ptBEVnet, self).__init__()
assert pt_pooling in ['max']
assert pt_selection in ['random','farthest']
if pt_model == 'pointnet':
self.PPmodel = nn.Sequential(
nn.BatchNorm1d(fea_dim),
nn.Linear(fea_dim, 64),
nn.BatchNorm1d(64),
nn.ReLU(inplace=True),
nn.Linear(64, 128),
nn.BatchNorm1d(128),
nn.ReLU(inplace=True),
nn.Linear(128, 256),
nn.BatchNorm1d(256),
nn.ReLU(inplace=True),
nn.Linear(256, out_pt_fea_dim)
)
self.pt_model = pt_model
self.BEV_model = BEV_net
self.pt_pooling = pt_pooling
self.max_pt = max_pt_per_encode
self.pt_selection = pt_selection
self.fea_compre = fea_compre
self.grid_size = grid_size
# NN stuff
if kernal_size != 1:
if self.pt_pooling == 'max':
self.local_pool_op = torch.nn.MaxPool2d(kernal_size, stride=1, padding=(kernal_size-1)//2, dilation=1)
else: raise NotImplementedError
else: self.local_pool_op = None
# parametric pooling
if self.pt_pooling == 'max':
self.pool_dim = out_pt_fea_dim
# point feature compression
if self.fea_compre is not None:
self.fea_compression = nn.Sequential(
nn.Linear(self.pool_dim, self.fea_compre),
nn.ReLU())
self.pt_fea_dim = self.fea_compre
else:
self.pt_fea_dim = self.pool_dim
def forward(self, pt_fea, xy_ind, voxel_fea=None):
cur_dev = pt_fea[0].get_device()
# concate everything
cat_pt_ind = []
for i_batch in range(len(xy_ind)):
cat_pt_ind.append(F.pad(xy_ind[i_batch],(1,0),'constant',value = i_batch))
cat_pt_fea = torch.cat(pt_fea,dim = 0)
cat_pt_ind = torch.cat(cat_pt_ind,dim = 0)
pt_num = cat_pt_ind.shape[0]
# shuffle the data
shuffled_ind = torch.randperm(pt_num,device = cur_dev)
cat_pt_fea = cat_pt_fea[shuffled_ind,:]
cat_pt_ind = cat_pt_ind[shuffled_ind,:]
# unique xy grid index
unq, unq_inv, unq_cnt = torch.unique(cat_pt_ind,return_inverse=True, return_counts=True, dim=0)
unq = unq.type(torch.int64)
# subsample pts
if self.pt_selection == 'random':
grp_ind = grp_range_torch(unq_cnt,cur_dev)[torch.argsort(torch.argsort(unq_inv))]
remain_ind = grp_ind < self.max_pt
elif self.pt_selection == 'farthest':
unq_ind = np.split(np.argsort(unq_inv.detach().cpu().numpy()), np.cumsum(unq_cnt.detach().cpu().numpy()[:-1]))
remain_ind = np.zeros((pt_num,),dtype = np.bool)
np_cat_fea = cat_pt_fea.detach().cpu().numpy()[:,:3]
pool_in = []
for i_inds in unq_ind:
if len(i_inds) > self.max_pt:
pool_in.append((np_cat_fea[i_inds,:],self.max_pt))
if len(pool_in) > 0:
pool = multiprocessing.Pool(multiprocessing.cpu_count())
FPS_results = pool.starmap(parallel_FPS, pool_in)
pool.close()
pool.join()
count = 0
for i_inds in unq_ind:
if len(i_inds) <= self.max_pt:
remain_ind[i_inds] = True
else:
remain_ind[i_inds[FPS_results[count]]] = True
count += 1
cat_pt_fea = cat_pt_fea[remain_ind,:]
cat_pt_ind = cat_pt_ind[remain_ind,:]
unq_inv = unq_inv[remain_ind]
unq_cnt = torch.clamp(unq_cnt,max=self.max_pt)
# process feature
if self.pt_model == 'pointnet':
processed_cat_pt_fea = self.PPmodel(cat_pt_fea)
if self.pt_pooling == 'max':
pooled_data = torch_scatter.scatter_max(processed_cat_pt_fea, unq_inv, dim=0)[0]
else: raise NotImplementedError
if self.fea_compre:
processed_pooled_data = self.fea_compression(pooled_data)
else:
processed_pooled_data = pooled_data
# stuff pooled data into 4D tensor
out_data_dim = [len(pt_fea),self.grid_size[0],self.grid_size[1],self.pt_fea_dim]
out_data = torch.zeros(out_data_dim, dtype=torch.float32).to(cur_dev)
out_data[unq[:,0],unq[:,1],unq[:,2],:] = processed_pooled_data
out_data = out_data.permute(0,3,1,2)
if self.local_pool_op != None:
out_data = self.local_pool_op(out_data)
if voxel_fea is not None:
out_data = torch.cat((out_data, voxel_fea), 1)
# run through network
net_return_data = self.BEV_model(out_data)
return net_return_data
def grp_range_torch(a,dev):
idx = torch.cumsum(a,0)
id_arr = torch.ones(idx[-1],dtype = torch.int64,device=dev)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return torch.cumsum(id_arr,0)
def parallel_FPS(np_cat_fea,K):
return nb_greedy_FPS(np_cat_fea,K)
@nb.jit('b1[:](f4[:,:],i4)',nopython=True,cache=True)
def nb_greedy_FPS(xyz,K):
start_element = 0
sample_num = xyz.shape[0]
sum_vec = np.zeros((sample_num,1),dtype = np.float32)
xyz_sq = xyz**2
for j in range(sample_num):
sum_vec[j,0] = np.sum(xyz_sq[j,:])
pairwise_distance = sum_vec + np.transpose(sum_vec) - 2*np.dot(xyz, np.transpose(xyz))
candidates_ind = np.zeros((sample_num,),dtype = np.bool_)
candidates_ind[start_element] = True
remain_ind = np.ones((sample_num,),dtype = np.bool_)
remain_ind[start_element] = False
all_ind = np.arange(sample_num)
for i in range(1,K):
if i == 1:
min_remain_pt_dis = pairwise_distance[:,start_element]
min_remain_pt_dis = min_remain_pt_dis[remain_ind]
else:
cur_dis = pairwise_distance[remain_ind,:]
cur_dis = cur_dis[:,candidates_ind]
min_remain_pt_dis = np.zeros((cur_dis.shape[0],),dtype = np.float32)
for j in range(cur_dis.shape[0]):
min_remain_pt_dis[j] = np.min(cur_dis[j,:])
next_ind_in_remain = np.argmax(min_remain_pt_dis)
next_ind = all_ind[remain_ind][next_ind_in_remain]
candidates_ind[next_ind] = True
remain_ind[next_ind] = False
return candidates_ind
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.