dongwoo-im / short-paper-reivew Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1 KB

short-paper-reivew's Introduction

About Me

B.S. in Systems Management Engineering (= Industrial Engineering), Sungkyunkwan University

Awards and Honors

2024 SNUH Medical Image AI Challenge 2023 (Pathology data) 3rd Place
2023 Hyundai Motor AI Competition 2023 Excellent Award (2nd ~ 4th Place)
2023 ETRI Fashion-how Season 4 Encouragement Award (4th Place)
2023 Elice AI Edu Hackathon Development of Educational Products based on Generative AI Excellent Award (3rd Place)
2022 Naver CLOVA AI RUSH 2022 Review Image Score Classificiation Award (3rd Place)
2022 Naver Connect Boostcamp AI Tech 3rd A Clear Picture without Fine Dust Impressive Project Top3

AI Contest

Dacon Gold / Kaggle Expert
- 2024 Kaggle USPTO - Explainable AI for Patent Professionals top 13% Bronze Medal
- 2024 Kaggle NeurIPS 2024 - Predict New Medicines with BELKA top 5% Silver Medal
- 2024 Dacon Domain Specific QA Handling Questions and Answers about Plastered top 3%
- 2024 Dacon Solving Jigsaw Puzzle Puzzle Image AI Competition top 3%
- 2023 Dacon Visual Question Answering Image Question Answering 5/102
- 2022 Dacon Image & Tabular Classification Breast Cancer Metastasis Prediction top 11%
- 2022 Dacon Image & Text Classification Tourism PoI Category Classifiaction top 7%
- 2022 Dacon Super Resolution (x4 Upscaling) Open Source Competition (AI YANGJAE HUB) top 13%
- 2022 Dacon Image & Tabular Classification Plant Disease Classification considering Environmental Information top 13%
Others
- 2024 LG Aimers 4th B2B Sales Opportunity Forcasting top 2%
- 2024 SNUH Medical Image AI Challenge 2023 Image & Tabular Regression Melanoma Recurrence Prediction 3/12
- 2023 K-water AI Competition 3rd Object Detection Fish Detection 5/69
- 2023 ETRI Fashion-how Season 4 (Sub-task1) Image Classification Fashion Image Multi-label Classification 2/19
- 2023 ETRI Fashion-how Season 4 (Sub-task2) Image Classification Fashion Image Color Imbalanced Classification 2/16
- 2023 ETRI Fashion-how Season 4 (Sub-task4) Reranking Reranking Coordi Set (with zero-shot item) Using Dialog 4/12
- 2022 CLOVA AI RUSH 2022 (Round 2) Image Classification Review Image Score Classificiation 3/15
- 2022 CLOVA AI RUSH 2022 (Round 1) Image Classification Face Age Group Classification 13/37
- 2022 Boostcamp AI Tech 3rd Semantic Segmentation Recycle Trash Segmentation 2/19
- 2022 Boostcamp AI Tech 3rd Text Detection OCR Text Detection 6/19
- 2022 Boostcamp AI Tech 3rd Object Detection Recycle Trash Detection 9/19
- 2022 Boostcamp AI Tech 3rd Image Classification Mask Wearing Status Classification 34/48

short-paper-reivew's People

Contributors

Watchers

short-paper-reivew's Issues

I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

github : https://github.com/facebookresearch/ijepa

1. Introduction

invaraiance-based (contrastive) pre-training은 hand-crafted view-based augmentation에 의존하기에 결국 특정 down-stream task에 적합한 방식이다. (ex. 이미지 분류)

masked pre-training은 상기한 augmentaiton의 영향을 받지 않기에 다른 modality로의 확장이 용이하지만 (data2vec)
reconstruction하는 decoder의 영향으로 더 낮은 수준의 semantic을 갖는 representation을 추출하는 것으로 보인다. (MAE, MSN)

I-JEPA는 extra prior knowledge 없이도 높은 semantic 정보를 갖는 representation 추출이 가능하다. how?

abstract prediction target : I-JEPA는 target representation을 예측한다.
multi-block masking strategy

2. Background

이미지 관점에서 각 architecture 설명

(a) JEA : invariance-based learning (ex. contrastive learning)
(b) GA : reconstrunction-based learning (ex. masked image modeling)
- (learnable) mask와 position token이 z에 해당
- 즉, 이 경우에 z가 y에 비해 적은 정보를 갖기에 representation collapse가 발생하지 않는다고 한다.
(c) JEPA : I-JEPA
- JEA와 다른 점은 invariant representation을 추출하는 것이 아니라 additional z를 바탕으로 예측한다는 것
- GA와 다른 점은 input space가 아닌 embedding space에 걸리는 loss function이다.
- x-encoder와 y-encoder는 asymmetric design

3. Method

context block이 주어지면 각 target block의 representation을 예측하는 구조이고, 최종적으로 target encoder를 evaluation에 사용
context encoder, predictor, target encoder 모두 ViT
- predictor는 narrow ViT지만, self-attention head는 encoder와 동일하게 맞춰줌
- [CLS] token 사용 X (average pooled representation on last layer or last 4 layers)
I-JEPA가 MAE와 다른 점은 non-generative 모델이고 representation level에서 prediction이 수행된다는 점

Targets

target scale = (0.15, 0.2)
target끼리는 overlap되어도 상관 없고, 저자들은 target을 4개로 설정
target encoder의 output에 masking

data2vec의 figure로, teacher의 representation을 예측한다는 점에서 I-JEPA와 비슷한 점이 있다.

Context

context scale = (0.85, 1.0)
각 prediction이 유의미할 수 있도록, context block 중 target block과 겹치는 영역은 제거한다.

Prediction

1개의 predictor로 4개의 target을 예측한다.

Loss

patch-level representation에 L2 distance loss 사용
context encoder와 predictor를 update하며, target encoder는 EMA로 update된다.

Evaluation

전체적으로 I-JEPA에서 기록한 모델 사이즈가 더 큰 경향이 있지만 연산 비용은 낮다.
- hand-crafted augmentation이 존재하지 않음
- 1 iter는 MAE보다 약간 더 오래 걸리지만 5배 더 빠르게 수렴한다고 주장
- torch autocast bfloat16의 영향일수도

left : linear evaluation on ImageNet
right : low-shot semi-supervised learning on ImageNet-1%
- linear probing과 partial fine-tuning 중 더 높은 성능 기재
- similar recipe with semi-ViT)

left : linear probing transfer on image classification
right : linear probing transfer on low-level task (object counting, depth prediction)
- ablation에서 model size를 키우더라도 low-level task 성능은 향상되지 않았다고 한다.

Predictor Visualization

predictor가 각 mask token의 위치에 적절한 latent를 예측하고 있는지 (JEPA 구조에서 z에 condition된 예측이 가능한지)
meta의 RCDM framework로 시각화한 결과입니다. (3,4,5,6 column은 서로 다른 seed 결과)

Representational Conditional Diffusion Model (RCDM)의 figure로,
pretrained SSL representation을 condition으로 하는 diffusion-based decoder를 학습하여 시각화에 활용한다.

Ablations

MAE와 동일하게 pixel을 target으로 할 경우, 심각한 성능 저하
저자들은 representation을 target함으로써 pixel detail에 의존적이지 않은 abstract prediction이 주효했을 것이라 추측

저자들은 target과 겹치는 context block을 제거해준 것이 주효했을 것이라 추측

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

project page : https://threedle.github.io/hyperfields/
github : https://github.com/threedle/hyperfields

HyperFields

generalized text-conditioned NeRF synthesis
unseen prompt에 대한 성능이 준수하고 fine-tuning이 필요하더라도 빠르게 수렴한다.

dynamic hypernetwork

learn a smooth mapping from text token embeddings to the space of NeRFs
일반적인 hypernetwork와 다른 점은 progressive and dynamic 구조라는 것 (Figure 3)
저자들은 batch 1로 학습하는 것이 실용적이지 않다고 판단하여 teacher는 batch 3, student는 batch 2로 학습
- 덕분에 unseen prompt 성능이 높아졌을 것이라고 추측함
text encoder는 BERT를 사용한다. (CLIP, T5도 실험해봤는데 근소하게 BERT가 좋았다더라)

NeRF distillation training

distill scenes encoded in individual NeRFs into one dynamic hypernetwork
Nerf는 SDS loss로 학습하지만 (like DreamFusion), hypernetwork는 NeRF distillation으로 학습한다. (Figure 2)

Understanding Masked Autoencoders via Hierarchical Latent Variable Models

arxiv : https://arxiv.org/abs/2306.04898

related work는 한 번 읽어보시길 추천합니다.

MAE를 hierarchy latent 관점에서 이론적으로 분석한 논문이다. (causal, generation 등에서 이러한 모델링이 존재했다.)

Causal
- Latent Hierarchical Causal Structure Discovery with Rank Constraints
- Identification of Linear Non-Gaussian Latent Hierarchical Structure
Generation
- BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling
- NVAE: A Deep Hierarchical Variational Autoencoder

Contribution

MAE를 hierarchical data-generating process로 설명할 수 있다는 것이, MAE로 hierarchy 모델링이 가능하다는 이론적 근거가 된다. (ConvNeXt-v2의 FCMAE가 떠오른다.)
MAE의 주요 hyperparam인 masking ratio, patch size를 representation에 관여하는 과정을 설명할 수 있다.
masking ratio가 매우 작거나 큰 경우 representation learning이 잘 안되는 실험 결과를 근거로 이를 검증했다.

process는 DAG를 가정한다. (inverted)

Notation

$x_m$이 masked token, $x_{m^c}$는 visible token을 의미한다.
첫 줄의 가운데 노드인 c는 representation Z의 sub-set을 의미하고, c는 minimal하다고 가정한다.
$s$는 positional embedding, [MASK] token 등의 추가 정보에 해당한다.

증명 과정은 appendix B에 있다.

shared info $c$를 통해 encoding과 decoding이 각각 분리될 수 있고, 분리하더라도 invertible하다는 게 주요 관점으로 보인다.

결국 masked token과 visible token에 동시에 관여하는 파란색 z가 실질적으로 학습하는 representation라는 해석이 가능하다.

How does MAE work?

MAE provably recovers high-level representations from low-level features like pixels

How does masking influence the learned representation?

MAE under different masking intensities learns representations of different abstraction levels
Learning high-level representations is very hard with extreme masking

Is current MAE optimal for representation learning?

Learning high-level representations can be challenging for random masking
저자들은 linear probing에서 MIM 성능이 CL에 비해 떨어지는 것이 random masking에서 기인한다고 주장한다.

SSIM/FSIM은 high representation을 평가하고, PSNR/(-)MSE는 low representation을 평가한다.
masking ratio가 0.9로 매우 높은 경우 low representation이 학습된다는 결과를 확인할 수 있다.
특이한 점은 Exploring Long-Sequence Masked Autoencoders 논문을 따라, ViT에서의 patch_size는 8로 고정하고, 1x1, 2x2, 4x4 block 형태로 masking했다고 한다.
- 이를 두고, ViT patch와 masking patch를 decoupling했다고 표현하더라.

mask ratio가 0.9인 경우 object-related token의 attention이 많이 활성화되는 것을 볼 수 있다.

mask ratio가 높고 patch size가 클수록, T-SNE 시각화 결과가 잘 나오는 것으로 보인다.

(left) Table 1 : mask ratio가 적절하고 (0.75) patch size가 클수록 (~32) 좋은 성능을 보인다.
(left) Table 3 : detection & segmentation 에서도 적절한 mask ratio의 성능이 좋다. (0.75)
(right) Table 4 : original MAE 방식대로 처리하면, 적절한 patch size의 성능이 제일 좋다. (16)
(right) Table 5 : detection & segmentation에서도 마찬가지다.

이를 통해 MAE에서 ViT patch와 mask patch를 decoupling하는 것이 꽤 중요할 수도 있어 보입니다.

HQ-SAM: Segment Anything in High Quality

github : https://github.com/SysCV/SAM-HQ

[Introduction]

SAM 한계

coarse mask boundaries (ex. thin object)
- SAM이 학습한 SA-1B는 생성된 mask로 구성되어 complex structure 퀄리티가 좋지 않음

HQ-SAM : 기존 SAM의 0.5% 정도의 추가 parameter (freeze SAM)

learnable HQ-Output token
global-local feature fusion

HQSeg-44K : 학습 데이터 (dataset 6개 + fine-grained mask)

[Related work]

high-quality segmentation

CRF, region growing 등의 post-segmentation refinement 방식은 high-level semantic 활용이 어렵다.
그래서 image encoder feature와 mask decoder feature를 fusion하여 high-level semantic을 반영하고, (기존의 64x64를 256x256까지 확장)
SAM의 output token과 GT mask 차이를 measure하는 HQ-Output Token을 추가하여 mask quality를 높인다.
즉, 기존 SAM의 process를 따라가면서 target(tiny object)에 대한 segmentation 성능 향상을 목적으로 함

[Method]

SAM

decoder layer에서 각 attention layer에는 point embedding to token / position embedding to image 적용 (on q, k)
즉, image embedding과 token embedding 사이에 position-aware two-way attention 수행

HQ-SAM

high-quality output token
- 기존 방식 대비 적은 parameter만 추가되기 때문에 time/data efficient
- SAM을 feeze하기에 overfitting 방지
global-local fusion for high-quality features
- feature 3개 사용
  - early layer : local (64x64)
  - final layer : global (64x64)
  - mask feature (256x256)
- convolution on 3 features
training and inference
- sample mixed types of prompts
- add random Gaussian noise in the boundary of GT mask

HQ-Output Token의 attention map이 더 detail하다.

[Experiments]

HQ-SAM은 SAM의 boundary가 미흡하다고 tackle하기 때문에 boundary metric 추가 (B 붙은건 boundary)

SAM (baseline)
- DIS, ThinObject dataset에서 유난히 성능이 낮은데, 해당 dataset의 train data가 HQSeg-44K에 포함되어 있어서 그런 것으로 추정한다. (supplementary material의 table 16을 참고하면, 해당 train data를 제외하더라도 성능차가 꽤 존재한다.)
Using SAM's mask decoder feature
- SAM + HQ-Output Token (X Output Token) : Output Token을 HQ-Output Token으로 대체한 실험으로 추정
- SAM + HQ-Output Token (Boundary Loss) : boundary 영역만 mask loss로 학습한 실험으로 추정
Using Out HQ-Feature
- 당연하지만 SAM feature보다 HQ-Feature를 사용했을 때 성능이 더 좋다.

흥미로운건 Deocder Mask feature를 fusion하지 않았을 때 성능 하락이 심하다는 것
- HQ-SAM은 SAM 결과에 의존하면서 SAM이 예측하지 못하는 boundary mask를 predict하기 때문으로 추정

Training the whole SAM : overfitting
Finetune SAM's decoder / post-refinement : overfitting on COCO
HQ-SAM : Finetune SAM's output token 실험과 비교하여 유의미한 성능 향상

이외에 다양한 task에서 SAM의 성능을 뛰어넘는 결과를 보여줌

Results on the SGinW Benchmark
Zero-Shot Open-world Segmentation
Zero-Shot Segmentation on High-resolution BIG Dataset
Zero-shot Instance Segmentation on COCO and LVIS
Point-based Interactive Segmentation Comparison
Zero-shot High-quality Video Instance Segmentation

noise에 대해서도 훨씬 robust하다.

DDPM: Denoising Diffusion Probabilistic Models

github : https://github.com/hojonathanho/diffusion

Abstract

weighted variational bound (connection between DPM and denoising score matching with Langevin dynamics)
progressive lossy decompression scheme (= generalization of autoregressive decoding)

1. Introduction

diffusion model로 고품질의 이미지 생성 가능
- parameteization을 통해 diffusion model이 denoising score matching과 동일함을 보임
하지만 다른 유형의 생성모델에 비해 log-likelihood 값은 좋지 못하다.
- 이러한 현상을 정보 이론의 lossy compression 측면에서 규명하고, (distortion이 높은 상태에서 생성하기에 log likelihood 값이 좋지 못하다는 해석)
- diffusion model의 sampling 과정이 (autoregressive decoding과 유사한) progressive decoding으로 해석될 수 있다고 주장

2. Background

3. Diffusion models and denoising autoencoders

3.1 Forward process and $L_T$

forward process variance $\beta_t$를 상수로 고정하여, $L_T$에 learnable parameter가 존재하지 않게 된다.

3.2. Reverse process and $L_{1:T-1}$

[First] $\Sigma_\theta(x_t, t)$

reverse process variance $\Sigma_\theta(x_t, t)$를 untrained time dipendent constant인 $\sigma_t^2I$로 설정한다.
이때 다음의 2가지 케이스를 실험해본 결과는 비슷했다고 하며, 각각의 경우 target하는 optimal case가 상이하다고 한다.

$\sigma_t^2I = \beta_t$ : $x_0 \sim \mathcal{N}(0, I)$
$\sigma_t^2I = \frac {1-\bar\alpha_{t-1}} {1-\bar\alpha_t} \beta_t$ : $x_0$ deterministically set to one point (?)

[Second] $\mu_\theta(x_t,t)$

수식 1과 수식 6을 활용하여 $L_{t-1}$을 표현한다.

이때 수식 4에서 $x_0$와 $\epsilon$으로 reparameterize된 $x_t$를 사용할 것이다.

다시 수식 8에서 constant $C$를 넘기고,

수식 7의 $\tilde \mu_t$을 대입하고 정리한다.

수식 10을 반영하여 $\mu_\theta$를 reparameterize할 것이다. 이때 $\epsilon_\theta$는 $x_t$로부터 예측된 $\epsilon$을 의미한다.

즉, 앞서 구한 $\Sigma_\theta(x_t, t)$와 $\mu_\theta(x_t,t)$에 기반하여 reverse process $p_\theta$로 $x_{t-1}$을 sampling하는 것은 다음과 같다.
만약 $\epsilon_\theta$를 learned gradient of data density로 생각한다면 이 수식은 곧 Langevin dynamics와 비슷하다고 주장한다.

또한, 2번의 reparameterization을 반영하여 수식 10을 정리하면 수식 12가 되고,
이를 통해 (Langevin dynamics와 유사한) 수식 11로부터 얻어지는 variational bound로 볼 수 있다고 주장한다.
더 나아가, denoising score matching과 유사한 objective로 학습하기에, 이를 variational inference로 볼 수 있다는 견해이다.

정리하면 $\tilde \mu_t$를 예측하는 mean approximator $\mu_\theta$로부터 reverse process를 학습할 수 있는데, (수식 8)
제안한 reparamterization을 적용하면 $\epsilon$을 예측하는 것으로도 reverse process의 학습이 가능하다는 내용이다. (수식 12)

3.3. Data scaling, reverse process decoder, and $L_0$

image pixel은 0 ~ 255 범위를 갖지만 [-1, 1] 범위로 linear scale된다.
$L_0$에서는 이를 고려하여 discrete log likelihood를 계산하기 위한 reverse process decoder로 기능한다.

D 차원을 갖는 이미지의 각 pixel에 대한 gaussian 분포가 곱해진 형태로,
입력되는 pixel의 범위에 따라 적분 범위가 달라지는 것을 확인할 수 있다.

여기서 적분 범위가 1/255 만큼 확장되는 것은 다음 논문을 참고한 것으로 보인다.
Improved Variational Inference with Inverse Autoregressive Flow 논문의 C.5 Discretized Logistic Likelihood

이러한 design을 통해 noise를 특별히 더해주거나 scaling operation에 대한 Jacobian 연산 없이 variational bound로 descrete data 생성이 가능하다고 하며,
conditional autoregressive model과 같은 더 효과적인 decoder를 future work로 제안하고 있다.

3.4. Simplified training objective

마지막으로 수식 12를 간소화하면 구현이 용이해지는데, 심지어 sample quality도 좋다고 한다.

먼저 t = 1인 경우가 $L_0$에 대응되며, 수식 13과 비교하면 $\sigma_1^2$을 무시한다.
또한 t > 1인 경우는 수식 12의 unweighted version에 해당되며, 이 형태가 NCSN의 denoising score matching model과 유사하다고 한다.

특히, 기존 weighted ELBO의 경우 small t에 가중치가 높은데, 이를 낮춤으로써 모델이 학습하기 어려운 large t에 집중할 수 있는 효과가 있다고 한다.

4. Experiments

Set $\beta_1 = 10^{-4}$, $\beta_T = 0.02$. It make $L_T \approx 10^{-5}$ bits/dim

left : CIFAR10 성능이 준수하며, train-test overfitting도 거의 존재하지 않는다.
right : baseline 방식 대비 $\epsilon$ prediction 방식의 성능이 준수하며, 간소화된 버전은 더 큰 성능 향상을 보인다.

4.3. Progressive coding

information theory에 속하는 lossy compression 측면에서 분석하면,

rate : reverse process $L_{1:T-1}$ -> 1.78 bits/dim
distortion : reverse process decoder $L_0$ -> 1.97 bits/dim

즉, reverse process가 갖는 정보의 양보다 reverse process decoder가 가진 정보의 양이 많다는 것을 알 수 있고,
이처럼 높은 distortion을 가지면서도 고품질의 CIFAR10 이미지 생성이 가능하다고 어필하고 있다.
(위 계산 과정에서 data scale의 영향은 없는 것인지 궁금하긴 하다.)

progressive lossy compression

더 나아가, rate-distortion 이론을 바탕으로 time step 별 양상을 확인한다.
여기서는 baseline 방식으로 sampling을 수행하며,
$x_t$를 $x_0$와 $\epsilon$으로 reparameterize하는 수식과 유사한 방식으로 $x_0$를 estimate한다.

low rate에서는 distortion이 빠르게 감소하며, high rate가 되면서 distortion이 천천히 줄어드는 것을 볼 수 있다.

distortion : rmse between estimated $\hat{x_0}$ and real $x_0$
rate : cumulative number of bits
x axis : T - t

progressive generation

Connection to autoregressive decoding

수식 5를 $x_0$ 없이 다른 방식으로 표현하자.
이때 T를 pixel이라고 생각하면, pixel에 대한 autoregressive model을 학습한다고 볼 수 있다는 견해이다.

4.4. Interpolation

TODO

Image

VLM

Mamba

Gen Image

Video

Efficinet LLM

LLM

ETC

Good Article

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

Project : https://syncdiffusion.github.io/

Image outpainting과 같은 montage generation 분야에 diffusion 모델을 사용하면, 이미지 전체에 일관성을 유지하기 어렵다고 한다.

SyncDiffusion에서는 latent-level에서 anchor windows와 일관성을 유지하도록 guidance하여 이 문제를 해결할 수 있다고 주장한다. (LPIPS score 차이가 줄어들도록 noisy image를 gradient descent로 update하는 방식)

알고리즘 1에 등장하는 MultiDiffusion은 montage generation task를 다루었던 논문으로, 여기서는 큰 사이즈의 이미지를 denoising하고 난 다음, small image를 pretrained diffusion model로 denoising시켜 얻은 결과를 guidance로 삼아, 큰 사이즈의 이미지 생성이 가능하도록 하였습니다. (여기서는 FTD loss라고 명명한 pixel distance based loss 사용)

어쨌든 위 논문에서 각각의 small image에서 예측되는 noise 값이, global space로 퍼질 수 있도록 하는 수식이 line 7에 해당

target하는 시나리오는 Text-Guided Panorama Generation 이다. (pretrained model = Stable diffusion 2.0)

만약 512 x 3072 이미지를 만들고 싶다면 (latent는 64 x 384)
각각의 window는 512 x 512 이고, stride는 128이라고 한다. (latent stride는 16)
즉, 위 시나리오에서는 총 21개의 window 필요 (center window = anchor)

w는 gradient descent weight로, weight decay를 0.95로 지정했다고 함

평가 지표

Coherence : window가 겹치지 않도록 나눈 다음, 각 match에서의 LPIPS/Style loss 값의 평균
- Intra-LPIPS
- Intra-Style-L
Fidelity
- Mean-GIQA : window가 겹치지 않도록 나눈 다음, anchor window와 나머지 window 사이의 GIQA metric
Fidelity & Diversity
- FID
- KID
Compatibility with Prompt
- Mean-CLIP-S

Coherence 향상을 위해 diversity는 약간 손해를 본 모습이다. (MultiDiffusion과 비교하여)

user study 결과, 생성 퀄리티는 더 좋다고 볼 수 있을듯

이외에 layout-guided image generation, 360 degree panorama generation에도 apllication이 가능하다고 한다.

w를 조절하여 coherence와 diversity를 조절할 수 있게 한다.

LightGlue: Local Feature Matching at Light Speed

github : https://github.com/cvg/LightGlue

wip

QueryOTR: Outpainting by Queries

github : https://github.com/Kaiseem/QueryOTR

QueryOTR = Query Outpainting TRansformer

CNN은 long range capture를 하지 못해서 outpainting에 적합하지 않다. -> ViT

image outpainting task를 patch-wise seq2seq autoregression로 정의

hybrid ViT-based encoder-decoder framework (MAE based Generator)

Pipeline : pretrained encoder - QEM - decoder - PSM
QEM = query expansion module
- 기존의 token을 key value로 decoding에 활용한다.
- 이때 query는 random noise + residual block를 거쳐 확장된 patch
PSM = patch smoothing module
- 원본 이미지와 확장된 영역 사이의 차이를 줄이기 위해 average 수행
  - 원래 MAE에서는 token을 pixel-level로 변환할 때 linear mapping을 사용하는데,
  - QueryOTR은 ConvTanspose2D 모듈로 여러 token을 복합적으로 고려한 mapping이 가능하게 했다.

QEM : 수렴 속도가 빨라지고 성능도 좋아진다. (noise = sampling) (DC = deform conv)
PSM : 생성 결과, 성능 모두 좋아진다. (per-patch norm : from MAE ?)

objective

patch-wise reconstruction loss = MAE recon loss
- warmup 단계에서는 recon loss만 사용
perceptual loss : (multi-scale) VGG-19 network pretraind on ImageNet
adversarial loss : (multi-scale) (CNN) PatchGAN discriminator (least squared loss -> hinge loss)
- discriminator regularization : DiffAugment + Spectral normalization

pretrianed encoder를 사용하지 않아도 성능 차이는 그리 크지 않지만, 수렴 속도에 차이가 있다고 함

x1, x2, x3는 한번에 생성하는 것이 아니라 반복적으로 수행한다.

위 버전은 QueryOTR 방식을 따라 다른 모델들의 성능을 평가한 것이다.
즉, outpainting의 본질인 외곽 영역에 대한 생성 퀄리티 측면에서 QueryOTR이 준수한 성능을 보인다.

Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

github : https://github.com/bair-climate-initiative/scale-mae

[Motivation]

위성 도메인에서는 image sacle이 다양하기에, image 상 거리와 실제 거리가 상이함
scale unware training에서는 아무리 많이 학습하더라도 unseen case에 대한 일반화 성능을 보장하기 어려움

[Main factor]

GSDPE (Ground Sample Distance Positional Encoding) : position and scale 이해 가능
Laplacian-pyramid decoder : multi-scale represenation 학습 가능
MAE variants 중 scale-aware 특성과 laplacian pyramid 사용한 경우는 본인들이 처음이라고 주장

[Main Figure]

[GSDPE]

original PE에 g/G term이 추가되어 이를 통해 scale-aware 가능
MAE encoder 들어가기 전, Demask 과정 총 두번에 걸쳐 주입됨
- 관련 insight, ablation 찾아보기

[Image]

$I_{hr}$ = initial higher resolution image : random crop (448x448) from orignal image
$I$ = input image : $I_{hr}$ downsample to 224x224
high-freq GT : $I_{hr}$ downsample to 56x56 and upsample to 448x448 and subtract from $I_{hr}$
- capture object edges, roads, and building outlines
low-freq GT : $I_{hr}$ downsample to 14x14 and upsample to 224x224
- capture color gradients and landscapes

[Decoder]

decoding : standard MAE decoder (8 layers -> 3 layers)
upsampling : upsample x2 and x4, passed to laplacian blocks
reconstruction : laplacian blocks (feature mapping, upsample, reconstruction) with L1 loss and L2 loss

dongwoo-im / short-paper-reivew Goto Github PK

short-paper-reivew's Introduction

About Me

Awards and Honors

AI Contest

short-paper-reivew's People

Contributors

Watchers

short-paper-reivew's Issues

1. Introduction

2. Background

3. Method

Targets

Context

Prediction

Loss

Evaluation

Predictor Visualization

Ablations

[Introduction]

[Related work]

[Method]

[Experiments]

Abstract

1. Introduction

2. Background

3. Diffusion models and denoising autoencoders

3.1 Forward process and $L_T$

3.2. Reverse process and $L_{1:T-1}$

[First] $\Sigma_\theta(x_t, t)$

[Second] $\mu_\theta(x_t,t)$

3.3. Data scaling, reverse process decoder, and $L_0$

3.4. Simplified training objective

4. Experiments

4.3. Progressive coding

progressive lossy compression

progressive generation

Connection to autoregressive decoding

4.4. Interpolation

[Motivation]

[Main factor]

[Main Figure]

[GSDPE]

[Image]

[Decoder]

[Evaluation]

Recommend Projects

Recommend Topics

Recommend Org