Coder Social home page Coder Social logo

ipvit's Introduction

Hi there 👋

  • 🔭 My research interests are robust visual perception by understanding and explaining AI behavior through adversarial machine learning, temporal perception, representation learning (self-supervision, self-distillation, self-critique), and configuring the role of language models (LLMs) in building visual AI systems.
  • 🌱 You are welcome to explore my research work along with the provided code below. Seven of the papers are accepted as Oral/Spotlight at ICLR, NeurIPS, AAAI, CVPR, BMVC, and ACCV.
  • 📫 How to reach me: [email protected]
  • ⚡ Fun fact: I am really into fitness and thinking of joining the GYM for quite some time now 😄

🌱 Repositories

Topic Application Paper Repo Venue
Vision-Language Learning Composed Video Retrieval Composed Video Retrieval via Enriched Context and Discriminative Embeddings composed-video-retrieval CVPR'24
Self-supervision Multi-Spectral Satellite Imagery Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery satmae_pp CVPR'24
Vision-Language Learning Video grounding Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding Video-GroundingDINO CVPR'24
Vision-Language Learning Language Driven VLM for Remote Sensing Geochat: Grounded large vision-language model for remote sensing GeoChat CVPR'24
Vision-Language Learning Leaverging LLM to generate complex scenes (Zero-Shot) LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts llmblueprint ICLR'24
Self-supervision Self-structural Alignment of Foundational Models (Zero-Shot) Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment S3A AAAI'24-Oral
Vision-Language Learning Test-Time Alignment of Foundational Models (Zero-Shot) Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization PromptAlign NeurIPS'23
Vision-Language Learning Regulating Foundational Models Self-regulating Prompts: Foundational Model Adaptation without Forgetting PromptSRC ICCV'23
Network Engineering Video Recognition Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition Video-FocalNets ICCV'23
Vision-Language Learning Face Anti-spoofing FLIP: Cross-domain Face Anti-spoofing with Language Guidance FLIP ICCV'23
3D Medical Segmentation Adversarial Training Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation VAFA MICCAI'23
Vision-Language Learning Facial Privacy CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search Clip2Protect CVPR'23
Vision-Language Learning Video Recognition (Zero-shot) Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting Vita-CLIP CVPR'23
Prompt learning Image Recognition (Category Discovery) PromptCAL for Generalized Novel Category Discovery PromptCAL CVPR'23
Prompt learning Adversarial Attack Boosting Adversarial Transferability using Dynamic Cues DCViT-AT ICLR'23
Self-supervision Video Recognition Self-Supervised Video Transformer SVT CVPR'22-Oral
Contrastive learning Adversarial Defense Stylized Adversarial Training SAT IEEE-TPAMI'22
Self-supervision Adversarial Attack Adversarial Pixel Restoration as a Pretext Task for Transferable Perturbations ARP BMVC'22-Oral
Self-supervision Image Recognition How to Train Vision Transformer on Small-scale Datasets? VSSD BMVC'22
Self-distillation Image Recognition (Domain Generalization) Self-Distilled Vision Transformer for Domain Generalization SDViT ACCV'22-Oral
Attention Analysis Understanding Vision Transformer Intriguing Properties of Vision Transformers IPViT NeurIPS'21-Spotlight
Self-ensemble Adversarial Attack On Improving Adversarial Transferability of Vision Transformers ATViT ICLR'21-Spotlight
Distribution matching Adversarial Attack On Generating Transferable Targeted Perturbations TTP ICCV'21
Contrastive learning Image Recognition Orthogonal Projection Loss OPL ICCV'21
Self-supervision Adversarial Defense A Self-supervised Approach for Adversarial Robustness NRP CVPR'20-Oral
Relativistic optimization Adversarial Attack Cross-Domain Transferability of Adversarial Perturbations CDA NeurIPS'19
Gradient Smoothing Adversarial Defense Local Gradients Smoothing: Defense Against Localized Adversarial Attacks LGS WACV'19

ipvit's People

Contributors

cgarbin avatar kahnchana avatar muzammal-naseer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipvit's Issues

Amazing work, but can it work on DETR?

ViT family show strong robustness on RandomDrop and Domain shift Problem. The thing is , I 'm working on object detection these days,detr is an end to end object detection methods which adopted Transformer's encoder decoder part, but the backbone I use , is Resnet50, it can still find the properties that your paper mentioned.
Above all I want to ask two questions:
(1).Do these intriguing properties come from encoder、decoder part?
(2).What's the difference between distribution shift and domain shift(I saw distribution shift first time on your paper)?

Question about links of pretrained models

Hi! First of all, thank the authors for the exciting work!
I noticed that the checkpoint link of the pretrained 'deit_tiny_distilled_patch16_224' in vit_models/deit.py is different from the one of the shape-biased model DeiT-T-SIN (distilled), as given in README.md.
I thought deit_tiny_distilled_patch16_224 has the same definition with DeiT-T-SIN (distilled). Do they have differences in model architecture or training procedure?

Segmentation maps from ViTs

Hi @Muzammal-Naseer and @cgarbin and @kahnchana

I've been reading the paper and looking into the code too. But was not able to fins the piece of codes related to visualization of figures Like Figure 4: Attention maps and Figure 8: Segmentation maps from ViTs. Could you please give me some clue?

Thanks very much!

The exiciting work!

Can you explain why the Transformers show such good occlusion-robustness compared CNNs?

Attention maps DINO Patchdrop

Hi, thanks for the amazing paper.

My question is about how which patches are dropped from the image with the DINO model. It looks like in the code in evaluate.py on line 132 head_number = 1. I want to understand the reason why this number was chosen (the other params used to index the attention maps seem to make sense). Wouldn't averaging the attention maps across heads give you better segmentation?

Thanks,

Ravi

Two questions on your paper

Hi. This is heonjin.

Firstly, big thanks to you and your paper. well-read and precise paper!
I have two questions on your paper.

  1. Please take a look at Figure 9.
    image
    On the 'no positional encoding' experiment, there is a peak on 196 shuffle size of "DeiT-T-no-pos". Why is there a peak? and I wonder why there is a decreasing from 0 shuffle size to 64 of "DeiT-T-no-pos".

  2. On the Figure 14,
    image
    On the Aircraft(few shot), Flower(few shot) dataset, CNN performs better than DeiT. Could you explain this why?

Thanks in advance.

Hi

May I ask if you can open source the code for distilling and training shape-bias ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.