Light

slimailisi / transformer-in-vision Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dirtyharrylyl/transformer-in-vision

0.0 1.0 0.0 546 KB

Some recent Transformer-based CV works

transformer-in-vision's Introduction

Transformer-in-Vision

Some recent Transformer-based CV works. Welcome to comment/contribute!

Keep update.

Resource

Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code], [arXiv]
OpenAI DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survery:

(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper], [Project]
(arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]
(arXiv 2021.04) EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION, [Paper]
(arXiv 2021.04) Compressing Visual-linguistic Model via Knowledge Distillation, [Paper]
(arXiv 2021.04) When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes, [Paper]
(arXiv 2021.04) Variational Transformer Networks for Layout Generation, [Paper]
(arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
(arXiv 2021.04) Fourier Image Transformer, [Paper]
(arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) VisQA: X-raying Vision and Language Reasoning in Transformers, [Paper]
(arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
(arXiv 2021.04) Language-based Video Editing via Multi-Modal Multi-Level Transformer, [Paper]
(arXiv 2021.04) LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, [Paper]
(arXiv 2021.04) LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
(arXiv 2021.04) Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis, [Paper], [Project]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
(arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
(arXiv 2021.03) Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning, [Paper], [Code]
(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.03) Scene-Intuitive Agent for Remote Embodied Visual Grounding, [Paper]
(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper]
(arXiv 2021.03) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
(arXiv 2021.03) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
(arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
(arXiv 2021.03) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
(arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.03) Paying Attention to Multiscale Feature Maps in Multimodal Image Matching, [Paper]
(arXiv 2021.03) Learning Multi-Scene Absolute Pose Regression with Transformers, [Paper]
(arXiv 2021.03) HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING, [Paper], [Code]
(arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
(arXiv 2021.03) AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting, [Paper], [Code]
(arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) ConViT: Improving Vision Transformers ith Soft Convolutional Inductive Biases, [Paper], [Code]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) On the Sentence Embeddings from Pre-trained Language Models, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [Paper]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models, [Paper], [Code]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.03) Causal Attention for Vision-Language Tasks, [Paper], [Code]
(arXiv 2021.03) Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks, [Paper]
(arXiv 2021.03) WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training, [Paper]
(arXiv 2021.03) Attention is not all you need: pure attention loses rank doubly exponentially with depth, [Paper]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.03) Perceiver: General Perception with Iterative Attention, [Paper]
(arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
(arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
(arXiv 2021.03) OmniNet: Omnidirectional Representations from Transformers, [Paper]
(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
(arXiv 2021.02) Evolving Attention with Residual Convolutions, [Paper]
(arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
(arXiv 2021.02) SparseBERT: Rethinking the Importance Analysis in Self-attention, [Paper]
(arXiv 2021.02) Investigating the Limitations of Transformers with Simple Arithmetic Tasks, [Paper], [Code]
(arXiv 2021.02) Do Transformer Modifications Transfer Across Implementations and Applications? [Paper]
(arXiv.2021.02) Do We Really Need Explicit Position Encodings for Vision Transformers? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
(arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Linear Transformers Are Secretly Fast Weight Memory Systems, [Paper]
(arXiv.2021.02) POSITION INFORMATION IN TRANSFORMERS: AN OVERVIEW, [Paper]
(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Project], [Code]
(arXiv 2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, [Paper]
(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]
(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]
(arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]
(arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]
(arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]
(arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]
(arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]
(arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]
(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]
(arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
(arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]
(ICLR'21) IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES, [Paper], [Code]
(ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]
(ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]
(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]
(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(CVPR'20) PaStaNet: Toward Human Activity Knowledge Engine, [Paper], [Code]
(CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]
(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]
(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]

TODO

V-L representation learning (https://arxiv.org/pdf/2103.16110.pdf has provided a detailed table)

transformer-in-vision's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.