Coder Social home page Coder Social logo

transformer-in-vision's Introduction

Transformer-in-Vision

Some recent Transformer-based CV works. Welcome to comment/contribute!

Keep update.

Resource

Survery:

  • (arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]

  • (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]

  • (arXiv 2020.9) Efficient Transformers: A Survey, [Paper]

  • (arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

  • (arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper], [Project]

  • (arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]

  • (arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]

  • (arXiv 2021.04) Handwriting Transformers, [Paper]

  • (arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]

  • (arXiv 2021.04) EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION, [Paper]

  • (arXiv 2021.04) Compressing Visual-linguistic Model via Knowledge Distillation, [Paper]

  • (arXiv 2021.04) When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes, [Paper]

  • (arXiv 2021.04) Variational Transformer Networks for Layout Generation, [Paper]

  • (arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]

  • (arXiv 2021.04) Fourier Image Transformer, [Paper]

  • (arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]

  • (arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]

  • (arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]

  • (arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]

  • (arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]

  • (arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]

  • (arXiv 2021.04) VisQA: X-raying Vision and Language Reasoning in Transformers, [Paper]

  • (arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]

  • (arXiv 2021.04) Language-based Video Editing via Multi-Modal Multi-Level Transformer, [Paper]

  • (arXiv 2021.04) LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, [Paper]

  • (arXiv 2021.04) LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]

  • (arXiv 2021.04) Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis, [Paper], [Project]

  • (arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]

  • (arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]

  • (arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]

  • (arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]

  • (arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]

  • (arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]

  • (arXiv 2021.03) Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning, [Paper], [Code]

  • (arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]

  • (arXiv 2021.03) Scene-Intuitive Agent for Remote Embodied Visual Grounding, [Paper]

  • (arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper]

  • (arXiv 2021.03) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]

  • (arXiv 2021.03) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, [Paper], [Code]

  • (arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]

  • (arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]

  • (arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]

  • (arXiv 2021.03) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]

  • (arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]

  • (arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]

  • (arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]

  • (arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]

  • (arXiv 2021.03) Paying Attention to Multiscale Feature Maps in Multimodal Image Matching, [Paper]

  • (arXiv 2021.03) Learning Multi-Scene Absolute Pose Regression with Transformers, [Paper]

  • (arXiv 2021.03) HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING, [Paper], [Code]

  • (arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]

  • (arXiv 2021.03) AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting, [Paper], [Code]

  • (arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]

  • (arXiv 2021.03) ConViT: Improving Vision Transformers ith Soft Convolutional Inductive Biases, [Paper], [Code]

  • (arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]

  • (arXiv 2021.03) On the Sentence Embeddings from Pre-trained Language Models, [Paper]

  • (arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]

  • (arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]

  • (arXiv 2021.03) Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [Paper]

  • (arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]

  • (arXiv 2021.03) Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models, [Paper], [Code]

  • (arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]

  • (arXiv 2021.03) Causal Attention for Vision-Language Tasks, [Paper], [Code]

  • (arXiv 2021.03) Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks, [Paper]

  • (arXiv 2021.03) WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training, [Paper]

  • (arXiv 2021.03) Attention is not all you need: pure attention loses rank doubly exponentially with depth, [Paper]

  • (arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]

  • (arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]

  • (arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]

  • (arXiv 2021.03) Perceiver: General Perception with Iterative Attention, [Paper]

  • (arXiv 2021.03) Transformer in Transformer, [Paper], [Code]

  • (arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]

  • (arXiv 2021.03) OmniNet: Omnidirectional Representations from Transformers, [Paper]

  • (arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]

  • (arXiv 2021.02) Evolving Attention with Residual Convolutions, [Paper]

  • (arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]

  • (arXiv 2021.02) SparseBERT: Rethinking the Importance Analysis in Self-attention, [Paper]

  • (arXiv 2021.02) Investigating the Limitations of Transformers with Simple Arithmetic Tasks, [Paper], [Code]

  • (arXiv 2021.02) Do Transformer Modifications Transfer Across Implementations and Applications? [Paper]

  • (arXiv.2021.02) Do We Really Need Explicit Position Encodings for Vision Transformers? [Paper], [Code]

  • (arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]

  • (arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]

  • (arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]

  • (arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]

  • (arXiv.2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]

  • (arXiv 2021.02) Linear Transformers Are Secretly Fast Weight Memory Systems, [Paper]

  • (arXiv.2021.02) POSITION INFORMATION IN TRANSFORMERS: AN OVERVIEW, [Paper]

  • (arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Project], [Code]

  • (arXiv 2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]

  • (arXiv 2021.02) Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, [Paper]

  • (arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]

  • (arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]

  • (arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]

  • (arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]

  • (arXiv 2021.02) Video Transformer Network, [Paper]

  • (arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]

  • (arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]

  • (arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]

  • (arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]

  • (arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]

  • (arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]

  • (arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]

  • (arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]

  • (arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]

  • (arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]

  • (arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]

  • (arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]

  • (arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]

  • (arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]

  • (arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]

  • (arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]

  • (arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]

  • (arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]

  • (arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]

  • (arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]

  • (arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]

  • (arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]

  • (arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]

  • (arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]

  • (arXiv 2020.12) DETR for Pedestrian Detection, [Paper]

  • (arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]

  • (arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]

  • (arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]

  • (arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]

  • (arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]

  • (arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]

  • (arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]

  • (arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]

  • (arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]

  • (arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]

  • (arXiv 2020.12) Point Transformer, [Paper]

  • (arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]

  • (arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]

  • (arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]

  • (arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]

  • (arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]

  • (arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]

  • (arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]

  • (arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]

  • (arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}

  • (arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]

  • (arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]

  • (arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]

  • (arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]

  • (arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]

  • (arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]

  • (ICLR'21) IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES, [Paper], [Code]

  • (ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]

  • (ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]

  • (ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]

  • (ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]

  • (ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]

  • (ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]

  • (ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]

  • (ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]

  • (CVPR'20) PaStaNet: Toward Human Activity Knowledge Engine, [Paper], [Code]

  • (CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]

  • (CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]

  • (CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]

  • (ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]

  • (EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]

TODO

transformer-in-vision's People

Contributors

dirtyharrylyl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.