| No. | Model Name | Title | Links | Pub. | Organization | Release Time |
|---|---|---|---|---|---|---|
| 1 | TimeSformer | Is Space-Time Attention All You Need for Video Understanding? | paper code | arXiv | Facebook AI | 24 Feb 2021 |
| 2 | Video Transformer | Video Transformer Network | paper | arXiv | Theator | 1 Feb 2021 |
| 3 | ViViT | ViViT: A Video Vision Transformer | paper | arXiv | Google AI | 29 Mar 2021 |
| 4 | VideoGPT | VideoGPT: Video Generation using VQ-VAE and Transformers | paper code | arXiv | UC Berkeley | 20 Apr 2021 |
| 5 | VIMPAC | VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning | paper code | arXiv | UNC | 21 June 2021 |
| 6 | - | Self-supervised Video Representation Learning by Context and Motion Decoupling | paper | CVPR 2021 | Alibaba | 2 April 2021 |
| 7 | VideoLightFormer | VideoLightFormer: Lightweight Action Recognition using Transformers | paper | arXiv | the university of shefield | 1 Jul 2021 |
| 8 | Video Swin Transformer | Video Swin Transformer | paper code | arXiv | MSRA | 24 Jun 2021 |
| 9 | ST Swin | Long-Short Temporal Contrastive Learning of Video Transformers | paper | arXiv | Facebook AI | 17 Jun 2021 |
| 10 | X-ViT | Space-time Mixing Attention for Video Transformer | paper | arXiv | Samsung AI Cambridge | 11 Jun 2021 |
| 11 | OCVT | Generative Video Transformer: Can Objects be the Words? | paper | ICML 2021 | Rutgers University | 20 Jul 2021 |
| 12 | - | An Image is Worth 16x16 Words, What is a Video Worth? | paper code | arXiv | Alibaba | 27 May 2021 |
| 13 | SCT | Shifted Chunk Transformer for Spatio-Temporal Representational Learning | paper | arXiv | Kuaishou Technology | 26 Aug 2021 |
| 14 | - | Evaluating Transformers for Lightweight Action Recognition | paper | arXiv | University of Sheffield | 18 Nov 2021 |
| 15 | DualFormer | DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition | paper | arXiv | Sea AI Lab | 9 Dec 2021 |
| 16 | BEVT | BEVT: BERT Pretraining of Video Transformers | paper | arXiv | Shanghai Key Lab of Intelligent Information Processing | 2 Dec 2021 |
| 17 | - | Efficient Video Transformers with Spatial-Temporal Token Selection | paper | arXiv | Shanghai Key Lab of Intelligent Information Processing | 23 Nov 2021 |
| 18 | - | Lite Vision Transformer with Enhanced Self-Attention | paper code | arXiv | Johns Hopkins University | 20 Dec 2021 |
| 19 | MViT | Multiscale Vision Transformers | paper code | ICCV 2021 | 22 Apr 2021 | |
| 20 | Uniformer | Uniformer: Unified Transformer For Efficient Spatiotemporal Representation Learning | paper code | arXiv | Chinese Academy of Sciences | 12 Jan 2022 |
| 21 | MaskFeat | Masked Feature Prediction for Self-Supervised Visual Pre-Training | paper | arXiv | Facebook AI | 16 Dec 2021 |
| 22 | MTV | Multiview Transformers for Video Recognition | paper | arXiv | 20 Jan 2022 | |
| 23 | MeMViT | MeMViT : Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition | paper | arXiv | Facebook AI Research | 20 Jan 2022 |