Transformers in AI Vision

Author: Stu Feeser

Introduction: In recent years, the field of artificial intelligence (AI) has witnessed a paradigm shift, with transformer models revolutionizing the way we approach language processing tasks. However, the transformative impact of these models is not limited to the realm of text and speech. Today, we turn our attention to another frontier—computer vision—and explore how transformer models are reshaping our approach to visual understanding, marking a significant leap from traditional models to cutting-edge advancements.

The Rise of Transformers in Computer Vision

Transformers, initially designed for natural language processing (NLP), have found a new domain to conquer—computer vision. This transition mirrors the evolution from text to imagery, where the need for models to ‘understand’ and ‘interpret’ visual data becomes paramount. The essence of transformer models, with their self-attention mechanisms, is now being applied to “see” and “interpret” images in ways previously unimaginable. The new application is called Vision Transformers (ViT)

NLP to ViT: Key Modifications

Image Patching:
- In NLP: A sentence is tokenized into words or subwords, which are then embedded into vectors.
- In ViT: An image is divided into fixed-size patches (e.g., 16x16 pixels), each patch treated as a “token.” These patches are flattened and linearly projected into embeddings of a specified dimension, similar to word embeddings in NLP.
Positional Encodings:
- In NLP: Transformers use positional encodings to retain the order of words, as the self-attention mechanism does not inherently process sequential data in order.
- In ViT: Similar positional encodings are added to the patch embeddings to preserve the positional information of each patch in the image, ensuring the model can recognize patterns related to the spatial arrangement of patches.
Self-Attention Mechanism:
- Adaptation: The self-attention mechanism remains largely unchanged but is applied to the patch embeddings instead of word embeddings. This allows the model to weigh the importance of each patch relative to others, enabling it to focus on more “informative” parts of the image for a given task.
Class Token:
- In ViT: A special token, known as the “class token,” is initially prepended to the sequence of patch embeddings. This token, which can start with random values or be initialized in a standardized way, is fed into the transformer encoder along with the embeddings of each patch. During the processing inside the transformer, the state of the class token is iteratively adjusted through the model’s self-attention mechanism, which operates on all patches (and the class token itself) collectively, rather than in isolation or sequentially. This process allows the class token to integrate and summarize information from across the entire image. By the conclusion of this process, the class token embodies a comprehensive summary of the image, enabling it to be used for tasks such as image classification.
Advantages:
- Adapting transformers to computer vision through these modifications allows ViT to leverage the global context of an image, making it powerful for tasks that require understanding the image as a whole. Unlike CNNs, which process local regions of an image independently, ViT can consider the entire image at once, leading to potentially more nuanced interpretations.
Challenges:
- Challenges = computational cost. Processing images as sequences of patches requires significant computational resources, especially for high-resolution images. This has led to ongoing research into more efficient transformer architectures for computer vision, aiming to reduce the computational demands while maintaining or improving performance.

Traditional vs. Transformer-Based Models in Vision

Traditional Approaches: For years, convolutional neural networks (CNNs) were the gold standard in computer vision, excelling in tasks like image classification, object detection, and more. These models rely on convolutional layers to process visual information, learning hierarchical feature representations.

Enter Vision Transformers (ViT): Vision Transformers (ViT) represent a significant departure from CNNs, treating images as sequences of pixels or patches. This approach allows ViTs to leverage the self-attention mechanism, enabling the model to weigh the importance of different parts of an image in relation to each other, similar to how transformer models evaluate word relationships in a sentence. ViT started with “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Alexey Dosovitskiy and others at Google, Oct 2020.

Demonstrated that transformers could be directly applied to sequences of image patches for image classification tasks, achieving competitive results on standard benchmarks such as ImageNet.
Introduced the concept of splitting an image into fixed-size patches, treating each patch as a token similar to words in NLP, and applying a transformer model to these tokens.
Showed that Vision Transformers scale effectively with the size of the model and the dataset, highlighting their potential to leverage large amounts of data for improved performance.

CNN vs ViT- which is best?

Richness of Understanding: ViTs have indeed demonstrated a remarkable ability to capture global dependencies within an image, thanks to their self-attention mechanism. This can lead to a richer understanding of the image as a whole, which is particularly beneficial for complex scenes where contextual understanding is crucial.
Versatility and Adaptability: Transformer models, including ViTs, are inherently more adaptable to a range of tasks without needing significant architectural changes. This adaptability stems from their architecture, which does not inherently assume any specific input structure (unlike CNNs, which are specifically designed to process grid-like data such as images). This makes transformers a powerful tool for tasks beyond those CNNs are typically used for, including NLP tasks and even multimodal tasks that involve both text and images.
Multifunctionality: The ability of a single transformer architecture to be applied to multiple visual tasks—ranging from image classification and object detection to segmentation and captioning—can indeed provide efficiency in terms of development and deployment. This multifunctionality can lead to cost savings, especially in scenarios requiring the development of models for multiple tasks.
Cost of Training: While it’s true that the versatility of ViTs might yield more “bang for your buck” in terms of the breadth of applications, it’s important to note that the training costs for ViTs can be significant, especially for large models and datasets. The computational requirements for training ViTs are often higher than those for CNNs due to the complexity of the self-attention mechanism. However, the investment might be justified by the broad applicability and potential performance gains of ViTs in complex scenarios.
CNNs Are Not One-Trick Ponies: It’s essential to recognize that CNNs have been and continue to be incredibly successful in a wide range of computer vision tasks. They are particularly well-suited for tasks where local spatial hierarchies in images are important. CNNs are also more computationally efficient for certain tasks and remain a preferred choice for many applications. The ongoing development in CNN architectures also aims to incorporate lessons learned from transformers, such as global context and adaptability.

Key Applications of Vision Transformers

Enhanced Object Detection: ViT’s ability to process entire images in patches allows for more accurate object detection, even in cluttered scenes. Unlike traditional methods that might miss obscured objects, ViTs provide a comprehensive view, making sense of complex visual scenes with remarkable precision.
Road Condition Monitoring: DETR (DEtection TRansformer) uses the transformer’s capability to analyze images as a whole, enabling precise localization and identification of road damages. This holistic view surpasses traditional models, offering a detailed understanding of road conditions in real-time.
Pedestrian Detection: In crowded urban environments, DETR outperforms older technologies by effectively handling overlapping figures and complex scenes. Its ability to ’think’ about the entire scene allows for more accurate pedestrian counting and detection.
Vehicle Classification and License Plate Recognition: While traditional models focus on specific regions, transformers provide a global perspective. This global understanding enhances vehicle classification accuracy and maintains the effectiveness of OCR (Optical Character Recognition) for License Plate Recognition (LPR), showcasing the versatility of transformers in both broad and specific tasks.

Why the Shift Matters

The shift towards transformer-based models in computer vision signifies a move towards more intelligent, adaptable AI systems. These models’ ability to interpret the visual world mimics human cognitive processes more closely, leading to AI that can better understand and interact with its environment. This evolution promises significant advancements in autonomous vehicles, surveillance, environmental monitoring, and beyond, marking a new era of AI-driven innovation.

What other Transformed based options exist?

Swin Transformer: Introduced by researchers from Microsoft in 2021, the Swin Transformer introduces a hierarchical structure that allows for more efficient processing of images at different resolutions. It uses shifted windows to limit the self-attention computation to local windows while allowing cross-window connection. This design improves the model’s scalability and efficiency, making it suitable for a wider range of tasks, including object detection and semantic segmentation, that is to locate and recognize many objects in the overall image.

Data-efficient Image Transformers (DeiT): DeiT, also introduced in 2021, focuses on making transformers more data-efficient, enabling them to perform well even without massive datasets like ImageNet for pre-training. DeiT introduces a distillation token and leverages knowledge distillation during training, allowing the model to learn effectively from both labeled data and the soft labels provided by a pre-trained teacher model.

Convolutional vision Transformers (CvT): CvT combines elements of both CNNs and transformers by integrating convolutional operations into the transformer architecture. Introduced in 2021, this approach aims to combine the representational efficiency of convolutional layers with the global processing capabilities of transformers, offering improvements in both performance and computational efficiency.

Multiscale Vision Transformers (MViT): MViT, proposed in 2021, introduces a multiscale architecture that processes input images at multiple resolutions, allowing the model to capture a richer set of features from fine to coarse levels. This approach is particularly beneficial for tasks that require understanding both detailed textures and global structures, such as video recognition and semantic segmentation.

Vision Transformer with Locally-enhanced Self-Attention (LeViT): LeViT, introduced in 2021, focuses on improving the speed and efficiency of Vision Transformers for real-time applications. It incorporates locally-enhanced self-attention mechanisms and is optimized for fast inference on CPUs, making it suitable for deployment in resource-constrained environments.