Transformers in AI Vision

Author: Stu Feeser

image image

Intro: A New Wave in Smart Tech

Lately, the world of AI (artificial intelligence) has seen some cool changes, especially with something called transformer models. These models have been a big deal for understanding and generating language. But now, they’re also helping computers “see” and understand pictures better than ever before. We’re going to dive into how these transformer models are changing the game in something called computer vision.

Transformers Take on Computer Vision

Originally, transformers were all about handling language tasks—like how Siri understands your questions. But now, they’re stepping into the realm of computer vision. That means they’re getting good at looking at images and figuring out what’s what, just like they do with words. This new approach is called Vision Transformers (ViT), and it’s pretty groundbreaking.

From Words to Pictures: How ViTs Work

  1. Turning Images into Data:

    • For language, transformers break down sentences into words or pieces of words, turning them into a form the computer can understand.
    • ViTs do something similar with pictures. They cut the image into little squares (patches), treat each one like a piece of data, and then work their magic to understand the image.
  2. Keeping Track of Position:

    • Just like in language, where the order of words matters, the position of each image patch is important for ViTs to get the whole picture right.
  3. Paying Attention to Details:

    • ViTs can focus on each part of the image, deciding which bits are important for understanding the whole scene.
  4. Summarizing the Image:

    • There’s a special piece of data called the “class token” that ViTs use to gather all the information from the image and come up with a summary of what’s in the picture.
  5. Why It’s Cool:

    • Unlike older models that just look at parts of an image one by one, ViTs can consider the entire picture all at once. This means they can get a better sense of the image as a whole.
  6. The Challenge:

    • The tricky part is that looking at images this way needs a lot of computer power, especially for really detailed pictures. So, researchers are working on making ViTs smarter and less demanding on resources.

The Old vs. The New

  • The Old Way: Before ViTs, there were models called CNNs that were the go-to for computer vision. They were great for recognizing objects in pictures because they could process images piece by piece.

  • The New Way: ViTs are changing things up by treating images more like a sequence of data points (like text), which lets them “see” the big picture more effectively.

ViTs vs. CNNs: Which Is Better?

  • Understanding Images: ViTs can get a fuller understanding of images because they look at the whole scene all at once, not just piece by piece.

  • Flexibility: ViTs are super flexible and can be used for different kinds of tasks, from figuring out what’s in a picture to spotting specific objects or even describing a scene.

  • Doing More with Less: Even though training ViTs can be pricey because they need lots of data and power, their ability to handle many tasks might make them more cost-effective in the long run.

ViTs in Action

ViTs are great for tasks like finding objects in busy scenes, monitoring road conditions, counting people in crowds, and more. They can look at an entire scene and make sense of it, which is super useful for things like self-driving cars or keeping an eye on environmental changes.

Why This Matters

The move towards transformer-based models like ViTs is a big deal because it means AI can understand both what it sees and what it reads, making it smarter and more useful in our everyday lives. It’s like giving AI a better set of eyes and a brain that can really understand the world around it.

What’s Next?

Besides ViTs, there are other cool transformer models like the Swin Transformer, which can handle images even more efficiently, and others that are designed to be faster or work better for specific tasks. All these advancements are making AI better at seeing and understanding our world.


And there you have it! A glimpse into how transformers are not just about language anymore but are also making waves in the visual world.