Building AI magic

Author: Stu Feeser

Before I continue, it’s time for the scientist within me to emerge. I never was one to just accept a black box without knowing what is inside. In the case of a transformer model, I found the truth of what is going on to be liberating, destroying all the vendor FUD (Fear, Uncertainty, and Doubt) that emerges when technology appears as magic to the uninitiated. In this blog, you and I are going to have a heart-to-heart talk with the wizard of Oz, and not pay any attention to the theatrics. Let’s start with the discovery that started the river of money.

What Started the AI River of Money

The paper that most famously reported the effect of (1) training data size, (2) parameter count, and (3) compute power on AI success, particularly noting the nearly linear relationship between these three factors and model performance, is “Scaling Laws for Neural Language Models” by Jared Kaplan and others from OpenAI. Published in early 2020, this work systematically explores how scaling up models improves performance in a predictable manner. This axiom is proven every day. So, let’s review each of these next.

Reagent #1: Training Data Size

In literature, a magic potion is made from reagents (ingredients). The most expensive reagent is the training data. Therefore, when more and more HIGHLY CURATED training data is used to train a model, there will be a linear effect on AI success. Note that “highly curated” functions as an adjectival phrase modifying “training data.” Perhaps a better way to say this is, GARBAGE IN, GARBAGE OUT. You will need MASSIVE amounts of curated data if you want the magic to happen.

Reagent #2: Parameter Count

In a neural net, each neuron is connected to upstream neurons, just like in a living brain. Each upstream connection contains a weight and is considered one parameter. So, think of this as the “wiring” of the neural net. Effectively, the more wires (parameters), the more intelligence the AI system will yield. In my opinion, the “magic” happens at 70 Billion parameters. The AI seems to take on a human-like ability to process prompts.

Reagent #3: Compute Power

Rather than a reagent, this is more like the cauldron where reagents 1 and 2 are mixed. Increasing compute power will yield better AI results. The larger the cauldron, the more powerful the AI magic will become, and in a linear relationship too. The “cauldron” is often comprised of two AI cards, the NVIDIA A100-80GB GPU and its big brother, the H100-80GB GPU. The bigger card has more tensor cores and requires less power yet processes data faster. The A100 costs about $16k, and the H100 at least twice that. At Alta3 Research, we currently use only A100s, as their processing power is more than half that of the H100, but the price is less than half. Either of these two GPUs is an excellent choice and both are known as high-end AI GPUs. Be aware that it can take thousands of these GPUs to train an AI model.

Reagent1 + Reagent2 + Reagent3 = The AI Model

When the three reagents are mixed over a period of weeks or months, depending on the compute power available, the result is an AI Model. The size of the model remains the same after training as it was before training. During training, the model’s parameters (specifically weights) were tweaked over and over until the model absorbed all the training data. This means that petabytes of data are now stored in gigabytes of weights. It does not end there; after compression, a 70 billion parameter model, which contains petabytes of training data, can require less than 80 GB of storage.

The Learning Process: Backpropagation

At the core of training transformer models is a process known as backpropagation, which can take days, weeks, or months. This algorithm tweaks model parameters based on the error of its predictions, learning over time to make more accurate assessments. Backpropagation is essential for the model to improve through exposure to vast amounts of data, refining its parameters to better understand and generate language. Backpropagation is an iterative process that works best when thousands of GPUs are available to handle the processing. Each parameter exists as a 32-bit floating-point value. In a 70 Billion parameter model, there will be 70 billion 32-bit floating-point weights. With this many parameters, computational processing needs to be a massive parallel action, much like the way the human brain operates, utilizing its neurons to simultaneously process information. This parallel processing capability allows transformer models to handle complex tasks, analyze large datasets, and learn from the intricacies of language patterns in a way that mirrors human cognition on a grand scale. The efficiency and speed of backpropagation, enhanced by advanced hardware, make it possible for these models to evolve rapidly, turning raw data into meaningful insights and increasingly sophisticated responses.

The Thinking Process: Inference

Inference is the application phase where the trained model uses its learned parameters to make predictions or generate text. This process involves taking an input sequence of tokens (words, characters, or other data points), processing it through the model’s layers, and producing an output sequence. The selection of each subsequent token is based on the model’s learned probabilities, aiming to create the most likely continuation of the input sequence.

RAG

Retrieval-Augmented Generation (RAG) introduces an additional layer of sophistication to transformer models. Often, humans use an AI model with poorly written prompts, which of course produce poor AI responses. By preprocessing human prompts with an external knowledge retrieval, the poorly written human prompt is converted into a robust prompt that is far more likely to yield great AI responses. RAG models can pull in information from vast databases, making AI responses more accurate and informative.

Explainable AI: Beyond the Fluff

Explainable AI seeks to make the decision-making processes of AI models transparent and understandable to humans. In the context of transformers, explainable AI efforts focus on understanding how models arrive at particular outputs, which parts of the data influenced certain decisions, and how the selection of the next token (word or character) in a sequence is determined. The study of “selecting the next token wisely” is the next step in making machines think like humans. For instance, have you ever stopped while telling a story because you realized you were going down the wrong path? So, you pause only for a moment, contemplating where you went wrong, realign your thinking, and then continue the story with a reorganized thought path. Normal AI inference cannot do this! But if we employ explainable AI, where the algorithm can spot that the next word (token) is the beginning of a hallucination, copyright infringement, or some other undesirable outcome that is only avoidable by “choosing that next word carefully.”

Conclusion

The blog series will now assume you are familiar with the following terms:

Training Data - Curated data to be used during backpropagation.
Parameters - The number of weights (wires).
GPU Compute - NVIDIA A100 and H100.
Backpropagation - Training a model to learn something new.
Inference - A model in the process of thinking.
Retrieval-Augmented Generation (RAG).
Explainable AI (XAI).