If you have read any in-depth article about AI in the past few years, you have probably encountered the word "transformer." Not the shape-shifting robots — the neural network architecture that powers virtually every major AI system today, from ChatGPT to Claude to Gemini to Llama.

The transformer was introduced in a 2017 research paper with the memorable title "Attention Is All You Need." In the years since, it has become the dominant architecture in AI, replacing older approaches and enabling the breakthroughs that put AI on the front page of every newspaper.

This chapter explains what transformers are and why they matter, without requiring you to understand linear algebra or calculus. If you have followed the previous chapters, you have everything you need.

What Came Before: The Limits of Reading One Word at a Time

To appreciate why transformers were a breakthrough, you need to understand the approach they replaced.

Before transformers, the dominant architecture for processing language was the Recurrent Neural Network, or RNN. The key feature of an RNN was that it processed text sequentially — one word (or token) at a time, in order, from left to right.

Imagine reading a book by looking at one word at a time through a tiny window, sliding the window forward word by word. As you move forward, you try to keep a running summary of everything you have read so far in your head. That running summary is your "hidden state," and it is how RNNs kept track of context.

This approach had two fundamental problems.

The forgetting problem. As the running summary was updated with each new word, information about earlier words gradually faded. By the time an RNN had processed a long paragraph, it had often lost track of important details from the beginning. Various clever modifications (called LSTMs and GRUs, if you want to look them up) helped with this, but they never fully solved it. It was like trying to remember the first chapter of a book while reading the last one — some information inevitably gets lost.

The speed problem. Because RNNs processed words one at a time, they could not take advantage of parallel processing. Modern GPUs contain thousands of processing cores that can work simultaneously, but an RNN forced them to sit idle while it plodded through the text sequentially. Processing a long document was slow because each word had to wait for all the previous words to be processed first.

These two problems — forgetting long-range information and processing text slowly — put a hard ceiling on what RNN-based systems could achieve. They could handle simple tasks like sentiment analysis, but generating coherent long-form text or understanding complex documents was beyond their reach.

The Big Idea: Attention

The transformer architecture solved both problems with a mechanism called self-attention. This is the core innovation, and understanding it intuitively is the key to understanding why transformers dominate AI.

Here is the basic idea. Instead of processing text one word at a time and maintaining a running summary, a transformer looks at all the words in a passage simultaneously and figures out which words are most relevant to each other.

Think about how you actually understand a sentence. When you read "The cat sat on the mat because it was tired," you instantly connect "it" to "cat" — not "mat." You do this not by processing words sequentially, but by considering the relationships between all the words at once. You attend to the relevant connections.

That is what self-attention does, and it is why the paper was titled "Attention Is All You Need." The model learns to ask, for each word in the input, "Which other words in this passage are most important for understanding this particular word?"

A Concrete Example

Let us walk through a more detailed example. Consider the sentence: "The bank approved the loan because the applicant had excellent credit."

For a human, the word "bank" is immediately understood as a financial institution, not a riverbank. How? Because of the words "loan," "applicant," and "credit" elsewhere in the sentence. You do not need to process the sentence left to right to figure this out. You see the whole sentence and the relationships within it click into place.

A transformer does something analogous. When processing the word "bank," the self-attention mechanism assigns high attention weights to "loan," "approved," "applicant," and "credit." These connections help the model understand that "bank" refers to a financial institution in this context. The attention mechanism has learned, through training, which relationships between words tend to be important for understanding meaning.

Now consider a different sentence: "She walked along the bank of the river, watching the ducks." The same word "bank" now gets high attention weights from "river," "walked," and "ducks," leading the model to understand it as a riverbank.

The model does not have two definitions of "bank" stored in a dictionary. Instead, it uses the attention mechanism to understand each word in the context of all the other words around it. This contextual understanding is one of the most powerful aspects of the transformer architecture.

Attention Across Long Distances

Self-attention also solves the forgetting problem that plagued RNNs. Because the mechanism considers relationships between all words simultaneously, it does not matter whether the relevant context is two words away or two thousand words away. The model can connect a pronoun in the last paragraph to the noun it refers to in the first paragraph, or understand that a conclusion at the end of an argument relates to a premise stated at the beginning.

This is why transformers can handle much longer documents than RNNs could. The context window sizes we discussed in the previous chapter — 100,000 tokens, even 1 million tokens — are only possible because of the transformer's ability to maintain connections across long stretches of text.

Processing in Parallel: The Speed Revolution

The second breakthrough of the transformer is that it can process all the input tokens in parallel rather than sequentially.

Remember the RNN's problem: it had to process word 1 before word 2, word 2 before word 3, and so on. A 1,000-word input required 1,000 sequential steps, regardless of how many processors you had.

A transformer processes all 1,000 words simultaneously. Each word's attention calculation — figuring out which other words are relevant — can happen at the same time for every word. This maps perfectly onto GPUs, which excel at performing thousands of operations in parallel.

This parallelism is what made it feasible to train models on trillions of tokens. Training an RNN on that much data would have taken years. Training a transformer on the same data, using thousands of GPUs working in parallel, takes weeks or months. Without this speedup, the scale of modern AI would simply not be possible.

Layers: Building Understanding Step by Step

A transformer does not just apply attention once. It applies it many times, in a stack of layers. A modern frontier model might have 80 or more layers.

Each layer refines the model's understanding of the text. Think of it like multiple rounds of analysis:

  • Early layers tend to capture basic patterns: grammar, syntax, common phrases, word relationships.
  • Middle layers build higher-level understanding: the topic of the text, the roles of different entities, logical relationships.
  • Later layers capture the most abstract and complex patterns: overall argument structure, tone, implied meaning, the kind of response that would be appropriate.

Each layer takes the output of the previous layer and applies its own attention and processing, gradually building a richer and more nuanced representation of the text. By the time the input has passed through all the layers, the model has developed a deep, multi-faceted understanding that goes far beyond what any single round of attention could capture.

This layered processing is why larger models (with more layers and more parameters per layer) tend to be more capable. More layers mean more rounds of refinement, which means the model can capture subtler patterns and more complex relationships.

Encoders and Decoders

The original transformer paper described an architecture with two main components: an encoder and a decoder. Understanding the difference is helpful because different AI systems use different parts.

The encoder processes input text and builds a rich internal representation of it. It uses self-attention to understand the relationships between all words in the input, producing a deep understanding of the text's meaning, structure, and content.

The decoder generates output text, one token at a time. It also uses self-attention, but with an important constraint: when generating each token, it can only attend to the tokens that came before it (because the tokens after it have not been generated yet). It can also attend to the encoder's representation of the input.

In practice, different AI systems use different configurations:

  • Encoder-only models (like BERT, used in search and classification) are good at understanding text but do not generate new text. They are used behind the scenes in search engines and text analysis systems.
  • Decoder-only models (like GPT, Claude, and Llama) are what you interact with as chatbots. They generate text by predicting one token at a time, attending to everything that came before.
  • Encoder-decoder models (like the original transformer, and Google's T5) use both components and are often used for tasks like translation, where you need to understand an input fully and then generate a complete output.

The chatbots you interact with daily are decoder-only models. They read your prompt (which goes through the decoder, attending to all previous tokens) and generate a response one token at a time, with each new token informed by everything that came before it.

Multi-Head Attention: Looking at Text from Multiple Angles

One refinement of the attention mechanism that is worth understanding is multi-head attention. Instead of having a single attention mechanism, transformers use multiple attention "heads" that each learn to focus on different types of relationships.

Think of it like having multiple analysts read the same document. One analyst might focus on who is doing what to whom (agent-action relationships). Another might focus on temporal relationships (what happened before what). A third might focus on causal relationships (what caused what). Each analyst captures different aspects of the text, and their combined insights provide a richer understanding than any single reading could.

In a transformer, different attention heads learn to capture different linguistic patterns — some track grammatical relationships, others track semantic similarity, others track positional patterns, and so on. The model combines all of these perspectives to build its understanding.

Modern models use dozens or even over a hundred attention heads per layer, each providing its own perspective on the relationships within the text.

Position Matters: How Transformers Know Word Order

There is a subtle problem with the attention mechanism as described so far. If the model looks at all words simultaneously rather than sequentially, how does it know the order of the words? The sentence "Dog bites man" and "Man bites dog" contain the same words, but they mean very different things.

Transformers solve this with positional encodings — additional information added to each token that tells the model where in the sequence that token appears. The first token gets one positional signal, the second gets a different one, and so on.

Modern transformers use a technique called Rotary Position Embeddings (RoPE) or similar methods that encode relative positions — the model knows not just that a word is at position 47, but that it is three positions after another word. This relative awareness helps the model understand grammatical relationships that depend on word order.

This might seem like a minor detail, but it was one of the key engineering challenges in designing the transformer. Without positional information, the model would treat every sentence as a bag of words with no order, which would make language understanding impossible.

Why Transformers Won

The transformer's dominance was not inevitable. When the "Attention Is All You Need" paper was published in 2017, it was one of many proposed architectures. But several factors converged to make it the clear winner:

  1. Scalability. Transformers scale beautifully with more data, more parameters, and more compute. The emergent abilities we discussed in Chapter 1 — where bigger models develop qualitatively new capabilities — appear more reliably in transformers than in other architectures.

  2. Parallelism. The ability to process tokens in parallel meant transformers could take full advantage of GPU hardware, making training faster and enabling the massive scale that produces the best results.

  3. Generality. The same transformer architecture works for text, code, images, audio, and video with relatively minor modifications. This generality meant that advances in one domain (like text) could be quickly applied to others.

  4. Community momentum. As transformers produced increasingly impressive results, more researchers focused on improving them, creating a virtuous cycle of innovation and improvement.

Today, virtually every frontier AI system is built on the transformer architecture. When you use any major AI chatbot, you are using a transformer.

The Limits of Transformers

Transformers are not perfect. Their biggest computational challenge is that the self-attention mechanism's cost grows with the square of the input length. Double the context window, and the attention computation takes four times as long. This is why very large context windows — even though they are now possible — are computationally expensive.

Researchers are actively working on more efficient attention mechanisms. Approaches like sparse attention (only attending to a subset of tokens), linear attention (reducing the mathematical complexity), and hybrid architectures (combining transformers with other approaches) are all being explored.

Whether transformers will remain dominant forever, or be eventually replaced by something better, is an open question. Some researchers believe the fundamental architecture will persist for years to come, while others are exploring fundamentally different approaches. History suggests that no architecture reigns forever, but for now, the transformer is king.

Why This Matters

You do not need to understand the math behind transformers to use AI effectively. But understanding the architecture at this level gives you genuine insight into why AI systems behave the way they do.

When a chatbot misunderstands a reference to something you mentioned many messages ago, you now understand that attention over very long contexts is computationally challenging. When you see claims about a new model being "more efficient," you can appreciate that efficiency improvements in the attention mechanism have direct practical benefits. When you read about hybrid architectures or transformer alternatives, you have the context to understand what is being proposed and why it might matter.

The transformer turned AI from a collection of narrow, specialized tools into a general-purpose technology. Understanding it — even at this intuitive level — is understanding the engine that drives the AI revolution.

See This in the News

While transformers dominate the AI landscape, researchers continue exploring alternative and hybrid architectures that might overcome some of the transformer's limitations. This article looks at one such effort combining transformer and RNN approaches:

AI2 OLMo: Hybrid Transformer-RNN Open Model