You type a question into a chatbot. A few seconds later, words start appearing on your screen, flowing out smoothly as if someone is typing in real time. The response is coherent, relevant, and sometimes remarkably insightful.

But what is actually happening behind the scenes? How does the model go from your question to its answer? And why does it sometimes produce confident-sounding nonsense?

This chapter walks through the text generation process — called inference — step by step. Understanding it will help you make sense of both the impressive capabilities and the frustrating limitations of every AI chatbot you use.

The Inference Process: One Token at a Time

Here is something that surprises many people: despite all the sophisticated architecture we discussed in the previous chapter, a chatbot generates its response one token at a time.

When you send a message, the model processes your entire input in parallel (thanks to the transformer architecture). But then it generates its response sequentially — first token, then second token, then third, and so on. Each token is chosen based on everything that came before it: your original message plus all the tokens the model has already generated.

This is why you see text appear gradually rather than all at once. The model is not typing for dramatic effect. It is genuinely deciding each word as it goes, streaming each token to your screen the moment it is generated.

The process works like this:

  1. Your message (plus any conversation history and system instructions) is processed through the transformer's layers.
  2. At the end of this processing, the model produces a probability distribution over its entire vocabulary — a score for every possible next token.
  3. One token is selected from this distribution (we will discuss how shortly).
  4. That token is added to the sequence, and the model processes everything again to generate the next token.
  5. This repeats until the model generates a special "end of response" token, or until it hits a maximum length limit.

Each of these steps involves running the entire input through the transformer network. For a response that is 500 tokens long, the model performs 500 forward passes through its neural network. This is why generating text is computationally expensive and why output tokens cost more than input tokens — each output token requires its own full computation.

Temperature: Turning the Creativity Dial

When the model produces its probability distribution for the next token, it does not simply pick the highest-probability token every time. If it did, the same prompt would always produce the exact same response, word for word. The output would be deterministic and often repetitive.

Instead, the model's behavior is controlled by a parameter called temperature. This is one of the most important settings in text generation, and understanding it will help you get better results from AI tools.

Low temperature (close to 0) makes the model more deterministic and focused. It strongly favors the highest-probability tokens, producing responses that are more predictable, consistent, and conservative. At temperature 0, the model always picks the most probable next token.

High temperature (closer to 1 or above) makes the model more random and creative. It is more willing to choose lower-probability tokens, producing responses that are more varied, surprising, and sometimes more creative — but also more likely to go off track or produce nonsensical text.

Think of temperature like a dial on a slot machine. At low temperature, the machine almost always shows the statistically most common result. At high temperature, unusual combinations appear more frequently.

In practice, most chatbot conversations use a moderate temperature — high enough to produce natural-sounding, varied text, but low enough to keep responses coherent and on-topic. When you use a chatbot for creative writing, a higher temperature might produce more interesting results. When you use it for factual questions or code generation, a lower temperature tends to produce more accurate and reliable output.

Some AI platforms let you adjust the temperature directly. Even when they do not, understanding this concept helps explain why you sometimes get different answers to the same question, and why asking the model to "be more creative" or "be more precise" can change the style of its responses.

Sampling Strategies: Top-k and Top-p

Temperature is not the only knob that controls how tokens are selected. Two other important techniques — top-k sampling and top-p sampling — shape the generation process in complementary ways.

Top-k Sampling

Top-k sampling restricts the model's choices to the k most probable tokens at each step. If k is set to 50, the model only considers the 50 most likely next tokens, regardless of how many other tokens exist in the vocabulary.

This prevents the model from ever choosing extremely unlikely tokens. Even at high temperature, a low top-k value ensures that the model stays within a reasonable range of likely options. Without top-k, a high temperature might occasionally cause the model to select a completely random and nonsensical token, derailing the entire response.

Top-p (Nucleus) Sampling

Top-p sampling, also called nucleus sampling, takes a different approach. Instead of fixing the number of tokens to consider, it considers the smallest set of tokens whose combined probabilities add up to at least p.

If top-p is set to 0.9, the model considers the most probable tokens until their cumulative probability reaches 90%, then ignores everything else. The advantage of this approach is that it adapts to the situation. When the model is highly confident about the next word (say, completing the phrase "the United States of"), the top-p set might include only one or two tokens. When the model is less certain (say, generating the next word in a creative story), the set might include hundreds of tokens.

This adaptive behavior makes top-p sampling very popular in practice. Most modern AI systems use some combination of temperature and top-p to balance creativity with coherence.

Why This Matters to You

You do not need to remember the details of these sampling strategies. But understanding that text generation involves controlled randomness explains several things you may have noticed:

  • Why you get different responses to the same question. The randomness in the sampling process means that the model may choose different tokens each time, leading the response down different paths.
  • Why regenerating a response sometimes gives a much better answer. A different random path through the token choices can produce a significantly different result.
  • Why AI output sometimes feels formulaic. At low temperature with conservative sampling, the model gravitates toward the most common patterns in its training data.

Hallucination: When Pattern Matching Goes Wrong

One of the most discussed limitations of language models is hallucination — the tendency to generate text that sounds confident and plausible but is factually wrong, internally inconsistent, or entirely fabricated.

Understanding how text generation works helps explain why hallucination happens. The model is not looking up facts in a database. It is generating text based on patterns it learned during training. When those patterns are strong and well-supported — "The capital of France is Paris" — the model reliably produces accurate information. But when the patterns are weaker, ambiguous, or the model is asked about something at the edge of its training data, it can produce confident-sounding fabrications.

Here are some common hallucination scenarios:

Fabricated details. Ask a model about a real person, and it might invent publications they never wrote, awards they never received, or events they never participated in. The model has learned the pattern of how biographical information is typically presented, and it fills in details that sound right even when they are wrong.

Confident errors. The model might state an incorrect fact with complete confidence. It does not hedge because it has no reliable internal mechanism for distinguishing what it "knows" well from what it is guessing about. The patterns that produce hedging language ("I'm not sure, but...") are separate from the patterns that produce factual claims, and they do not always correlate with actual uncertainty.

Plausible but wrong reasoning. Ask the model to solve a logic problem, and it might produce a step-by-step solution that reads perfectly but arrives at the wrong answer. Each step follows the pattern of what reasoning looks like, but a logical error somewhere in the chain goes undetected because the model is predicting plausible next tokens, not verifying logical validity.

Citation fabrication. Ask for references, and the model might generate academic citations that look perfect — correct formatting, plausible author names, reasonable-sounding journal titles — but point to papers that do not exist. It has learned the pattern of what citations look like but cannot verify whether specific citations are real.

Why Hallucination Is Hard to Fix

Hallucination is not a simple bug that can be patched. It is a fundamental consequence of how language models work. The model generates text by predicting probable next tokens based on patterns in training data. When the correct answer is also the most probable answer, things work well. But when the most probable-sounding answer is not the correct one, the model has no reliable way to tell the difference.

Researchers have made significant progress in reducing hallucination through better training techniques, retrieval-augmented generation (connecting the model to external knowledge sources it can verify against), and training models to express uncertainty. But hallucination remains an active challenge, and users should always verify important factual claims from AI models.

The Illusion of Understanding

When you have a conversation with a modern chatbot, it genuinely feels like talking to someone who understands you. The model tracks the conversation, responds to nuance, asks clarifying questions, and produces responses that demonstrate apparent comprehension.

This raises a philosophical question that researchers, philosophers, and the general public actively debate: does the model actually understand, or is it an incredibly sophisticated pattern matcher that produces the illusion of understanding?

The honest answer is that this question does not have a settled resolution, and it may partly be a question about what we mean by "understanding" in the first place.

What we can say with confidence is that the model does not understand language the way you do. You have a lifetime of physical experience in the world. You know what it feels like to be cold, what it means to be hungry, what it is like to lose someone you care about. When you read the word "heartbreak," you connect it to a web of lived experience, emotions, and memories.

The model connects "heartbreak" to statistical patterns about how the word is used — what words tend to appear near it, what contexts it appears in, what kind of sentences follow it. This is a fundamentally different kind of connection, even if the outputs look similar.

On the other hand, the patterns the model has learned are far more sophisticated than simple word associations. Modern models can solve novel problems, make inferences that require combining information from different domains, and generate creative solutions that were not present in their training data. Whether this constitutes "understanding" or "mere pattern matching" may say more about our definitions than about the models themselves.

For practical purposes, the key takeaway is this: treat AI outputs as the work of a very capable but imperfect assistant that does not have lived experience, does not have common sense grounded in physical reality, and can be confidently wrong. Verify important claims. Use AI as a tool to augment your thinking, not a replacement for it.

System Prompts: The Hidden Instructions

When you use a chatbot like ChatGPT or Claude, your conversation does not start from nothing. Before you type your first message, the model has already received a set of instructions called a system prompt.

The system prompt is written by the company that built the chatbot, and it shapes the model's behavior in important ways. It typically includes instructions like:

  • What persona to adopt ("You are a helpful, harmless, and honest AI assistant")
  • What topics to avoid or handle carefully
  • How to format responses
  • What to do when asked about sensitive topics
  • How to handle requests that might be harmful
  • What its limitations are and when to acknowledge them

You usually cannot see the system prompt, but its effects are visible in every interaction. This is why different chatbots have different personalities even when they use similar underlying models. The system prompt is like a job description and set of guidelines that the model follows during the conversation.

How System Prompts Work Technically

From the model's perspective, the system prompt is just text that appears at the beginning of its context window. The model does not treat system prompt instructions as special or inviolable — they are tokens like any other, processed through the same attention mechanism.

This is why jailbreaking attempts (trying to override the system prompt with clever user messages) sometimes work. The model weighs all the text in its context window when generating responses, and a sufficiently persuasive user message can sometimes override system prompt instructions. AI companies invest significant effort in making their models more robust against such attempts, but it remains an ongoing challenge.

It also means that system prompts have limits. They can guide the model's behavior, but they cannot give the model new knowledge or capabilities it did not learn during training. A system prompt that says "you are an expert in quantum physics" will shape the model's tone and confidence level, but it will not improve the model's actual knowledge of quantum physics beyond what it learned during training.

Custom System Prompts

Many AI platforms now allow users to write their own system prompts through features like custom instructions or custom GPTs. This is a powerful tool for tailoring the model's behavior to your needs. You might set a system prompt that says "Always respond in bullet points" or "Assume I am a software engineer and skip basic explanations" or "Focus on practical advice rather than theory."

Understanding that this is just text prepended to your conversation helps you write better custom instructions. Be clear, specific, and direct, just as you would be when instructing a human assistant.

Putting It All Together: From Question to Answer

Let us trace the complete path from your question to the chatbot's answer:

  1. You type your message. This becomes a sequence of tokens.

  2. Your message is combined with the system prompt and conversation history. All of this text forms the input to the model.

  3. The entire input is processed through the transformer's layers. The self-attention mechanism connects relevant pieces of text across the entire input, building a rich understanding of what you are asking and what kind of response would be appropriate.

  4. The model generates the first token of its response. It produces probabilities for every possible token and selects one based on the temperature and sampling settings.

  5. The selected token is appended to the input, and the process repeats. Each new token is generated based on everything that came before — your input, the system prompt, the conversation history, and all the response tokens generated so far.

  6. This continues until the model generates a stop token or hits a length limit. The complete response is then delivered to you.

The entire process typically takes a few seconds for a moderate-length response, with each token taking a few milliseconds to generate. The streaming effect you see — text appearing gradually — is the model generating tokens in real time.

Why Responses Are Not Instant

Given that modern computers can perform billions of operations per second, you might wonder why generating a response takes several seconds rather than milliseconds. The answer lies in the sheer scale of the computation.

Each token requires a forward pass through the entire transformer network. For a frontier model with hundreds of billions of parameters, that means performing hundreds of billions of mathematical operations — per token. A 500-token response requires 500 of these passes.

Additionally, the model must load its parameters from memory for each computation, and the sheer size of the model (hundreds of gigabytes for the largest models) creates a memory bandwidth bottleneck. Even with specialized hardware, moving that much data around takes time.

This is why there is an active race to make inference faster and cheaper. Techniques like quantization (reducing the precision of the model's numbers), speculative decoding (guessing multiple tokens ahead and verifying them in parallel), and smaller, distilled models (which approximate the performance of larger models with fewer parameters) are all aimed at making the generation process faster and more affordable.

Key Takeaways

Understanding how chatbots generate text demystifies both their impressive capabilities and their limitations:

  • Text is generated one token at a time, with each token influenced by everything that came before it.
  • Temperature and sampling control the balance between predictability and creativity.
  • Hallucination is a fundamental feature of the generation process, not a bug that will be easily fixed.
  • The apparent understanding is sophisticated pattern matching, which is remarkably effective but fundamentally different from human comprehension.
  • System prompts shape behavior but are just text in the context window, not magical constraints.

The next time you interact with a chatbot, you will have a much clearer mental model of what is happening behind the screen. The words appearing are not retrieved from a database, not copied from the internet, and not produced by a conscious entity. They are the output of an enormously complex but ultimately mechanical process — trillions of mathematical operations, guided by patterns learned from trillions of words of text, producing one token at a time.

That this mechanical process produces output that can be insightful, creative, and genuinely useful is one of the most remarkable achievements in the history of technology. That it can also be confidently wrong, subtly biased, and fundamentally lacking in true understanding is a reminder that we are still in the early chapters of the AI story.

See This in the News

The text generation process described in this chapter is what makes AI coding assistants possible. When a model generates code, it is using the same token-by-token process, guided by patterns learned from millions of code repositories. See how the latest models apply this capability to complex agentic coding tasks:

Anthropic Claude Opus 4.6: Agentic Coding