Tokens, Context Windows, and Why They Matter - How AI Actually Works — From Headlines to Understanding

If you have spent any time using AI chatbots or reading about them, you have probably encountered the word "token." Maybe you noticed that pricing pages list costs "per token," or you hit a limit and were told your conversation had "too many tokens." Perhaps you saw a product boast about supporting "1 million tokens of context."

Tokens are one of those concepts that seem technical but are actually straightforward once explained. And understanding them will help you use AI tools more effectively, understand why they cost what they do, and make sense of the spec sheets that AI companies publish.

Tokens Are Not Words

The first thing to understand is that AI models do not read words the way you do. They break text down into smaller pieces called tokens. A token is a chunk of text — sometimes a whole word, sometimes part of a word, sometimes just a single character or a punctuation mark.

Here are some examples to make this concrete:

The word "hello" is typically one token.
The word "uncomfortable" might be broken into three tokens: "un," "comfort," and "able."
The word "AI" is one token.
A common word like "the" is one token.
An unusual or technical word like "defenestration" might be broken into several tokens: "def," "en," "est," "ration."
A number like "2025" might be one or two tokens depending on the model.
Spaces and punctuation are often included as part of adjacent tokens.

As a rough rule of thumb, one token is approximately three-quarters of a word in English. So 1,000 tokens is roughly 750 words. But this ratio varies by language — some languages use more tokens per word than English, and others use fewer.

Why Tokenization Works This Way

You might wonder why models do not just work with whole words. There are several practical reasons.

First, there are too many words. If a model needed a separate entry for every word it might encounter, its vocabulary would need to include every word in every language, plus every name, technical term, slang word, and typo. That vocabulary would be enormous and unwieldy.

Instead, models use a fixed vocabulary of tokens — typically somewhere between 30,000 and 100,000 tokens. This vocabulary includes common words as whole tokens ("the," "and," "is"), common word pieces ("ing," "tion," "un"), individual characters, and combinations that appear frequently in the training data.

By breaking uncommon words into smaller pieces, the model can represent any word — even one it has never seen before — by combining tokens from its vocabulary. The word "cryptocurrency" might be broken into "crypt" and "ocurrency" or "crypto" and "currency," depending on the specific tokenizer. Either way, the model can handle it, even if "cryptocurrency" was not common enough to earn its own dedicated token.

Second, this approach handles multiple languages naturally. A Chinese character might be one or two tokens. An Arabic word might be several tokens. The model does not need a separate system for each language — the same tokenization approach works across all of them.

How Tokenization Is Decided

The specific way text is broken into tokens is determined by a tokenizer, which is created before training begins. The most common approach is called Byte Pair Encoding (BPE), which works roughly like this:

Start by treating every individual character as its own token.
Look through all the training data and find the two tokens that appear next to each other most frequently.
Merge those two tokens into a single new token.
Repeat steps 2 and 3 thousands of times until you reach your desired vocabulary size.

The result is a vocabulary where common words and phrases are represented by single tokens (because they got merged early in the process), while rare words are built from multiple smaller tokens.

You do not need to remember these details, but it helps explain why tokenization can sometimes seem arbitrary. The boundaries between tokens are determined by statistical patterns in the training data, not by linguistic rules. The model does not know that "un" is a prefix meaning "not" — it just knows that "un" appears frequently enough to be its own token.

Context Windows: The Model's Working Memory

Now that you understand tokens, we can talk about one of the most important practical concepts in AI: the context window.

A context window is the maximum amount of text — measured in tokens — that a model can consider at one time. Think of it as the model's working memory. Everything the model needs to know to generate its next response must fit within this window: your question, the conversation history, any documents you have pasted in, and the model's own previous responses.

How Context Windows Have Grown

The growth of context windows over the past few years has been dramatic:

Early models (2020): Context windows of about 2,000 tokens — roughly 1,500 words, or about three pages of text.
GPT-3.5 (2022): 4,096 tokens — about 3,000 words.
GPT-4 (2023): 8,192 tokens in the standard version, 32,768 in the extended version.
Claude and competitors (2024): 100,000 to 200,000 tokens — enough for a short novel.
Current frontier models (2025-2026): Some models now support 1 million tokens or more — equivalent to several full-length books.

This expansion has been transformative. With a small context window, you could ask a simple question. With a large one, you can paste an entire legal contract and ask the model to summarize it, upload a full codebase and ask for a bug, or have a long, detailed conversation without the model losing track of what was discussed earlier.

What Happens at the Edges

When a conversation exceeds the context window, something has to give. Different systems handle this differently:

Some simply refuse to continue, telling you the conversation is too long.
Some silently drop the oldest messages from the conversation, keeping only the most recent ones. This means the model may "forget" things you discussed earlier.
Some use compression techniques to summarize older parts of the conversation, preserving the gist while reducing the token count.

This is why you might notice that in very long conversations, an AI assistant seems to forget something you told it earlier. It is not being careless — the earlier message may have been pushed out of the context window.

Understanding this can help you use AI tools more effectively. If you are working on a complex task with a lot of context, it can be better to start a new conversation with all the relevant information pasted in fresh, rather than relying on a long conversation history that might be getting truncated.

How Tokens Affect Pricing

If you use an AI API (a way for software to send requests to an AI model), you pay per token. And the pricing reveals something interesting: input tokens (what you send to the model) and output tokens (what the model generates) are usually priced differently, with output tokens costing more — often two to five times as much.

Why? Because generating each output token requires the model to do a full forward pass through its neural network. Input tokens are processed more efficiently because they can be handled in parallel. Each output token, by contrast, is generated one at a time, with each new token depending on all the ones that came before it.

Here is a practical example of how this plays out. Suppose you paste a 20-page document (about 10,000 tokens) into a chatbot and ask for a one-paragraph summary (about 100 tokens). You are paying for 10,000 input tokens and 100 output tokens. If you instead ask the model to rewrite the entire document in a different style, you might be paying for 10,000 input tokens and 10,000 output tokens — a much more expensive request.

For individual users on subscription plans (like ChatGPT Plus or Claude Pro), these costs are bundled into your monthly fee. But the per-token economics still affect your experience. Heavy users may hit usage caps, and certain features (like processing very long documents) may be limited because they consume so many tokens.

Why Free Tiers Differ

The economics of tokens directly explain why free tiers of different AI chatbots offer different limits. Each response a model generates costs the provider real money — for the compute required to process your input tokens and generate output tokens. Companies must balance how generous their free tier can be against the cost of serving millions of users.

This is why free tiers typically impose limits on the number of messages per day, the length of responses, or the size of documents you can upload. They may also default to smaller, cheaper models rather than the most capable (and most expensive) frontier models.

Practical Implications for Users

Understanding tokens helps you be a more effective AI user in several ways.

Writing Better Prompts

Since every token counts toward the context window, being clear and concise in your prompts is not just good communication — it is practical efficiency. A prompt that rambles for 500 words before getting to the point wastes tokens that could be used for the model's response or for additional context.

That said, do not be so terse that you leave out important information. The model can only work with what is in the context window. If you need it to account for specific constraints, preferences, or background information, include them. The goal is to be informative without being redundant.

Working with Long Documents

If you need the model to work with a long document, you have options:

Paste the whole thing if it fits in the context window. Modern models with 100K+ token windows can handle most individual documents.
Break it into sections if it does not fit. Process each section separately, then ask the model to synthesize the results.
Summarize first, then work with the summary if you need to free up context window space for a detailed conversation about the document.

Understanding Model Behavior in Long Conversations

In long conversations, models may start to lose track of details mentioned early on. This is not a bug — it is a consequence of how context windows work. If you notice the model contradicting something it said earlier or forgetting a constraint you set up, it may help to restate the key information.

Some users develop a habit of periodically summarizing the current state of a conversation and pasting it back in, essentially giving the model a refreshed context. This can be particularly useful for complex, multi-session projects.

Tokens in Different Languages

One important equity issue with tokens is that they are not created equally across languages. Because tokenizers are typically trained on data that skews heavily toward English, English text tends to be tokenized efficiently — common English words get their own tokens. Text in other languages, particularly those with non-Latin scripts, often requires more tokens to represent the same amount of information.

This means that speakers of some languages effectively get smaller context windows and pay more per word than English speakers. A sentence in Thai or Amharic might use two to three times as many tokens as an equivalent English sentence. This is an active area of improvement, with newer tokenizers designed to be more equitable across languages, but it remains a real disparity.

Tokens and Code

Programming code is tokenized differently from natural language, and this matters if you use AI for coding tasks. Code contains a lot of special characters (brackets, semicolons, operators), short variable names, and indentation. Some of these are tokenized efficiently, while others are not.

Python code, for example, uses significant whitespace for indentation. Each level of indentation consumes tokens. A deeply nested function might use a surprising number of tokens just for its indentation. Languages like JSON and XML, which are verbose by nature, consume tokens rapidly.

Understanding this can help you structure your interactions with AI coding assistants. If you are hitting token limits, you might provide just the relevant function rather than an entire file, or ask the model to generate code in a less verbose format.

The Future of Context Windows

The trend toward larger context windows shows no sign of slowing down. Researchers are developing techniques to make models more efficient at processing long contexts, and the hardware improvements we discussed in the previous chapter are making it feasible to handle ever-larger windows.

Larger context windows open up new use cases. A model that can hold an entire codebase in context can find bugs that span multiple files. A model that can hold an entire book in context can answer questions that require synthesizing information from different chapters. A model that can hold months of email correspondence can help you prepare for a meeting by understanding the full history of a negotiation.

But larger windows also bring challenges. Models do not always pay equal attention to all parts of a long context. Research has shown that some models are better at using information at the beginning and end of their context window, with a "lost in the middle" effect where information buried in the middle of a long context is more likely to be overlooked. Addressing this is an active area of research.

Key Takeaways

Tokens and context windows might seem like technical implementation details, but they shape every interaction you have with an AI model. They determine how much information the model can work with, how much it costs, and how well it performs on different tasks.

The key points to remember:

Tokens are chunks of text, roughly three-quarters of a word in English.
Context windows are measured in tokens and represent the model's working memory.
Everything — your input, the conversation history, and the model's response — must fit in the context window.
Output tokens cost more than input tokens.
Context windows have grown from thousands to millions of tokens in just a few years.
Not all languages are tokenized equally, with non-English text often requiring more tokens.

In the next chapter, we will look at the architecture that makes all of this possible: the transformer.

See This in the News

Token limits and context windows directly affect the free tiers of the AI chatbots millions of people use every day. Understanding tokens helps you evaluate which service gives you the best value:

ChatGPT vs Claude vs Gemini: Which Free Tier Wins?