Beyond Text: AI Learns to See, Hear, and Create

For years, the AI models making headlines were text-only systems. You typed words in, and words came back out. But humans do not experience the world through text alone. We see images, hear sounds, watch videos, and write code that makes machines do things. The next frontier of AI is multimodal models — systems that can work with multiple types of information at once.

This chapter explores how AI is expanding beyond text into vision, audio, video, and code. We will look at how these capabilities work under the hood, why combining them matters, and where the technology is headed.

What "Multimodal" Actually Means

The word "modal" refers to a type or mode of information. Text is one modality. Images are another. Audio, video, and code are others. A "multimodal" AI system is one that can understand and generate content across multiple modalities.

Think about how you use your smartphone. You might take a photo of a restaurant menu and ask a friend what looks good. You might send a voice message instead of typing. You might watch a video tutorial and then try to replicate what you saw. In each case, you are naturally working across multiple modalities — shifting between visual, auditory, and textual information without even thinking about it.

Early AI systems could not do this. A text model could only handle text. An image model could only handle images. You needed separate, specialized systems for each type of information, and they could not talk to each other.

Modern multimodal models change this fundamentally. A single model can look at an image, read text, listen to audio, and respond in whatever modality makes sense. This might seem like a small technical improvement, but it opens up entirely new categories of applications.

How AI Processes Images

To understand how a model "sees" an image, it helps to know that digital images are just grids of numbers. Each pixel in a photograph has numerical values representing its color — typically three numbers for red, green, and blue intensity. A standard smartphone photo might contain millions of pixels, each with these three values. So an image is, at its core, a massive spreadsheet of numbers.

Early computer vision systems used specialized neural networks called convolutional neural networks, or CNNs. These networks were designed to scan across an image in small patches, identifying basic features like edges and corners, then combining those features into more complex patterns like shapes, textures, and eventually whole objects.

Modern multimodal models take a different approach. They use a technique called a vision transformer, which borrows ideas from the transformer architecture that powers text models. Instead of scanning an image patch by patch, the model:

  1. Divides the image into a grid of patches — think of cutting a photo into small squares, like tiles on a mosaic.
  2. Converts each patch into a numerical representation called an embedding — essentially a list of numbers that captures what that patch "means."
  3. Processes all the patches together, allowing the model to understand relationships between different parts of the image — like noticing that the thing in the upper left appears to be a hat sitting on top of the thing in the center, which appears to be a person.

The crucial innovation is that these visual embeddings live in the same mathematical space as text embeddings. This means the model can naturally connect what it sees in an image with the language it uses to describe things. When you upload a photo and ask "What is this?" the model maps the image into the same representation space as words, allowing it to reason about both simultaneously.

What Models Can Actually Do With Images

The practical capabilities are impressive and expanding rapidly:

Object recognition and description. Upload a photo and the model can describe what it sees in detail — identifying objects, people, settings, text within the image, and even the mood or style of the photograph.

Document understanding. Models can read and interpret complex documents like charts, graphs, receipts, handwritten notes, and technical diagrams. This is enormously useful in business settings where information is locked in visual formats.

Visual question answering. You can have a conversation about an image. "What kind of tree is this?" "Is this rash something I should worry about?" "What architectural style is this building?" The model can answer questions that require both seeing the image and knowing about the world.

Image generation. Some multimodal models can also create images from text descriptions. You describe what you want, and the model generates a picture. This uses different techniques — typically diffusion models — but the multimodal framing means the same system can both understand and create images.

Speech-to-Text and Text-to-Speech

Audio is another modality where AI has made remarkable progress. The two key capabilities are turning spoken words into text (speech-to-text, or STT) and turning text into spoken words (text-to-speech, or TTS).

Speech-to-Text

If you have ever used dictation on your phone or talked to a virtual assistant, you have used speech-to-text AI. But the current generation is dramatically better than what existed even a few years ago.

Modern speech-to-text models, like OpenAI's Whisper, can transcribe audio with accuracy approaching that of professional human transcriptionists. They handle accents, background noise, multiple speakers, and technical jargon far better than older systems.

How do they work? At a high level, the audio waveform — the squiggly line you see in audio editing software — gets converted into a visual representation called a spectrogram. A spectrogram shows how the frequencies in the audio change over time. It literally looks like an image, which means the model can use similar techniques to what it uses for vision: break the spectrogram into patches, convert them to embeddings, and process them with a transformer.

The model learns patterns like: this particular combination of frequencies at this timing corresponds to the sound "th," and this combination corresponds to "ee," and together they make the word "the." But it operates at a much more sophisticated level than simple sound matching, using context to disambiguate words that sound alike (like "there," "their," and "they're").

Text-to-Speech

Going the other direction, modern text-to-speech systems generate remarkably natural-sounding voices. The robotic, stilted speech of early TTS systems has been replaced by AI voices that carry natural rhythm, emotion, and even personality.

These systems work by training on large datasets of human speech. The model learns the patterns of how humans speak — not just pronunciation, but pacing, emphasis, breath pauses, and the subtle variations that make speech sound natural rather than mechanical. Some systems can even clone a specific person's voice from just a few seconds of sample audio, which raises both exciting possibilities and serious ethical concerns.

The combination of STT and TTS with language models creates a powerful loop: a user speaks, the system transcribes their speech, processes it with a language model, generates a response, and speaks it back. This is how the most natural voice assistants work today.

Code Generation: AI Learns to Program

One of the most transformative multimodal capabilities is code generation — the ability of AI to write, understand, and debug computer programs. While code is technically text, it operates under very different rules than natural language, which is why it deserves its own discussion.

How AI Writes Code

Code generation models are trained on vast repositories of programming code, including open-source projects, documentation, and coding forums. They learn patterns like:

  • "When someone defines a function that takes a list of numbers, they often want to do something like sort, filter, or calculate a sum."
  • "After opening a database connection, you typically need to close it when you are done."
  • "This error message usually means this particular thing went wrong."

When you describe what you want a program to do in plain English, the model translates your intent into the appropriate programming syntax. It is similar to how a human translator converts between languages, but the target "language" is Python, JavaScript, or another programming language.

Practical Code Capabilities

Writing code from descriptions. You can say "Write a function that takes a list of email addresses and returns only the ones from Gmail accounts," and the model will produce working code. For straightforward tasks, this code is often correct on the first try.

Explaining existing code. Upload a confusing piece of code and ask "What does this do?" The model can walk you through it line by line, translating programmer syntax into plain English.

Debugging. Paste an error message along with your code, and the model can often identify the bug and suggest a fix. This is like having an experienced programmer look over your shoulder.

Converting between languages. The model can take code written in one programming language and rewrite it in another, preserving the same functionality. This is useful when companies need to modernize old systems.

The impact on the software industry has been significant. Studies suggest that AI code assistants can improve programmer productivity by 30 to 50 percent on routine tasks, though the benefit varies greatly depending on the complexity of the work.

Video Understanding

Video adds another layer of complexity because it combines visual information with temporal information — things change over time. Understanding a video requires not just recognizing what is in each frame, but tracking how objects move, how scenes change, and how events unfold over time.

How AI Processes Video

At the simplest level, a video is a sequence of images (frames) displayed rapidly to create the illusion of motion. Early approaches to video understanding simply processed each frame independently and tried to stitch the results together. Modern approaches are more sophisticated:

Temporal attention. The model learns to track objects and actions across frames, understanding that the red car in frame 100 is the same red car in frame 200, even though it has moved and the camera angle has changed.

Event recognition. Beyond identifying objects, the model can recognize actions and events — a person throwing a ball, a car turning left, someone opening a door. This requires understanding sequences of motion, not just static snapshots.

Scene understanding. The model can comprehend the overall narrative of a video clip — this is a cooking demonstration, this is a sports highlight, this is a news broadcast — and extract relevant information accordingly.

AI Video Generation

Perhaps even more remarkable is the ability of AI to generate video from text descriptions. Systems like those developed by major AI labs can produce short video clips that look increasingly realistic. You describe a scene — "a golden retriever playing in autumn leaves in slow motion" — and the model generates a video of it.

This technology works through a combination of diffusion models (which gradually refine random noise into coherent images) extended across the time dimension. The model must ensure not just that each frame looks realistic, but that the motion between frames is smooth and physically plausible.

The quality of AI-generated video has improved at a stunning pace. What looked obviously fake two years ago now requires careful inspection to distinguish from real footage in many cases, though longer videos and complex interactions still reveal limitations.

The Convergence of Modalities

The most important trend in multimodal AI is convergence — the movement toward single unified models that handle all modalities natively rather than bolting separate systems together.

Early multimodal systems were essentially multiple specialized models wearing a trench coat. You had a vision model, a language model, and an audio model, with some glue code connecting them. Information had to be translated between systems, and meaning was often lost in translation.

Modern architectures are moving toward true native multimodality, where the model processes all types of information in a unified way from the ground up. This means:

Better cross-modal reasoning. A natively multimodal model can understand that the emotion expressed in someone's voice, the expression on their face in a photo, and the words they wrote in a text message all relate to the same situation. It can reason across these modalities in ways that separate systems cannot.

More natural interaction. Humans naturally mix modalities in communication. We point at things while talking, sketch diagrams while explaining, and reference images while writing. Natively multimodal AI can keep up with this natural human behavior.

Emergent capabilities. When models can process multiple modalities together, they sometimes develop capabilities that were not explicitly trained for. A model trained on images and text might develop an ability to read handwriting that neither a pure vision model nor a pure text model could manage alone.

Real-World Applications

Multimodal AI is not a laboratory curiosity. It is already being deployed in practical applications that affect everyday life.

Healthcare

Doctors can upload medical images — X-rays, MRI scans, skin photographs — and get AI-assisted analysis. The model can identify potential abnormalities, compare against known patterns, and generate reports. This does not replace the doctor's judgment, but it provides a valuable second opinion, especially in settings where specialist doctors are not available.

Education

Students can photograph a math problem from their textbook, and a multimodal model can recognize the equation, solve it, and explain each step. They can upload a diagram from a biology class and ask questions about it. They can even speak their questions aloud and get audio responses, making AI tutoring more accessible.

Accessibility

Multimodal AI is a game-changer for people with disabilities. Visually impaired users can take a photo of their surroundings and get a detailed audio description. Deaf users can get real-time transcription of conversations. People with motor impairments can control applications through voice commands with much greater accuracy than before.

Content Creation

Creators can generate images for blog posts, convert podcast episodes to text articles, create video summaries of long documents, and produce content in multiple formats from a single source. This dramatically reduces the time and cost of content production.

Business Operations

Companies use multimodal AI to process invoices (understanding both the visual layout and the text), analyze customer feedback across formats (written reviews, voice calls, video testimonials), and automate quality control in manufacturing by inspecting products visually.

Limitations and Challenges

Despite the rapid progress, multimodal AI has significant limitations worth understanding.

Hallucinations carry across modalities. Just as text models sometimes make up facts, multimodal models can misidentify objects in images, transcribe audio incorrectly, or generate images that do not match the description. The errors can be harder to catch when spread across multiple modalities.

Context windows are still limited. Processing images and video consumes enormous amounts of computational resources. A single high-resolution image uses as many tokens as several pages of text. Video is even more expensive. This means there are practical limits on how much multimodal content a model can process at once.

Cultural and representation biases. Models trained primarily on English-language data and Western imagery may perform poorly on content from other cultures. A model might struggle to identify foods from South Asian cuisine or misinterpret gestures that carry different meanings in different cultures.

Deepfakes and misinformation. As image and video generation become more realistic, the potential for creating convincing fake content grows. This raises serious concerns about misinformation, fraud, and privacy, and it is driving research into detection methods and watermarking techniques.

What Comes Next

The trajectory of multimodal AI points toward systems that interact with the world in increasingly human-like ways. We are moving toward models that can:

  • Participate in video calls and understand both what is being said and what is being shown
  • Navigate physical environments by processing camera feeds and making decisions in real time
  • Create rich, multi-format content — articles with custom illustrations, presentations with narration, videos with appropriate background music — from simple descriptions

The practical implication for everyday users is that interacting with AI will feel increasingly natural. Instead of carefully crafting text prompts, you will be able to communicate with AI the same way you communicate with another person — by talking, showing, pointing, and mixing whatever combination of modalities gets your point across most effectively.


See This in the News

AI video generation is advancing at a remarkable pace, with new models producing increasingly realistic results. For a look at the cutting edge, read ByteDance Seedance 2: AI Video on AIWire.