Chapter 7 • 9 min read • Last reviewed: June 2026

Future Frontiers & Physical AI

We are entering a new era of artificial intelligence. The frontier is no longer just about making models bigger. Researchers are expanding models into the physical world, training them to process multiple senses natively, and shifting from one-shot answers toward agents that can take action over time.

Native Multimodality

Early multimodal systems were "Stitched" together. For example, to let an AI "see" an image, engineers would run an image-captioning model to generate a text description, and then feed that text to the LLM. This was incredibly lossy; a text caption cannot capture the precise spatial layout of a room, the emotional expression on a face, or the specific pitch of a sound.

Modern state-of-the-art models are increasingly Natively Multimodal. They are built with a unified architecture or tightly integrated model system. Text, pixels, audio waveforms, video frames, and tool outputs are converted into a shared mathematical language (embeddings) and routed through models that can reason across them.

This allows the model to reason across modalities simultaneously. A native multimodal model can watch a video, listen to the speaker's sarcasm, read the slides in the background, and output a unified analysis in real time, catching nuances that stitched systems miss entirely.

The "Data Wall" and Synthetic Data

For a decade, AI progress was fueled by feeding models more data. However, the industry is hitting a Data Wall: LLMs have already consumed almost all high-quality, publicly available human-written text on the internet.

To continue training, researchers are turning to Synthetic Data—data generated by AI models to train other AI models.

The Promise and Danger of Synthetic Data

If models train on unverified synthetic data, they risk Model Collapse—a phenomenon where errors, biases, and weird linguistic quirks compound over generations, causing the model to become increasingly stupid and disconnected from reality.

To prevent this, engineers use Verified Synthesis: using external environments to validate the AI's data. For example:

An AI generates code, which is then run in a compiler to verify it works. Only working code is used for training.
An AI solves a math problem. The solution is validated using formal math verifiers.
An AI reasons about physical properties. The scenario is run through a physics engine to make sure it follows real-world laws.

Robotics and Physical Grounding

For AI to truly understand the world, it must interact with it. By combining multimodal LLMs with robotic control systems, researchers have developed Vision-Language-Action (VLA) models such as Google's RT-2 and Gemini Robotics.

A VLA model doesn't just output text; it outputs physical actions for a robot's joints and grippers. When you tell a VLA-enabled robot arm: "Pick up the yellow banana and put it in the basket," the model processes the camera feed (pixels), matches the words to the objects, calculates the spatial path, and controls the robot's motors directly. The LLM acts as the robot's planning layer, giving it common-sense reasoning and adaptability to new environments without custom programming for every object.

The Next Paradigm: Test-Time Compute

Pre-training scaling laws (adding more parameters and GPUs during training) are no longer the only axis of progress. The newer vector is Test-Time Compute (scaling at inference time).

Instead of forcing a model to answer within a fraction of a second, test-time compute lets the model spend extra compute planning, checking, searching, or coordinating tools. This is why frontier model releases increasingly emphasize agentic coding, computer use, document work, and scientific workflows rather than only chat benchmark scores. The practical question is becoming: how much thought should the system buy for this task?

What Is Relevant Now

The frontier is becoming less about one chatbot box and more about systems that coordinate perception, memory, tools, and verification. The most relevant product questions are practical:

Can the model read the actual modality the user cares about, or is information lost in conversion?
Can synthetic data be checked by compilers, tests, simulators, formal verifiers, humans, or trusted datasets?
Can a robot or agent fail safely when perception is uncertain?
Is extra test-time compute buying real accuracy, or just slower answers?
Can the team observe and evaluate the full workflow rather than only the final response?

These questions make the chapter more concrete: multimodality, robotics, and reasoning are not separate trends. They are ways of giving AI systems better inputs, better actions, and better checks.

Sources

RT-2: New model translates vision and language into action — Google DeepMind
Gemini Robotics: Bringing AI into the Physical World — Google DeepMind
AI models collapse when trained on recursively generated data — Nature
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Snell et al.
Learning to Reason with LLMs — OpenAI