Original
Chapter 7 β€’ 9 min read β€’ Last reviewed: May 2026

Future Frontiers & Physical AI

We are entering a new era of AI. The boundaries of what is possible are no longer just about making models bigger. Researchers are focused on expanding models into the physical world, training them to process a bunch of senses natively, and tackling the looming threat of the "data wall."

Native Multimodality

Early multimodal systems were "Stitched" together. For example, to let an AI "see" an image, engineers would run an image-captioning model to generate a text description, and then feed that text to the LLM. This was incredibly lossy; a text caption cannot capture the precise spatial layout of a room, the emotional expression on a face, or the specific pitch of a sound.

Modern frontier-tier models (like Gemini) are Natively Multimodal. They are built with a single unified neural architecture. Text, pixels, audio waveforms, and video frames are converted into a shared mathematical language (embeddings) and fed into the same Transformer network.

This allows the model to reason across modalities simultaneously. A native multimodal model can watch a video, listen to the speaker's sarcasm, read the slides in the background, and output a unified analysis in real time, catching nuances that stitched systems miss entirely.

The "Data Wall" and Synthetic Data

For a decade, AI progress was fueled by feeding models more data. However, the industry is hitting a Data Wall: LLMs have already consumed almost all high-quality, publicly available human-written text on the internet.

To continue training, researchers are turning to Synthetic Dataβ€”data generated by AI models to train other AI models.

The Promise and Danger of Synthetic Data

If models train on unverified synthetic data, they risk Model Collapseβ€”a phenomenon where errors, biases, and weird linguistic quirks compound over generations, causing the model to become increasingly stupid and disconnected from reality.

To prevent this, engineers use Verified Synthesis: using external environments to validate the AI's data. For example:

  • An AI generates code, which is then run in a compiler to verify it works. Only working code is used for training.
  • An AI solves a math problem. The solution is validated using formal math verifiers.
  • An AI reasons about physical properties. The scenario is run through a physics engine to make sure it follows real-world laws.

Robotics and Physical Grounding

For AI to truly get the world, it must interact with it. By combining multimodal LLMs with robotic control systems, researchers have developed Vision-Language-Action (VLA) models (such as Google's RT-2).

A VLA model doesn't just output text; it outputs physical coordinates for a robot's joints and grippers. When you tell a VLA-enabled robot arm: "Pick up the yellow banana and put it in the basket," the model processes the camera feed (pixels), matches the words to the objects, calculates the spatial path, and controls the robot's motors directly. The LLM acts as the robot's brain, giving it common-sense reasoning and immediate adaptability to new environments without custom programming.

The Next Paradigm: Test-Time Compute

Pre-training scaling laws (adding more parameters and GPUs during training) are starting to show diminishing returns. The new vector of scaling is Test-Time Compute (scaling at inference time).

Instead of forcing a model to answer within a fraction of a second, test-time compute allows the model to spend seconds or minutes thinking. By running Monte Carlo Tree Search (MCTS) or generating extensive self-corrections (similar to how chess engines calculate paths before moving), models can work through highly big-brain scientific, mathematical, and coding problems, opening up a new dimension of machine intelligence.

Sources