Vision-Language-Action models (VLAs) are the dominant paradigm for robot AI in 2025-2026. A VLA is a single neural network that simultaneously processes three types of information: images from the robot's cameras (vision), natural language task descriptions or commands (language), and the robot's proprioceptive state (joint angles, velocities), and produces continuous joint actions as output. This end-to-end formulation is a departure from the classical robotics pipeline, in which separate modules handle perception, planning, and control.
The conceptual breakthrough behind VLAs is the realization that large-scale pre-training on internet data — images and text — produces representations that generalize to robot manipulation with relatively modest robot-specific fine-tuning. Language models trained on trillions of text tokens have implicit knowledge of physics, affordances, and task structure. By connecting this knowledge to visual grounding and action generation, VLAs produce policies that understand instructions like "pick up the red cup and place it to the right of the bowl" without being explicitly programmed for each possible object and configuration.
Architecture
Modern VLAs build on vision-language models (VLMs) such as PaLI, Flamingo, or LLaMA-3 Vision by adding an action head — a module that converts the VLM's latent representation into robot joint targets. The action head is the critical differentiating component across architectures:
Regression heads: Simple MLP layers that output joint positions directly. Fast and deterministic, but struggle to represent multimodal action distributions — situations where multiple equally valid actions exist.
Diffusion action heads: Generate actions through an iterative denoising process, capable of representing complex action distributions. Used in π0 (Physical Intelligence); produces high-quality actions but originally required 10-50 denoising steps (significant latency).
Flow matching heads: Generate actions through continuous normalizing flows with fewer steps (2-4) than diffusion. Used in π0.2; achieves comparable quality to diffusion at a fraction of the latency. Current state of the art.
Tokenized action heads: Discretize action space into tokens and use autoregressive generation. Used in Google's RT-2 and OpenVLA. Simple to implement using standard language model architecture, but discretization introduces quantization error that limits fine manipulation precision.
Key VLA Models and Benchmarks
RT-2 (Google DeepMind, 2023): First widely reproduced VLA at scale. 55B parameter PaLI-X backbone with tokenized action head. Demonstrated emergent chain-of-thought reasoning for robot manipulation and first meaningful zero-shot generalization to novel objects and instructions.
OpenVLA (Stanford, 2024): Open-source VLA using LLaMA-2 7B backbone with action tokenization. 7B parameters versus RT-2's 55B, with approximately 65% of RT-2 performance. Important for research reproducibility; weights publicly released.
π0 (Physical Intelligence, 2024): Flow matching action head on PaliGemma vision-language backbone. First VLA designed explicitly for cross-embodiment deployment. 3B parameters.
π0.2 (Physical Intelligence, 2026): Updated π0 with improved cross-embodiment pre-training dataset (10,000+ robot-hours across 12 platforms) and faster flow matching. Achieves 94% zero-shot task success across 12 robot platforms.
Helix (Figure AI, 2025): Proprietary VLA for Figure 02/03. Integrated into robot firmware. Architecture not fully disclosed, but Figure has described a transformer backbone with proprioceptive state fusion and action diffusion head.
Limitations
VLAs fail predictably in several categories:
- Long-horizon tasks: Current VLAs plan reactively rather than hierarchically. A 30-step task requires each intermediate action to succeed; failure recovery from mid-task errors is an active research problem.
- Precise placement: Quantization error in tokenized action heads limits placement precision to approximately ±5mm — adequate for pick-and-place but insufficient for precision assembly.
- Novel objects: Generalization degrades significantly for objects very different from training data. Reflective surfaces, transparent objects, and deformable materials (fabric, cables) remain challenging.
- Force-contact tasks: VLAs primarily reason about visual observations. Tasks where the correct action depends on tactile feedback (tightening a bolt to correct torque, feeling a surface finish) are not naturally handled.
FAQ
What's the difference between a VLA and a large language model (LLM)? An LLM takes text input and produces text output. A VLA takes image + text input and produces robot joint actions as output. VLAs typically use a pre-trained VLM (which has LLM-like language understanding) as their backbone, but add a robot-specific action head and are fine-tuned on robot demonstration data. The language understanding capability from the VLM pre-training is what enables VLAs to follow natural language instructions.
Can VLAs run in real time on a humanoid robot? Yes, with current hardware. A 3B parameter VLA like π0 can run inference in 40-100ms on an NVIDIA Orin-class SoC (the compute platform used in most commercial humanoids). This supports a 10-25Hz control loop, sufficient for manipulation tasks. Larger models (55B+) require offboard compute or dedicated inference accelerators.
Do VLAs replace traditional robot programming? For many manipulation tasks, VLAs reduce or eliminate the need for traditional task-specific programming. However, safety-critical behaviors (emergency stops, joint limit enforcement, collision detection) are still implemented in traditional software layers below the VLA. VLAs operate as a high-level policy; hard safety constraints are enforced by lower-level controllers that the VLA cannot override.