The AI powering humanoid robots has converged around a handful of paradigms. Vision-Language-Action (VLA) models — which combine visual perception, language reasoning, and motor control in a single neural network — have emerged as the dominant architecture, adopted by Tesla (Optimus), Figure AI (Helix), and Google DeepMind (RT-2). Foundation models like Physical Intelligence's pi0 and NVIDIA's GR00T aim to create universal robot brains that work across different body types. Reinforcement learning with sim-to-real transfer remains the standard for locomotion. Below is every major company mapped to its AI approach, plus the foundation models and key concepts shaping the field.
| COMPANY | ROBOT | AI APPROACH | KEY MODEL | TRAINING METHOD | COMPUTE |
|---|---|---|---|---|---|
| Tesla | Optimus | End-to-end neural net | FSD-derived VLA | Real-world + sim | Custom Dojo + NVIDIA H100 |
| Figure AI | Figure 03 | VLA (Helix) | Helix (custom) | Imitation learning + RL | NVIDIA GPUs |
| 1X Technologies | NEO | World model | 1X World Model | Video prediction + RL | NVIDIA |
| Physical Intelligence | Multiple | Foundation model | pi0 / pi0-FAST | Cross-embodiment | TPU + GPU |
| Boston Dynamics | Atlas | Model predictive control + ML | Hybrid classical/learned | Optimization + learning | Custom |
| UBTECH | Walker S2 | Reinforcement learning | Custom RL stack | Sim-to-real | NVIDIA Jetson |
| Xpeng | Iron | VLA/VLT | EV-derived vision | Transfer from autonomous driving | NVIDIA Orin |
| NEURA | 4NE-1 | Cognitive architecture | MAiRA system | Multimodal perception | NVIDIA |
| Agility | Digit | Reinforcement learning | Custom RL | Sim-to-real (Isaac Sim) | NVIDIA |
| Unitree | H1/G1 | RL locomotion | PPO-based | Sim-to-real | NVIDIA Jetson |
| Skild AI | Multiple | General-purpose model | Skild Brain | Cross-embodiment training | NVIDIA |
| Google DeepMind | Multiple | RT-X family | RT-2, RT-X | Large-scale multi-robot | TPU |
Tesla Optimus uses a Vision-Language-Action architecture directly derived from its Full Self-Driving system. The end-to-end neural network processes camera input and outputs motor commands for 28+ degrees of freedom. Tesla leverages its growing Optimus Gen 3 fleet (~1,000+ units, targeting 50K-100K by end 2026) for continuous data collection, supplemented by simulation. Training runs on Tesla's custom Dojo supercomputer and NVIDIA H100 GPU clusters.
Figure AI developed Helix, a proprietary Vision-Language-Action model purpose-built for humanoid manipulation. Helix combines visual perception, language understanding, and dexterous motor control in a unified architecture. Training uses imitation learning from human teleoperation data combined with reinforcement learning for refinement. Figure also partnered with OpenAI for conversational capabilities layered on top of Helix.
1X Technologies takes a world-model approach — their system learns to predict future states of the environment, then uses those predictions for planning and decision-making. The 1X World Model is trained on video data to build an internal physics simulator, enabling the robot to "imagine" the consequences of actions before executing them. This approach is combined with reinforcement learning for motor policy optimization.
Physical Intelligence built pi0 (and its faster variant pi0-FAST), a cross-embodiment foundation model trained on data from multiple robot types — arms, quadrupeds, and humanoids. pi0 can generate control policies for robot bodies it has never seen before by abstracting over embodiment differences. Trained on Google TPUs and NVIDIA GPUs using large-scale diverse robot datasets. Represents the most ambitious attempt at a universal robot brain.
Boston Dynamics uses a hybrid approach combining classical model predictive control (MPC) with learned components. Atlas's whole-body control relies on optimization-based planners that solve for joint trajectories in real-time, enhanced by machine learning for perception and adaptive behavior. This approach provides strong safety guarantees and predictable dynamics that pure neural network approaches struggle to match.
UBTECH trains Walker S2 primarily through reinforcement learning with sim-to-real transfer. Locomotion and manipulation policies are learned in simulation using domain randomization to bridge the reality gap. The custom RL stack runs on NVIDIA Jetson edge compute for onboard inference. UBTECH has deployed 600+ humanoid units commercially, providing real-world feedback to improve training.
Xpeng's Iron humanoid directly leverages the company's autonomous driving AI stack, transferring VLA and VLT (Vision-Language-Transformer) models from its EV lineup. The 82 DoF robot uses EV-derived visual perception trained on Xpeng's driving dataset, adapted for humanoid manipulation and navigation. Runs on NVIDIA Orin SoCs for onboard processing. A rare example of direct EV-to-humanoid AI transfer.
NEURA Robotics developed MAiRA, a cognitive AI architecture that integrates multimodal perception (vision, language, touch, proprioception) into a unified reasoning system. Unlike pure end-to-end approaches, MAiRA maintains explicit representations of the environment and supports structured reasoning about tasks. The system is designed for safe human-robot collaboration in industrial settings.
Agility Robotics trains Digit's locomotion and manipulation policies using reinforcement learning in NVIDIA Isaac Sim. The sim-to-real pipeline uses extensive domain randomization — varying friction, mass, lighting, and sensor noise — to produce policies that transfer robustly to hardware. Agility operates a dedicated Digit factory (RoboFab) and has commercial deployments at Amazon and GXO warehouses.
Unitree trains its H1 and G1 humanoids using Proximal Policy Optimization (PPO), a reinforcement learning algorithm, in simulation. The PPO-based policies handle locomotion including walking, running, and dynamic balancing. Unitree has demonstrated some of the fastest humanoid running speeds using this approach. Onboard inference runs on NVIDIA Jetson edge compute. The company is known for making humanoid hardware accessible at lower price points.
Skild AI is building "Skild Brain," a general-purpose foundation model for robots. Like Physical Intelligence's pi0, Skild Brain is trained across multiple robot embodiments to learn generalizable control policies. The model aims to be a universal robot intelligence layer that different hardware manufacturers can deploy on their platforms. Backed by $1.83B in funding at a $14B valuation (SoftBank-led Series C, Jan 2026), Skild represents a bet that robotics AI will follow the same foundation-model trajectory as language AI.
Google DeepMind pioneered the VLA paradigm with RT-2 (Robotic Transformer 2), which showed that pre-training on web-scale vision-language data dramatically improves robot manipulation. RT-X extended this to cross-embodiment: trained on data from 22 different robot types across 21 institutions, it demonstrated positive transfer between robot platforms. RT-2 and RT-X run on Google TPUs and established the blueprint that many humanoid companies now follow.
| MODEL | CREATOR | TYPE | KEY FEATURE |
|---|---|---|---|
| pi0 | Physical Intelligence | Cross-embodiment | Works across different robot bodies |
| Helix | Figure AI | VLA | Vision-language-action for manipulation |
| RT-2 / RT-X | Google DeepMind | VLA | Web-scale vision-language transfer |
| Skild Brain | Skild AI | General-purpose | Single model for diverse robots |
| GR00T | NVIDIA | Foundation model | Isaac Sim ecosystem integration |
| 1X World Model | 1X Technologies | World model | Predicts future states for planning |
Models that combine visual understanding with language reasoning to generate robot actions. Built on vision-language model architectures (like GPT-4V or Gemini) with an added action output layer. The dominant paradigm in humanoid AI as of 2026.
Training robot policies in physics simulators (NVIDIA Isaac Sim, MuJoCo, PyBullet) then transferring to physical hardware. Domain randomization — varying friction, mass, lighting, and noise — helps bridge the "reality gap." Used by nearly every humanoid company for locomotion training.
Learning robot behaviors from human demonstrations, typically collected via teleoperation (a human controls the robot remotely while wearing motion capture equipment). Figure AI, Tesla, and others use imitation learning extensively for manipulation skill acquisition.
Training a single AI model that works across different robot body types — arms, quadrupeds, humanoids. Physical Intelligence (pi0), Skild AI (Skild Brain), and Google DeepMind (RT-X) are leading this approach. The goal is a universal robot brain that any hardware can use.
Large pre-trained AI models adapted for robotics tasks. Just as GPT was pre-trained on internet text, robot foundation models are pre-trained on diverse robot data (and often internet vision-language data too) then fine-tuned for specific tasks. NVIDIA GR00T, pi0, and RT-2 are prominent examples.
The AI powering humanoid robots is converging on two paradigms. For manipulation and high-level reasoning, Vision-Language-Action (VLA) models are becoming the standard — Tesla, Figure AI, Google DeepMind, and Xpeng all use VLA architectures that leverage web-scale pre-training. For locomotion, reinforcement learning with sim-to-real transfer remains dominant, with NVIDIA Isaac Sim as the de facto training environment. The most ambitious trend is cross-embodiment foundation models (Physical Intelligence pi0, Skild Brain, NVIDIA GR00T) that aim to be universal robot brains working across any body type. NVIDIA is the clear infrastructure winner, supplying training compute (H100/B200 GPUs), simulation (Isaac Sim), onboard compute (Jetson Thor), and foundation models (GR00T) to the majority of the industry. The next 12-18 months will determine whether the foundation model approach — one model for all robots — wins out over company-specific VLAs optimized for individual platforms.