FIG.AIFigure 03+1.5B.SeriesCTSLA.BOTOptimus+8k.Units.Q1BOS.DYNAtlas+New.CEO.2026AGIL.ROBDigit+AMZN.ScaleNEURA4NE-1+$1B.SeriesDAPP.TRONApollo+$520M.ExtAUNIT.REEG1+3k.Ships.25SUNDAYHomeBot+$1.15B.ValGALBOTG1+RMB2.5B.SerBSANC.AIPhoenix+$90M.SeriesD1X.TECHNEO+$125M.SerCMIND.ROBStealth+Founded.2026FUND.YTD2026$5.8B.Raised
Home/Humanoid Robot AI Models
TECHNICAL INTELLIGENCE // AI & MACHINE LEARNING IN HUMANOID ROBOTICS

What AI Powers Humanoid Robots? Foundation Models & VLAs Explained

The AI powering humanoid robots has converged around a handful of paradigms. Vision-Language-Action (VLA) models — which combine visual perception, language reasoning, and motor control in a single neural network — have emerged as the dominant architecture, adopted by Tesla (Optimus), Figure AI (Helix), and Google DeepMind (RT-2). Foundation models like Physical Intelligence's pi0 and NVIDIA's GR00T aim to create universal robot brains that work across different body types. Reinforcement learning with sim-to-real transfer remains the standard for locomotion. Below is every major company mapped to its AI approach, plus the foundation models and key concepts shaping the field.

12 AI Approaches Tracked
6+ Foundation Models
NVIDIA Top Platform
VLA / Sim-to-Real Dominant Method
Last updated: April 2026 · View all companies →

AI APPROACHES BY COMPANY

COMPANYROBOTAI APPROACHKEY MODELTRAINING METHODCOMPUTE
TeslaOptimusEnd-to-end neural netFSD-derived VLAReal-world + simCustom Dojo + NVIDIA H100
Figure AIFigure 03VLA (Helix)Helix (custom)Imitation learning + RLNVIDIA GPUs
1X TechnologiesNEOWorld model1X World ModelVideo prediction + RLNVIDIA
Physical IntelligenceMultipleFoundation modelpi0 / pi0-FASTCross-embodimentTPU + GPU
Boston DynamicsAtlasModel predictive control + MLHybrid classical/learnedOptimization + learningCustom
UBTECHWalker S2Reinforcement learningCustom RL stackSim-to-realNVIDIA Jetson
XpengIronVLA/VLTEV-derived visionTransfer from autonomous drivingNVIDIA Orin
NEURA4NE-1Cognitive architectureMAiRA systemMultimodal perceptionNVIDIA
AgilityDigitReinforcement learningCustom RLSim-to-real (Isaac Sim)NVIDIA
UnitreeH1/G1RL locomotionPPO-basedSim-to-realNVIDIA Jetson
Skild AIMultipleGeneral-purpose modelSkild BrainCross-embodiment trainingNVIDIA
Google DeepMindMultipleRT-X familyRT-2, RT-XLarge-scale multi-robotTPU

AI APPROACH DEEP DIVES

TeslaOptimus
FSD-derived VLA modelVLA
End-to-end neural net · Real-world + sim · Custom Dojo + NVIDIA H100

Tesla Optimus uses a Vision-Language-Action architecture directly derived from its Full Self-Driving system. The end-to-end neural network processes camera input and outputs motor commands for 28+ degrees of freedom. Tesla leverages its growing Optimus Gen 3 fleet (~1,000+ units, targeting 50K-100K by end 2026) for continuous data collection, supplemented by simulation. Training runs on Tesla's custom Dojo supercomputer and NVIDIA H100 GPU clusters.

Figure AIFigure 03
Helix (custom) modelVLA
VLA (Helix) · Imitation learning + RL · NVIDIA GPUs

Figure AI developed Helix, a proprietary Vision-Language-Action model purpose-built for humanoid manipulation. Helix combines visual perception, language understanding, and dexterous motor control in a unified architecture. Training uses imitation learning from human teleoperation data combined with reinforcement learning for refinement. Figure also partnered with OpenAI for conversational capabilities layered on top of Helix.

1X TechnologiesNEO
1X World Model modelFoundation
World model · Video prediction + RL · NVIDIA

1X Technologies takes a world-model approach — their system learns to predict future states of the environment, then uses those predictions for planning and decision-making. The 1X World Model is trained on video data to build an internal physics simulator, enabling the robot to "imagine" the consequences of actions before executing them. This approach is combined with reinforcement learning for motor policy optimization.

Physical IntelligenceMultiple
pi0 / pi0-FAST modelFoundation
Foundation model · Cross-embodiment · TPU + GPU

Physical Intelligence built pi0 (and its faster variant pi0-FAST), a cross-embodiment foundation model trained on data from multiple robot types — arms, quadrupeds, and humanoids. pi0 can generate control policies for robot bodies it has never seen before by abstracting over embodiment differences. Trained on Google TPUs and NVIDIA GPUs using large-scale diverse robot datasets. Represents the most ambitious attempt at a universal robot brain.

Boston DynamicsAtlas
Hybrid classical/learned modelHybrid
Model predictive control + ML · Optimization + learning · Custom

Boston Dynamics uses a hybrid approach combining classical model predictive control (MPC) with learned components. Atlas's whole-body control relies on optimization-based planners that solve for joint trajectories in real-time, enhanced by machine learning for perception and adaptive behavior. This approach provides strong safety guarantees and predictable dynamics that pure neural network approaches struggle to match.

UBTECHWalker S2
Custom RL stack modelRL
Reinforcement learning · Sim-to-real · NVIDIA Jetson

UBTECH trains Walker S2 primarily through reinforcement learning with sim-to-real transfer. Locomotion and manipulation policies are learned in simulation using domain randomization to bridge the reality gap. The custom RL stack runs on NVIDIA Jetson edge compute for onboard inference. UBTECH has deployed 600+ humanoid units commercially, providing real-world feedback to improve training.

XpengIron
EV-derived vision modelVLA
VLA/VLT · Transfer from autonomous driving · NVIDIA Orin

Xpeng's Iron humanoid directly leverages the company's autonomous driving AI stack, transferring VLA and VLT (Vision-Language-Transformer) models from its EV lineup. The 82 DoF robot uses EV-derived visual perception trained on Xpeng's driving dataset, adapted for humanoid manipulation and navigation. Runs on NVIDIA Orin SoCs for onboard processing. A rare example of direct EV-to-humanoid AI transfer.

NEURA4NE-1
MAiRA system modelHybrid
Cognitive architecture · Multimodal perception · NVIDIA

NEURA Robotics developed MAiRA, a cognitive AI architecture that integrates multimodal perception (vision, language, touch, proprioception) into a unified reasoning system. Unlike pure end-to-end approaches, MAiRA maintains explicit representations of the environment and supports structured reasoning about tasks. The system is designed for safe human-robot collaboration in industrial settings.

AgilityDigit
Custom RL modelRL
Reinforcement learning · Sim-to-real (Isaac Sim) · NVIDIA

Agility Robotics trains Digit's locomotion and manipulation policies using reinforcement learning in NVIDIA Isaac Sim. The sim-to-real pipeline uses extensive domain randomization — varying friction, mass, lighting, and sensor noise — to produce policies that transfer robustly to hardware. Agility operates a dedicated Digit factory (RoboFab) and has commercial deployments at Amazon and GXO warehouses.

UnitreeH1/G1
PPO-based modelRL
RL locomotion · Sim-to-real · NVIDIA Jetson

Unitree trains its H1 and G1 humanoids using Proximal Policy Optimization (PPO), a reinforcement learning algorithm, in simulation. The PPO-based policies handle locomotion including walking, running, and dynamic balancing. Unitree has demonstrated some of the fastest humanoid running speeds using this approach. Onboard inference runs on NVIDIA Jetson edge compute. The company is known for making humanoid hardware accessible at lower price points.

Skild AIMultiple
Skild Brain modelFoundation
General-purpose model · Cross-embodiment training · NVIDIA

Skild AI is building "Skild Brain," a general-purpose foundation model for robots. Like Physical Intelligence's pi0, Skild Brain is trained across multiple robot embodiments to learn generalizable control policies. The model aims to be a universal robot intelligence layer that different hardware manufacturers can deploy on their platforms. Backed by $1.83B in funding at a $14B valuation (SoftBank-led Series C, Jan 2026), Skild represents a bet that robotics AI will follow the same foundation-model trajectory as language AI.

Google DeepMindMultiple
RT-2, RT-X modelVLA
RT-X family · Large-scale multi-robot · TPU

Google DeepMind pioneered the VLA paradigm with RT-2 (Robotic Transformer 2), which showed that pre-training on web-scale vision-language data dramatically improves robot manipulation. RT-X extended this to cross-embodiment: trained on data from 22 different robot types across 21 institutions, it demonstrated positive transfer between robot platforms. RT-2 and RT-X run on Google TPUs and established the blueprint that many humanoid companies now follow.

KEY FOUNDATION MODELS FOR HUMANOID ROBOTS

MODELCREATORTYPEKEY FEATURE
pi0Physical IntelligenceCross-embodimentWorks across different robot bodies
HelixFigure AIVLAVision-language-action for manipulation
RT-2 / RT-XGoogle DeepMindVLAWeb-scale vision-language transfer
Skild BrainSkild AIGeneral-purposeSingle model for diverse robots
GR00TNVIDIAFoundation modelIsaac Sim ecosystem integration
1X World Model1X TechnologiesWorld modelPredicts future states for planning

KEY CONCEPTS IN HUMANOID ROBOT AI

VLA (Vision-Language-Action)

Models that combine visual understanding with language reasoning to generate robot actions. Built on vision-language model architectures (like GPT-4V or Gemini) with an added action output layer. The dominant paradigm in humanoid AI as of 2026.

Sim-to-Real

Training robot policies in physics simulators (NVIDIA Isaac Sim, MuJoCo, PyBullet) then transferring to physical hardware. Domain randomization — varying friction, mass, lighting, and noise — helps bridge the "reality gap." Used by nearly every humanoid company for locomotion training.

Imitation Learning

Learning robot behaviors from human demonstrations, typically collected via teleoperation (a human controls the robot remotely while wearing motion capture equipment). Figure AI, Tesla, and others use imitation learning extensively for manipulation skill acquisition.

Cross-Embodiment

Training a single AI model that works across different robot body types — arms, quadrupeds, humanoids. Physical Intelligence (pi0), Skild AI (Skild Brain), and Google DeepMind (RT-X) are leading this approach. The goal is a universal robot brain that any hardware can use.

Foundation Models

Large pre-trained AI models adapted for robotics tasks. Just as GPT was pre-trained on internet text, robot foundation models are pre-trained on diverse robot data (and often internet vision-language data too) then fine-tuned for specific tasks. NVIDIA GR00T, pi0, and RT-2 are prominent examples.

BOTTOM LINE

The AI powering humanoid robots is converging on two paradigms. For manipulation and high-level reasoning, Vision-Language-Action (VLA) models are becoming the standard — Tesla, Figure AI, Google DeepMind, and Xpeng all use VLA architectures that leverage web-scale pre-training. For locomotion, reinforcement learning with sim-to-real transfer remains dominant, with NVIDIA Isaac Sim as the de facto training environment. The most ambitious trend is cross-embodiment foundation models (Physical Intelligence pi0, Skild Brain, NVIDIA GR00T) that aim to be universal robot brains working across any body type. NVIDIA is the clear infrastructure winner, supplying training compute (H100/B200 GPUs), simulation (Isaac Sim), onboard compute (Jetson Thor), and foundation models (GR00T) to the majority of the industry. The next 12-18 months will determine whether the foundation model approach — one model for all robots — wins out over company-specific VLAs optimized for individual platforms.

FREQUENTLY ASKED QUESTIONS

RELATED INTELLIGENCE

Every Humanoid Robot Company: Complete ListHumanoid Robot Stocks: Every Public CompanyWho Is Buying Humanoid Robots?