What AI models do humanoid robots use?

Humanoid robots use a range of AI approaches depending on the company. The dominant paradigm in 2026 is the Vision-Language-Action (VLA) model, which combines visual perception, language understanding, and motor control in a single neural network. Tesla Optimus uses an FSD-derived VLA trained on real-world and simulation data. Figure AI developed Helix, a custom VLA for manipulation tasks. Google DeepMind created the RT-2 and RT-X family that transfers web-scale vision-language knowledge to robot actions. Physical Intelligence built pi0, a cross-embodiment foundation model that works across different robot bodies. Other companies like Agility (Digit) and Unitree (H1/G1) rely primarily on reinforcement learning with sim-to-real transfer for locomotion.

What is a VLA (Vision-Language-Action) model?

A Vision-Language-Action (VLA) model is a type of AI that combines three capabilities into one neural network: visual perception (understanding what the robot sees through cameras), language reasoning (interpreting instructions and contextual knowledge), and action generation (outputting motor commands to move the robot). VLAs build on the success of large vision-language models (VLMs) like GPT-4V and Gemini by adding an action output layer. The key insight is that web-scale knowledge about objects, physics, and spatial relationships — learned from billions of images and text — transfers to robotic manipulation. Google DeepMind's RT-2 was one of the first VLAs, and the approach has been adopted by Tesla (Optimus), Figure AI (Helix), and Xpeng (Iron).

Does Tesla Optimus use the same AI as Tesla FSD?

Yes, partially. Tesla Optimus uses a VLA (Vision-Language-Action) architecture derived from Tesla's Full Self-Driving (FSD) system. The core neural network architecture — which processes visual input from cameras and generates action outputs — shares significant DNA with FSD. Tesla leverages the same end-to-end neural network training philosophy, the same data engine infrastructure, and much of the same compute stack (custom Dojo supercomputer plus NVIDIA H100 clusters). However, the action space is fundamentally different: FSD outputs steering, acceleration, and braking commands, while Optimus outputs commands for 28+ degrees of freedom across hands, arms, legs, and torso. Tesla trains Optimus using a combination of real-world teleoperation data, simulation, and the FSD-derived perception backbone.

What is NVIDIA GR00T?

NVIDIA GR00T (Generalist Robot 00 Technology) is a foundation model for humanoid robots announced by NVIDIA CEO Jensen Huang. GR00T is designed to be a general-purpose model that can understand natural language and learn from observing human actions to generate robot movements. It integrates deeply with NVIDIA's robotics ecosystem: Isaac Sim for physics simulation and synthetic data generation, Isaac Lab for reinforcement learning training, and the Jetson Thor (Project Thor) SoC for onboard compute. GR00T is not tied to a single robot — it is designed to work across different humanoid platforms. Multiple humanoid companies including Agility, Apptronik, 1X Technologies, and UBTECH have announced integration with NVIDIA's robotics stack. GR00T represents NVIDIA's bet that the same platform-play that dominated AI training (with CUDA and GPUs) can dominate embodied AI.

Can one AI model control different humanoid robots?

Yes, this is the goal of cross-embodiment foundation models, and several companies are actively building them. Physical Intelligence's pi0 is the most prominent example — it is trained on data from multiple robot types (arms, quadrupeds, humanoids) and can generate control policies for different body configurations. Skild AI's "Skild Brain" follows the same approach, training a general-purpose model across diverse robot embodiments. Google DeepMind's RT-X was trained on data from 22 different robot types across 21 institutions. NVIDIA's GR00T is also designed as a cross-embodiment foundation model. The key challenge is that different robots have different numbers of joints, different sensor configurations, and different dynamics — so the model must learn to abstract across these differences. Cross-embodiment models are still early but represent the most scalable path to robot intelligence.

How are humanoid robots trained?

Humanoid robots are trained through a combination of methods. Sim-to-real transfer is the most common: robots first learn in physics simulators like NVIDIA Isaac Sim or MuJoCo, where millions of training episodes can run in parallel at thousands of times real-time speed, then policies are transferred to physical hardware. Reinforcement learning (RL), particularly PPO (Proximal Policy Optimization), is used to train locomotion — walking, balancing, stair climbing — in simulation. Imitation learning trains manipulation skills from human demonstrations, typically collected via teleoperation (a human controls the robot remotely while wearing motion capture gear). Foundation models like pi0 and RT-2 add large-scale pre-training on internet data (images, text, video) to give robots broad world knowledge. Tesla uniquely combines real-world data from its Optimus fleet with simulation. The trend is toward end-to-end models that handle perception and control in a single neural network, replacing hand-engineered control stacks.

HUMANOID.INTEL

FIG.AIFigure 03+1.5B.SeriesC TSLA.BOTOptimus+8k.Units.Q1 BOS.DYNAtlas+New.CEO.2026 AGIL.ROBDigit+AMZN.Scale NEURA4NE-1+$1B.SeriesD APP.TRONApollo+$520M.ExtA UNIT.REEG1+3k.Ships.25 SUNDAYHomeBot+$1.15B.Val GALBOTG1+RMB2.5B.SerB SANC.AIPhoenix+$90M.SeriesD 1X.TECHNEO+$125M.SerC MIND.ROBStealth+Founded.2026FUND.YTD2026$5.8B.Raised

Home/Humanoid Robot AI Models

TECHNICAL INTELLIGENCE // AI & MACHINE LEARNING IN HUMANOID ROBOTICS

What AI Powers Humanoid Robots? Foundation Models & VLAs Explained

The AI powering humanoid robots has converged around a handful of paradigms. Vision-Language-Action (VLA) models — which combine visual perception, language reasoning, and motor control in a single neural network — have emerged as the dominant architecture, adopted by Tesla (Optimus), Figure AI (Helix), and Google DeepMind (RT-2). Foundation models like Physical Intelligence's pi0 and NVIDIA's GR00T aim to create universal robot brains that work across different body types. Reinforcement learning with sim-to-real transfer remains the standard for locomotion. Below is every major company mapped to its AI approach, plus the foundation models and key concepts shaping the field.

12 AI Approaches Tracked

6+ Foundation Models

NVIDIA Top Platform

VLA / Sim-to-Real Dominant Method

Last updated: May 2026 · View all companies →

AI APPROACHES BY COMPANY

COMPANY	ROBOT	AI APPROACH	KEY MODEL	TRAINING METHOD	COMPUTE
Tesla	Optimus	End-to-end neural net	FSD-derived VLA	Real-world + sim	Custom Dojo + NVIDIA H100
Figure AI	Figure 03	VLA (Helix)	Helix (custom)	Imitation learning + RL	NVIDIA GPUs
1X Technologies	NEO	World model	1X World Model	Video prediction + RL	NVIDIA
Physical Intelligence	Multiple	Foundation model	pi0 / pi0-FAST	Cross-embodiment	TPU + GPU
Boston Dynamics	Atlas	Model predictive control + ML	Hybrid classical/learned	Optimization + learning	Custom
UBTECH	Walker S2	Reinforcement learning	Custom RL stack	Sim-to-real	NVIDIA Jetson
Xpeng	Iron	VLA/VLT	EV-derived vision	Transfer from autonomous driving	NVIDIA Orin
NEURA	4NE-1	Cognitive architecture	MAiRA system	Multimodal perception	NVIDIA
Agility	Digit	Reinforcement learning	Custom RL	Sim-to-real (Isaac Sim)	NVIDIA
Unitree	H1/G1	RL locomotion	PPO-based	Sim-to-real	NVIDIA Jetson
Skild AI	Multiple	General-purpose model	Skild Brain	Cross-embodiment training	NVIDIA
Google DeepMind	Multiple	RT-X family	RT-2, RT-X	Large-scale multi-robot	TPU

AI APPROACH DEEP DIVES

TeslaOptimus

FSD-derived VLA modelVLA

End-to-end neural net · Real-world + sim · Custom Dojo + NVIDIA H100

Tesla Optimus uses a Vision-Language-Action architecture directly derived from its Full Self-Driving system. The end-to-end neural network processes camera input and outputs motor commands for 28+ degrees of freedom. Tesla leverages its growing Optimus Gen 3 fleet (~1,000+ units, targeting 50K-100K by end 2026) for continuous data collection, supplemented by simulation. Training runs on Tesla's custom Dojo supercomputer and NVIDIA H100 GPU clusters.

Figure AIFigure 03

Helix (custom) modelVLA

VLA (Helix) · Imitation learning + RL · NVIDIA GPUs

Figure AI developed Helix, a proprietary Vision-Language-Action model purpose-built for humanoid manipulation. Helix combines visual perception, language understanding, and dexterous motor control in a unified architecture. Training uses imitation learning from human teleoperation data combined with reinforcement learning for refinement. Figure also partnered with OpenAI for conversational capabilities layered on top of Helix.

1X TechnologiesNEO

1X World Model modelFoundation

World model · Video prediction + RL · NVIDIA

1X Technologies takes a world-model approach — their system learns to predict future states of the environment, then uses those predictions for planning and decision-making. The 1X World Model is trained on video data to build an internal physics simulator, enabling the robot to "imagine" the consequences of actions before executing them. This approach is combined with reinforcement learning for motor policy optimization.

Physical IntelligenceMultiple

pi0 / pi0-FAST modelFoundation

Foundation model · Cross-embodiment · TPU + GPU

Physical Intelligence built pi0 (and its faster variant pi0-FAST), a cross-embodiment foundation model trained on data from multiple robot types — arms, quadrupeds, and humanoids. pi0 can generate control policies for robot bodies it has never seen before by abstracting over embodiment differences. Trained on Google TPUs and NVIDIA GPUs using large-scale diverse robot datasets. Represents the most ambitious attempt at a universal robot brain.

Boston DynamicsAtlas

Hybrid classical/learned modelHybrid

Model predictive control + ML · Optimization + learning · Custom

Boston Dynamics uses a hybrid approach combining classical model predictive control (MPC) with learned components. Atlas's whole-body control relies on optimization-based planners that solve for joint trajectories in real-time, enhanced by machine learning for perception and adaptive behavior. This approach provides strong safety guarantees and predictable dynamics that pure neural network approaches struggle to match.

UBTECHWalker S2

Custom RL stack modelRL

Reinforcement learning · Sim-to-real · NVIDIA Jetson

UBTECH trains Walker S2 primarily through reinforcement learning with sim-to-real transfer. Locomotion and manipulation policies are learned in simulation using domain randomization to bridge the reality gap. The custom RL stack runs on NVIDIA Jetson edge compute for onboard inference. UBTECH has deployed 600+ humanoid units commercially, providing real-world feedback to improve training.

XpengIron

EV-derived vision modelVLA

VLA/VLT · Transfer from autonomous driving · NVIDIA Orin

Xpeng's Iron humanoid directly leverages the company's autonomous driving AI stack, transferring VLA and VLT (Vision-Language-Transformer) models from its EV lineup. The 82 DoF robot uses EV-derived visual perception trained on Xpeng's driving dataset, adapted for humanoid manipulation and navigation. Runs on NVIDIA Orin SoCs for onboard processing. A rare example of direct EV-to-humanoid AI transfer.

NEURA4NE-1

MAiRA system modelHybrid

Cognitive architecture · Multimodal perception · NVIDIA

NEURA Robotics developed MAiRA, a cognitive AI architecture that integrates multimodal perception (vision, language, touch, proprioception) into a unified reasoning system. Unlike pure end-to-end approaches, MAiRA maintains explicit representations of the environment and supports structured reasoning about tasks. The system is designed for safe human-robot collaboration in industrial settings.

AgilityDigit

Custom RL modelRL

Reinforcement learning · Sim-to-real (Isaac Sim) · NVIDIA

Agility Robotics trains Digit's locomotion and manipulation policies using reinforcement learning in NVIDIA Isaac Sim. The sim-to-real pipeline uses extensive domain randomization — varying friction, mass, lighting, and sensor noise — to produce policies that transfer robustly to hardware. Agility operates a dedicated Digit factory (RoboFab) and has commercial deployments at Amazon and GXO warehouses.

UnitreeH1/G1

PPO-based modelRL

RL locomotion · Sim-to-real · NVIDIA Jetson

Unitree trains its H1 and G1 humanoids using Proximal Policy Optimization (PPO), a reinforcement learning algorithm, in simulation. The PPO-based policies handle locomotion including walking, running, and dynamic balancing. Unitree has demonstrated some of the fastest humanoid running speeds using this approach. Onboard inference runs on NVIDIA Jetson edge compute. The company is known for making humanoid hardware accessible at lower price points.

Skild AIMultiple

Skild Brain modelFoundation

General-purpose model · Cross-embodiment training · NVIDIA

Skild AI is building "Skild Brain," a general-purpose foundation model for robots. Like Physical Intelligence's pi0, Skild Brain is trained across multiple robot embodiments to learn generalizable control policies. The model aims to be a universal robot intelligence layer that different hardware manufacturers can deploy on their platforms. Backed by $1.83B in funding at a $14B valuation (SoftBank-led Series C, Jan 2026), Skild represents a bet that robotics AI will follow the same foundation-model trajectory as language AI.

Google DeepMindMultiple

RT-2, RT-X modelVLA

RT-X family · Large-scale multi-robot · TPU

Google DeepMind pioneered the VLA paradigm with RT-2 (Robotic Transformer 2), which showed that pre-training on web-scale vision-language data dramatically improves robot manipulation. RT-X extended this to cross-embodiment: trained on data from 22 different robot types across 21 institutions, it demonstrated positive transfer between robot platforms. RT-2 and RT-X run on Google TPUs and established the blueprint that many humanoid companies now follow.

KEY FOUNDATION MODELS FOR HUMANOID ROBOTS

MODEL	CREATOR	TYPE	KEY FEATURE
pi0	Physical Intelligence	Cross-embodiment	Works across different robot bodies
Helix	Figure AI	VLA	Vision-language-action for manipulation
RT-2 / RT-X	Google DeepMind	VLA	Web-scale vision-language transfer
Skild Brain	Skild AI	General-purpose	Single model for diverse robots
GR00T	NVIDIA	Foundation model	Isaac Sim ecosystem integration
1X World Model	1X Technologies	World model	Predicts future states for planning

KEY CONCEPTS IN HUMANOID ROBOT AI

VLA (Vision-Language-Action)

Models that combine visual understanding with language reasoning to generate robot actions. Built on vision-language model architectures (like GPT-4V or Gemini) with an added action output layer. The dominant paradigm in humanoid AI as of 2026.

Sim-to-Real

Training robot policies in physics simulators (NVIDIA Isaac Sim, MuJoCo, PyBullet) then transferring to physical hardware. Domain randomization — varying friction, mass, lighting, and noise — helps bridge the "reality gap." Used by nearly every humanoid company for locomotion training.

Imitation Learning

Learning robot behaviors from human demonstrations, typically collected via teleoperation (a human controls the robot remotely while wearing motion capture equipment). Figure AI, Tesla, and others use imitation learning extensively for manipulation skill acquisition.

Cross-Embodiment

Training a single AI model that works across different robot body types — arms, quadrupeds, humanoids. Physical Intelligence (pi0), Skild AI (Skild Brain), and Google DeepMind (RT-X) are leading this approach. The goal is a universal robot brain that any hardware can use.

Foundation Models

Large pre-trained AI models adapted for robotics tasks. Just as GPT was pre-trained on internet text, robot foundation models are pre-trained on diverse robot data (and often internet vision-language data too) then fine-tuned for specific tasks. NVIDIA GR00T, pi0, and RT-2 are prominent examples.

BOTTOM LINE

The AI powering humanoid robots is converging on two paradigms. For manipulation and high-level reasoning, Vision-Language-Action (VLA) models are becoming the standard — Tesla, Figure AI, Google DeepMind, and Xpeng all use VLA architectures that leverage web-scale pre-training. For locomotion, reinforcement learning with sim-to-real transfer remains dominant, with NVIDIA Isaac Sim as the de facto training environment. The most ambitious trend is cross-embodiment foundation models (Physical Intelligence pi0, Skild Brain, NVIDIA GR00T) that aim to be universal robot brains working across any body type. NVIDIA is the clear infrastructure winner, supplying training compute (H100/B200 GPUs), simulation (Isaac Sim), onboard compute (Jetson Thor), and foundation models (GR00T) to the majority of the industry. The next 12-18 months will determine whether the foundation model approach — one model for all robots — wins out over company-specific VLAs optimized for individual platforms.

FREQUENTLY ASKED QUESTIONS

RELATED INTELLIGENCE

Every Humanoid Robot Company: Complete List →Humanoid Robot Stocks: Every Public Company →Who Is Buying Humanoid Robots? →