Key academic papers shaping the development of humanoid robots — locomotion, manipulation, sim-to-real transfer, VLA models, and tactile sensing.
A staged training approach that sidesteps the pitfalls of directly mixing human and robot data. Ψ₀ first pre-trains a VLM backbone on 800 hours of egocentric human manipulation video, then post-trains a flow-based action expert on just 30 hours of high-quality humanoid robot data. The complete ecosystem — training pipelines, model weights, and inference engines — is fully open-sourced.
XHugWBC trains a single policy that generalizes whole-body locomotion and manipulation across diverse humanoid hardware — without robot-specific retraining. Key innovations include physics-consistent morphological randomization and semantically aligned observation/action spaces across architectures. Validated across 12 simulated and 7 real-world humanoid platforms.
A unified latent VLA framework for simultaneous locomotion and manipulation. The model learns from large quantities of action-free egocentric video paired with a loco-manipulation RL policy — dramatically reducing the cost of training data collection. Validated on the AgiBot X2 humanoid.
GR00T N1 is a 2.2B-parameter open foundation model built on a dual-system architecture — an Eagle-2 VLM for environmental understanding and a diffusion transformer for real-time motor generation. Trained on a heterogeneous mix of real-robot trajectories, human videos, and synthetic data. Fully open-sourced on GitHub and HuggingFace.
TWIST retargets human motion capture data to a humanoid robot to generate reference clips, then trains a single unified whole-body controller combining RL and behavior cloning. The controller handles whole-body manipulation, legged manipulation, locomotion, and expressive movement with one network. Fully open-sourced including datasets, training code, and checkpoints.
A practical sim-to-real RL recipe for training vision-based dexterous manipulation on humanoids with multi-fingered hands — without relying on human demonstrations. Components include automated real-to-sim tuning, contact-based reward formulation, divide-and-conquer policy distillation, and modality-specific augmentation to close the perceptual sim-to-real gap.
A system enabling humanoid robots to shadow and imitate human motions in real time using egocentric video, achieving robust whole-body control and skill transfer.
OpenVLA is a 7B-parameter open-source VLA model trained on 970k robot demonstrations, achieving state-of-the-art performance on manipulation benchmarks.
GR-2 leverages internet-scale video pretraining to build a generalist manipulation policy that generalizes across robot morphologies and task types.
Training legged robots to perform parkour maneuvers including wall-running, gap jumping, and flipping using a hierarchical RL framework in Isaac Gym.