Research Hub
Key academic papers shaping the development of humanoid robots — locomotion, manipulation, sim-to-real transfer, VLA models, and tactile sensing.
Ψ₀: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
A staged training approach that sidesteps the pitfalls of directly mixing human and robot data. Ψ₀ first pre-trains on 800 hours of egocentric human manipulation video, then post-trains a flow-based action expert on just 30 hours of humanoid robot data. The complete ecosystem — training pipelines, model weights, and inference engines — is fully open-sourced.
HumDex: Humanoid Dexterous Manipulation Made Easy
A portable teleoperation framework for dexterous humanoid manipulation using IMU-based motion tracking. Introduces a learning-based hand control retargeting method and a two-phase training approach: pre-training on human motion data then fine-tuning on robot data to bridge the embodiment gap. Full system is open-sourced.
ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video
ZeroWBC eliminates the need for robot teleoperation data by fine-tuning a Vision-Language Model to predict human motions from egocentric video and text instructions. A tracking policy adapts predicted motions to the robot's joints for whole-body control. Tested on the Unitree G1 humanoid across diverse motion categories including sitting and kicking.
PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching
PhysiFlow proposes a "multi-brain" VLA framework that combines semantic understanding with physics-aware whole-body coordination. It uses latent flow matching to bridge high-level vision-language intent with low-level motor execution, improving inference efficiency while maintaining physical plausibility for full-body humanoid coordination.
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
ULTRA presents a unified multimodal controller for humanoid whole-body loco-manipulation handling varied inputs — from motion-capture data to imperfect egocentric vision. A physics-driven neural retargeting algorithm compresses skills into latent representations, enabling autonomous goal-directed execution without reference motions at test time. Evaluated on Unitree G1.
SPARK: Skeleton-Parameter Aligned Retargeting on Humanoid Robots with Kinodynamic Trajectory Optimization
A two-stage pipeline (accepted ICRA 2026) for converting human motion-capture data into physically feasible humanoid reference trajectories. Human motion is first aligned to the target robot's skeletal parameters, then three-stage kinodynamic trajectory optimization produces dynamically consistent motion references that generalize across different humanoid platforms.
HuMI: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations
HuMI enables learning diverse humanoid whole-body manipulation tasks without any physical robot during data collection. A portable wearable captures full-body human motion, feeding a hierarchical learning pipeline that translates human motions into dexterous humanoid skills. Tested across five tasks: kneeling, squatting, tossing, walking, and bimanual manipulation.
RPL: Learning Robust Humanoid Perceptive Locomotion on Challenging Terrains
A two-stage training framework for multi-directional humanoid locomotion on complex terrain. Stage one trains terrain-specific expert policies using privileged height map observations; stage two distills these into a single transformer policy driven by multiple depth cameras. A custom simulation tool achieves 5× faster depth rendering than prior alternatives.
UniForce: A Unified Latent Force Model for Robot Manipulation with Diverse Tactile Sensors
UniForce addresses the tactile sensor heterogeneity problem by learning a shared latent force representation across diverse sensor types (GelSight, TacTip, uSkin). It jointly models inverse and forward dynamics, constrained by force equilibrium and image reconstruction losses. The universal encoder enables zero-shot cross-sensor transfer for force-aware manipulation.
DemoBot: Efficient Learning of Bimanual Manipulation from a Single Human Video
DemoBot enables a dual-arm, multi-finger robot to learn complex bimanual manipulation from a single unannotated RGB-D video demonstration. Structured motion trajectories are extracted from video, then a three-innovation RL pipeline — temporal-segment RL, success-gated reset, and event-driven reward curriculum — refines those motions through contact-rich simulation before real deployment.
Scalable and General Whole-Body Control for Cross-Humanoid Locomotion (XHugWBC)
XHugWBC trains a single policy that generalizes whole-body locomotion and manipulation across diverse humanoid hardware without robot-specific retraining. Key innovations include physics-consistent morphological randomization and semantically aligned observation/action spaces. Validated across 12 simulated and 7 real-world humanoid platforms.
WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control
A unified latent VLA framework for simultaneous locomotion and manipulation. The model learns from action-free egocentric video paired with a loco-manipulation RL policy, dramatically reducing training data cost. Validated on the AgiBot X2 humanoid on tasks requiring navigation + bimanual manipulation across large spaces. Accepted to ICLR 2026.
TWIST: Teleoperated Whole-Body Imitation System
TWIST retargets human motion-capture data to a humanoid to generate reference clips, then trains a single unified whole-body controller combining RL and behavior cloning. One network handles whole-body manipulation, legged manipulation, locomotion, and expressive movement. Fully open-sourced including datasets, training code, and checkpoints.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a 2.2B-parameter open foundation model built on a dual-system architecture — an Eagle-2 VLM for environmental understanding and a diffusion transformer for real-time motor generation. Trained on real-robot trajectories, human videos, and synthetic data. Fully open-sourced on GitHub and HuggingFace.
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
A practical sim-to-real RL recipe for training vision-based dexterous manipulation on humanoids with multi-fingered hands without relying on demonstrations. Components include automated real-to-sim tuning, contact-based reward formulation, divide-and-conquer policy distillation, and modality-specific augmentation to close the perceptual sim-to-real gap.
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA is a 7B-parameter open-source VLA model trained on 970k robot demonstrations, achieving state-of-the-art performance on manipulation benchmarks. The open weights and training code established a community baseline for vision-language-action research across diverse robot platforms.