CaP-X Framework Tests Code-as-Policy Against VLA Models

Can Code-as-Policy Match Vision-Language-Action Models for Robot Control?

Researchers have released CaP-X, an open-access benchmarking framework that directly compares Code-as-Policy approaches against data-intensive Vision-Language-Action Models for robot manipulation tasks. The framework introduces CaP-Gym, an interactive environment where agents control robots by synthesizing and executing programs rather than learning from massive datasets.

The timing is critical as humanoid robotics companies like Figure AI and Physical Intelligence (π) have invested heavily in VLA architectures that require millions of training examples. Code-as-Policy offers a fundamentally different approach: instead of learning patterns from data, agents generate executable code that directly controls robot actuators and sensors.

CaP-X addresses a significant gap in robotics research. While VLA models demonstrate impressive performance on trained tasks, they struggle with zero-shot generalization to novel scenarios. Code-as-Policy agents, by contrast, can theoretically handle any manipulation task that can be programmed, but their real-world effectiveness has been poorly quantified until now.

The framework's release comes as the humanoid robotics industry grapples with the computational costs of training foundation models. A single VLA model can require thousands of GPU hours and millions of demonstration examples, while Code-as-Policy approaches need only computational resources for program synthesis.

Framework Architecture and Testing Environment

CaP-X centers on CaP-Gym, a standardized simulation environment built for systematic evaluation of code-generating agents. The gym provides robots with standard manipulation primitives — grasp, move, rotate, release — while tracking success rates, execution time, and code quality metrics across different task complexities.

Unlike previous ad-hoc evaluations of Code-as-Policy systems, CaP-X establishes reproducible benchmarks. The framework tests agents on hierarchical task structures: simple pick-and-place operations, multi-object manipulation sequences, and complex assembly tasks requiring precise force control.

The researchers designed the environment to expose key weaknesses in both approaches. VLA models excel at tasks similar to their training data but fail catastrophically on novel object geometries or unexpected environmental conditions. Code-as-Policy agents handle novel scenarios better but struggle with tasks requiring fine-grained sensorimotor coordination that's difficult to encode algorithmically.

Initial results from the framework reveal that Code-as-Policy agents achieve 73% success rates on novel manipulation tasks, compared to 45% for state-of-the-art VLA models when tested outside their training distribution. However, VLA models outperform code-based approaches on tasks requiring dexterous manipulation, achieving 89% success versus 62% for programmed controllers.

Industry Implications for Humanoid Development

The CaP-X framework arrives as humanoid robotics companies face mounting pressure to demonstrate real-world capabilities beyond controlled demonstrations. Current VLA approaches require extensive data collection in each new environment — a significant bottleneck for companies deploying humanoids across diverse industrial settings.

Code-as-Policy offers potential advantages for rapid deployment scenarios. A humanoid equipped with robust program synthesis capabilities could theoretically adapt to new warehouse layouts, manufacturing processes, or household environments without additional training data. This flexibility could prove crucial for companies like Agility Robotics deploying Digit robots across multiple Amazon fulfillment centers.

However, the framework also exposes fundamental limitations. Code-as-Policy struggles with tasks requiring learned sensorimotor skills — precisely the capabilities that make humanoids valuable for complex manipulation tasks. The approach works well for structured industrial environments but may prove insufficient for the unstructured scenarios that justify humanoids' higher costs versus specialized automation.

The research suggests hybrid approaches may emerge as the optimal solution. Several companies are already exploring architectures that combine VLA models for perception and skill learning with Code-as-Policy systems for high-level task planning and novel scenario adaptation.

Technical Challenges and Future Directions

CaP-X reveals several technical hurdles that both approaches must overcome. Code-as-Policy agents frequently generate syntactically correct but physically impossible commands — attempting to grasp objects through obstacles or exceeding joint limits. The framework tracks these failure modes systematically, providing crucial data for improving program synthesis models.

VLA models face different but equally significant challenges. The framework demonstrates that these models exhibit brittle generalization, performing well on interpolation tasks but failing on extrapolation scenarios. A VLA model trained on cylindrical objects may completely fail when presented with cubic ones, despite the similarity in manipulation strategies required.

The researchers propose several improvements enabled by their benchmarking framework. Code-as-Policy agents could incorporate physics-aware program synthesis, generating only feasible motion sequences. VLA models could benefit from systematic evaluation protocols that expose generalization failures during development rather than deployment.

For humanoid robotics companies, CaP-X provides a standardized methodology for comparing different control approaches. Rather than relying on carefully curated demonstrations, companies can now evaluate their systems against consistent benchmarks that reveal real-world performance characteristics.

Key Takeaways

CaP-X framework enables systematic comparison of Code-as-Policy versus VLA approaches for robot manipulation
Code-as-Policy agents achieve 73% success on novel tasks versus 45% for VLA models outside training distribution
VLA models excel at dexterous manipulation (89% success) compared to programmed approaches (62%)
Framework reveals critical generalization failures in both approaches that weren't apparent in previous evaluations
Hybrid architectures combining both approaches may emerge as optimal solution for humanoid robotics deployment
Open-access framework provides standardized benchmarks for industry evaluation of control systems

Frequently Asked Questions

What makes CaP-X different from previous robot control evaluations?

CaP-X provides the first systematic framework for directly comparing Code-as-Policy and VLA approaches using standardized benchmarks. Previous evaluations relied on cherry-picked demonstrations or incomparable test scenarios, making it impossible to assess relative strengths objectively.

How do Code-as-Policy agents handle tasks they weren't explicitly programmed for?

Code-as-Policy agents use large language models to generate executable programs for novel tasks. Instead of learning from demonstration data, they synthesize new code by combining primitive operations and adapting existing program templates to new scenarios.

Why do VLA models struggle with generalization despite massive training datasets?

VLA models learn statistical patterns from training data, making them excellent at interpolating between known examples but poor at extrapolating to genuinely novel scenarios. They lack the compositional reasoning capabilities that allow Code-as-Policy systems to combine known operations in new ways.

Which approach is better for commercial humanoid deployment?

The framework suggests neither approach alone is optimal. Code-as-Policy offers better adaptability for new environments, while VLA models excel at complex manipulation skills. Hybrid architectures that combine both approaches may prove most effective for commercial deployment.

How does CaP-X impact sim-to-real transfer research?

CaP-X standardizes evaluation protocols for sim-to-real transfer, allowing researchers to systematically measure how well different approaches transfer from simulation to physical robots. This addresses a major gap in current robotics research methodology.