SimHum unifies transferable actions from simulation with realistic visuals from human data to achieve data-efficient and generalizable robotic manipulation. In this video, we demonstrate how our co-training framework leverages the inherent complementarity of these sources. Empirically, SimHum outperforms the baseline by up to 35% under the same data budget, , and achieves relative improvements of up to 7.1x compared to the real-only baseline in unseen environments.
Abstract
Synthetic simulation data and real-world human data provide scalable alternatives to circumvent the prohibitive costs of robot data collection. However, these sources suffer from the sim-to-real visual gap (a) and the human-to-robot embodiment gap (b), respectively, which limits the policy's generalization to real-world scenarios. In this work, we identify a natural yet underexplored complementarity between these sources: simulation offers the robot action that human data lacks, while human data provides the real-world observation that simulation struggles to render. Motivated by this insight, we present SimHum, a co-training framework to simultaneously extract kinematic prior from simulated robot actions and visual prior from real-world human observations (c). Based on the two complementary priors, we achieves data-efficient and generalizable robotic manipulation in real-world tasks. Empirically, SimHum outperforms the baseline by up to 35% under the same data collection budget, and achieves a 62.5% average success rate, showing relative improvements of up to 7.1× compared to the real-only baseline (d).
Data Collection
The Complementarity of Heterogeneous Data
Real-world robot data is costly. We overcome this by leveraging the natural complementarity between two scalable data sources: (1) Simulation provides Robot Action but suffers from sim2real visual gap due to inevitable visual rendering discrepancies in simulators; (2) Human Data offers Real-world Observation but suffers from embodiment gap due to kinematic mismatch with robot grippers. By unifying them, we get the best of both worlds.
Hover over the buttons to explore relationships
Simulation Data
Real Robot Data
Human Data
Human Data Collection System
Our human data collection system consists of hardware devices, a real-time GUI interface, and synchronized recording capabilities. The system ensures high-quality data capture with frame-level synchronization between GUI telemetry and scene observations.
Collected Data Visualization
Explore our multi-source demonstration dataset captured across diverse environments. Switch between the tabs below to view episodes for different manipulation tasks.
Simulation Data
Human Data
Real Robot Data
Approach
We employ a Two-Stage Training Paradigm to train our Modular Policy Architecture. In the Sim-and-Human Pre-training stage (a), we leverage Modular Action Encoders/Decoders to extract transferable kinematic priors from simulation, and Domain-specific Vision Adaptors to extract visual priors from human data. Subsequently, for Real-robot Fine-tuning (b), we restructure the policy by selectively retaining the compatible components—specifically the Real-world Vision Adaptor and Robot Encoder/Decoder—to achieve data-efficient generalization.
SimHum in Real-World Scenarios
We evaluate the SimHum under In-Distribution (ID) and Out-of-Distribution (OOD) settings. While baselines degrade significantly in OOD scenarios (featuring unseen background textures, distractors, and extreme lighting), SimHum maintains robust performance. SimHum achieves relative improvements of up to 7.1× compared to the real-only baseline in these challenging unseen environments.
All Videos Autonomous 1×
In-Distribution (ID)
Out-of-Distribution (OOD)
Decoupling the Effects: Why Both Sources Matter
Hover over the charts to view detailed data
Human Data → Visual Generalization
Human data provides diverse visual priors. As shown in the ablation study, excluding background diversity (F_bg) causes a sharp performance drop (-60%), highlighting its critical role in handling real-world visual perturbations.
Simulation → Spatial Robustness
Simulation data provides dense kinematic coverage. The heatmap demonstrates that SimHum generalizes significantly better to unseen object positions (outer grid), improving performance by +36.7% compared to human-only pre-training.
Data Efficiency and Performance Scalability
Hover over the charts to view detailed data
Data Efficiency
SimHum is significantly more data-efficient. It matches the baseline performance using 20x less real-world data (achieving comparable results with only 8 real demos vs. 160 demos for the baseline).
Performance Scalability
Our performance scales consistently with the size of pre-training datasets, suggesting continued improvements with larger simulation and human datasets. This demonstrates the scalability of our approach.
Baseline Failure Cases in OOD Settings
We visualize typical failure cases of baseline methods in Out-of-Distribution (OOD) scenarios. Please switch between the tabs below to view different manipulation tasks.
Real-only
Trained solely on limited real-world data, it tends to overfit to spurious visual correlations, leading to poor generalization.
HumReal
Pre-trained on human data and then fine-tuned on limited real-world data, it struggles with the embodiment gap due to kinematic mismatches between human and robots.
SimReal
Pre-trained on simulation and then fine-tuned on limited real-world data, it is hindered by the visual gap, as simulated rendering does not perfectly align with the real world.
All Videos Autonomous 1×
Real Only
HumReal
SimReal
BibTeX