Sim-and-Human Co-training for Data-Efficient and Generalization Robotic Manipulation

SimHum unifies transferable actions from simulation with realistic visuals from human data to achieve data-efficient and generalizable robotic manipulation. In this video, we demonstrate how our co-training framework leverages the inherent complementarity of these sources. Empirically, SimHum outperforms the baseline by up to 35% under the same data budget, , and achieves relative improvements of up to 7.1x compared to the real-only baseline in unseen environments.

Abstract

Teaser Image

Synthetic simulation data and real-world human data provide scalable alternatives to circumvent the prohibitive costs of robot data collection. However, these sources suffer from the sim-to-real visual gap (a) and the human-to-robot embodiment gap (b), respectively, which limits the policy's generalization to real-world scenarios. In this work, we identify a natural yet underexplored complementarity between these sources: simulation offers the robot action that human data lacks, while human data provides the real-world observation that simulation struggles to render. Motivated by this insight, we present SimHum, a co-training framework to simultaneously extract kinematic prior from simulated robot actions and visual prior from real-world human observations (c). Based on the two complementary priors, we achieves data-efficient and generalizable robotic manipulation in real-world tasks. Empirically, SimHum outperforms the baseline by up to 35% under the same data collection budget, and achieves a 62.5% average success rate, showing relative improvements of up to 7.1× compared to the real-only baseline (d).

Data Collection

The Complementarity of Heterogeneous Data

Real-world robot data is costly. We overcome this by leveraging the natural complementarity between two scalable data sources: (1) Simulation provides Robot Action but suffers from sim2real visual gap due to inevitable visual rendering discrepancies in simulators; (2) Human Data offers Real-world Observation but suffers from embodiment gap due to kinematic mismatch with robot grippers. By unifying them, we get the best of both worlds.

Hover over the buttons to explore relationships

Simulation Data

Robot Action
Visual - sim2real visual gap
Low Cost - 2000 episodes in total
🔗
Kinematic Consistency

Simulation and Real Robot employ identical robot URDFs, ensuring that the learned action priors are directly transferable.

✓ Same Robot URDF ✓ Transferable Action

Real Robot Data

Robot Action
Real-world Observation
High Cost - 320 episodes in total
🔗
Visual Alignment

Human and Real Robot capture images using the same camera model from an identical viewpoint, guaranteeing visual alignment with real-world deployment.

✓ Same Camera ✓ Aligned Viewpoint

Human Data

Human Hand - embodiment gap
Real-world Observation
Low Cost - 2000 episodes in total
Consistent Task Definitions - All three sources collect data for an identical set of tasks to capture task-relevant manipulation priors

Human Data Collection System

Our human data collection system consists of hardware devices, a real-time GUI interface, and synchronized recording capabilities. The system ensures high-quality data capture with frame-level synchronization between GUI telemetry and scene observations.

Hardware Setup

Collected Data Visualization

Explore our multi-source demonstration dataset captured across diverse environments. Switch between the tabs below to view episodes for different manipulation tasks.

Simulation Data

Human Data

Real Robot Data


Approach

Teaser Image

We employ a Two-Stage Training Paradigm to train our Modular Policy Architecture. In the Sim-and-Human Pre-training stage (a), we leverage Modular Action Encoders/Decoders to extract transferable kinematic priors from simulation, and Domain-specific Vision Adaptors to extract visual priors from human data. Subsequently, for Real-robot Fine-tuning (b), we restructure the policy by selectively retaining the compatible components—specifically the Real-world Vision Adaptor and Robot Encoder/Decoder—to achieve data-efficient generalization.

SimHum in Real-World Scenarios

We evaluate the SimHum under In-Distribution (ID) and Out-of-Distribution (OOD) settings. While baselines degrade significantly in OOD scenarios (featuring unseen background textures, distractors, and extreme lighting), SimHum maintains robust performance. SimHum achieves relative improvements of up to 7.1× compared to the real-only baseline in these challenging unseen environments.

All Videos Autonomous 1×

In-Distribution (ID)

Base Scene
Complex Scene 1
Complex Scene 2
Complex Scene 3

Out-of-Distribution (OOD)

Extreme Complex Scene 1
Extreme Complex Scene 2

Decoupling the Effects: Why Both Sources Matter

Hover over the charts to view detailed data

Human Data Visual Generalization

Human Data Analysis
Background Diversity
SimHum (Full): 80.0% ± 8.9
SimHum (w/o Factor): 20.0% ± 8.9
Distractor Diversity
SimHum (Full): 75.0% ± 10.7
SimHum (w/o Factor): 35.0% ± 9.7
Lighting Diversity
SimHum (Full): 75.0% ± 11.2
SimHum (w/o Factor): 50.0% ± 9.7
Object Diversity
SimHum (Full): 80.0% ± 11.2
SimHum (w/o Factor): 50.0% ± 8.9

Human data provides diverse visual priors. As shown in the ablation study, excluding background diversity (F_bg) causes a sharp performance drop (-60%), highlighting its critical role in handling real-world visual perturbations.

Simulation Spatial Robustness

Sim Data Analysis

Simulation data provides dense kinematic coverage. The heatmap demonstrates that SimHum generalizes significantly better to unseen object positions (outer grid), improving performance by +36.7% compared to human-only pre-training.

Data Efficiency and Performance Scalability

Hover over the charts to view detailed data

Data Efficiency

Data Efficiency Chart
2 hours of data
Real Only: 15.0% ± 8.0
SimHum (Ours): 30.0% ± 10.3
4 hours of data
Real Only: 15.0% ± 8.0
SimHum (Ours): 40.0% ± 11.0
8 hours of data
HumReal: 20.0% ± 8.9
Real Only: 25.0% ± 9.7
SimReal: 30.0% ± 10.3
SimHum (Ours): 70.0% ± 10.3

SimHum is significantly more data-efficient. It matches the baseline performance using 20x less real-world data (achieving comparable results with only 8 real demos vs. 160 demos for the baseline).

Performance Scalability

Data Scaling Chart
8 Real Demos
SimHum (Ours): 55.0% ± 11.1
Real Only: 38.3% ± 10.9
19 Real Demos
SimHum (Ours): 60.0% ± 11.0
Real Only: 43.3% ± 11.1
40 Real Demos
SimHum (Ours): 65.0% ± 10.7
Real Only: 45.0% ± 11.1
80 Real Demos
SimHum (Ours): 83.3% ± 8.3
Real Only: 48.3% ± 11.2
160 Real Demos
SimHum (Ours): 91.7% ± 6.2
Real Only: 58.3% ± 11.0

Our performance scales consistently with the size of pre-training datasets, suggesting continued improvements with larger simulation and human datasets. This demonstrates the scalability of our approach.

Baseline Failure Cases in OOD Settings

We visualize typical failure cases of baseline methods in Out-of-Distribution (OOD) scenarios. Please switch between the tabs below to view different manipulation tasks.

Real-only

Trained solely on limited real-world data, it tends to overfit to spurious visual correlations, leading to poor generalization.

HumReal

Pre-trained on human data and then fine-tuned on limited real-world data, it struggles with the embodiment gap due to kinematic mismatches between human and robots.

SimReal

Pre-trained on simulation and then fine-tuned on limited real-world data, it is hindered by the visual gap, as simulated rendering does not perfectly align with the real world.

All Videos Autonomous 1×

Real Only

Final Score: 0/2

HumReal

Final Score: 1/2

SimReal

Final Score: 1/2

BibTeX