Sim-and-Human Co-training for Data-Efficient and Generalization Robotic Manipulation

SimHum unifies transferable actions from simulation with realistic visuals from human data to achieve data-efficient and generalizable robotic manipulation. In this video, we demonstrate how our co-training framework leverages the inherent complementarity of these sources. Empirically, SimHum outperforms the baseline by up to 35% under the same data budget, , and achieves relative improvements of up to 7.1x compared to the real-only baseline in unseen environments.

Abstract

Synthetic simulation data and real-world human data provide scalable alternatives to circumvent the prohibitive costs of robot data collection. However, these sources suffer from the sim-to-real visual gap (a) and the human-to-robot embodiment gap (b), respectively, which limits the policy's generalization to real-world scenarios. In this work, we identify a natural yet underexplored complementarity between these sources: simulation offers the robot action that human data lacks, while human data provides the real-world observation that simulation struggles to render. Motivated by this insight, we present SimHum, a co-training framework to simultaneously extract kinematic prior from simulated robot actions and visual prior from real-world human observations (c). Based on the two complementary priors, we achieves data-efficient and generalizable robotic manipulation in real-world tasks. Empirically, SimHum outperforms the baseline by up to 35% under the same data collection budget, and achieves a 62.5% average success rate, showing relative improvements of up to 7.1× compared to the real-only baseline (d).

Data Collection

The Complementarity of Heterogeneous Data

Real-world robot data is costly. We overcome this by leveraging the natural complementarity between two scalable data sources: (1) Simulation provides Robot Action but suffers from sim2real visual gap due to inevitable visual rendering discrepancies in simulators; (2) Human Data offers Real-world Observation but suffers from embodiment gap due to kinematic mismatch with robot grippers. By unifying them, we get the best of both worlds.

Hover over the buttons to explore relationships

Simulation Data

Robot Action

Visual - sim2real visual gap

Low Cost - 2000 episodes in total

🔗

Real Robot Data

Robot Action

Real-world Observation

High Cost - 320 episodes in total

🔗

Human Data

Human Hand - embodiment gap

Real-world Observation

Low Cost - 2000 episodes in total

Consistent Task Definitions - All three sources collect data for an identical set of tasks to capture task-relevant manipulation priors

Human Data Collection System

Our human data collection system consists of hardware devices, a real-time GUI interface, and synchronized recording capabilities. The system ensures high-quality data capture with frame-level synchronization between GUI telemetry and scene observations.

Collected Data Visualization

Explore our multi-source demonstration dataset captured across diverse environments. Switch between the tabs below to view episodes for different manipulation tasks.

Simulation Data

Human Data

Real Robot Data

Approach

We employ a Two-Stage Training Paradigm to train our Modular Policy Architecture. In the Sim-and-Human Pre-training stage (a), we leverage Modular Action Encoders/Decoders to extract transferable kinematic priors from simulation, and Domain-specific Vision Adaptors to extract visual priors from human data. Subsequently, for Real-robot Fine-tuning (b), we restructure the policy by selectively retaining the compatible components—specifically the Real-world Vision Adaptor and Robot Encoder/Decoder—to achieve data-efficient generalization.

SimHum in Real-World Scenarios

We evaluate the SimHum under In-Distribution (ID) and Out-of-Distribution (OOD) settings. While baselines degrade significantly in OOD scenarios (featuring unseen background textures, distractors, and extreme lighting), SimHum maintains robust performance. SimHum achieves relative improvements of up to 7.1× compared to the real-only baseline in these challenging unseen environments.

All Videos Autonomous 1×

In-Distribution (ID)

Base Scene

Complex Scene 1

Complex Scene 2

Complex Scene 3

Out-of-Distribution (OOD)

Extreme Complex Scene 1

Extreme Complex Scene 2

Decoupling the Effects: Why Both Sources Matter

Hover over the charts to view detailed data

Human Data → Visual Generalization

Human data provides diverse visual priors. As shown in the ablation study, excluding background diversity (F_bg) causes a sharp performance drop (-60%), highlighting its critical role in handling real-world visual perturbations.

Simulation → Spatial Robustness

Simulation data provides dense kinematic coverage. The heatmap demonstrates that SimHum generalizes significantly better to unseen object positions (outer grid), improving performance by +36.7% compared to human-only pre-training.

Data Efficiency and Performance Scalability

Hover over the charts to view detailed data

Data Efficiency

SimHum is significantly more data-efficient. It matches the baseline performance using 20x less real-world data (achieving comparable results with only 8 real demos vs. 160 demos for the baseline).

Performance Scalability

Our performance scales consistently with the size of pre-training datasets, suggesting continued improvements with larger simulation and human datasets. This demonstrates the scalability of our approach.

Baseline Failure Cases in OOD Settings

We visualize typical failure cases of baseline methods in Out-of-Distribution (OOD) scenarios. Please switch between the tabs below to view different manipulation tasks.

Real-only

Trained solely on limited real-world data, it tends to overfit to spurious visual correlations, leading to poor generalization.

HumReal

Pre-trained on human data and then fine-tuned on limited real-world data, it struggles with the embodiment gap due to kinematic mismatches between human and robots.

SimReal

Pre-trained on simulation and then fine-tuned on limited real-world data, it is hindered by the visual gap, as simulated rendering does not perfectly align with the real world.

All Videos Autonomous 1×

Real Only

Final Score: 0/2

HumReal

Final Score: 1/2

SimReal

Final Score: 1/2

BibTeX

@article{fang2026simhum,
  title = {Sim-and-Human Co-training for Data-Efficient and Generalizable Robotic Manipulation},
  author = {Fang, Kaipeng and Liang, Weiqing and Li, Yuyang and Zhang, Ji and Zeng, Pengpeng and Gao, Lianli and Song, Jingkuan and Shen, Heng Tao},
  journal = {arXiv preprint arXiv:2601.19406},
  year = {2026},
}

Sim-and-Human Co-training for Data-Efficient and Generalizable Robotic Manipulation

¹University of Electronic Science and Technology of China
²Southwest Jiaotong University, ³Tongji University ⁴Shanghai Innovation Institute

Abstract

Data Collection

The Complementarity of Heterogeneous Data

Simulation Data

Real Robot Data

Human Data

Human Data Collection System

GUI Interface

Collection Scene

Collected Data Visualization

Simulation Data

Human Data

Real Robot Data

Approach

SimHum in Real-World Scenarios

All Videos Autonomous 1×

In-Distribution (ID)

Out-of-Distribution (OOD)

Decoupling the Effects: Why Both Sources Matter

Human Data → Visual Generalization

Simulation → Spatial Robustness

Data Efficiency and Performance Scalability

Data Efficiency

Performance Scalability

Baseline Failure Cases in OOD Settings

Real-only

HumReal

SimReal

All Videos Autonomous 1×

Real Only

HumReal

SimReal

BibTeX

Sim-and-Human Co-training for Data-Efficient and Generalizable Robotic Manipulation

1University of Electronic Science and Technology of China 2Southwest Jiaotong University, 3Tongji University 4Shanghai Innovation Institute

Abstract

Data Collection

The Complementarity of Heterogeneous Data

Simulation Data

Real Robot Data

Human Data

Human Data Collection System

GUI Interface

Collection Scene

Collected Data Visualization

Simulation Data

Human Data

Real Robot Data

Approach

SimHum in Real-World Scenarios

All Videos Autonomous 1×

In-Distribution (ID)

Out-of-Distribution (OOD)

Decoupling the Effects: Why Both Sources Matter

Human Data → Visual Generalization

Simulation → Spatial Robustness

Data Efficiency and Performance Scalability

Data Efficiency

Performance Scalability

Baseline Failure Cases in OOD Settings

Real-only

HumReal

SimReal

All Videos Autonomous 1×

Real Only

HumReal

SimReal

BibTeX

¹University of Electronic Science and Technology of China
²Southwest Jiaotong University, ³Tongji University ⁴Shanghai Innovation Institute