EasyInsert: A Data-Efficient and Generalizable Insertion Policy

Guanghe Li^1,2,3,4^*, Junming Zhao^1,2,3,5^*, Shengjie Wang^1,2,3, Yang Gao^1,2,3^†

* Equal contribution † Corresponding author

¹Tsinghua University, ²Shanghai AI Laboratory, ³Shanghai Qi Zhi Institute, ⁴Jilin University, ⁵Fudan University

Overview

We propose EasyInsert, a framework which leverages the human intuition that relative pose (delta pose) between plug and socket is sufficient for successful insertion, and employs efficient and automated real-world data collection with minimal human labor to train a generalizable model for relative pose prediction. During execution, EasyInsert follows a coarse-to-fine execution procedure based on predicted delta pose, and successfully performs various insertion tasks. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, handling cases with significant initial pose deviations while maintaining high sample efficiency and requiring little human effort. In real-world experiments, with just 5 hours of training data, EasyInsert achieves over 90% success in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, with only one human demonstration and 4 minutes of automatically collected data for fine-tuning, it reaches over 90% success rate for all 15 objects.

EasyInsert Framework

Overview of our method: (1) Left: Data collection module: that constructs training dataset with 80% automated and 20% manual data collection, where manual data collection focuses on fine-grained interactions around the socket area and auto-collection scale data in a larger spatial range. (2) Middle: Generalist Policy pretrained from the collected data, predicts relative pose between plug and socket directly from visual inputs. For tasks requiring higher precision, the same data collection module can be reused to perform one-shot finetuning on the target objects. (3) Right: Motivated by human insertion behavior, we design a similar coarse-to-fine execution process for the robot.

Evaluation Videos

We trained EasyInsert on five categories of objects and evaluated it on 15 unseen insertion tasks. The robotic system consists of a 7-DoF Franka Emika Panda arm, using dual wrist-mounted Intel Realsense 405 RGB cameras for visual perception. Experimental results demonstrate that EasyInsert exhibits strong generalization capabilities across diverse objects, spatial configurations, and environmental conditions, while maintaining high resistance to perturbations.

Zero-Shot Videos

AutoMate-01129

AutoMate-00417

AutoMate-00320

AutoMate-01041

AutoMate-00681

Key

HDMI

Type-C

Ethernet

Doughnut

Trapezoid

Rectangle

Rectangle-Thin

Stick

Round-1

Round-2

Zero-Shot(Human Perturbation)

Type-C

Rectangle

Ethernet

AutoMate-00681

Zero-Shot(Extreme Pose Deviation)

Type-C

Rectangle

HDMI

AutoMate-00681

Discussions

Main Results

Trained on only 5 hours of data, EasyInsert exhibited strong zero-shot generalization across all 15 unseen objects. As shown below, the method achieved success rates of 90% on most tasks, with only two cases showing lower performance of 80%. Remarkably, tested in a much more cluttered environment and with much greater initial pose deviation, EasyInsert's zero-shot success rate outperforms AutoMate's in-domain results, highlighting its strong generalization capability.

A Long Uncut Video

Here, we present an unedited video in which EasyInsert successfully performs consecutive object insertions in a zero-shot setting under human perturbation. This demonstrates EasyInsert's strong generalization capability and robustness against external disturbances.

Citation

If you find our work helpful, please cite us:

@article{li2025easyinsert,
  title={EasyInsert: A Data-Efficient and Generalizable Insertion Policy},
  author={Li, Guanghe and Zhao, Junming and Wang, Shengjie and Gao, Yang},
  journal={arXiv preprint arXiv:2505.16187},
  year={2025}
}

Thank you!