We propose EasyInsert, a framework which leverages the human intuition that relative pose (delta pose) between plug and socket is sufficient for successful insertion, and employs efficient and automated real-world data collection with minimal human labor to train a generalizable model for relative pose prediction. During execution, EasyInsert follows a coarse-to-fine execution procedure based on predicted delta pose, and successfully performs various insertion tasks. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, handling cases with significant initial pose deviations while maintaining high sample efficiency and requiring little human effort. In real-world experiments, with just 5 hours of training data, EasyInsert achieves over 90% success in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, with only one human demonstration and 4 minutes of automatically collected data for fine-tuning, it reaches over 90% success rate for all 15 objects.
Overview of our method: (1) Left: Data collection module: that constructs training dataset with 80% automated and 20% manual data collection, where manual data collection focuses on fine-grained interactions around the socket area and auto-collection scale data in a larger spatial range. (2) Middle: Generalist Policy pretrained from the collected data, predicts relative pose between plug and socket directly from visual inputs. For tasks requiring higher precision, the same data collection module can be reused to perform one-shot finetuning on the target objects. (3) Right: Motivated by human insertion behavior, we design a similar coarse-to-fine execution process for the robot.
We trained EasyInsert on five categories of objects and evaluated it on 15 unseen insertion tasks. The robotic system consists of a 7-DoF Franka Emika Panda arm, using dual wrist-mounted Intel Realsense 405 RGB cameras for visual perception. Experimental results demonstrate that EasyInsert exhibits strong generalization capabilities across diverse objects, spatial configurations, and environmental conditions, while maintaining high resistance to perturbations.
AutoMate-01129
AutoMate-00417
AutoMate-00320
AutoMate-01041
AutoMate-00681
Key
HDMI
Type-C
Ethernet
Doughnut
Trapezoid
Rectangle
Rectangle-Thin
Stick
Round-1
Round-2
Type-C
Rectangle
Ethernet
AutoMate-00681
Type-C
Rectangle
HDMI
AutoMate-00681
Trained on only 5 hours of data, EasyInsert exhibited strong zero-shot generalization across all 15 unseen objects. As shown below, the method achieved success rates of 90% on most tasks, with only two cases showing lower performance of 80%. Remarkably, tested in a much more cluttered environment and with much greater initial pose deviation, EasyInsert's zero-shot success rate outperforms AutoMate's in-domain results, highlighting its strong generalization capability.
Here, we present an unedited video in which EasyInsert successfully performs consecutive object insertions in a zero-shot setting under human perturbation. This demonstrates EasyInsert's strong generalization capability and robustness against external disturbances.
@article{li2025easyinsert,
title={EasyInsert: A Data-Efficient and Generalizable Insertion Policy},
author={Li, Guanghe and Zhao, Junming and Wang, Shengjie and Gao, Yang},
journal={arXiv preprint arXiv:2505.16187},
year={2025}
}