Bimanual manipulation is a fundamental robotic skill that requires continuous and precise coordination between two arms. While imitation learning (IL) is the dominant paradigm for acquiring this capability, existing approaches, whether robot-centric or object-centric, often overlook the dynamic geometric relationship among the two arms and the manipulated object. This limitation frequently leads to inter-arm collisions, unstable grasps, and degraded performance in complex tasks. To address this, in this paper we explicitly models the Robot–Object Triadic Interaction (RoTri) representation in bimanual systems, by encoding the relative 6D poses between the two arms and the object to capture their spatial triadic relationship and establish continuous triangular geometric constraints. Building on this, we further introduce RoTri-Diff, a diffusion-based imitation learning framework that combines RoTri constraints with robot keyposes and object motion in a hierarchical diffusion process. This enables the generation of stable, coordinated trajectories and robust execution across different modes of bimanual manipulation. Extensive experiments show that our approach outperforms state-of-the-art baselines by 10.2% on 11 representative RLBench2 tasks and achieves stable performance on 4 challenging real-world bimanual tasks.
We present RoTri-Diff, a diffusion-based framework for bimanual imitation learning that centers on robot–object triadic interaction RoTri. By explicitly modeling and leveraging the relative 6D pose relations between the two arm end-effectors and the manipulated objects, it achieves stable performance on bimanual tasks requiring fine-grained coordination.
Overview of RoTri-Diff. (a) Visual Perception and RoTri Modeling: Extracting the initial object point cloud $F_0$, 3D semantic features $S_t$, and the initial RoTri representation $R_0$ from multi-view observations. (b) Imitation Learning Guidance Signals: Three complementary signals used for supervision: Keyposes, Object Pointflow, and the RoTri Relationship. (c) Hierarchical Diffusion Model: The model concurrently predicts object pointflow and autoregressively predicts a future RoTri segment. These predictions then serve as dynamic conditions to guide the denoising and generation of keyposes and continuous actions within a synergistic attention module.