Listen to this article
|
Teleoperation can be a powerful method, not only for performing complex tasks, but also for collecting on-robot data. This data is essential for robot learning from demonstrations, as teleoperation offers accurate and precise examples, plus natural and smooth trajectories for imitation learning. These allow the learned policies to generalize to a environments, configurations, and tasks.
Thanks to large-scale, real-robot data, learning-based robotic manipulation has advanced to a new level in the past few years, but that doesn’t mean it’s without limitations. Currently, there are two major components in most teleoperation systems: actuation and perception.
For actuation, many engineers use joint copying to puppeteer the robot, providing high control bandwidth and precision. However, this requires the operators and the robot to be physically in the same locations, not allowing for remote control. Each piece of the robot’s hardware needs to be coupled with specific teleoperation hardware.
In addition, these systems are not yet able to operate multi-finger dexterous hands.
The most straightforward way to handle perception is to observe the robot task space with the operator’s own eyes in a third-person or first-person view. Such an approach will inevitably result in part of the scene being occluded during teleoperation. The operator also cannot ensure the collected demonstration has captured the visual observation needed for policy learning.
On top of that, for fine-grained manipulation tasks, it’s difficult for the teleoperator to look closely and intuitively at the object during manipulation. Displaying a third-person static camera viewer using passthrough in a virtual reality (VR) headset can result in similar challenges.
A team of researchers from the Massachusetts Institute of Technology and the University of California, San Diego, said it hopes to achieve a new level of intuitiveness and ease of use in teleoperation systems, ensuring high-quality, diverse, and scalable data. To do this, the team has proposed an immersive teleoperation system called Open-TeleVision.
How does Open-TeleVision work?
The MIT and UC San Diego team said Open-TeleVision allows operators to actively perceive the robot’s surroundings in a stereoscopic manner. Open-TeleVision is a general framework that allows users to perform teleoperation with high precision. It applies to different VR devices on different robots and manipulators and is open-source.
The system mirrors the operator’s arm and hand movements on the robot. The team says this creates an immersive experience as if the operator’s mind is transmitted to a robot embodiment.
The researchers tested the system with two humanoid robots: the Unitree H1, which has multi-finger hands, and the Fourier GR1, which has parallel-jaw grippers.
To validate Open-TeleVision, the team started with capturing the human operators’ hand poses and performing re-targeting to control the hands or grippers. It relied on inverse kinetics to convert the operator’s hand root position to the position of the robot arm’s end effector.
The team tested the effectiveness of the system by collecting data and training imitation-learning policies on four long-horizon precise tasks. These included can sorting, can insertion, folding, and unloading.
More dexterous robotic manipulation offers benefits
The researchers said their major contribution to allowing fine-grained manipulations comes from perception. Open-TeleVision incorporates VR systems with active visual feedback.
To do this, the team used a single active stereo RGB camera placed on the robot’s head. The camera is equipped alongside actuation with two or three degrees of freedom, mimicking human head movement to observe a large workspace.
During teleoperation, the camera moves along the operator’s head, streaming real-time, egocentric 3D observations to the VR device. The human operator can see what the robot sees. The researchers said this first-person active sensing brings benefits for both teleoperation and policy learning.
For teleoperation, the system provides a more intuitive mechanism for users to explore a broader view when moving the robot’s head, allowing them to attend to the important regions for detailed interactions. For imitation learning, the policy will imitate how to move the robot head actively with manipulation-related regions. It also reduces the pixels to process for smooth, real-time, and precise close-loop control.
In addition, the MIT and UC San Diego researchers highlighted the benefits of perception that come with streaming stereoscopic video for the robot view to human eyes. This gives the operator a better spatial understanding, which is crucial for completing tasks, they said.
The team also showed how training with stereo image frames can improve the performance of the policy.
A key benefit of the system is that it enables an operator to remotely control robots via the Internet. One of the authors, MIT’s Ge Yang on the East Coast, was able to teleoperate the H1 robot at UC San Diego on the West Coast.
Tell Us What You Think!