NVIDIA's RVT can learn new tasks after just 10 demos

Listen to this article

NVIDIA Robotics Research has announced new work that combines text prompts, video input, and simulation to more efficiently teach robots how to perform manipulation tasks, like opening drawers, dispensing soap, or stacking blocks, in real life.

Generally, methods of 3D object manipulation perform better when they build an explicit 3D representation rather than only relying on camera images. NVIDIA wanted to find a method of doing that came with less computing costs and was easier to scale than explicit 3D representations like voxels. To do so, the company used a type of neural network called a multi-view transformer to create virtual views from the camera input.

The team’s multi-view transformer, Robotic View Transformer (RVT), is both scalable and accurate. RVT takes camera images and task language descriptions as inputs and predicts the gripper pose action. In simulations, NVIDIA’s research team found that just one RVT model can work well across 18 RLBench tasks with 249 task variations.

The model can perform a variety of manipulation tasks in the real world with around 10 demonstrations per task. The team trained a single RVT model from real-world data and an RVT model from RLBench simulation data. In both settings, the single-trained RVT model was used to evaluate the performance on all tasks.

The Team found that RVT had a 26% higher relative success rate than existing state-of-the-art models. RVT isn’t just more successful than other models, it can also learn faster than traditional models. NVIDIA’s model trains 36 times faster than PerAct, an end-to-end behavior-cloning agent that can learn a single-conditioned policy for 18 RLBench tasks with 249 unique variations, and achieves 2.3 times the inference speed of PerAct.

While RVT was able to outperform similar models, it does come with some limitations that NVIDIA would like to look into further. For example, the team explored various view options for RVT and landed on an option that worked well across tasks, but in the future, the team would like to better optimize view specification using learned data.

RVT, and explicit voxel-based methods, also require extrinsics to be calibrated from the camera to the robot base, and in the future, the team would like to explore extensions that remove this constraint.