Listen to this article
A research project led by USC computer science student Sumedh A. Sontakke wants to open the door for robots to be caregivers for aging populations. The team claims the RoboCLIP algorithm, developed with help from Professor Erdem Biyik and Professor Laurent Itti, allows robots to perform new tasks after just one demonstration.
RoboCLIP only needs to see one video or textual demonstration of a task for it to perform the task two or three times better than other imitation learning (IL) models, the team claimed.
“To me, the most impressive thing about RoboCLIP is being able to make our robots do something based on only one video demonstration or one language description,” said Biyik, a roboticist who joined USC Viterbi’s Thomas Lord Department of Computer Science in August 2023 and leads the Learning and Interactive Robot Autonomy Lab (Lira Lab).
The project started two years ago when Sontakke realized how much data is needed to have robots perform basic household tasks.
“I started thinking about household tasks like opening doors and cabinets,” Sontakke said. “I didn’t like how much data I needed to collect before I could get the robot to successfully do the task I cared about. I wanted to avoid that, and that’s where this project came from.”
How does RoboCLIP work?
Most IL models learn how to complete tasks by trial and error. The robot performs the task over and over again to get a reward when it finally completes the task. While this can be effective, it requires massive amounts of time, data, and human supervision to get the robot to successfully perform a new task.
“The large amount of data currently required to get a robot to successfully do the task you want it to do is not feasible in the real world, where you want robots that can learn quickly with few demonstrations,” Sontakke said in a release.
RoboCLIP works differently than typical IL models, as it incorporates the latest advances in generative AI and video-language models (VLMs). These systems are pre-trained on large amounts of video and textual demonstrations, according to Biyik.
The researchers claimed RoboCLIP performs well out of the box to perform household tasks, like opening and closing drawers or cabinets.
“The key innovation here is using the VLM to critically ‘observe’ simulations of the virtual robot babbling around while trying to perform the task, until at some point it starts getting it right – at that point, the VLM will recognize that progress and reward the virtual robot to keep trying in this direction,” Itti said.
According to Itti, the VLM can tell it’s getting closer to success when the textual description it creates observing the robot comes closer to what the user wants.
“This new kind of closed-loop interaction is very exciting to me and will likely have many more future applications in other domains,” Itti said.
Sontakke hopes that the program could someday help robots care for aging populations, or lead to other applications that could help anyone. The team says that future research will be necessary before the system is ready to take on the real world.
The paper, titled RoboCLIP: One Demonstration is Enough to Learn Robot Policies, was presented by Sontakke at the 37th Conference on Neural Information Processing Systems (NeurIPS), Dec. 10-16 in New Orleans.
Collaborating with Sontakke, Biyik and Itti on the RoboCLIP paper were two USC Viterbi graduates, Sebastien M.R. Arnold, now at Google Research, and Karl Pertsch, now at UC Berkeley and Stanford University. Jesse Zhang, a fourth-year Ph.D. candidate in computer sciences at USC Viterbi, also worked on the RoboCLIP project.