Listen to this article
A team of researchers at the University of Washington has developed robotic, shape-changing smart speakers that can deploy themselves to divide rooms into speech zones and track the positions of individual speakers.
Using deep learning algorithms, the system is able to allow users to mute certain areas or separate simultaneous conversations, even if two people near each other have similar voices. The robots, each about one inch in diameter, can deploy from, and return to, a charging station on their own, similar to a Roomba.
Unlike previous research on robot swarms, which have required using overhead or on-device cameras, projectors, or special surfaces, the UW team’s system is able to accurately distribute a robot swarm using only sound.
The team’s prototype is made up of seven small robots that spread themselves across tables of various sizes. As they move from their charger, each robot emits a high-frequency sound. The robots use this frequency and other sensors to avoid obstacles and move without falling off the table.
The swarm’s automatic deployment capabilities allow the robots to place themselves with maximum accuracy, permitting greater sound control than if a person set them. The robots disperse themselves as far from each other as possible since greater distances make differentiating and locating people speaking easier.
“If I have one microphone a foot away from me, and another microphone two feet away, my voice will arrive at the microphone that’s a foot away first. If someone else is closer to the microphone that’s two feet away, their voice will arrive there first,” co-lead author Tuochao Chen, a UW doctoral student in the Allen School, said. “We developed neural networks that use these time-delayed signals to separate what each person is saying and track their positions in a space. So you can have four people having two conversations and isolate any of the four voices and locate each of the voices in a room.”
Testing the swarms
The UW team tested the robots in offices, living rooms, and kitchens with groups of three to five people speaking. Across all of these environments, the system was able to discern different voices within 1.6 feet (50 centimeters) of each other 90% of the time, without prior information about the number of speakers.
The system was able to process three seconds of audio in 1.82 seconds on average, which is fast enough for live streaming, but still too slow for real-time communications like video calls.
As this technology continues to progress, the team says that acoustic swarms could be deployed in smart homes to better differentiate people talking with smart speakers. This could allow only people sitting on a couch, in an “active zone” to vocally control a TV, for example.
The team plans to eventually make microphone robots that can move around rooms, instead of just being limited to tables. They’re also investigating whether the speakers can emit sounds that would allow for real-world mute and active zones, so people in different parts of a room can hear different audio.