Perception-action loops for drones develop with Microsoft simulation

Sensing technologies have improved steadily, but the ability of robots to make decisions in real time based on what they perceive still has a long way to go to equal or surpass human capabilities. Researchers from Microsoft Corp., Carnegie Mellon University, and Oregon State University have been collaborating to improve perception-action loops.

As members of Team Explorer, they are participating in the Defense Advanced Research Projects Agency’s Subterranean (DARPA SubT) Challenge. The competition is designed to develop technologies that could aid first responders in hazardous environments. Team Explorer won first place in the Tunnel Circuit in September 2019 and second place in the February 2020 Urban Circuit.

In a blog post, the research team explained how it has created machine learning systems to enable robots or drones to make decisions based on camera data. It includes Rogerio Bonatti, a Ph.D. student at Carnegie Mellon University (CMU), and Sebastian Scherer, an associate research professor at CMU. The team also includes Ratnesh Madaan, research software development engineer for business AI; Vibhav Vineet, a senior researcher; and Ashish Kapoor, partner research manager at Microsoft.

“The [perception-action loop] system is trained via simulations and learns to independently navigate challenging environments and conditions in real world, including unseen situations,” the researchers wrote. “We wanted to push current technology to get closer to a human’s ability to interpret environmental cues, adapt to difficult conditions, and operate autonomously.”

Building a drone racing model

“In first-person view (FPV) drone racing, expert pilots can plan and control a quadrotor with high agility using a noisy monocular camera feed, without compromising safety,” said the researchers. “We attempted to mimic this ability with our framework, and tested it with an autonomous drone on a racing task.”

The team trained a neural network with data from an RGB camera and mapped visual information directly to control actions. It broke the task into two parts — building a simulation and taking control actions.

The models had to account for variances in between the simulation and the real world, such as differences in lighting. The researchers used Microsoft’s AirSim simulator and a Cross-Modal Variational Auto Encoder (CM-VAE) framework and combined raw unlabeled data with the relative poses of gates in the drone’s coordinate frame.

“The system naturally incorporated both labeled and unlabeled data modalities into the training process of the latent variable,” they said. “Imitation learning was then used to train a deep control policy that mapped latent variables into velocity commands for the quadrotor.”

By abstracting video frames to a lower-dimensional representation, the team was able to train a deep control policy with imitation learning while still providing enough information for the drone to navigate through obstacles.

Visualization of smooth latent space interpolation between two real-world images. The ground truth and predicted distances between the camera and gate for Images A and B were 2 and 6m and 2.5 and 5.8m, respectively. Source: Microsoft

Testing the perception-action loop

The researchers tested their perception-action loop system on a drone racing track with different courses. While they reported that the “performance of standard architectures dropped significantly,” their CM-VAE was able to approximate gate distances based purely on simulated data.

In a three-second flight segment, a) the CM-VAE decodes images and b) the time history f gate center poses decoded from the CM-VAE (red) and regression (blue). The regression representation is affected by higher offset and noise. Source: Microsoft

The control framework even worked indoors, with stripes painted on the floor matching the gate color, and in snow. “Despite the intense visual distractions from background conditions, the drone was still able to complete the courses by employing our cross-modal perception module,” the team wrote.

“By separating the perception-action loop into two modules and incorporating multiple data modalities into the perception training phase, we can avoid overfitting our networks to non-relevant characteristics of the incoming data,” it added. The combination of abstracted sensor data and simulation-trained models could lead to better performance.

However, the researchers found that “an unexpected result we came across during our experiments is that combining unlabeled real-world data with the labeled simulated data for training the representation models did not increase overall performance. Using simulation-only data worked better. We suspect that this drop in performance occurs because only simulated data was used in the control learning phase with imitation learning.”

Using unlabeled data

A recent trend in artificial intelligence and robotics development is to limit or tighten the data sets needed to train autonomous systems. Microsoft’s work with Team Explorer is an example of how separating the perception from control policies can lead to more robust perception-action loops.

The researchers at Microsoft, CMU, and Oregon State concluded that combining multiple data streams in the CM-VAE led to better generalization and recognition of objects, but more work remains to be done on using adversarial techniques to bring simulated data closer to real images.

The use of unlabeled data and simulation could have multiple applications for autonomous systems, noted the team. They include detecting people’s faces for search-and-rescue operations, drone inspections, and robotic piece picking.