Context is king: How Avride uses cloud VLMs as a safety net for delivery robots

Avride has integrated vision-language models or VLMs into its delivery robots.

Avride has integrated vision-language models into its delivery robots. Source: Avride

Avride Inc. has built its delivery robots for high level of autonomy. Every single day, hundreds of them navigate busy city streets entirely on their own, processing complex sensor data locally on their onboard compute units. Our sidewalk robots run with minimal human involvement, reliably handling standard urban maneuvers, pedestrians, and traffic lights on their own.

However, efficiently managing the mechanics of navigation – even in challenging conditions like narrow pathways or bad weather – is only one part of the equation. Ensuring a robot behaves appropriately in unusual, sensitive, or high-stakes real-world environments requires a different kind of intelligence.

To add a proactive layer of environmental awareness, we have integrated heavy, cloud-based vision-language models (VLMs) into its system as an automated “VLM-watcher.”

From object detection to holistic scene understanding

Avride’s onboard perception stack is already highly capable. Using a combination of onboard sensors and local neural networks, our delivery robots are designed to detect surrounding agents, including cyclists, children, wheelchairs, and emergency vehicles.

However, while our onboard models can identify these individual elements, certain real-world scenarios require a much deeper layer of contextual understanding.

Consider how a scenario unfolds on a city street. Encountering a police officer or a firefighter on the sidewalk might hint that something unusual is happening, but basic object detection isn’t enough to grasp the full picture.

For instance, distinguishing a police officer walking home after a shift from an active, sensitive crime scene is a highly non-trivial task. It requires a holistic understanding of how multiple elements interact within the frame – interpreting the scene as a whole scenario rather than a mere checklist of detected objects.

We want to significantly reduce the likelihood of our delivery robots accidentally entering an active emergency area, crossing a live crime scene, or rolling into unmapped roadwork where fresh, wet cement looks just like a standard grey sidewalk. While onboard models capture the primary entities needed to navigate, a heavy foundation model in the cloud excels at this holistic interpretation, instantly piecing together the deep semantic context of the entire situation.

ITE AD for the 2026 RoboBusiness call for speakers

Submit your session idea for the 2026 RoboBusiness

How it works: VLMs as cloud guardians

It is important to clarify: we do not use VLMs to drive the robot. Using a heavy cloud model to steer in real time would introduce latency and connectivity dependencies that compromise safety. Instead, the VLM acts as an automated “early warning system” for our remote assistance team.

Data ingestion: While driving autonomously, the robot transmits a snapshot from its cameras to the cloud once every few seconds. To protect public privacy, all visual data is automatically anonymized right on the robot – with faces and license plates blurred locally – before it ever leaves the onboard compute.
Context evaluation: In the cloud, the VLM watcher processes the feeds of snapshots, translating the visual data into a semantic description of what is happening on the street. We guide the model using a detailed prompt that defines exactly what types of unusual, sensitive, or complex situations to look for. The VLM evaluates the scene against these specific instructions and assigns specific high-stakes tags to the scenes.
Human-in-the-loop: If the model flags a critical situational tag, it immediately alerts our remote assistance team. An assistant can then review the live feed to ensure the robot behaves seamlessly, yields to emergency workers, or stays clear of restricted zones.

Because the AI landscape evolves at a breakneck pace, we don’t tie our infrastructure to a single provider. We treat this cloud layer as an open, plug-and-play architecture – continuously experimenting, testing, and benchmarking the latest state-of-the-art models to ensure we are always using the most accurate semantic interpreter available.

A view from the robot’s cameras shows autonomy with an extra safety layer: The robot autonomously yields to first responders moving a gurney. Simultaneously, the cloud VLM watcher flags the unusual context, bringing a remote assistant in to monitor the scene. Source: Avride

The evolution from data mining to live operations

The integration of live VLMs into Avride‘s daily operations is a natural evolution of our internal engineering tools.

Storing and processing every single minute of video from hundreds of robots operating every day is incredibly expensive and unnecessary. We don’t want to save everything; we only want to preserve data that genuinely helps us improve our technology and maintain safety.

Historically, we used this exact 5-second live-stream analysis pipeline as a data-filtering tool. Cloud VLMs monitored the incoming streams in real time to automatically mine for rare, valuable scenarios — like specific animal interactions or complex infrastructure — that we could securely save as pre-anonymized data for further labeling and training.

As the pipeline proved to be exceptionally accurate at spotting unique real-world context live, it became a logical next step to extend this tool into live operations. If the system was already capable of identifying unique contexts in real time, it could just as effectively be used to trigger live human oversight.

We integrated this data-mining infrastructure directly into our production pipeline, creating a seamless bridge between cutting-edge AI and human assistance.

The road ahead: Bringing VLMs to the edge

Operating these heavy models in the cloud is an incredibly effective solution for today, but it is just the beginning. As VLMs become more compact through optimization techniques, and as next-generation onboard robotics hardware grows more powerful, our ultimate goal is clear.

Eventually, this deep semantic layer will migrate from the cloud directly onto the robot’s onboard compute. This will allow our robots to achieve an even deeper level of autonomous decision-making entirely on the edge, completely independent of network connectivity.

Until then, our cloud-to-remote-assistance safety net ensures that Avride delivery robots remain polite, responsible, and aware citizens on the sidewalk.

About the author

Roman Nefedov is the head of autonomous delivery at Avride, where he holds end-to-end responsibility for the autonomous delivery product, overseeing both overall business operations and software development. Nefedov previously led the company’s delivery robot engineering division, building on over a decade and a half of expertise in the technology sector.

Throughout his career, he has focused on leading large-scale engineering teams and driving the development of smart devices and consumer IoT products.