The Robot Report

  • Home
  • News
  • Technologies
    • Batteries / Power Supplies
    • Cameras / Imaging / Vision
    • Controllers
    • End Effectors
    • Microprocessors / SoCs
    • Motion Control
    • Sensors
    • Soft Robotics
    • Software / Simulation
  • Development
    • Artificial Intelligence
    • Human Robot Interaction / Haptics
    • Mobility / Navigation
    • Research
  • Robots
    • AGVs
    • AMRs
    • Consumer
    • Collaborative Robots
    • Drones
    • Humanoids
    • Industrial
    • Self-Driving Vehicles
    • Unmanned Maritime Systems
  • Business
    • Financial
      • Investments
      • Mergers & Acquisitions
      • Earnings
    • Markets
      • Agriculture
      • Healthcare
      • Logistics
      • Manufacturing
      • Mining
      • Security
    • RBR50
      • RBR50 Winners 2025
      • RBR50 Winners 2024
      • RBR50 Winners 2023
      • RBR50 Winners 2022
      • RBR50 Winners 2021
  • Resources
    • Automated Warehouse Research Reports
    • Digital Issues
    • eBooks
    • Publications
      • Automated Warehouse
      • Collaborative Robotics Trends
    • Search Robotics Database
    • Videos
    • Webinars / Digital Events
  • Events
    • RoboBusiness
    • Robotics Summit & Expo
    • DeviceTalks
    • R&D 100
    • Robotics Weeks
  • Podcast
    • Episodes
  • Advertise
  • Subscribe

Context is king: How Avride uses cloud VLMs as a safety net for delivery robots

By Roman Nefedov | July 4, 2026

Avride has integrated vision-language models or VLMs into its delivery robots.

Avride has integrated vision-language models into its delivery robots. Source: Avride

Avride Inc. has built its delivery robots for high level of autonomy. Every single day, hundreds of them navigate busy city streets entirely on their own, processing complex sensor data locally on their onboard compute units. Our sidewalk robots run with minimal human involvement, reliably handling standard urban maneuvers, pedestrians, and traffic lights on their own.

However, efficiently managing the mechanics of navigation – even in challenging conditions like narrow pathways or bad weather – is only one part of the equation. Ensuring a robot behaves appropriately in unusual, sensitive, or high-stakes real-world environments requires a different kind of intelligence.

To add a proactive layer of environmental awareness, we have integrated heavy, cloud-based vision-language models (VLMs) into its system as an automated “VLM-watcher.”

From object detection to holistic scene understanding

Avride’s onboard perception stack is already highly capable. Using a combination of onboard sensors and local neural networks, our delivery robots are designed to detect surrounding agents, including cyclists, children, wheelchairs, and emergency vehicles.

However, while our onboard models can identify these individual elements, certain real-world scenarios require a much deeper layer of contextual understanding.

Consider how a scenario unfolds on a city street. Encountering a police officer or a firefighter on the sidewalk might hint that something unusual is happening, but basic object detection isn’t enough to grasp the full picture.

For instance, distinguishing a police officer walking home after a shift from an active, sensitive crime scene is a highly non-trivial task. It requires a holistic understanding of how multiple elements interact within the frame – interpreting the scene as a whole scenario rather than a mere checklist of detected objects.

We want to significantly reduce the likelihood of our delivery robots accidentally entering an active emergency area, crossing a live crime scene, or rolling into unmapped roadwork where fresh, wet cement looks just like a standard grey sidewalk. While onboard models capture the primary entities needed to navigate, a heavy foundation model in the cloud excels at this holistic interpretation, instantly piecing together the deep semantic context of the entire situation.


ITE AD for the 2026 RoboBusiness call for speakersSubmit your session idea for the 2026 RoboBusiness

How it works: VLMs as cloud guardians

It is important to clarify: we do not use VLMs to drive the robot. Using a heavy cloud model to steer in real time would introduce latency and connectivity dependencies that compromise safety. Instead, the VLM acts as an automated “early warning system” for our remote assistance team.

  • Data ingestion: While driving autonomously, the robot transmits a snapshot from its cameras to the cloud once every few seconds. To protect public privacy, all visual data is automatically anonymized right on the robot – with faces and license plates blurred locally – before it ever leaves the onboard compute.
  • Context evaluation: In the cloud, the VLM watcher processes the feeds of snapshots, translating the visual data into a semantic description of what is happening on the street. We guide the model using a detailed prompt that defines exactly what types of unusual, sensitive, or complex situations to look for. The VLM evaluates the scene against these specific instructions and assigns specific high-stakes tags to the scenes.
  • Human-in-the-loop: If the model flags a critical situational tag, it immediately alerts our remote assistance team. An assistant can then review the live feed to ensure the robot behaves seamlessly, yields to emergency workers, or stays clear of restricted zones.

Because the AI landscape evolves at a breakneck pace, we don’t tie our infrastructure to a single provider. We treat this cloud layer as an open, plug-and-play architecture – continuously experimenting, testing, and benchmarking the latest state-of-the-art models to ensure we are always using the most accurate semantic interpreter available.

A view from the robot’s cameras shows autonomy with an extra safety layer: The robot autonomously yields to first responders moving a gurney. Simultaneously, the cloud VLM-watcher flags the unusual context, bringing a remote assistant in to monitor the scene.

A view from the robot’s cameras shows autonomy with an extra safety layer: The robot autonomously yields to first responders moving a gurney. Simultaneously, the cloud VLM watcher flags the unusual context, bringing a remote assistant in to monitor the scene. Source: Avride

The evolution from data mining to live operations

The integration of live VLMs into Avride‘s daily operations is a natural evolution of our internal engineering tools.

Storing and processing every single minute of video from hundreds of robots operating every day is incredibly expensive and unnecessary. We don’t want to save everything; we only want to preserve data that genuinely helps us improve our technology and maintain safety.

Historically, we used this exact 5-second live-stream analysis pipeline as a data-filtering tool. Cloud VLMs monitored the incoming streams in real time to automatically mine for rare, valuable scenarios — like specific animal interactions or complex infrastructure — that we could securely save as pre-anonymized data for further labeling and training.

As the pipeline proved to be exceptionally accurate at spotting unique real-world context live, it became a logical next step to extend this tool into live operations. If the system was already capable of identifying unique contexts in real time, it could just as effectively be used to trigger live human oversight.

We integrated this data-mining infrastructure directly into our production pipeline, creating a seamless bridge between cutting-edge AI and human assistance.

The road ahead: Bringing VLMs to the edge

Operating these heavy models in the cloud is an incredibly effective solution for today, but it is just the beginning. As VLMs become more compact through optimization techniques, and as next-generation onboard robotics hardware grows more powerful, our ultimate goal is clear.

Eventually, this deep semantic layer will migrate from the cloud directly onto the robot’s onboard compute. This will allow our robots to achieve an even deeper level of autonomous decision-making entirely on the edge, completely independent of network connectivity.

Until then, our cloud-to-remote-assistance safety net ensures that Avride delivery robots remain polite, responsible, and aware citizens on the sidewalk.

Roman Nefedov, AvrideAbout the author

Roman Nefedov is the head of autonomous delivery at Avride, where he holds end-to-end responsibility for the autonomous delivery product, overseeing both overall business operations and software development. Nefedov previously led the company’s delivery robot engineering division, building on over a decade and a half of expertise in the technology sector.

Throughout his career, he has focused on leading large-scale engineering teams and driving the development of smart devices and consumer IoT products.

Tell Us What You Think! Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles Read More >

Cobot's new robot, a Figure robot at work, NEURA humanoids at work, and Agility's Digit humanoids at work.
Top 10 robotics developments of June 2026
Apptronik offers a bipedal and wheeled option for its Apollo 2 robot.
Apptronik unveils Apollo 2 and a flagship data collection and training facility
X Square Robot performing household tasks in a home environment.
X Square Robot brings its valuation to $2.8B with four consecutive funding rounds
By integrating reinforcement learning with high-fidelity physics-based simulation, morph enables a faster translation from concept to product.
Soft, robotic cells from morph embed physical AI into hardware

RBR50 Innovation Awards

“2026”
“rr
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, tools and strategies for Robotics Professionals.

Latest Episode of The Robot Report Podcast

Automated Warehouse Research Reports

Sponsored Content

  • In Robotics, Ruggedization Is No Longer Optional
  • Advantages of hypoid gearing over worm, bevel and bevel-planetary
  • Daimon Robotics and Galbot jointly launches RobOmni for benchmarking tactile perception and dexterous manipulation
  • genisom tradeshow booth with quadrupeds. GENISOM AI debuts deployable robotics platforms at ICRA 2026
  • How humanoids learn to read the room
More Sponsored Content >
The Robot Report
  • Automated Warehouse
  • RoboBusiness Event
  • Robotics Summit & Expo
  • About The Robot Report
  • Subscribe
  • Contact Us

Copyright © 2026 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search The Robot Report

  • Home
  • News
  • Technologies
    • Batteries / Power Supplies
    • Cameras / Imaging / Vision
    • Controllers
    • End Effectors
    • Microprocessors / SoCs
    • Motion Control
    • Sensors
    • Soft Robotics
    • Software / Simulation
  • Development
    • Artificial Intelligence
    • Human Robot Interaction / Haptics
    • Mobility / Navigation
    • Research
  • Robots
    • AGVs
    • AMRs
    • Consumer
    • Collaborative Robots
    • Drones
    • Humanoids
    • Industrial
    • Self-Driving Vehicles
    • Unmanned Maritime Systems
  • Business
    • Financial
      • Investments
      • Mergers & Acquisitions
      • Earnings
    • Markets
      • Agriculture
      • Healthcare
      • Logistics
      • Manufacturing
      • Mining
      • Security
    • RBR50
      • RBR50 Winners 2025
      • RBR50 Winners 2024
      • RBR50 Winners 2023
      • RBR50 Winners 2022
      • RBR50 Winners 2021
  • Resources
    • Automated Warehouse Research Reports
    • Digital Issues
    • eBooks
    • Publications
      • Automated Warehouse
      • Collaborative Robotics Trends
    • Search Robotics Database
    • Videos
    • Webinars / Digital Events
  • Events
    • RoboBusiness
    • Robotics Summit & Expo
    • DeviceTalks
    • R&D 100
    • Robotics Weeks
  • Podcast
    • Episodes
  • Advertise
  • Subscribe