Fine-Tuning Vision Language Models (VLM) for Agile Robotics

A man in a lab coat operates a control panel near a robotic arm and laptop displaying a 3D model.

Fine-Tuning Vision Language Models (VLM) for Agile Robotics Fine-Tuning Vision Language Models (VLM) for Agile Robotics

By: Giorgio Manganini, Giulia Vilone, Mark Langtry, Prashanth Viswanath, Jim Gibbons, Paul Heraty

Sep 24, 2025

How ADI is reducing data dependency to accelerate model development

We all expected robots to be everywhere around us by now. Yet the cost of teaching robots how to perform basic tasks is often higher than the price of the hardware itself. In fact, the total cost of ownership (TCO) for a robot is often 50% to 100% higher than its sticker price once you factor in integration, operator training, maintenance, cybersecurity, and liability insurance.¹

Want to teach your robot a new skill or improve how it works? You must invest in a new configuration. This limitation slowed robot adoption, especially for industries with a high product mix and low volumes. That’s why you frequently see robots making car bodies rather than custom interiors.

Robots that understand the world like humans require more than just sensors and code; they require the ability to interpret and reason across different modalities. ADI is exploring how vision language models (VLMs) can bridge that gap. The goal is to enable robots to comprehend their environments more effectively. Ideally, we can also help them act intelligently, even with limited training data.

Meeting the Data Challenge in Robotics

The first step in any artificial intelligence (AI) training involves data. Computer vision tasks in robotics often suffer from a lack of sample training data. Collecting and labeling new large datasets is slow and expensive.

VLMs, pre-trained on vast internet-scale datasets, offer a compelling solution. They provide strong zero-shot performance on previously unseen tasks and the ability to process complex, multimodal inputs.

In its recent work, ADI’s Limerick, Ireland-based team integrated VLMs into a robotic framework to:

Recognize object states
Interpret spatial environments
Validate internal states before task execution

This approach enables robots to reason contextually using visual and textual cues. It also improves their flexibility and adaptability across tasks. This approach reduced reliance on rigid, pre-programmed logic, making the systems more resilient to environmental variability.

Fine-Tuning Without the Overhead

Another large model challenge involves adapting them to specific applications. To tailor the general-purpose VLMs to the target robotic applications, the team used parameter-efficient fine-tuning (PEFT) using low rank adaptation (LoRA). This method freezes the base model and fine-tunes a small number of additional parameters. The results? Comparable accuracy to full fine-tuning, with drastically reduced computational and storage costs.

The team trained on 182 images and tested on 102, validating PEFT’s effectiveness in low data settings. This efficiency makes VLMs more accessible to robotics teams with limited computational resources or data.

PEFT also supports faster experimentation. By minimizing the tuning footprint, the team was able to iterate quickly on different model architectures and task configurations, identifying optimal solutions without incurring significant training costs.

Key Insights from Real-World Testing

While the potential of VLMs in robotics is clear, real-world implementation surfaced several challenges:

Spatial reasoning: Models struggled with understanding object relationships, causality, and physical interactions in cluttered or dynamic scenes.
Prompt sensitivity: As with many language models, outputs varied significantly. Stark differences emerged from subtle phrasing changes, requiring careful prompt engineering.
Integration complexity: Customization requires a deep understanding of the model architecture. That involved adding task-specific cues and managing multimodal alignment.

To address these issues, the team employed richer textual inputs and prompt engineering techniques. These prompts guided the model toward more reliable outputs. For example, rewording a question or adding background information improved object detection accuracy and state classification.

Another key finding was the importance of human-in-the-loop (HITL) evaluation. During live user interactions, the team observed how models responded to nuanced queries, helping us identify gaps in robustness and refine our approach iteratively.

Building Toward Natural Human-Robot Interaction

VLMs aren’t just about better vision—they’re a step toward more natural, interactive, and general-purpose robotics. They can also handle vision-and-language tasks, such as captioning, visual question answering (VQA), and reasoning. In line with explainable AI (XAI) principles, VLMs are ideal for both perception and communication.

The team used VLMs to create robotic agents with three key benefits:

Greater adaptability to new scenarios and dynamic environments
Reduced dependency on extensive labeled datasets or handcrafted rules
Improved communication through natural language

The team applied these principles to enhance the user experience in human-robot interactions. In such situations, having a robot clarify its planning and actions in human language helps foster trust and improve usability. Operators can give flexible commands and receive the robot’s feedback in natural language. Operators can then adapt the robot’s plan to their directions before it acts.

Looking Ahead

The team’s work represents a meaningful step toward scalable, data-efficient robotic policies.

As VLMs continue to evolve, the team sees growing potential in:

Deployment to edge devices using compressed, optimized models
Clearer safety assessments using benchmark-driven evaluations
Combining VLMs with reinforcement learning and sensor fusion for richer contextual understanding

Another potential future direction is the exploration of vision language action (VLA) models, which integrate action, planning, and execution, enabling robots to decide and execute physical actions based on visual perception and natural language understanding. VLAs directly close the “see-understand-act” loop in a single system, extending the generalization capability of LLMs/VLMs for acting in the real world.

ADI has already explored some of these models and had the first attempt to try them on a real robot at ADI’s Catalyst™ hub in Limerick.

Conclusion

VLMs will play a foundational role in the next generation of robotics, where systems learn from fewer examples, interact more intuitively, and generalize across tasks with human-like fluency. You may even have a chance to tell them how to do things your way.

Sources:

¹ “How much do robots cost? 2025 price breakdown.” Standard Bots, August 2025.