Vision-language models are fundamentally changing how humans and robots work together in manufacturing environments, according to a new survey published in Frontiers of Engineering Management. The research, conducted by a team from The Hong Kong Polytechnic University and KTH Royal Institute of Technology, provides the first comprehensive mapping of how these AI systems are reshaping human-robot collaboration in smart manufacturing.
The survey, which analyzed 109 studies from 2020 to 2024, demonstrates how vision-language models enable robots to process both images and language simultaneously. This dual-modality capability allows robots to interpret complex scenes, follow spoken or written instructions, and generate multi-step plans—capabilities that traditional rule-based systems could not achieve. The research is documented in detail at https://doi.org/10.1007/s42524-025-4136-9.
In task planning applications, VLMs help robots interpret human commands, analyze real-time scenes, break down multi-step instructions, and generate executable action sequences. Systems built on architectures like CLIP, GPT-4V, BERT, and ResNet have achieved success rates above 90% in collaborative assembly and tabletop manipulation tasks. This represents a significant improvement over conventional robots that have been constrained by brittle programming, limited perception, and minimal understanding of human intent.
For navigation, VLMs allow robots to translate natural-language goals into movement, mapping visual cues to spatial decisions. These models can follow detailed step-by-step instructions or reason from higher-level intent, enabling robust autonomy in domestic, industrial, and embodied environments. In manipulation tasks, VLMs help robots recognize objects, evaluate affordances, and adjust to human motion—critical capabilities for safety-critical collaboration on factory floors.
The survey also highlights emerging work in multimodal skill transfer, where robots learn directly from visual-language demonstrations rather than labor-intensive coding. This approach could dramatically reduce the time and expertise required to program industrial robots for new tasks. The authors emphasize that VLMs mark a turning point for industrial robotics because they enable a shift from scripted automation to contextual understanding.
"Robots equipped with VLMs can comprehend both what they see and what they are told," the researchers explain, highlighting that this dual-modality reasoning makes interaction more intuitive and safer for human workers. They envision VLM-enabled robots becoming central to future smart factories—capable of adjusting to changing tasks, assisting workers in assembly, retrieving tools, managing logistics, conducting equipment inspections, and coordinating multi-robot systems.
However, the authors caution that achieving large-scale deployment will require addressing challenges in model efficiency, robustness, and data collection. Developing industrial-grade multimodal benchmarks for reliable evaluation will also be crucial for practical implementation. As VLMs mature, robots could learn new procedures from video-and-language demonstrations, reason through long-horizon plans, and collaborate fluidly with humans without extensive reprogramming.
The research concludes that breakthroughs in efficient VLM architectures, high-quality multimodal datasets, and dependable real-time processing will be key to unlocking their full industrial impact. This technological advancement could potentially usher in a new era of safe, adaptive, and human-centric manufacturing where machines function as flexible collaborators rather than scripted tools.


