MAViS: A Multi-Agent Approach for Training-Free Referring Video Object Segmentation
Journal
IEEE Transactions on Consumer Electronics
Start Page
1
ISSN
0098-3063
1558-4127
Date Issued
2026-01-16
Author(s)
Peng, Tai
Abstract
In this paper, we introduce a simple but effective training-free pipeline for handling the task of text-to-video object segmentation. Our approach leverages open-source Multimodal Large Language Models (MLLMs) for segmenting objects in videos based on language descriptions. We design three multimodal reasoning agents that decompose the task into semantic, temporal, and spatial reasoning stages: a Video Summarization Agent to provide concise semantic context, a Keyframe Selection Agent employing a Binary-Logit Frame Scoring mechanism to identify informative frames, and an Object Grounding Agent predicting bounding boxes for the described objects. Finally, by providing high-quality prompts to a semantic-free segmentation tool, our method effectively handles spatiotemporal variations and reduces segmentation errors. Extensive experiments show that our training-free method significantly reduces resource requirements while achieving comparable or even better performance than supervised fine-tuning approaches.
Subjects
multi-agent system
multimodal large language Models
reasoning segmentation
Referring video object segmentation
segment anything
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
Type
journal article
