Peng, TaiTaiPengChen, Chu-SongChu-SongChen2026-03-192026-03-192026-01-1600983063https://www.scopus.com/record/display.uri?eid=2-s2.0-105027990731&origin=resultslisthttps://scholars.lib.ntu.edu.tw/handle/123456789/736439In this paper, we introduce a simple but effective training-free pipeline for handling the task of text-to-video object segmentation. Our approach leverages open-source Multimodal Large Language Models (MLLMs) for segmenting objects in videos based on language descriptions. We design three multimodal reasoning agents that decompose the task into semantic, temporal, and spatial reasoning stages: a Video Summarization Agent to provide concise semantic context, a Keyframe Selection Agent employing a Binary-Logit Frame Scoring mechanism to identify informative frames, and an Object Grounding Agent predicting bounding boxes for the described objects. Finally, by providing high-quality prompts to a semantic-free segmentation tool, our method effectively handles spatiotemporal variations and reduces segmentation errors. Extensive experiments show that our training-free method significantly reduces resource requirements while achieving comparable or even better performance than supervised fine-tuning approaches.falsemulti-agent systemmultimodal large language Modelsreasoning segmentationReferring video object segmentationsegment anythingMAViS: A Multi-Agent Approach for Training-Free Referring Video Object Segmentationjournal article10.1109/tce.2025.36502882-s2.0-105027990731