Data-Efficient 3D Visual Grounding via Order-Aware Referring
Part Of
Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
Start Page
3107
End Page
3117
ISBN (of the container)
979-833151083-1
DOI (of the container)
10.1109/WACV61041.2025.00307
ISBN
[9798331510831]
Date Issued
2025-02-26
Author(s)
Abstract
3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. Previous works usually require significant data relating to point color and their descriptions to exploit the corresponding complicated verbo-visual relations. In our work, we introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. Vigor leverages LLM to produce a desirable referential order from the input description for 3D visual grounding. With the proposed stacked object-referring blocks, the predicted anchor objects in the above order allow one to locate the target object progressively with-out supervision on the identities of anchor objects or exact relations between anchor/target objects. We also present an order-aware warm-up training strategy, which augments referential orders for pre-training the visual grounding framework, allowing us to better capture the complex verbo-visual relations and benefit the desirable data-efficient learning scheme. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in low-resource scenarios. In particular, Vigor surpasses current state-of-the-art frameworks by 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the NR3D dataset, respectively.
Event(s)
2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
Publisher
IEEE
Type
conference paper
