Enhancing Real-Time Semantic Segmentation with Textual Knowledge of Pre-Trained Vision-Language Model: A Lightweight Approach

Lin, Chia YiChia YiLinChen, Jun ChengJun ChengChenJA-LING WU2024-01-182024-01-182023-01-019798350300673https://scholars.lib.ntu.edu.tw/handle/123456789/638618In this paper, we present a lightweight method for real-time semantic segmentation models by leveraging the power of pre-trained vision-language models. Our approach incorporates the CLIP text encoder, which provides rich semantic embeddings for text labels, and effectively distills its rich textual knowledge to the segmentation model. The proposed framework integrates the image and text embeddings, enabling visual and textual information alignment. Besides, we introduce learnable prompt embeddings to capture class-specific information and enhance the semantic understanding of the model. To ensure efficient learning, we devise a two-stage training procedure that allows the segmentation backbone to learn from fixed text embeddings in the first stage and optimize the prompt embeddings in the second stage. Extensive experiments and ablation studies demonstrate the effectiveness of our method in significantly improving the performance of the real-time semantic segmentation model.Enhancing Real-Time Semantic Segmentation with Textual Knowledge of Pre-Trained Vision-Language Model: A Lightweight Approachconference paper10.1109/APSIPAASC58517.2023.103171792-s2.0-85180010437https://api.elsevier.com/content/abstract/scopus_id/85180010437