XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Hsu, Chan Jan; HUNG-YI LEE; Tsao, Yu

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Journal

Proceedings of the Annual Meeting of the Association for Computational Linguistics

Journal Volume

2

ISBN

9781955917223

Date Issued

2022-01-01

Author(s)

Hsu, Chan Jan

HUNG-YI LEE

Tsao, Yu

URI

https://scholars.lib.ntu.edu.tw/handle/123456789/629449

URL

https://api.elsevier.com/content/abstract/scopus_id/85149108427

Abstract

Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and fine-tuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.

Type

conference paper

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

關於 (About)

聯絡資訊 (Contact Us)

相關網站 (Useful Links)

關於開放取用 (Open Access, OA)

出版社期刊論文授權政策 (Copyright)

使用說明 (Instructions)

登入說明 (Sign-in)

匯入著作 (Submission)