Jiang, YiquanYiquanJiangLiao, KengteKengteLiaoSHOU-DE LINQiao, HongmingHongmingQiaoYu, KefengKefengYuYang, ChengweiChengweiYangChen, YinqiYinqiChen2023-12-252023-12-252024-01-01978981998144118650929https://scholars.lib.ntu.edu.tw/handle/123456789/638124Solving object similarity remains a persistent challenge in the field of data science. In the context of e-commerce retail, the identification of substitutable and similar products involves similarity measures. Leveraging the multimodal learning derived from real-world experiences, humans can recognize similar products based solely on their titles, even in cases where significant literal differences exist. Motivated by this intuition, we propose a self-supervised mechanism that extracts strong prior knowledge from product image-title pairs. This mechanism serves to enhance the encoder’s capacity for learning product representations in a multimodal framework. The similarity between products can be reflected by the distance between their respective representations. Additionally, we introduce a novel attention regularization to effectively direct attention toward product category-related signals. The proposed model exhibits wide applicability as it can be effectively employed in unimodal tasks where only free-text inputs are available. To validate our approach, we evaluate our model on two key tasks: product similarity matching and retrieval. These evaluations are conducted on a real-world dataset consisting of thousands of diverse products. Experimental results demonstrate that multimodal learning significantly enhances the language understanding capabilities within the e-commerce domain. Moreover, our approach outperforms strong unimodal baselines and recently proposed multimodal methods, further validating its superiority.Multimodal Learning | Product Similarity | Self-Supervised Learning[SDGs]SDG9Self-supervised Multimodal Representation Learning for Product Identification and Retrievalconference paper10.1007/978-981-99-8145-8_442-s2.0-85178563632https://api.elsevier.com/content/abstract/scopus_id/85178563632