Self-supervised Multimodal Representation Learning for Product Identification and Retrieval
Journal
Communications in Computer and Information Science
Journal Volume
1965 CCIS
ISBN
9789819981441
Date Issued
2024-01-01
Author(s)
Abstract
Solving object similarity remains a persistent challenge in the field of data science. In the context of e-commerce retail, the identification of substitutable and similar products involves similarity measures. Leveraging the multimodal learning derived from real-world experiences, humans can recognize similar products based solely on their titles, even in cases where significant literal differences exist. Motivated by this intuition, we propose a self-supervised mechanism that extracts strong prior knowledge from product image-title pairs. This mechanism serves to enhance the encoder’s capacity for learning product representations in a multimodal framework. The similarity between products can be reflected by the distance between their respective representations. Additionally, we introduce a novel attention regularization to effectively direct attention toward product category-related signals. The proposed model exhibits wide applicability as it can be effectively employed in unimodal tasks where only free-text inputs are available. To validate our approach, we evaluate our model on two key tasks: product similarity matching and retrieval. These evaluations are conducted on a real-world dataset consisting of thousands of diverse products. Experimental results demonstrate that multimodal learning significantly enhances the language understanding capabilities within the e-commerce domain. Moreover, our approach outperforms strong unimodal baselines and recently proposed multimodal methods, further validating its superiority.
Subjects
Multimodal Learning | Product Similarity | Self-Supervised Learning
SDGs
Type
conference paper
