Abstract:In recent years, with the significant evolution of multi-modal large models, many recommender researchers realized the potential of multi-modal information for user interest modeling. In industry, a wide-used modeling architecture is a cascading paradigm: (1) first pre-training a multi-modal model to provide omnipotent representations for downstream services; (2) The downstream recommendation model takes the multi-modal representation as additional input to fit real user-item behaviours. Although such paradigm achieves remarkable improvements, however, there still exist two problems that limit model performance: (1) Representation Unmatching: The pre-trained multi-modal model is always supervised by the classic NLP/CV tasks, while the recommendation models are supervised by real user-item interaction. As a result, the two fundamentally different tasks' goals were relatively separate, and there was a lack of consistent objective on their representations; (2) Representation Unlearning: The generated multi-modal representations are always stored in cache store and serve as extra fixed input of recommendation model, thus could not be updated by recommendation model gradient, further unfriendly for downstream training. Inspired by the two difficulties challenges in downstream tasks usage, we introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.

What problem does this paper attempt to address?

This paper attempts to solve two main problems encountered when using multi - modal information in recommendation systems: 1. **Representation Unmatching**: - **Problem Description**: Pretrained multi - modal models are usually supervised through classic natural language processing (NLP) or computer vision (CV) tasks, while recommendation models are supervised by real user - item interaction data. The goals of these two types of tasks are relatively independent, resulting in a lack of consistency in their representations. - **Solution**: The paper proposes an "Item Alignment Mechanism", which fine - tunes the pretrained multi - modal model so that the generated representations can better reflect the real - world business user - item interaction patterns. 2. **Representation Unlearning**: - **Problem Description**: The generated multi - modal representations are usually stored in the cache and used as fixed input features of the recommendation model, and cannot be updated by the gradients of the recommendation model, which limits the training effect of downstream tasks. - **Solution**: The paper introduces a "Quantitative Code Mechanism", which converts the aligned multi - modal representations into learnable code IDs for end - to - end training in the recommendation model. ### Specific Methods 1. **Item Alignment Mechanism**: - **Implementation Method**: Build an alignment model with pure multi - modal representation input, and use the existing retrieval model knowledge to supervise it to reflect the real - world business characteristics. - **Specific Steps**: - Generate a high - quality item - pair dataset \( D \). - Use the batch contrastive loss function \( L_{\text{align}} \) to perform alignment training on multi - modal representations. 2. **Quantitative Code Mechanism**: - **Implementation Method**: Design two heuristic quantization mechanisms, namely vector quantization (VQ) and residual quantization (RQ), to convert the aligned multi - modal representations into learnable code IDs. - **Specific Steps**: - **VQ Code**: Directly use the aligned representations of all items as the codebook and perform quantization through Top - K nearest - neighbor search. - **RQ Code**: Use the heuristic K - means algorithm to generate a multi - layer codebook and recursively quantize the representations. ### Experimental Verification The paper conducted detailed offline and online experiments in Kuaishou's shopping and advertising services to verify the effectiveness of QARM. The experimental results show that QARM significantly improves the performance of the recommendation system on multiple evaluation metrics, specifically: - **Offline Experiments**: - In the advertising service, QARM improves AUC, UAUC, and GAUC by 0.18%, 0.29%, and 0.25% respectively. - In the shopping service, QARM improves AUC, UAUC, and GAUC by 0.23%, 0.56%, and 0.50% respectively. - **Online Experiments**: - In the advertising service, QARM increases revenue by 9.704% in the cold - start scenario and by 3.147% in other scenarios. - In the shopping service, QARM improves the number of orders, GMV, CTR, and CVR by 1.396%, 2.296%, 0.478%, and 0.903% respectively. ### Summary By introducing the item alignment mechanism and the quantitative code mechanism, the paper effectively solves the problems of representation unmatching and representation unlearning of multi - modal information in the recommendation system, and significantly improves the performance of the recommendation system. These methods have been widely used in multiple services of Kuaishou, supporting the recommendation needs of 400 million active users per day.

QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou