Combo-Attention Network for Baidu Video Advertising

Tan Yu,Yi Yang,Yi Li,Xiaodong Chen,Mingming Sun,Ping Li
DOI: https://doi.org/10.1145/3394486.3403297
2020-08-20
Abstract:With the progress of communication technology and the popularity of the smart phone, videos grow to be the largest medium. Since videos can grab a customer's attention quickly and leave a big impression, video ads can gain more trust than traditional ads. Thus advertisers start to pour more resources into making creative video ads to built the connections with potential customers. Baidu, as the leading search engine company in China, receives billions of search queries per day. In this paper, we introduce a technique used in Baidu video advertising for feeding relevant video ads according to the user's query. Note that, retrieving relevant videos using the text query is a cross-modal problem. Due to the modal gap, the text-to-video search is more challenging than well exploited text-to-text search and image-to-image search. To tackle this challenge, we propose a Combo-Attention Network (CAN) and launch it in Baidu video advertising. In the proposed CAN model, we represent a video as a set of bounding boxes features and represent a sentence as a set of words features, and formulate the sentence-to-video search as a set-to-set matching problem. The proposed CAN is built upon the proposed combo-attention module, which exploits cross-modal attentions besides self attentions to effectively capture the relevance between words and bounding boxes. To testify the effectiveness of the proposed CAN offline, we built a Daily700K dataset collected from HaoKan APP. The systematic experiments on Daily700K as well as a public dataset, VATEX, demonstrate the effectiveness of our CAN. After launching the proposed CAN in Baidu's dynamic video advertising (DVA), we achieve a $5.47%$ increase in Conversion Rate (CVR) and a $11.69%$ increase in advertisement impression rate.
What problem does this paper attempt to address?