Neighbor Does Matter: Curriculum Global Positive-Negative Sampling for Vision-Language Pre-training
Bin Huang,Feng He,Qi Wang,Hong Chen,Guohao Li,Zhifan Feng,Xin Wang,Wenwu Zhu
DOI: https://doi.org/10.1145/3664647.3681502
2024-01-01
Abstract:Sampling strategies have been widely adopted in Vision-Language Pre-training (VLP) and have achieved great success recently. However, the sampling strategies adopted by current VLP works are limited in two ways: i) they only focus on negative sampling, ignoring the importance of more informative positive samples; ii) their sampling strategies are conducted in the local in-batch level, which may lead to sub-optimal results. To tackle these problems, in this paper, we propose a curriculum-based Global Positive-Negative Sampling (GPN-S) framework for vision-language pre-training, which conducts both positive and negative sampling in the global level, grounded on the notion of neighborhood relationships. Additionally, we incorporate curriculum learning into our sampling strategy, progressively increasing the complexity of samples as the training progresses. Specifically, our proposed GPN-S framework is capable of utilizing positive sampling to bring semantically equivalent samples closer, as well as employing negative sampling to push challenging negative samples farther away. We jointly consider them for vision-language pre-training on the global-level perspective rather than a local-level mini-batch, which provides more informative and diverse samples. We evaluate the effectiveness of the proposed GPN-S framework by conducting experiments on several common downstream tasks, and the results demonstrate significant performance improvement over the existing models.