Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

Wanrong Zhu,Jack Hessel,Anas Awadalla,Samir Yitzhak Gadre,Jesse Dodge,Alex Fang,Youngjae Yu,Ludwig Schmidt,William Yang Wang,Yejin Choi
DOI: https://doi.org/10.48550/arXiv.2304.06939
2023-04-14
Computer Vision and Pattern Recognition
Abstract:In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. mmc4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (90%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (78%). After filtering NSFW images, ads, etc., the corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
What problem does this paper attempt to address?