Commonly Uncommon: Semantic Sparsity in Situation Recognition

Mark Yatskar,Vicente Ordonez,Luke Zettlemoyer,Ali Farhadi
DOI: https://doi.org/10.1109/cvpr.2017.671
2017-07-01
Abstract:Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of pro ducing structured summaries of what is happening in im ages, including activities, objects and the roles objects play within the activity. For this problem, we find empirically that most substructures required for prediction are rare, and current state-of-the-art model performance dramati cally decreases if even one such rare substructure exists in the target output. We avoid many such errors by (1) introducing a novel tensor composition function that learns to share examples across substructures more effectively and (2) se mantically augmenting our training data with automatically gathered examples of rarely observed outputs using web data. When integrated within a complete CRF-based structured prediction model, the tensor-based approach outper forms existing state of the art by a relative improvement of 2.11% and 4.40% on top-5 verb and noun-role accuracy, re spectively. Adding 5 million images with our semantic aug mentation techniques gives further relative improvements of 6.23% and 9.57% on top-5 verb and noun-role accuracy.
What problem does this paper attempt to address?