EAT: an Enhancer for Aesthetics-Oriented Transformers

Shuai He,Anlong Ming,Shuntian Zheng,Haobin Zhong,Huadong Ma
DOI: https://doi.org/10.1145/3581783.3611881
2023-01-01
Abstract:Transformers have shown great potential in various vision tasks, but none of them have surpassed the best CNN model on image aesthetics assessment (IAA) tasks. IAA is a challenging task in multimedia systems that requires attention to both foreground and background, as well as robustness to noisy and redundant labels. The global and dense attention mechanism of Transformers, designed for saliency-oriented tasks, may miss important aesthetic information in the background, increase the computational cost and slow down the convergence on IAA tasks. To address these issues, we propose an Enhancer for Aesthetics-Oriented Transformers (EAT). EAT uses a deformable, sparse and data-dependent attention mechanism that learns where to focus and how to refine attention by offsets. EAT also guides the offsets to balance the attention between foreground and background according to dedicated rules. Our EAT-enhanced Transformers outperform the previous methods on four representative datasets with fewer training epochs. Code is available in https://github.com/woshidandan/Image-Aesthetics-Assessment
What problem does this paper attempt to address?