Large Vision Models: How Transformer-based Models excelled over Traditional Deep Learning Architectures in Video Processing

Farah Aymen,Andreas Pester,Hanin Monir
DOI: https://doi.org/10.1109/AIRC61399.2024.10672087
2024-04-22
Abstract:Large vision models (LVMs), particularly vision transformers (ViTs), stand at the forefront of computer vision ad-vancements, demonstrating exceptional capabilities in processing and understanding visual data at a large scale. These models, with their deep learning frameworks and extensive parameter spaces, excel in tasks from object detection to complex scene comprehension, surpassing traditional models like CNNs and GANs. This paper explores the progression of LVMs, emphasizing the advantages of ViTs in video summarization and prediction. It highlights the limitations of CNNs, including their vulnerability to adversarial attacks and difficulties with minor image variations, and commends ViTs for their effective handling of long-range dependencies through self-attention mechanisms. The paper also examines LVM applications in both supervised and unsupervised video summarization, and introduces multimodal approaches that integrate visual, textual, and audio data, underlining the superiority of ViTs in a variety of computer vision tasks due to their advanced learning capabilities.
Computer Science
What problem does this paper attempt to address?