ViTCA-Net: a framework for disease detection in video capsule endoscopy images using a vision transformer and convolutional neural network with a specific attention mechanism

Yassine Oukdach,Zakaria Kerkaou,Mohamed El Ansari,Lahcen Koutti,Ahmed Fouad El Ouafdi,Thomas De Lange
DOI: https://doi.org/10.1007/s11042-023-18039-1
IF: 2.577
2024-01-13
Multimedia Tools and Applications
Abstract:Video capsule endoscopy (VCE) is a non-invasive procedure to examine the human bowel. The VCE technology generates thousands of images from different parts of the gastrointestinal tract. Since the examination of these images is a tedious and time-consuming task for doctors, automated diagnosis of digestive diseases from VCE images is highly desired. The majority of the existing studies are based on CNN methods, which are not efficient enough in learning invariant global features in VCE images. Therefore, this paper presents a new framework that combines the learning of global and local features from VCE images. The proposed method utilizes a specific attention mechanism within a convolutional neural network to extract local features, while a vision transformer captures global features. Both local and global features are fused for final classification. Extensive experiments were performed on the public Kvasir Capsule Endoscopy dataset, revealing a promising accuracy of 97%. These results not only highlight the model's capabilities but also demonstrate its favorable standing when compared to the state-of-the-art methods. Additionally, achieving a recall of 85%, the proposed system demonstrated robust generalization capabilities, performing impressively on an unseen dataset.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?