Unified deep learning model for multitask representation and transfer learning: image classification, object detection, and image captioning
Leta Yobsan Bayisa,Weidong Wang,Qingxian Wang,Chiagoziem C. Ukwuoma,Hirpesa Kebede Gutema,Ahmed Endris,Turi Abu
DOI: https://doi.org/10.1007/s13042-024-02177-5
2024-05-18
International Journal of Machine Learning and Cybernetics
Abstract:The application of deep learning has demonstrated impressive performance in computer vision tasks such as object detection, image classification, and image captioning. Though most models excel at performing single vision or language tasks, designing a single architecture that balances task specialization, performance, and adaptability across diverse tasks is challenging. To effectively address vision and language integration challenges, a combination of text embeddings and visual representation is necessary to understand dependencies of each subarea for multiple tasks. This paper proposes a single architecture that can handle various tasks in computer vision with fine-tuning capabilities for other specific vision and language tasks. The proposed model employs a modified DenseNet201 as a feature extractor (network backbone), an encoder-decoder architecture, and a task-specific head for inference. To tackle overfitting and improve precision, enhanced data augmentation and normalization techniques are employed. The model's robustness is evaluated on over five datasets for different tasks: image classification, object detection, image captioning, and adversarial attack and defense. The experimental results demonstrate competitive performance compared to other works on CIFAR-10, CIFAR-100, Flickr8, Flickr30, Caltech10, and other task-specific datasets such as OCT, BreakHis, and so on. The proposed model is flexible and easy to adapt to new tasks, as it can also be extended to other vision and language tasks through fine-tuning with task-specific input indices.
computer science, artificial intelligence