Abstract:This article investigates whether higher classification accuracy can always be achieved by utilizing a pre-trained deep learning model as the feature extractor in the Bag-of-Deep-Visual-Words (BoDVW) classification model, as opposed to directly using the new classification layer of the pre-trained model for classification. Considering the multiple factors related to the feature extractor -such as model architecture, fine-tuning strategy, number of training samples, feature extraction method, and feature encoding method—we investigate these factors through experiments and then provide detailed answers to the question. In our experiments, we use five feature encoding methods: hard-voting, soft-voting, locally constrained linear coding, super vector coding, and fisher vector (FV). We also employ two popular feature extraction methods: one (denoted as Ext-DFs(CP)) uses a convolutional or non-global pooling layer, and another (denoted as Ext-DFs(FC)) uses a fully-connected or global pooling layer. Three pre-trained models—VGGNet-16, ResNext-50(32×4d), and Swin-B—are utilized as feature extractors. Experimental results on six datasets (15-Scenes, TF-Flowers, MIT Indoor-67, COVID-19 CXR, NWPU-RESISC45, and Caltech-101) reveal that compared to using the pre-trained model with only the new classification layer re-trained for classification, employing it as the feature extractor in the BoDVW model improves the accuracy in 35 out of 36 experiments when using FV. With Ext-DFs(CP), the accuracy increases by 0.13% to 8.43% (averaged at 3.11%), and with Ext-DFs(FC), it increases by 1.06% to 14.63% (averaged at 5.66%). Furthermore, when all layers of the pre-trained model are fine-tuned and used as the feature extractor, the results vary depending on the methods used. If FV and Ext-DFs(FC) are used, the accuracy increases by 0.21% to 5.65% (averaged at 1.58%) in 14 out of 18 experiments. Our results suggest that while using a pre-trained deep learning model as the feature extractor does not always improve classification accuracy, it holds great potential as an accuracy improvement technique.

An evaluation of pre-trained models for feature extraction in image classification

Performance Evaluation of Deep Learning Classification Network for Image Features

Can using a pre-trained deep learning model as the feature extractor in the bag-of-deep-visual-words model always improve image classification accuracy?

Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval

One size does not fit all in evaluating model selection scores for image classification

Towards Inadequately Pre-trained Models in Transfer Learning

Examining the Use of Neural Networks for Feature Extraction: A Comparative Analysis using Deep Learning, Support Vector Machines, and K-Nearest Neighbor Classifiers

А classification of text as images using neural networks pre-trained on the imagenet

A Comparative Study of Deep Learning Classification Methods on a Small Environmental Microorganism Image Dataset (EMDS-6): From Convolutional Neural Networks to Visual Transformers

Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis

Disease Classification and Impact of Pretrained Deep Convolution Neural Networks on Diverse Medical Imaging Datasets across Imaging Modalities

What makes ImageNet good for transfer learning?

Are Natural Domain Foundation Models Useful for Medical Image Classification?

Image recognition in depth: comparative study of CNN and Pre-trained VGG16 architecture for classification tasks

Do Better ImageNet Models Transfer Better... for Image Recommendation?

Pretrained ViTs Yield Versatile Representations For Medical Images

Will Transfer Learning Enhance ImageNet Classification Accuracy Using ImageNet-Pretrained Models?

Large-scale pretraining on pathological images for fine-tuning of small pathological benchmarks

Plant recognition by AI: Deep neural nets, transformers, and kNN in deep embeddings