Classification modeling and recognition for cross modal and multi-label biomedical image
Yuhai Yu,Hongfei Lin,Jiana Meng,Hai Guo,Zhehuan Zhao
DOI: https://doi.org/10.11834/jig.170556
2018-01-01
Journal of Image and Graphics
Abstract:Objective The amount of biomedical literature in electronic format has increased considerably with the development of the Internet.PubMed comprises more than 27 million citations for biomedical literature linking to full-text content from PubMed Central and publisher web sites.The figures in these biomedical studies can be retrieved through tools along with the full text.However,the lack of associated metadata,apart from the captions,hinders the fulfillment of richer information requirements of biomedical researchers and educators.The modality of a figure is an extremely useful type of metadata.Therefore,biomedical modality classification is an important primary step that can aid users to access required biomedical images and further improve the performance of the literature retrieval system.Many images in the biomedical literature (more than 40%) are compound figures including several subfigures with various biomedical modalities,such as computerized tomography,X-ray,or genetic biomedical illustrations.The subfigures in one compound figure may describe one medical problem in several views and have strong semantic correlation with each other.Thus,these figures are valuable to biomedical research and education.The standard approach to modality recognition from biomedical compound figure first detects whether the figure is compound or not.If it is compound,then a figure separation algorithm is first invoked to split it into its constituent subfigures.Then,another multi-class classifier is used to predict the modality of each subfigure.Nevertheless,the figure separation algorithms are not perfect,and the errors in figure separation propagate to the multi-class model for modality classification.Recently,some multi-label learning models use pre-trained convolutional neural networks to extract high-level features to recognize the image modalities from the compound figures.These deep learning methods learn more expressive representations of image data.However,convolutional neural networks may be hindered to disentangle the factors of variation by the limited samples with high variability and the imbalanced label distribution of training data.A new cross-modal multi-label classification model using convolutional neural networks based on hybrid transfer learning is presented to learn biomedical modality information from the compound figure without separating it into subfigures.Method An end-to-end training and multi-label classification method,which does not require additional classifiers,is proposed.Building two convolutional neural networks enables to learn the components of an image without learning from single separated subfigure that represents the image modalities,but from labeled compound figures and their captions.The proposed cross-modal model learns general domain features from large-scale nature images and more special biomedical domain features from the simple figures and their captions in biomedical literature,leveraging techniques of heterogeneous and homogeneous transfer learning.Specifically,the proposed visual convolutional neural network (CNN) is pre-trained on a large auxiliary dataset,which contains approximately 1.2 million labeled training images of 1000 classes.Then,the top layer of the deep CNN is trained from scratch on single-label simple biomedical figures to achieve homogeneous transfer learning.The key point of such transfer learning is fine-tuning the pre-trained deep visual models on the current multi-label compound figure dataset.The architecture of the deep visual models should be changed slightly and then they could be fine-tuned on the current dataset.On the other hand,the weights of the embedding layer are initialized by the word vectors,which are pre-trained on captions extracted from 300 000 biomedical articles in PubMed,and are updated while training the networks.Similar to the homogeneous transfer learning strategy of visual model,the proposed textual convolutional neural networks are first pre-trained on the captions of the simple biomedical figures.Then,the pre-trained textual model is fine-tuned on current multi-label compound figures to capture more biomedical features.Finally,cross-modal multi-label learning model combines outputs of the visual and textual models to predict labels using multi-stage fusion strategy.Result The proposed cross-modal multi-label classification model based on hybrid transfer learning is evaluated on the dataset of the multi-label classification task in ImageCLEF2016.Our approach is evaluated based on multi-label classification Hamming Loss and Macro F1 Score,according to the evaluation criterion of the benchmark.The two comparative models learn multi-label information only from visual content.They pre-train AlexNet on large-scale nature images.Then,the DeCAF features are extracted from the pre-trained AlexNet and fed into the SVM classifier with a linear kernel.One comparative model predicts modalities by the highest score of SVM and the other model predicts by the highest posterior probability.The visual model achieves 33.9% lower Hamming Loss and 100.3% higher Macro F1 Score by introducing homogeneous transfer learning technique,and the textual model efficiently improves the performance in the two metrics.Thus,the proposed cross-modal model can achieve similar Hamming Loss of 0.0157 with the state-of-the-art model and obtain 52.5% higher Macro F1 Score,which is increased from 0.320 to 0.488.Conclusion A new method to extract biomedical modalities from the compound figures is proposed.The proposed models obtain more competitive results than the other reported methods in the literature.The proposed cross-modal model exhibits acceptable generalization capability and could achieve higher performance.The results imply that the homogeneous transfer learning method can aid deep convolutional neural networks (DCNNs) to capture a larger number of biomedical domain features and improve the performance of multi-label classification.The proposed cross-modal model addresses the problems of overfitting and imbalanced dataset and effectively recognizes modalities from biomedical compound figures based on visual content and textual information.In the future,building DCNNs and training networks with new techniques could further improve the proposed method.