Abstract:Background and objective: Medical imaging techniques are widely employed in disease diagnosis and treatment. A readily available medical report can be a useful tool in assisting an expert for investigating the patient's health. A radiologist can benefit from an automatic medical image to radiological report translation system while preparing a final report. Previous attempts on automatic medical report generation task includes image captioning algorithms without taking domain-specific visual and textual contents into account, thus arises the question about credibility of generated report. Methods: In this work, a novel Adaptive Multilevel Multi-Attention (AMLMA) approach is proposed by offering domain-specific visual-textual knowledge to generate a thorough and believable radiological report for any view of a human chest X-ray image. The proposed approach leverages the encoder-decoder framework incorporated with multiple adaptive attention mechanisms. The potential of a convolutional neural network (CNN) with residual attention module (RAM) is demonstrated as a strong visual encoder for multi-label abnormality detection. The multilevel visual features (local and global) are extracted from proposed visual encoder to retrieve regional-level and abstract-level radiology-based semantic information. The Word2Vec and FastText word embeddings are trained on medical reports to acquire radiological knowledge and further used as textual encoders, feeding as input to Bi-directional Long Short Term Memory (Bi-LSTM) network to learn the co-relationship between medical terminologies in radiological reports. The AMLMA employs a weighted multilevel association of adaptive visual-semantic attention and visual-based linguistic attention mechanisms. This association of adaptive attention is exploited as a decoder and produces significant improvements in the report generation task. Results: The proposed approach is evaluated on a publicly available Indiana University chest X-ray (IU-CXR) dataset. The CNN with RAM shows the significant improvement in recall (0.4423), precision (0.1803) and F1-score (0.2551) for prediction of multiple abnormalities in X-ray image. The results of language generation metrics for proposed variants were acquired using the COCO-caption evaluation Application Program Interface (API). The trained embeddings with AMLMA model generates the convincing radiology report and outperform state-of-the-art (SOTA) approaches with high evaluation metrics scores for Bleu-4 (0.172), Meteor (0.247), Rouge_L (0.376) and CIDEr (0.381). In addition, a new "Unique Index" (UI) statistic is introduced to highlight the model's ability for generating unique reports. Conclusion: The overall architecture aids to the understanding of various X-ray image views and generating the relevant normal and abnormal radiography statements. The proposed model is emphasized on multi-level visual-textual knowledge with adaptive attention mechanism to balance visual and linguistic information for the generation of admissible radiology report.

Multi-modal transformer architecture for medical image analysis and automated report generation

VMEKNet: Visual Memory and External Knowledge Based Network for Medical Report Generation.

Vision Transformer and Language Model Based Radiology Report Generation

Toward an enhanced automatic medical report generator based on large transformer models

Automated Radiographic Report Generation Purely on Transformer: A Multicriteria Supervised Approach

Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement

Radiology Report Generation Using Transformers Conditioned with Non-imaging Data

Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

Deep neural models for automated multi-task diagnostic scan management—quality enhancement, view classification and report generation

MATNet: Exploiting Multi-Modal Features for Radiology Report Generation.

MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation

Generating Radiology Reports via Memory-driven Transformer

Automatic Generation of Chest X-ray Reports Using a Transformer-based Deep Learning Model

IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer

METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens

Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures

Translating medical image to radiological report: Adaptive multilevel multi-attention approach

Auxiliary signal-guided knowledge encoder-decoder for medical report generation

Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification

AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation