Abstract:The advancements in computer vision and natural language processing are keys to thriving modern healthcare systems and its applications. Nonetheless, they have been researched and used as separate technical entities without integrating their predictive knowledge discovery when they are combined. Such integration will benefit every clinical/medical problem as they are inherently multimodal - they involve several distinct forms of data, such as images and text. However, the recent advancements in machine learning have brought these fields closer using the notion of meta-transformers. At the core of this synergy is building models that can process and relate information from multiple modalities where the raw input data from various modalities are mapped into a shared token space, allowing an encoder to extract high-level semantic features of the input data. Nerveless, the task of automatically identifying arguments in a clinical/medical text and finding their multimodal relationships remains challenging as it does not rely only on relevancy measures (e.g. how close that text to other modalities like an image) but also on the evidence supporting that relevancy. Relevancy based on evidence is a normal practice in medicine as every practice is an evidence-based. In this article we are experimenting with meta-transformers that can benefit evidence based predictions. In this article, we are experimenting with variety of fine tuned medical meta-transformers like PubmedCLIP, CLIPMD, BiomedCLIP-PubMedBERT and BioCLIP to see which one provide evidence-based relevant multimodal information. Our experimentation uses the TTi-Eval open-source platform to accommodate multimodal data embeddings. This platform simplifies the integration and evaluation of different meta-transformers models but also to variety of datasets for testing and fine tuning. Additionally, we are conducting experiments to test how relevant any multimodal prediction to the published medical literature especially those that are published by PubMed. Our experimentations revealed that the BiomedCLIP-PubMedBERT model provide more reliable evidence-based relevance compared to other models based on randomized samples from the ROCO V2 dataset or other multimodal datasets like MedCat. In this next stage of this research we are extending the use of the winning evidence-based multimodal learning model by adding components that enable medical practitioner to use this model to predict answers to clinical questions based on sound medical questioning protocol like PICO and based on standardized medical terminologies like UMLS.

M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description Generation

Mmformer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

MATNet: Exploiting Multi-Modal Features for Radiology Report Generation.

M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities

MMTN: Multi-Modal Memory Transformer Network for Image-Report Consistent Medical Report Generation

Transformer-based Cross-Modal Multi-Contrast Network for Ophthalmic Diseases Diagnosis

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

Multi-modal transformer architecture for medical image analysis and automated report generation

AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization

Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation

A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers

Medical Transformer: Universal Brain Encoder for 3D MRI Analysis

Model long-range dependencies for multi-modality and multi-view retinopathy diagnosis through transformers

APPLICATIONS OF MULTIMODAL GENERATIVE ARTIFICIAL INTELLIGENCE IN A REAL-WORLD RETINA CLINIC SETTING

Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models

Using Meta-Transformers for Multimodal Clinical Decision Support and Evidence-Based Medicine