Bora: Biomedical Generalist Video Generation Model

Weixiang Sun,Xiaocao You,Ruizhe Zheng,Zhengqing Yuan,Xiang Li,Lifang He,Quanzheng Li,Lichao Sun

2024-07-16

Abstract:Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for medical AI development. Diffusion models can now generate realistic images from text prompts, while recent advancements have demonstrated their ability to create diverse, high-quality videos. However, these models often struggle with generating accurate representations of medical procedures and detailed anatomical structures. This paper introduces Bora, the first spatio-temporal diffusion probabilistic model designed for text-guided biomedical video generation. Bora leverages Transformer architecture and is pre-trained on general-purpose video generation tasks. It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus, which includes paired text-video data from various biomedical fields. To the best of our knowledge, this is the first attempt to establish such a comprehensive annotated biomedical video dataset. Bora is capable of generating high-quality video data across four distinct biomedical domains, adhering to medical expert standards and demonstrating consistency and diversity. This generalist video generative model holds significant potential for enhancing medical consultation and decision-making, particularly in resource-limited settings. Additionally, Bora could pave the way for immersive medical training and procedure planning. Extensive experiments on distinct medical modalities such as endoscopy, ultrasound, MRI, and cell tracking validate the effectiveness of our model in understanding biomedical instructions and its superior performance across subjects compared to state-of-the-art generation models.

Computer Vision and Pattern Recognition,Image and Video Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when existing generative models generate medical videos, they are often difficult to accurately represent medical procedures and detailed anatomical structures. Although diffusion models have been able to generate realistic images from text prompts, and recent developments have also demonstrated their ability to create diverse and high - quality videos, in the medical field, the performance of these models still needs to be improved. Therefore, this paper introduces a new spatio - temporal diffusion probability model - Bora, which is specifically used for text - guided biomedical video generation. Bora is based on the Transformer architecture and is aligned and instruction - tuned through a newly - established medical video corpus to generate high - quality video data that meets the standards of medical experts and shows consistency and diversity. This general - purpose video generation model has great potential in enhancing medical consultation and decision - making, especially in resource - limited environments, and can also pave the way for immersive medical training and surgical planning.

Bora: Biomedical Generalist Video Generation Model

Endora: Video Generation Models as Endoscopy Simulators

Artificial Intelligence for Biomedical Video Generation

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

Annotated Biomedical Video Generation using Denoising Diffusion Probabilistic Models and Flow Fields

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

SurGen: Text-Guided Diffusion Model for Surgical Video Generation

Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

How Far is Video Generation from World Model: A Physical Law Perspective

Application of transformer architectures in generative video modeling for neurosurgical education

MediSyn: Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis

Text-to-video generative artificial intelligence: sora in neurosurgery

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Open-Sora Plan: Open-Source Large Video Generation Model

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Interactive Generation of Laparoscopic Videos with Diffusion Models

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys

Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval