Abstract:In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-LoRA, which dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Slide-LoRA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source MLLM to enhance visual text generation capabilities further. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-LoRA, TextHarmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries. Code is available at <a class="link-external link-https" href="https://github.com/bytedance/TextHarmony" rel="external noopener nofollow">this https URL</a>.

Enhancing Consistency with the Fusion of Paralleled Decoders for Text Generation

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation

Incorporating Consistency Verification into Neural Data-to-Document Generation.

Incorporating Consistency Verification into Neural Data-to-Document Generation

Consistency and Coherency Enhanced Story Generation

Generating Long and Coherent Text with Multi-Level Generative Adversarial Networks

PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model

Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence

Enhancing Text Generation with Cooperative Training

TILGAN: transformer-based implicit latent GAN for diverse and coherent text generation

Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition

Optimizing Multi-feature Dependent Story Generation Model

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Harmonizing Visual Text Comprehension and Generation

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Cascaded Text Generation with Markov Transformers

A Semantically Consistent and Syntactically Variational Encoder-Decoder Framework for Paraphrase Generation.

DE-GAN: Text-to-image Synthesis with Dual and Efficient Fusion Model

Fuse It More Deeply! A Variational Transformer with Layer-Wise Latent Variable Inference for Text Generation