PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Teng Zhou,Xiaoyu Zhang,Yongchuan Tang
2024-11-24
Abstract:Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at <a class="link-external link-https" href="https://github.com/0606zt/PanoLlama" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in Panoramic Image Generation (PIG): 1. **Multilevel - Coherence Challenge**: - The goal is to achieve coherence between low - level features (such as color, texture, edge) and high - level features (such as layout, structure, semantic) in panoramic image generation. Diffusion models have difficulties in definition and balance when dealing with this multilevel coherence. 2. **Implementation Complexity**: - Diffusion models require complex algorithm designs to coordinate the denoising paths between different image patches, which affects the stability and scalability of the system. To solve these problems, the paper proposes the PanoLlama framework, which redefines panoramic image generation as a next - token prediction task. Specifically: - **New Paradigm**: By leveraging pre - trained Auto - Regressive (AR) models, PanoLlama generates high - quality panoramas in a training - free manner, avoiding the multilevel - coherence and implementation - complexity problems existing in traditional diffusion models. - **Speed Up**: PanoLlama does not need to perform time - consuming denoising iterations and optimization processes, thus significantly increasing the generation speed. - **Versatile Applications**: Besides text - to - panorama generation, PanoLlama also supports multi - scale, multi - layout, and multi - guided generation, with higher flexibility and adaptability. In summary, by introducing Auto - Regressive models and sequence - generation methods, PanoLlama fundamentally solves the limitations of existing diffusion models in panoramic image generation and provides a more efficient and flexible solution.