Mamba-ST: State Space Model for Efficient Style Transfer

Filippo Botti,Alex Ergasti,Leonardo Rossi,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati

2024-09-16

Abstract:The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba's inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at <a class="link-external link-https" href="https://github.com/FilippoBotti/MambaST" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issues of high computational resource consumption and long inference time in style transfer tasks with existing methods such as transformer models and diffusion models. The authors propose a new architecture, Mamba-ST, which achieves style transfer by adjusting the internal equations of the Mamba State Space Model (SSM), thereby significantly reducing memory usage and inference time while maintaining high quality. Specifically: 1. **Background and Motivation**: The current state-of-the-art style transfer methods mostly rely on transformer models or diffusion models. The former, although effective, consumes a large amount of memory; the latter can generate high-quality images but has a long inference time. 2. **Solution**: The authors propose a new architecture based on Mamba—Mamba-ST. By modifying the internal matrices of Mamba, it can simulate the function of cross-attention layers, thereby directly integrating style and content information without the need for additional modules such as Adaptive Layer Normalization (AdaLN). 3. **Experimental Results**: Mamba-ST performs excellently on ArtFID and FID metrics, surpassing existing transformer and diffusion models, and also shows significant advantages in terms of inference time and memory usage. In summary, the paper aims to improve the efficiency and performance of style transfer tasks through a lightweight approach.

Mamba-ST: State Space Model for Efficient Style Transfer

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer

ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank

Correlation-based and Content-Enhanced Network for Video Style Transfer

Learning Structure-Aware Transformations for Arbitrary Image Style Transfer

Diverse Image Style Transfer Via Invertible Cross-Space Mapping

Preserving Structural Consistency in Arbitrary Artist and Artwork Style Transfer

DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Two Birds, One Stone: A Unified Framework for Joint Learning of Image and Video Style Transfers

A non-definitive auto-transfer mechanism for arbitrary style transfers

LCCStyle: Arbitrary Style Transfer with Low Computational Complexity

Any-to-Any Style Transfer: Making Picasso and Da Vinci Collaborate

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

Style Mixer: Semantic-aware Multi-Style Transfer Network

Name Your Style: An Arbitrary Artist-aware Image Style Transfer

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

DiffStyler: Diffusion-based Localized Image Style Transfer