MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling

Xin Xu,Zhouchen Lin
DOI: https://doi.org/10.3233/faia240593
2024-01-01
Abstract:Sequence modeling is a critical task in various domains such as natural language processing, speech recognition, and time series analysis. The existing models still face challenges in capturing long-range dependencies and efficiently modeling sequences. This paper proposes a novel hybrid sequence modeling architecture called MixCon to address these challenges. The MixCon (Mixture of Conba) architecture combines a Transformer layer based on attention mechanism, a Conba layer, and a Mixture of Experts (MoE) module. We apply this idea to the design of the attention mechanism, achieving significant improvements in computational efficiency. Additionally, the MixCon architecture integrates feedback and adaptive control mechanism inspired by control theory, providing a new perspective and approach to sequence modeling. The experimental results demonstrate MixCon’s superior throughput, outperforming Mixtral by 4.5 times and Jamba by 1.5 times when processing lengthy sequences of up to 128K tokens on a single A800 80GB GPU. Moreover, MixCon achieves top-tier scores on academic benchmarks, exemplified by its outstanding performance with a score of 87.9% on HellaSwag and 83.4% on WinoGrande, showcasing its capability to excel in complex sequence modeling tasks.
What problem does this paper attempt to address?