STAR: Synthesis of Tailored Architectures

Armin W. Thomas,Rom Parnichkun,Alexander Amini,Stefano Massaroli,Michael Poli
2024-11-27
Abstract:Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.
Machine Learning,Artificial Intelligence,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing automated model architecture optimization methods, especially the deficiencies in designing search spaces and generation patterns. Specifically, current automated or manual methods face the following challenges when optimizing deep - learning model architectures: 1. **Limited search space**: The existing progress in search space design is limited, resulting in relatively simple generated model architecture patterns and heuristic methods that cannot cover a wider range of potential improvements. 2. **High optimization cost**: Optimizing model architectures is both challenging and costly, especially when multiple quality and efficiency metrics need to be considered. 3. **Lack of a unified framework**: Current methods fail to provide a unified framework to significantly improve quality and efficiency across domains and objectives. To solve these problems, the authors propose a new method - STAR (Synthesis of Tailored Architectures) for synthesizing customized model architectures. The main features of STAR include: - **A new search space based on the Linear Input Variation (LIV) system**: STAR uses LIV theory to design building blocks and supports hierarchical numerical encoding into the architecture genome. The LIV framework can describe most of the computing units used in deep learning, such as attention mechanisms, linear recursions, convolutions, etc. - **Automatic evolution algorithm optimization**: The architecture genome is automatically refined and reorganized by a gradient - free evolution algorithm to optimize multiple model quality and efficiency metrics. - **Diverse computing units and connection patterns**: STAR can optimize a large - scale new architecture population, taking advantage of diverse computing units and interconnection patterns, outperforming highly optimized Transformers and striped hybrid models in autoregressive language modeling. ### Specific problem description 1. **Limitations of existing methods**: - Automated methods are limited by predefined search spaces and it is difficult to achieve significant quality and efficiency improvements across multiple domains and objectives. - Although manual design has achieved some results, it is limited to relatively basic design patterns and requires a large amount of resources, expertise and time. 2. **Optimization objectives**: - Improve quality (e.g., perplexity). - Reduce the number of parameters. - Decrease the inference cache size (KV cache and fixed - state cache). 3. **Solutions**: - By combining LIV theory and evolution algorithms, STAR designs a search space that is both comprehensive and well - conditioned. - By optimizing different - level LIV architectures, STAR can reduce the number of parameters and cache size while maintaining performance. ### Experimental verification To verify the effectiveness of STAR, the authors conducted multiple experiments, including: - Evaluating STAR - optimized architectures on autoregressive language modeling tasks. - Comparing the performance of different evolution algorithms (such as the firefly algorithm, genetic algorithm, and NSGA - 2) when optimizing the STAR genome. - Exploring different optimization scales (such as reducing depth or width) to reduce computational costs. The experimental results show that STAR - optimized architectures outperform existing Transformers and hybrid models on multiple evaluation metrics, demonstrating its potential in improving model quality and efficiency.