Abstract:Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing automated model architecture optimization methods, especially the deficiencies in designing search spaces and generation patterns. Specifically, current automated or manual methods face the following challenges when optimizing deep - learning model architectures: 1. **Limited search space**: The existing progress in search space design is limited, resulting in relatively simple generated model architecture patterns and heuristic methods that cannot cover a wider range of potential improvements. 2. **High optimization cost**: Optimizing model architectures is both challenging and costly, especially when multiple quality and efficiency metrics need to be considered. 3. **Lack of a unified framework**: Current methods fail to provide a unified framework to significantly improve quality and efficiency across domains and objectives. To solve these problems, the authors propose a new method - STAR (Synthesis of Tailored Architectures) for synthesizing customized model architectures. The main features of STAR include: - **A new search space based on the Linear Input Variation (LIV) system**: STAR uses LIV theory to design building blocks and supports hierarchical numerical encoding into the architecture genome. The LIV framework can describe most of the computing units used in deep learning, such as attention mechanisms, linear recursions, convolutions, etc. - **Automatic evolution algorithm optimization**: The architecture genome is automatically refined and reorganized by a gradient - free evolution algorithm to optimize multiple model quality and efficiency metrics. - **Diverse computing units and connection patterns**: STAR can optimize a large - scale new architecture population, taking advantage of diverse computing units and interconnection patterns, outperforming highly optimized Transformers and striped hybrid models in autoregressive language modeling. ### Specific problem description 1. **Limitations of existing methods**: - Automated methods are limited by predefined search spaces and it is difficult to achieve significant quality and efficiency improvements across multiple domains and objectives. - Although manual design has achieved some results, it is limited to relatively basic design patterns and requires a large amount of resources, expertise and time. 2. **Optimization objectives**: - Improve quality (e.g., perplexity). - Reduce the number of parameters. - Decrease the inference cache size (KV cache and fixed - state cache). 3. **Solutions**: - By combining LIV theory and evolution algorithms, STAR designs a search space that is both comprehensive and well - conditioned. - By optimizing different - level LIV architectures, STAR can reduce the number of parameters and cache size while maintaining performance. ### Experimental verification To verify the effectiveness of STAR, the authors conducted multiple experiments, including: - Evaluating STAR - optimized architectures on autoregressive language modeling tasks. - Comparing the performance of different evolution algorithms (such as the firefly algorithm, genetic algorithm, and NSGA - 2) when optimizing the STAR genome. - Exploring different optimization scales (such as reducing depth or width) to reduce computational costs. The experimental results show that STAR - optimized architectures outperform existing Transformers and hybrid models on multiple evaluation metrics, demonstrating its potential in improving model quality and efficiency.

STAR: Synthesis of Tailored Architectures

$\alpha$ DARTS Once More: Enhancing Differentiable Architecture Search by Masked Image Modeling

Efficient Architecture Search by Network Transformation

Optimizing the Structures of Transformer Neural Networks Using Parallel Simulated Annealing

Mechanistic Design and Scaling of Hybrid Architectures

Cyclic Differentiable Architecture Search

Neural Architecture Search on Efficient Transformers and Beyond

A neural network architecture optimizer based on DARTS and generative adversarial learning

Differentiable Neural Architecture Search with Morphism-based Transformable Backbone Architectures

Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search

Evolution and Efficiency in Neural Architecture Search: Bridging the Gap Between Expert Design and Automated Optimization

Automating Neural Architecture Design without Search

Towards modular and programmable architecture search

LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution

Automatic Graph Topology-Aware Transformer

Full Stack Optimization of Transformer Inference: a Survey

Efficient Multi-objective Neural Architecture Search via Lamarckian Evolution

Understanding Architectures Learnt by Cell-based Neural Architecture Search

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

Taming Transformers for High-Resolution Image Synthesis