Abstract:Transformers have shown remarkable performance, however, their architecture design is a time-consuming process that demands expertise and trial-and-error. Thus, it is worthwhile to investigate efficient methods for automatically searching high-performance Transformers via Transformer Architecture Search (TAS). In order to improve the search efficiency, training-free proxy based methods have been widely adopted in Neural Architecture Search (NAS). Whereas, these proxies have been found to be inadequate in generalizing well to Transformer search spaces, as confirmed by several studies and our own experiments. This paper presents an effective scheme for TAS called TRansformer Architecture search with ZerO-cost pRoxy guided evolution (T-Razor) that achieves exceptional efficiency. Firstly, through theoretical analysis, we discover that the synaptic diversity of multi-head self-attention (MSA) and the saliency of multi-layer perceptron (MLP) are correlated with the performance of corresponding Transformers. The properties of synaptic diversity and synaptic saliency motivate us to introduce the ranks of synaptic diversity and saliency that denoted as DSS++ for evaluating and ranking Transformers. DSS++ incorporates correlation information among sampled Transformers to provide unified scores for both synaptic diversity and synaptic saliency. We then propose a block-wise evolution search guided by DSS++ to find optimal Transformers. DSS++ determines the positions for mutation and crossover, enhancing the exploration ability. Experimental results demonstrate that our T-Razor performs competitively against the state-of-the-art manually or automatically designed Transformer architectures across four popular Transformer search spaces. Significantly, T-Razor improves the searching efficiency across different Transformer search spaces, e.g., reducing required GPU days from more than 24 to less than 0.4 and outperforming existing zero-cost approaches. We also apply T-Razor to the BERT search space and find that the searched Transformers achieve competitive GLUE results on several Neural Language Processing (NLP) datasets. This work provides insights into training-free TAS, revealing the usefulness of evaluating Transformers based on the properties of their different blocks.

Training-free Neural Architectural Search on Transformer Via Evaluating Expressivity and Trainability

Training-free Neural Architecture Search for RNNs and Transformers

Neural Architecture Search on Efficient Transformers and Beyond

Training-free Neural Architecture Search on Hybrid Convolution-attention Networks

GLiT: Neural Architecture Search for Global and Local Image Transformer

Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution

Understanding and Accelerating Neural Architecture Search with Training-Free and Theory-Grounded Metrics

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective

AutoST: Training-free Neural Architecture Search for Spiking Transformers

Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search

LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

AutoFormer: Searching Transformers for Visual Recognition

NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict Aware Supernet Training

EnTranNAS: Towards Closing the Gap between the Architectures in Search and Evaluation

Neural Architecture Search via Trainless Pruning Algorithm: A Bayesian Evaluation of a Network with Multiple Indicators

Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search

TG-NAS: Leveraging Zero-Cost Proxies with Transformer and Graph Convolution Networks for Efficient Neural Architecture Search

AttentiveNAS: Improving Neural Architecture Search Via Attentive Sampling

Transfer NAS: Knowledge Transfer between Search Spaces with Transformer Agents

Efficient Evaluation Methods for Neural Architecture Search: A Survey