Abstract:This paper investigates the under-explored area of low-rank weight training for large-scale Conformer-based speech recognition models from scratch. Our study demonstrates the viability of this training paradigm for such models, yielding several notable findings. Firstly, we discover that applying a low-rank structure exclusively to the attention modules can unexpectedly enhance performance, even with a significant rank reduction of 12%. In contrast, feed-forward layers present greater challenges, as they begin to exhibit performance degradation with a moderate 50% rank reduction. Furthermore, we find that both initialization and layer-wise rank assignment play critical roles in successful low-rank training. Specifically, employing SVD initialization and linear layer-wise rank mapping significantly boosts the efficacy of low-rank weight training. Building on these insights, we introduce the Low-Rank Speech Model from Scratch (LR-SMS), an approach that achieves performance parity with full-rank training while delivering substantial reductions in parameters count (by at least 2x), and training time speedups (by 1.3x for ASR and 1.15x for AVSR).

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to train large - scale Conformer - based speech recognition models from scratch so that, when using low - rank weights, they can significantly reduce the number of parameters and accelerate the training time without sacrificing performance**. ### Specific problem description: 1. **Challenges in applying low - rank structures**: Previous studies have shown that directly applying low - rank structures to train large - scale neural networks from scratch usually impairs the model performance. Therefore, a method needs to be explored to enable low - rank training to be effectively carried out without affecting performance. 2. **Differences in the adaptability of different layers to low - rank structures**: It has been found that different model layers (such as the multi - head self - attention module and the feed - forward network layer) have different adaptabilities to low - rank structures. For example, the multi - head self - attention module can improve performance under low - rank constraints, while the performance of the feed - forward network layer drops significantly at lower ranks. Therefore, a strategy needs to be found to optimize the low - rank configurations of different layers. 3. **The importance of initialization and hierarchical rank assignment**: Research has shown that appropriate initialization methods (such as SVD initialization) and hierarchical rank assignment (such as linear hierarchical rank mapping) are crucial for successful low - rank training. These techniques can significantly improve the effectiveness of low - rank training. 4. **The practical application effects of low - rank training**: By introducing the Low - Rank Speech Model from Scratch (LR - SMS) method, the authors hope to prove that low - rank training can significantly reduce the number of parameters (by at least 2 times) while maintaining performance comparable to full - rank training, and accelerate the training time (1.3 times for ASR tasks and 1.15 times for A VSR tasks). ### Summary: This paper aims to systematically study and verify the feasibility of low - rank training in large - scale automatic speech recognition (ASR) and audio - visual speech recognition (A VSR) models, especially when training from scratch, how to achieve dual improvements in performance and efficiency through techniques such as optimizing initialization and hierarchical rank assignment.

Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Sparse Low-rank Adaptation of Pre-trained Language Models

Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Optimizing Data Usage for Low-Resource Speech Recognition

Sparsely Shared LoRA on Whisper for Child Speech Recognition

Learning on LoRAs: GL-Equivariant Processing of Low-Rank Weight Spaces for Large Finetuned Models

Exploring Effective Data Utilization for Low-Resource Speech Recognition

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

SBoRA: Low-Rank Adaptation with Regional Weight Updates

Speech Recognition Rescoring with Large Speech-Text Foundation Models

Low-Rank Plus Diagonal Adaptation For Deep Neural Networks