PALM: Few-Shot Prompt Learning for Audio Language Models

Asif Hanif,Maha Tufail Agro,Mohammad Areeb Qazi,Hanan Aldarmaki
2024-09-30
Abstract:Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at <a class="link-external link-https" href="https://asif-hanif.github.io/palm/" rel="external noopener nofollow">this https URL</a>
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of performance optimization of Audio Language Models (ALMs) in zero - shot and few - shot audio recognition tasks. Specifically, the authors focus on how to improve the accuracy and efficiency of audio recognition by improving prompt learning techniques. The following are the main objectives of this research: 1. **Improve the performance of zero - shot audio recognition**: - Zero - shot audio recognition depends on matching the audio waveform features with the text prompt features of the category. However, manually - designed text prompts have a great impact on performance, resulting in unstable results. Therefore, the authors hope to find an automated method to optimize these text prompts. 2. **Introduce a new prompt learning method**: - The authors propose a new method - PALM (Prompt Learning in Audio Language Models), which improves training efficiency by optimizing the feature space of the text encoder rather than the input space. This makes the loss gradient not need to flow through the text encoder, thus reducing the computational cost. 3. **Verify adaptability**: - The researchers evaluate the applicability of prompt learning techniques in existing Visual - Language Models (VLMs) in audio language models and demonstrate the potential of these techniques in audio recognition tasks. 4. **Establish a benchmark**: - The authors conduct experiments on 11 different audio recognition datasets, covering a variety of speech processing tasks. By comparing with three baseline methods (ZERO - SHOT, COOP, COCOOP), they prove the effectiveness and computational efficiency of the PALM method. 5. **Reduce computational resource consumption**: - PALM not only improves classification accuracy but also significantly reduces computational requirements. This is of great significance for large - scale audio processing tasks in practical applications. ### Summary In general, this paper attempts to solve the performance bottleneck problem of audio language models in zero - shot and few - shot audio recognition tasks by introducing and optimizing prompt learning techniques while reducing the consumption of computational resources. PALM, as a new prompt learning method, has proven its effectiveness and superiority in experiments on multiple datasets.