Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Chien-yu Huang,Wei-Chih Chen,Shu-wen Yang,Andy T. Liu,Chen-An Li,Yu-Xiang Lin,Wei-Cheng Tseng,Anuj Diwan,Yi-Jen Shih,Jiatong Shi,William Chen,Xuanjun Chen,Chi-Yuan Hsiao,Puyuan Peng,Shih-Heng Wang,Chun-Yi Kuan,Ke-Han Lu,Kai-Wei Chang,Chih-Kai Yang,Fabian Ritter-Gutierrez,Ming To Chuang,Kuan-Po Huang,Siddhant Arora,You-Kuan Lin,Eunjung Yeo,Kalvin Chang,Chung-Ming Chien,Kwanghee Choi,Cheng-Hsiu Hsieh,Yi-Cheng Lin,Chee-En Yu,I-Hsiang Chiu,Heitor R. Guimarães,Jionghao Han,Tzu-Quan Lin,Tzu-Yuan Lin,Homu Chang,Ting-Wu Chang,Chun Wei Chen,Shou-Jen Chen,Yu-Hua Chen,Hsi-Chun Cheng,Kunal Dhawan,Jia-Lin Fang,Shi-Xin Fang,Kuan-Yu Fang Chiang,Chi An Fu,Hsien-Fu Hsiao,Ching Yu Hsu,Shao-Syuan Huang,Lee Chen Wei,Hsi-Che Lin,Hsuan-Hao Lin,Hsuan-Ting Lin,Jian-Ren Lin,Ting-Chun Liu,Li-Chun Lu,Tsung-Min Pai,Ankita Pasad,Shih-Yun Shan Kuan,Suwon Shon,Yuxun Tang,Yun-Shao Tsai,Jui-Chiang Wei,Tzu-Chieh Wei,Chengxi Wu,Dien-Ruei Wu,Chao-Han Huck Yang,Chieh-Chi Yang,Jia Qi Yip,Shao-Xiang Yuan,Vahid Noroozi,Zhehuai Chen,Haibin Wu,Karen Livescu,David Harwath,Shinji Watanabe,Hung-yi Lee

2024-11-08

Abstract:Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline.

Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

SUPERB-SG: Enhanced Speech Processing Universal PERformance Benchmark for Semantic and Generative Capabilities

A Large-Scale Evaluation of Speech Foundation Models

SUPERB: Speech Understanding and PERformance Benchmark

SUPERB: Speech Processing Universal PERformance Benchmark

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Roadmap towards Superhuman Speech Understanding using Large Language Models

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation