Shu-wen Yang,Po-Han Chi,Yung-Sung Chuang,Cheng-I Lai,Kushal Lakhotia,Y. Yist,Lin,Andy T. Liu,Jiatong Shi,Xuankai Chang,Daniel Lin,Tzu-hsien Huang,W. Tseng,Godic Lee,Da-Rong Liu,Zili Huang,Annie Dong,Shang-Wen Li,Shinji Watanabe,Abdel-rahman Mohamed,Hung-yi Lee

Abstract:Using self-supervised learning methods to pre-train a network on large volumes of unlabeled data followed by fine-tuning for multiple downstream tasks has proven vital for advancing research in natural language representation learning. However, the speech processing community lacks a similar setup that systematically measures the quality of learned representations across a wide range of downstream speech applications. To bridge this gap, we introduce the Speech Understanding and Performance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of learned speech representations on ten speech processing tasks. We present a complete framework for learning and evaluating specialized prediction heads for each task given the pre-trained speech representations. Our results on many publicly-available self-supervised models demonstrate their generalization abilities to multiple speech tasks with limited supervised and minimal architecture changes. All the materials are open-sourced and reproducible in the s3prl toolkit to facilitate future research in speech representation learning.

SUPERB: Speech Understanding and PERformance Benchmark