Abstract:Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.

Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

TinyLlama: An Open-Source Small Language Model

Mini Minds: Exploring Bebeshka and Zlata Baby Models

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

CUTE: Measuring LLMs' Understanding of Their Tokens

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?

Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

Large Language Models Demonstrate the Potential of Statistical Learning in Language

Small Language Models: Survey, Measurements, and Insights

LAMAL: LAnguage Modeling is All You Need for Lifelong Language Learning

Evaluating Neural Language Models as Cognitive Models of Language Acquisition

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models

Large language model programs

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition

Large Language Models Are Strong Audio-Visual Speech Recognition Learners