Audio parsing and rapid speaker adaptation in speech recognition for spoken document retrieval

Bowen Zhou,John H. Hansen
2003-01-01
Abstract:The focus of this thesis is to address a number of research issues in developing an effective large vocabulary continuous speech recognition (LVCSR) based on-line spoken document retrieval system. Within this framework, the primary thesis contributions include the following distinct yet related areas: The first thesis contribution addresses the problem of efficient audio stream parsing. Here, an extension to the previously proposed Bayesian Information Criterion (BIC) based algorithm is formulated as T 2-BIC, by integrating the Hotelling's T2-Statistic into BIC. Using the proposed algorithm, a significant computational speed improvement is demonstrated with superior parsing performance. Second, novel rapid model adaptation techniques, entitled Eigenspace Mapping, represent a primary contribution from this thesis. The idea of Eigenspace Mapping is to construct discriminative acoustic models for the test speaker by preserving the dominant discriminating power from the baseline model along the test speaker's first primary eigendirections. The adaptation process is accomplished through a linear transformation in the model space. Based on this key idea, a number of algorithms can be formulated such as EigMap, and extensions using different objective functions including the Structural Maximum Likelihood Eigenspace Mapping. Unsupervised adaptation experiments show that the proposed algorithms are effective using very limited amounts of adaptation data. Furthermore, the proposed algorithms are highly additive to other traditional methods such as MLLR by bringing additional discrimination information. Finally, the last contribution focuses on an experimental on-line spoken document retrieval system, SpeechFind, which is designed and implemented by incorporatitig state-of-the-art LVCSR and information retrieval (IR) technologies. In addition to system development efforts, contributions have been made to enhance the quality of automatic transcripts and several methods such as query and document expansions have been developed to overcome the issue of IR over corrupted transcripts. Collectively, the contributions made in these three related areas have resulted in an effective and integrated on-line spoken document retrieval system. Moreover, the proposed audio parsing and novel rapid speaker adaptation algorithms have helped advance the state of the art in robust speech recognition.
What problem does this paper attempt to address?