Abstract:The language modeling approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smoothing techniques. In the thesis, we propose a novel context-sensitive semantic smoothing method referred to as a topic signature language model. It extracts explicit topic signatures from a document and then statistically maps them into individual words in the vocabulary. In order to support the new language model, we developed two automated algorithms to extract multiword phrases and ontological concepts, respectively, and an EM-based algorithm to learn semantic mapping knowledge from co-occurrence data. The topic signature language model is applied to three applications: information retrieval, text classification, and text clustering. The evaluations on news collection and biomedical literature prove the effectiveness of the topic signature language model. In the experiment of information retrieval, the topic signature language model consistently outperforms the baseline two-stage language model as well as the context-insensitive semantic smoothing method in all configurations. It also beats the state-of-the-art Okapi models in all configurations. In the experiment of text classification, when the size of training documents is small, the Bayesian classifier with semantic smoothing not only outperforms the classifiers with background smoothing and Laplace smoothing, but it also beats the active learning classifiers and SVM classifiers. On the task of clustering, whether or not the dataset to cluster is small, the model-based k-means with semantic smoothing performs significantly better than both the model-based k-means with background smoothing and Laplace smoothing. It is also superior to the spherical k-means in terms of effectiveness. In addition, we empirically prove that, within the framework of topic signature language models, the semantic knowledge learned from one collection could be effectively applied to other collections. In the thesis, we also compare three types of topic signatures (i.e., words, multiword phrases, and ontological concepts), with respect to their effectiveness and efficiency for semantic smoothing. In general, it is more expensive to extract multiword phrases and ontological concepts than individual words, but semantic mapping based on multiword phrases and ontological concepts are more effective in handling data sparsity than on individual words.

Semantic indexing and document retrieval for personalized language modeling

Semantic Query Expansion and Context-Based Discriminative Term Modeling for Spoken Document Retrieval

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

Language Modeling Structures in Audio Transcription for Retrieval of Historical Speeches

Improved Semantic Retrieval of Spoken Content by Language Models Enhanced with Acoustic Similarity Graph

Language Models As Semantic Indexers

A Simplified Latent Semantic Indexing Approach for Multi-Linguistic Information Retrieval.

Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

Integrating Semantic Information into Sketchy Reading Module of Retro-Reader for Vietnamese Machine Reading Comprehension

Semantics-based language models for information retrieval and text mining

Personalized Speech Recognizer With Keyword-Based Personalized Lexicon And Language Model Using Word Vector Representations

Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications

Some Like It Small: Czech Semantic Embedding Models for Industry Applications

ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

A study of user profile representation for personalized cross-language information retrieval

Interactive Spoken Document Retrieval with Suggested Key Terms Ranked by a Markov Decision Process

DocReLM: Mastering Document Retrieval with Language Model

LDA-Based Retrieval Framework for Semantic News Video Retrieval

Context-sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR

Semantic Retrieval of Personal Photos Using Matrix Factorization and Two-Layer Random Walk Fusing Sparse Speech Annotations with Visual Features