Utilizing Sentence Similarity and Question Type Similarity to Response to Similar Questions in Knowledge-Sharing Community
palakorn achananuparp,xiaohua hu,xiaohua zhou,xiaodan zhang
2012-01-01
Abstract:1. INTRODUCTION Reusing information redundancy in question-answer pairs is one of the alternative approaches to question answering (QA) system. If the same question has been asked by other users, the QA system responses to such question using the answer associated with the redundant question. Nevertheless, the task of identifying similarity of questions is not trivial. Traditional text similarity measures are neither effective nor efficient in distinguishing the similarity of sentence-level text. Document similarity techniques are not effective since the length of sentence text is rather short and contains very little word overlap. Furthermore, the similarity and relevance of sentences can be characterized into different levels, which is difference than a standard topicality notion used in document retrieval. In this paper, we focus on the problem of identifying questions that express the same information need. The main goal is to match questions with their paraphrases. To achieve this, we propose a hybrid question similarity approach that combines semantic, syntactic, and question type similarity. Semantic and syntactic information is measured by taking into account word similarity, word ordering, and parts of speech information. Information about the types of question is derived from a Support Vector Machine classifier. The experimental results have shown that our approach is highly effective in detecting redundant questions. For many years, knowledge-sharing community sites, such as Yahoo! Answers, have been accepting a large amount of questions from millions of users. Given the current magnitude of questions and answers in their archive, it is plausible that a newly submitted question has already been asked by the other users. However, finding such similar questions is ineffective due to the inherited limitation of the current search engines. Standard text retrieval approaches that compute the similarity of a document-level text are neither effective nor efficient for matching natural language questions. First, the fundamental principle of document similarity techniques is based on the degree of word overlaps. This notion works well in distinguishing similar documents since they are likely to contain sufficient number of words in common. On the other hand, the length of question phrases is relatively short and often contains very few word overlap. Furthermore, due to the generative power of natural language, the same question can be expressed in various ways. Hence, most questions are likely to receive a low similarity score from document similarity measures. The notion of topical relevance, which is central to the standard information retrieval systems, …