Abstract:1. INTRODUCTION Reusing information redundancy in question-answer pairs is one of the alternative approaches to question answering (QA) system. If the same question has been asked by other users, the QA system responses to such question using the answer associated with the redundant question. Nevertheless, the task of identifying similarity of questions is not trivial. Traditional text similarity measures are neither effective nor efficient in distinguishing the similarity of sentence-level text. Document similarity techniques are not effective since the length of sentence text is rather short and contains very little word overlap. Furthermore, the similarity and relevance of sentences can be characterized into different levels, which is difference than a standard topicality notion used in document retrieval. In this paper, we focus on the problem of identifying questions that express the same information need. The main goal is to match questions with their paraphrases. To achieve this, we propose a hybrid question similarity approach that combines semantic, syntactic, and question type similarity. Semantic and syntactic information is measured by taking into account word similarity, word ordering, and parts of speech information. Information about the types of question is derived from a Support Vector Machine classifier. The experimental results have shown that our approach is highly effective in detecting redundant questions. For many years, knowledge-sharing community sites, such as Yahoo! Answers, have been accepting a large amount of questions from millions of users. Given the current magnitude of questions and answers in their archive, it is plausible that a newly submitted question has already been asked by the other users. However, finding such similar questions is ineffective due to the inherited limitation of the current search engines. Standard text retrieval approaches that compute the similarity of a document-level text are neither effective nor efficient for matching natural language questions. First, the fundamental principle of document similarity techniques is based on the degree of word overlaps. This notion works well in distinguishing similar documents since they are likely to contain sufficient number of words in common. On the other hand, the length of question phrases is relatively short and often contains very few word overlap. Furthermore, due to the generative power of natural language, the same question can be expressed in various ways. Hence, most questions are likely to receive a low similarity score from document similarity measures. The notion of topical relevance, which is central to the standard information retrieval systems, …

Efficient Near-Duplicate Detection for Q&A Forum.

Multi-Factor Duplicate Question Detection in Stack Overflow

A Topic Clustering Approach to Finding Similar Questions from Large Question and Answer Archives

Efficient Partial-Duplicate Detection Based on Sequence Matching

An Intent-based and Annotation-free Method for Duplicate Question Detection in CQA Forums

Improved fuzzy set information retrieval approach on duplicate webpage detection

MapDupReducer: detecting near duplicates over massive datasets.

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

An Integrated Approach for Detecting Approximate Duplicate Records

Utilizing Sentence Similarity and Question Type Similarity to Response to Similar Questions in Knowledge-Sharing Community

Detecting high-quality posts in community question answering sites

Fast And Robust Detection Of Near-Duplicates In Web Video Database

Fast and accurate near-duplicate image elimination for visual sensor networks

A binary-tree based algorithm for online duplicate documents detection

Detecting Duplicate Questions in Stack Overflow Via Source Code Modeling

A Fast and Effective Method for Clustering Large-Scale Chinese Question Dataset

Efficient near-duplicate image detection by learning from examples

Data-Driven Answer Selection in Community QA Systems.

Template-Independent Wrapper For Web Forums

Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams

Question Retrieval for Community-Based Question Answering Via Heterogeneous Social Influential Network.