Abstract:Identifying semantically identical questions on, Question and Answering social media platforms like Quora is exceptionally significant to ensure that the quality and the quantity of content are presented to users, based on the intent of the question and thus enriching overall user experience. Detecting duplicate questions is a challenging problem because natural language is very expressive, and a unique intent can be conveyed using different words, phrases, and sentence structuring. Machine learning and deep learning methods are known to have accomplished superior results over traditional natural language processing techniques in identifying similar texts. In this paper, taking Quora for our case study, we explored and applied different machine learning and deep learning techniques on the task of identifying duplicate questions on Quora's dataset. By using feature engineering, feature importance techniques, and experimenting with seven selected machine learning classifiers, we demonstrated that our models outperformed previous studies on this task. Xgboost model with character level term frequency and inverse term frequency is our best machine learning model that has also outperformed a few of the Deep learning baseline models. We applied deep learning techniques to model four different deep neural networks of multiple layers consisting of Glove embeddings, Long Short Term Memory, Convolution, Max pooling, Dense, Batch Normalization, Activation functions, and model merge. Our deep learning models achieved better accuracy than machine learning models. Three out of four proposed architectures outperformed the accuracy from previous machine learning and deep learning research work, two out of four models outperformed accuracy from previous deep learning study on Quora's question pair dataset, and our best model achieved accuracy of 85.82% which is close to Quora state of the art accuracy.

Detect Toxic Content to Improve Online Conversations

Detecting Insincere Questions from Text: A Transfer Learning Approach

Predicting Different Types of Subtle Toxicity in Unhealthy Online Conversations

Purging the Poison: A Machine Learning Approach to Filtering Toxic Comments

Toxicity Detection for Indic Multilingual Social Media Content

A Survey of Toxic Comment Classification Methods

Analyzing Toxicity in Deep Conversations: A Reddit Case Study

Impersonation on Social Media: A Deep Neural Approach to Identify Ingenuine Content

An Automated Toxicity Classification on Social Media Using LSTM and Word Embedding

Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study

Which one is more toxic? Findings from Jigsaw Rate Severity of Toxic Comments

Technological Solutions to Online Toxicity: Potential and Pitfalls

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions

Detection of Toxic Language in Short Text Messages

Deep learning for religious and continent-based toxic content detection and classification

Effect of Toxic Review Content on Overall Product Sentiment

Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach

Modeling subjectivity (by Mimicking Annotator Annotation) in toxic comment identification across diverse communities

Towards Robust Toxic Content Classification

Impact of Sentiment Detection to Recognize Toxic and Subversive Online Comments