Abstract:The rise of social media platforms has revolutionized the way consumers interact with retailers and express their opinions on products and services. Online retailers particularly need to keep a close eye on customer sentiment in real-time to make informed decisions about their offerings and improve customer satisfaction. However, efficiently analysing large volumes of unstructured text data from social media in real-time poses a significant challenge. This research aimed to develop a scalable, real-time sentiment analysis system tailored for online retailers using Reddit as the data source. The system comprises three main components: a data extraction and streaming pipeline, a sentiment analysis model, and a web application with real-time analytics. To address the data extraction challenge, a job queue-based system was implemented using Node.js, ‘BullMQ’, and Redis to create and manage campaigns for data streaming from Reddit. The data was streamed using Kafka, a distributed streaming platform, to enable efficient real-time processing. The sentiment analysis model was developed using a Naive Bayes classifier after experimenting with other machine learning and deep learning techniques. In the conducted study, the sentiment analysis model's performance was evaluated using standard metrics tailored to the context of online retail sentiment analysis. An accuracy of 0.6737 was achieved, reflecting the model's ability to correctly classify approximately 67.37 per cent of the sentiments in the test data. Concurrently, an F1 score of 0.7894 was recorded and the Area Under the Curve (AUC) value on the test data was measured at 0.5468, a metric that, while acceptable, suggests room for further refinement in the model's discriminatory ability between classes. The integration of the Data Version Control (DVC) system provided a mechanism for fine-tuning the model according to specific data requirements of various tenants. These results, taken together, not only validate the feasibility of employing a Naive Bayes classifier for real-time sentiment analysis in the retail context, but also provide a baseline for future research aimed at enhancing both the accuracy and efficiency of sentiment classification. The project’s evaluation focused on the performance of the sentiment analysis model, the efficiency of the Kafka streaming and real-time Spark pre-processing pipeline, and the backend infrastructure, including the job queuing system and WebSocket implementation. Various evaluation techniques, such as graphs and literature comparisons, were used to assess the system’s performance. In conclusion, this project successfully demonstrated the feasibility of a scalable, real-time sentiment analysis system for online retailers using Reddit data. The system has the potential to help retailers better understand customer opinions and make data-driven decisions for their businesses. Future work could include exploring alternative data sources, experimenting with more advanced sentiment analysis techniques, and enhancing the web application’s user interface and analytics capabilities.

Real-time Text Analytics Pipeline Using Open-source Big Data Tools

Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming

Real-time Twitter data analysis using Hadoop ecosystem

Socialanalysis: A Real-Time Query And Mining System From Social Media Data Streams

Exploring Real-Time Data Processing Using Big Data Frameworks

Big Data Analytics in Real Time - Technical Challenges and its Solutions

Real Time Big Data Sentiment Analysis and Classification of Facebook

Exploring Real-Time Sentiment Analysis Prototype for Retail Industry

A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning

Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

Data pipeline for real-time energy consumption data management and prediction

Large-Scale Real-Time Semantic Processing Framework for Internet of Things

Implementing Sentiment Analysis on Real-Time Twitter Data

Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Real Time Analytics: Algorithms and Systems

An efficient architecture for processing real-time traffic data streams using apache flink

Design and Implementation of Real Time Data Processing System Based on Spark Streaming

Twitter Sentiment Analysis Using Textual Information and Diffusion Patterns

Real-time Intelligent Big Data Processing:Technology, Platform, and Applications

Effects of acute alcohol intoxication and paroxetine on aggression in men.