Abstract:Social media platforms face an ongoing challenge in combating the proliferation of social bots, automated accounts that are also known to distort public opinion and support the spread of disinformation. Over the years, social bots have evolved greatly, often becoming indistinguishable from real users, and more recently, families of bots have been identified that are powered by Large Language Models to produce content for posting. We suggest an idea to classify social users as bots or not using genetic similarity algorithms. These algorithms provide an adaptive method for analyzing user behavior, allowing for the continuous evolution of detection criteria in response to the ever-changing tactics of social bots. Our proposal involves an initial clustering of social users into distinct macro species based on the similarities of their timelines. Macro species are then classified as either bot or genuine based on genetic characteristics. The preliminary idea we present, once fully developed, will allow existing detection applications based on timeline equality alone to be extended to detect bots. By incorporating new metrics, our approach will systematically classify non-trivial accounts into appropriate categories, effectively peeling back layers to reveal non-obvious species.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the detection of social bots on social media platforms. Specifically, with the continuous evolution of social bots, their behaviors are becoming increasingly difficult to distinguish from those of real users, especially when using content generated by large - language models (LLMs). These social bots are not only numerous but are also often used to manipulate public opinion and support the spread of false information, which has brought serious negative impacts on society.
To meet this challenge, the author proposes a new method based on genetic similarity to classify and identify social bots. The core idea of this method is to analyze users' online behavior patterns (represented by "digital DNA") and cluster users into different "species", and then further distinguish which are bot accounts and which are real - user accounts according to the genetic characteristics of these species.
### Overview of Main Problems and Solutions:
1. **Problem Background**:
- Social media platforms are facing the problem of an overabundance of social bots.
- These bots have complex behaviors and are difficult to distinguish, especially the recently emerged families of bots driven by large - language models.
- Traditional detection methods based on timeline equality are no longer sufficient to deal with these advanced bots.
2. **Proposed Solutions**:
- **Digital DNA Encoding**: Encode users' online behaviors into a character sequence, where each character represents a specific operation (such as tweeting, retweeting, replying, etc.).
- **Species Clustering**: Based on the Longest Common Substring (LCS) algorithm, cluster users into different macroscopic species. LCS is an indicator for measuring the similarity of users' behaviors.
- **Preliminary Classification**: Identify groups of users with similar behaviors through the significant change points of the LCS length, and initially divide them into the suspected bot group (gSpamBot) and the real - user group (gGenuine).
- **Genetic Similarity Classification**: For unlabeled species, use a custom - defined genetic similarity measure for classification. Calculate the similarity score through the sequence alignment algorithm and combine factors such as group size, and finally determine whether these species belong to bots or real users.
3. **Expected Effects**:
- Provide an adaptable method to analyze user behaviors, which can cope with the ever - changing strategies of social bots.
- Expand the existing detection applications based on timeline equality, introduce new measurement standards, and more systematically classify non - trivial accounts to reveal hidden bot species.
### Formula Representation:
- The character set \( B \) in digital DNA encoding is defined as:
\[
B = \begin{cases}
A & \text{Ordinary tweet} \\
T & \text{Retweet} \\
C & \text{Reply}
\end{cases}
\]
- The length of the Longest Common Substring (LCS) is used to measure the similarity of users' behaviors:
\[
LCS(s_1, s_2, \ldots, s_k)
\]
where \( s_i \) represents the digital DNA sequence of the \( i \) - th user.
Through this method, researchers hope to detect and classify social bots more effectively, thereby maintaining the healthy ecology of social media platforms.