American cultural regions mapped through the lexical analysis of social media

Thomas Louf,Bruno Gonçalves,Jose J. Ramasco,David Sanchez,Jack Grieve
DOI: https://doi.org/10.1057/s41599-023-01611-3
2023-04-18
Abstract:Cultural areas represent a useful concept that cross-fertilizes diverse fields in social sciences. Knowledge of how humans organize and relate their ideas and behavior within a society helps to understand their actions and attitudes towards different issues. However, the selection of common traits that shape a cultural area is somewhat arbitrary. What is needed is a method that can leverage the massive amounts of data coming online, especially through social media, to identify cultural regions without ad-hoc assumptions, biases or prejudices. This work takes a crucial step in this direction by introducing a method to infer cultural regions based on the automatic analysis of large datasets from microblogging posts. The approach presented here is based on the principle that cultural affiliation can be inferred from the topics that people discuss among themselves. Specifically, regional variations in written discourse are measured in American social media. From the frequency distributions of content words in geotagged Tweets, the regional hotspots of words' usage are found, and from there, principal components of regional variation are derived. Through a hierarchical clustering of the data in this lower-dimensional space, this method yields clear cultural areas and the topics of discussion that define them. It uncovers a manifest North-South separation, which is primarily influenced by the African American culture, and further contiguous (East-West) and non-contiguous divisions that provide a comprehensive picture of today's cultural areas in the US.
Computation and Language,Computers and Society,Social and Information Networks,Physics and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to objectively identify cultural regions in the United States through language data on social media. Specifically, the authors aim to develop a method that utilizes large - scale geotagged social media data (such as Twitter) to automatically analyze the topic distribution in these data, thereby inferring the cultural characteristics and differences in different regions. ### Main problems of the paper 1. **How to objectively define cultural regions**: - Traditionally, the division of cultural regions depends on subjectively - selected cultural factors (such as politics, religion, economy, etc.), and the combination of these factors is also subjective. - This paper proposes an automated method based on social media data, which avoids human biases and assumptions and infers cultural regions by analyzing the topics that people discuss daily. 2. **How to extract meaningful cultural information from massive data**: - The authors used more than 3.3 billion geotagged Twitter data, covering micro - blog content across the United States from 2015 to 2021. - They extracted and visualized the characteristics of cultural regions through methods such as calculating word frequencies, Getis - Ord z - scores (used to identify geographical hotspots), and principal component analysis (PCA). 3. **How to verify the stability and representativeness of cultural regions**: - The research found that the United States can be divided into five major cultural regions, each with its own unique discussion topics. - These cultural regions have been relatively stable in recent years, further proving their authenticity. ### Method overview 1. **Data collection and pre - processing**: - Collected 3.3 billion geotagged Twitter data and cleaned it, including removing abnormal users, non - English tweets, etc. 2. **Measuring regional changes**: - Calculated the relative word frequencies in each county and used Getis - Ord z - scores to identify geographical hotspots of word use. 3. **Principal component analysis (PCA)**: - Performed PCA on the z - scores of all words to extract the main dimensions of regional changes. 4. **Cluster analysis**: - Used hierarchical clustering method to cluster the PCA results and finally determined five major cultural regions. ### Main findings - The United States can be divided into five major cultural regions, each with its own unique topic characteristics. - The division of these regions not only reflects the north - south differences but also reveals the east - west direction and other discontinuous regional divisions. - The formation of cultural regions is influenced by multiple factors, including geographical location, population density, dialects, etc. Through this method, researchers can provide an objective and data - driven cultural region division plan, providing a new perspective for understanding the complexity and diversity of American society.