Open e-commerce 1.0, five years of crowdsourced U.S. Amazon purchase histories with user demographics

Alex Berke,Dan Calacci,Robert Mahari,Takahiro Yabe,Kent Larson,Sandy Pentland
DOI: https://doi.org/10.1038/s41597-024-03329-6
2024-05-14
Scientific Data
Abstract:This is a first-of-its-kind dataset containing detailed purchase histories from 5027 U.S. Amazon.com consumers, spanning 2018 through 2022, with more than 1.8 million purchases. Consumer spending data are customarily collected through government surveys to produce public datasets and statistics, which serve public agencies and researchers. Companies now collect similar data through consumers' use of digital platforms at rates superseding data collection by public agencies. We published this dataset in an effort towards democratizing access to rich data sources routinely used by companies. The data were crowdsourced through an online survey and shared with participants' informed consent. Data columns include order date, product code, title, price, quantity, and shipping address state. Each purchase history is linked to survey data with information about participants' demographics, lifestyle, and health. We validate the dataset by showing expenditure correlates with public Amazon sales data (Pearson r = 0.978, p < 0.001) and conduct analyses of specific product categories, demonstrating expected seasonal trends and strong relationships to other public datasets.
multidisciplinary sciences
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to promote the democratization of consumer behavior research by releasing an unprecedented data set containing the detailed purchase histories of 5,027 American Amazon consumers between 2018 and 2022. This data not only includes more than 1.8 million purchase records, but is also linked to survey data on participants' sociodemographic characteristics, lifestyles, and health conditions. The paper aims to show how to use these data sets to supplement or replace traditional government survey data, especially in the case of a decline in government survey response rates, to improve the quality of research on consumer spending patterns. Specifically, the paper focuses on the following aspects: 1. **Data representativeness and verification**: By comparing with publicly available Amazon sales data (Pearson correlation coefficient \(r = 0.978\), \(p < 0.001\)), the validity of the data set was verified, and the consumption trends and seasonal variations of specific product categories were shown. 2. **Research on consumer behavior**: The differences in consumer behavior among different groups of people (such as gender, age, income level, etc.) were explored. For example, the average expenditure of high - income groups increased significantly during the holiday season (Q4); the average expenditure of female users was higher than that of male users after the start of the COVID - 19 pandemic (the second quarter of 2020). 3. **Coping with the limitations of traditional data collection methods**: With the decline in government survey response rates, especially the response rates of key surveys of the US Census Bureau, such as the Current Population Survey (CPS) and the Consumer Expenditure Survey (CES), which have decreased by 19% and 15% respectively, the paper proposes that the use of e - commerce platform transaction data can supplement traditional data and improve the accuracy of economic indicators. 4. **Promoting future research**: The paper hopes to stimulate more new research in fields such as consumer behavior, socioeconomic status, and public health by releasing this data set, especially in terms of how to use big data technology to improve existing research methods. In summary, the core objective of this paper is to enhance the academic community's and society's understanding of consumer behavior by opening a high - quality e - commerce purchase history data set, while exploring new data collection and analysis methods to meet the challenges faced by traditional data collection means.