Using the wayback machine to mine websites in the social sciences: A methodological resource
Sanjay K. Arora,Yin Li,Jan Youtie,Philip Shapira
DOI: https://doi.org/10.1002/asi.23503
2015-05-05
Journal of the Association for Information Science and Technology
Abstract:Websites offer an unobtrusive data source for developing and analyzing information about various types of social science phenomena. In this paper, we provide a methodological resource for social scientists looking to expand their toolkit using unstructured web‐based text, and in particular, with the W ayback M achine, to access historical website data. After providing a literature review of existing research that uses the W ayback M achine, we put forward a step‐by‐step description of how the analyst can design a research project using archived websites. We draw on the example of a project that analyzes indicators of innovation activities and strategies in 300 U.S. small‐ and medium‐sized enterprises in green goods industries. We present six steps to access historical W ayback website data: (a) sampling, (b) organizing and defining the boundaries of the web crawl, (c) crawling, (d) website variable operationalization, (e) integration with other data sources, and (f) analysis. Although our examples draw on specific types of firms in green goods industries, the method can be generalized to other areas of research. In discussing the limitations and benefits of using the W ayback M achine, we note that both machine and human effort are essential to developing a high‐quality data set from archived web information.
information science & library science,computer science, information systems