Abstract:Common Crawl is a multi-petabyte longitudinal dataset containing over 100 billion web pages which is widely used as a source of language data for sequence model training and in web science research. Each of its constituent archives is on the order of 75TB in size. Using it for research, particularly longitudinal studies, which necessarily involve multiple archives, is therefore very expensive in terms of compute time and storage space and/or web bandwidth. Two new methods for mitigating this problem are presented here, based on exploiting and extending the much smaller (<200 gigabytes (GB) compressed) _index_ which is available for each archive. By adding Last-Modified timestamps to the index we enable longitudinal exploration using only a single archive. By comparing the distribution of index features for each of the 100 segments into which archive is divided with their distribution over the whole archive, we have identified the least and most representative segments for a number of recent archives. Using this allows the segment(s) that are most representative of an archive to be used as proxies for the whole. We illustrate this approach in an analysis of changes in URI length over time, leading to an unanticipated insight into the how the creation of Web pages has changed over time.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of excessive computational resources and storage costs when conducting longitudinal Web analysis using Common Crawl. Specifically, Common Crawl is a multi - petabyte dataset containing more than 100 billion web pages, which is widely used for language model training and Web science research. Each archive has approximately 75TB of data, so when conducting longitudinal studies, processing multiple archives will consume a large amount of computing time, storage space or bandwidth. To solve these problems, the author proposes two new methods: 1. **By expanding the index file**: Utilize and expand the smaller (less than 200GB after compression) index files in each archive to reduce the resources required to process the complete archive. In particular, by adding the "Last - Modified" timestamp to the index, it is possible to conduct longitudinal exploration using only a single archive. 2. **Select the most representative fragments**: Divide each archive into 100 random subsets (called fragments), and identify the most representative fragments by comparing the distribution of index features in these fragments with the distribution of the entire archive. In this way, these fragments can be used as proxies for the entire archive, thereby reducing the computational cost. ### Method overview 1. **Improve the index file**: - Add the "Last - Modified" timestamp to the index file, so that the change of Web content can be analyzed based on the time dimension. - Using this timestamp information, researchers can obtain historical information about the creation time of Web pages without accessing the entire archive. 2. **Select representative fragments**: - For each archive, divide it into 100 fragments and evaluate its representativeness by analyzing the distribution of index features of each fragment. - Use statistical methods (such as rank correlation) to measure the similarity between the fragment and the entire archive, thereby identifying the most representative fragments. - These fragments can be used as proxies for the entire archive, thereby reducing the demand for computing resources. ### Experimental verification The author verified the effectiveness of these proposals through the following two - part experiments: 1. **Fragment representativeness measurement**: - Analyze the archive fragments of four different years (2019, 2020, 2021, 2023) and evaluate their representativeness. - The results show that the best fragments of each year have a very high correlation with the entire archive (> 0.95), and the worst fragments should be avoided. 2. **URI length change analysis**: - Use the "Last - Modified" timestamp to analyze the change trend of URI length over time. - Some unexpected insights were discovered, such as the creation method of Web pages gradually changing from manual writing to automatic generation. ### Conclusion Through these improved methods, the author shows how to effectively use Common Crawl for large - scale longitudinal Web analysis while significantly reducing the computational cost. This not only improves the research efficiency but also reveals some new trends in Web development.

Improved methodology for longitudinal Web analytics using Common Crawl

Big Data Science Over the Past Web

Methods and Approaches to Using Web Archives in Computational Communication Research

Modeling Updates of Scholarly Webpages Using Archived Data

Using the wayback machine to mine websites in the social sciences: A methodological resource

Incremental Structured Web Database Crawling Via History Versions

Identify Temporal Websites Based on User Behavior Analysis.

You, the Web and Your Device: Longitudinal Characterization of Browsing Habits

Analysing Parallel and Passive Web Browsing Behavior and its Effects on Website Metrics

Changes in Web client access patterns: Characteristics and caching implications

Lost but not forgotten: finding pages on the unarchived web

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages

Digital humanities and web archives: Possible new paths for combining datasets

A Brief History of Web Crawlers

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

Web Archives Metadata Generation with GPT-4o: Challenges and Insights

Beyond time delays: How web scraping distorts measures of online news consumption

The Blind Men and the Internet: Multi-Vantage Point Web Measurements