Methods to Assess the UK Government's Current Role as a Data Provider for AI

Neil Majithia,Elena Simperl
2024-11-28
Abstract:The compositions of generative AI training corpora remain closely-guarded secrets, causing an asymmetry of information between AI developers and organisational data owners whose digital assets may have been incorporated into the corpora without their knowledge. While this asymmetry is the subject of well-known ongoing lawsuits, it also inhibits the measurement of the impact of open data sources for AI training. To address this, we introduce and implement two methods to assess open data usage for the training of Large Language Models (LLMs) and 'peek behind the curtain' in order to observe the UK government's current contributions as a data provider for AI. The first method, an ablation study that utilises LLM 'unlearning', seeks to examine the importance of the information held on UK government websites for LLMs and their performance in citizen query tasks. The second method, an information leakage study, seeks to ascertain whether LLMs are aware of the information held in the datasets published on the UK government's open data initiative <a class="link-external link-http" href="http://data.gov.uk" rel="external noopener nofollow">this http URL</a>. Our findings indicate that UK government websites are important data sources for AI (heterogenously across subject matters) while <a class="link-external link-http" href="http://data.gov.uk" rel="external noopener nofollow">this http URL</a> is not. This paper serves as a technical report, explaining in-depth the designs, mechanics, and limitations of the above experiments. It is accompanied by a complementary non-technical report on the ODI website in which we summarise the experiments and key findings, interpret them, and build a set of actionable recommendations for the UK government to take forward as it seeks to design AI policy. While we focus on UK open government data, we believe that the methods introduced in this paper present a reproducible approach to tackle the opaqueness of AI training corpora and provide organisations a framework to evaluate and maximize their contributions to AI development.
Computers and Society,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the current role of the UK government as an AI data provider, especially the extent to which its data sources contribute to the performance of large - language models (LLMs). Specifically, the research aims to answer the following question: **Research Question: To what extent do the UK government's data sources contribute to the performance of AI models?** To answer this question, the authors propose two methods to evaluate the role of UK government data in training large - language models: 1. **Ablation Study**: - Through the "unlearning" technique, remove the data from UK government websites from LLMs, and then evaluate how this removal affects the model's performance in citizen - query tasks. - The goal is to examine the importance of government - website information to LLMs and their performance in citizen - query tasks. 2. **Information Leakage Study**: - Aims to determine whether LLMs can recall the information in the datasets published on data.gov.uk. - By designing specific prompts and analyzing the responses of LLMs, evaluate whether these models actually use the data from data.gov.uk for training. ### Research Background As artificial intelligence (AI) models increasingly dominate the technology field, there are still many unsolved mysteries regarding the content of their underlying training corpora. Governments usually collect and manage large amounts of high - quality data, which can provide important support for the development of AI. However, since the specific composition of generative AI training corpora is often a commercial secret, this makes it difficult to plan data - sharing schemes. ### Method Overview 1. **Ablation Study**: - **Objective**: Evaluate the importance of government - website information to LLMs in citizen - query tasks. - **Method**: Adopt the "unlearning" technique proposed by Yao et al., gradually adjust the model parameters to increase the loss of the model on the target dataset while keeping the performance on other datasets unchanged. - **Evaluation Framework**: Through a series of citizen - query tasks, compare the model's performance before and after ablation, with a focus on evaluating structural errors (Type 1) and knowledge errors (Type 2). 2. **Information Leakage Study**: - **Objective**: Determine whether LLMs can recall the datasets on data.gov.uk. - **Method**: Design specific prompts to test whether LLMs can correctly answer questions related to data.gov.uk data. - **Evaluation Criteria**: Evaluate its accuracy by comparing the model's answers with the actual data. ### Key Findings - **Ablation Study**: - After ablation, all models had a significant increase in knowledge errors (Type 2), indicating that government websites are crucial for LLMs' knowledge acquisition. - The performance differences among different models suggest that some models may be more dependent on specific types of data sources. - **Information Leakage Study**: - The results show that LLMs perform poorly in recalling the data on data.gov.uk, indicating that these data are not widely used for training LLMs. ### Conclusion This research not only provides an in - depth understanding of the UK government's role as an AI data provider but also proposes a reproducible method to evaluate the role of other organizations and datasets in AI training corpora. This is of great significance for formulating AI policies and maximizing data contributions.