Accuracy of training data and model outputs in Generative AI: CREATe Response to the Information Commissioner Office Consultation

Zihao Li,Weiwei Yi,Jiahong Chen
2024-05-30
Abstract:The accuracy of Generative AI is increasingly critical as Large Language Models become more widely adopted. Due to potential flaws in training data and hallucination in outputs, inaccuracy can significantly impact individuals interests by distorting perceptions and leading to decisions based on flawed information. Therefore, ensuring these models accuracy is not only a technical necessity but also a regulatory imperative. ICO call for evidence on the accuracy of Generative AI marks a timely effort in ensuring responsible Generative AI development and use. CREATe, as the Centre for Regulation of the Creative Economy based at the University of Glasgow, has conducted relevant research involving intellectual property, competition, information and technology law. We welcome the ICO call for evidence on the accuracy of Generative AI, and we are happy to highlight aspects of data protection law and AI regulation that we believe should receive attention.
Computers and Society,Emerging Technologies
What problem does this paper attempt to address?
This paper attempts to address the issues of data accuracy and model output accuracy in Generative AI (GenAI). With the widespread use of large - language models (LLMs), potential flaws in the training data and the hallucination phenomenon in the output may lead to serious information distortion, which in turn affects personal interests and decision - making. Therefore, ensuring the accuracy of these models is not only a technical necessity but also an urgent regulatory requirement. Specifically, the paper points out the following five key issues: 1. **Accuracy Paradox**: - Relying solely on disclosing the statistical accuracy rate of generative AI models is not enough, which may lead to the "Accuracy Paradox". That is, users overly rely on these indicators without verification, which instead increases the risk. Even if the model has a high accuracy rate, it cannot guarantee 100% credibility because LLMs only predict the probability of word occurrence and do not understand the content they process. 2. **The trade - off between accuracy and privacy**: - Improving the accuracy of input, model, and output is often accompanied by privacy costs. This involves not only technical individual identifiability but also social risks such as more precise commercial target positioning and social stratification. Developers should clearly state that they will not violate other interests when improving accuracy. 3. **Over - reliance on the compliance of developers and deployers**: - Over - relying on the accuracy compliance measures of developers may ultimately become a burden on users. Developers may shift the responsibility of submitting accurate personal information to users, forcing users to provide accurate and up - to - date personal data, sacrificing privacy to meet the business needs of AI developers. 4. **Unforeseeable specific application scenarios**: - The application scenarios of large - language models (LLMs) are complex and changeable, and downstream developers may build systems different from the original design purpose. Therefore, the accuracy obligations for general - purpose models should focus more on the review of content and output to ensure the accuracy and reliability of information. 5. **The accuracy of training data cannot be directly translated into the accuracy of output**: - Although most training data is reliable and trustworthy, when LLMs generate new answers by recombining this data, they may ignore the authenticity and credibility of the answers. Therefore, it is not enough to only focus on the accuracy of input data, but also need to attach importance to the output verification and content review mechanisms. In conclusion, this paper aims to emphasize the complexity and challenges of data accuracy and model output accuracy in generative AI and proposes several key issues that need further attention.