Future and AI-Ready Data Strategies: Response to DOC RFI on AI and Open Government Data Assets

Hamidah Oderinwale,Shayne Longpre
2024-07-26
Abstract:The following is a response to the US Department of Commerce's Request for Information (RFI) regarding AI and Open Government Data Assets. First, we commend the Department for its initiative in seeking public insights on the organization and sharing of data. To facilitate scientific discovery and advance AI development, it is crucial for all data producers, including the Department of Commerce and other governmental entities, to prioritize the quality of their data corpora. Ensuring data is accessible, scalable, and secure is essential for harnessing its full potential. In our response, we outline best practices and key considerations for AI and the Department of Commerce's Open Government Data Assets.
Computers and Society
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: How to optimize and improve the management and sharing of government data assets in order to better support the development of artificial intelligence (AI) and scientific discovery. Specifically, the paper responds to the Request for Information (RFI) of the U.S. Department of Commerce regarding AI and open government data assets, and proposes the following key issues and solutions in several aspects: 1. **Data Quality and Accessibility**: - Ensure the quality, accessibility, and security of government data assets to fully realize their potential. - Data should be machine - readable, in an open format, without usage restrictions, and based on public standards. 2. **Metadata Standardization**: - Improve the consistency of metadata standards, especially between different systems or data sets. - Introduce timestamp metadata tags to record the creation, editing, and deletion times of data sets to support the reproducibility of research. 3. **Labeling and Annotation**: - Invest in building AI - ready data sets, especially high - quality labeled data. - Labeling helps automatic AI or scientific analysis programs distinguish between relevant and irrelevant records and identify potential data biases, omissions, or errors. 4. **Data Sharing Platform**: - Publish government data to public repositories (such as GitHub, Hugging Face, etc.) to improve the searchability and visibility of data. - Through these platforms, developers can more easily find data sets that meet their needs. 5. **Special Data Customization Service**: - Automate and make the process transparent, change the manually submitted email requests to be submitted through database forms. - Automatically process data customization requests and generate the required data sets. 6. **Patent Data and Semantic Search**: - Use structured metadata and semantic search techniques to improve the discovery and utilization of government data sets. - Semantic search can help users find relevant data more accurately, rather than just relying on keyword matching. 7. **Benchmark Testing and Evaluation**: - Develop benchmark tests for policy - and government - specific knowledge to ensure the reliability and accuracy of generative AI outputs. - Avoid data pollution problems and develop more powerful models. 8. **Data Format Optimization**: - Improve the data formats of federal agencies and the Department of Commerce, avoid using formats that are difficult to extract content such as PDF, and use machine - readable formats (such as JSON, XML) instead. 9. **Document Publication and Version Control**: - Use platforms that support DOI and version control (such as PubPub) to publish government documents to ensure the persistence and traceability of documents. By solving these problems, the paper aims to help the government establish a future - oriented, AI - ready data infrastructure, thereby promoting scientific research and the development of AI technology.