The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

Shachar Don-Yehiya,Leshem Choshen,Omri Abend
2024-08-16
Abstract:Human-model conversations provide a window into users' real-world scenarios, behavior, and needs, and thus are a valuable resource for model development and research. While for-profit companies collect user data through the APIs of their models, using it internally to improve their own models, the open source and research community lags behind. We introduce the ShareLM collection, a unified set of human conversations with large language models, and its accompanying plugin, a Web extension for voluntarily contributing user-model conversations. Where few platforms share their chats, the ShareLM plugin adds this functionality, thus, allowing users to share conversations from most platforms. The plugin allows the user to rate their conversations, both at the conversation and the response levels, and delete conversations they prefer to keep private before they ever leave the user's local storage. We release the plugin conversations as part of the ShareLM collection, and call for more community effort in the field of open human-model data. The code, plugin, and data are available.
Computation and Language
What problem does this paper attempt to address?
The paper mainly addresses the following issues: 1. **Promoting the collection and sharing of human-model dialogue data in the open-source community**: Currently, for-profit companies collect dialogue data between users and large language models (LLMs) through their model APIs and use it to improve their own models, while the open-source and research communities have made slower progress in this area. The paper proposes a solution to balance this disparity. 2. **Establishing a unified dataset**: The authors collected existing human-model dialogue datasets and unified them into a single format, called the ShareLM collection. These datasets cover dialogues from different sources, including different models and users from various national backgrounds. 3. **Developing plugins to continuously collect dialogue data**: To overcome the limitation of existing datasets as static collections, the paper introduces the ShareLM plugin—a Chrome extension that allows users to easily contribute their dialogue records with various models. This plugin supports multiple platforms, and users can rate dialogues or choose to delete sensitive dialogues to protect privacy. 4. **Enhancing user control over their data**: With the delayed upload feature, users can review and delete any dialogue records they do not wish to make public before the dialogues leave local storage. Additionally, users can view and manage their own data, including providing feedback. 5. **Increasing data diversity**: To address the issue of insufficient population coverage in LLM training data, the plugin encourages users to provide some demographic information (such as age, gender, and country), which helps improve the model's understanding of different groups. In summary, the paper aims to advance the development of language models by creating an open and dynamically growing dialogue dataset, particularly addressing the needs of the open-source community, and enhancing user engagement and data control through technical means.