Abstract:Human-model conversations provide a window into users' real-world scenarios, behavior, and needs, and thus are a valuable resource for model development and research. While for-profit companies collect user data through the APIs of their models, using it internally to improve their own models, the open source and research community lags behind. We introduce the ShareLM collection, a unified set of human conversations with large language models, and its accompanying plugin, a Web extension for voluntarily contributing user-model conversations. Where few platforms share their chats, the ShareLM plugin adds this functionality, thus, allowing users to share conversations from most platforms. The plugin allows the user to rate their conversations, both at the conversation and the response levels, and delete conversations they prefer to keep private before they ever leave the user's local storage. We release the plugin conversations as part of the ShareLM collection, and call for more community effort in the field of open human-model data. The code, plugin, and data are available.

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Promoting the collection and sharing of human-model dialogue data in the open-source community**: Currently, for-profit companies collect dialogue data between users and large language models (LLMs) through their model APIs and use it to improve their own models, while the open-source and research communities have made slower progress in this area. The paper proposes a solution to balance this disparity. 2. **Establishing a unified dataset**: The authors collected existing human-model dialogue datasets and unified them into a single format, called the ShareLM collection. These datasets cover dialogues from different sources, including different models and users from various national backgrounds. 3. **Developing plugins to continuously collect dialogue data**: To overcome the limitation of existing datasets as static collections, the paper introduces the ShareLM plugin—a Chrome extension that allows users to easily contribute their dialogue records with various models. This plugin supports multiple platforms, and users can rate dialogues or choose to delete sensitive dialogues to protect privacy. 4. **Enhancing user control over their data**: With the delayed upload feature, users can review and delete any dialogue records they do not wish to make public before the dialogues leave local storage. Additionally, users can view and manage their own data, including providing feedback. 5. **Increasing data diversity**: To address the issue of insufficient population coverage in LLM training data, the plugin encourages users to provide some demographic information (such as age, gender, and country), which helps improve the model's understanding of different groups. In summary, the paper aims to advance the development of language models by creating an open and dynamically growing dialogue dataset, particularly addressing the needs of the open-source community, and enhancing user engagement and data control through technical means.

The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

Group Chat Ecology in Enterprise Instant Messaging: How Employees Collaborate Through Multi-User Chat Channels on Slack

Learning from Naturally Occurring Feedback

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

OpenAssistant Conversations -- Democratizing Large Language Model Alignment

OLMo: Accelerating the Science of Language Models

HFCommunity: An Extraction Process and Relational Database to Analyze Hugging Face Hub Data

Analyzing the Evolution and Maintenance of ML Models on Hugging Face

The Future of Open Human Feedback

LEXI: Large Language Models Experimentation Interface

Are Human Conversations Special? A Large Language Model Perspective

ChatDashboard: A Framework to collect, link, and process donated WhatsApp Chat Log Data

Llama 2: Open Foundation and Fine-Tuned Chat Models

Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View

SMILE: Single-turn to Multi-turn Inclusive Language Expansion via ChatGPT for Mental Health Support

ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free-Standing Social Interactions in the Wild

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Lessons Learned from Mining the Hugging Face Repository

A Community Contribution Framework for Sharing Materials Data with Materials Project

LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming

Ethically Collecting Multi-Modal Spontaneous Conversations with People that have Cognitive Impairments