Moderator: Moderating Text-to-Image Diffusion Models through Fine-grained Context-based Policies

Peiran Wang,Qiyu Li,Longxuan Yu,Ziyao Wang,Ang Li,Haojian Jin
DOI: https://doi.org/10.1145/3658644.3690327
2024-09-12
Abstract:We present Moderator, a policy-based model management system that allows administrators to specify fine-grained content moderation policies and modify the weights of a text-to-image (TTI) model to make it significantly more challenging for users to produce images that violate the policies. In contrast to existing general-purpose model editing techniques, which unlearn concepts without considering the associated contexts, Moderator allows admins to specify what content should be moderated, under which context, how it should be moderated, and why moderation is necessary. Given a set of policies, Moderator first prompts the original model to generate images that need to be moderated, then uses these self-generated images to reverse fine-tune the model to compute task vectors for moderation and finally negates the original model with the task vectors to decrease its performance in generating moderated content. We evaluated Moderator with 14 participants to play the role of admins and found they could quickly learn and author policies to pass unit tests in approximately 2.29 policy iterations. Our experiment with 32 stable diffusion users suggested that Moderator can prevent 65% of users from generating moderated content under 15 attempts and require the remaining users an average of 8.3 times more attempts to generate undesired content.
Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to effectively regulate the content generated by text - to - image (TTI) diffusion models through fine - grained context - based policies to prevent users from generating policy - violating images**. Specifically, the paper focuses on the following aspects: 1. **Differences in content moderation requirements**: Different platforms, regions, and user groups have different requirements for content moderation. For example, some platforms prohibit weapon - related content, while others may only restrict false information under specific circumstances. These differences necessitate a flexible and customizable content moderation method. 2. **The importance of context**: Merely deleting or replacing objects is not sufficient to effectively regulate content; context also needs to be considered. For example, nudity may be allowed in a medical setting, but may need to be moderated in other situations. 3. **Limitations of existing methods**: Existing content moderation methods, such as identifying and rejecting inappropriate text prompts and defining built - in negative prompts, although somewhat effective, are still unable to completely prevent users from generating inappropriate content. Therefore, a new method is needed to improve the effectiveness of moderation. To solve these problems, the paper proposes a system named **Moderator**. Moderator achieves its goals in the following ways: - **Policy - based model management**: Administrators can specify fine - grained content moderation policies, including the content to be moderated, the way of moderation, and the reasons for moderation. - **Self - reverse Fine - tuning (SRFT)**: Moderator uses self - generated data to extract task vectors and weakens the model's ability to generate violating content by subtracting these task vectors. - **Multi - dimensional moderation methods**: Supports multiple moderation methods, such as deleting / mosaicing target content, replacing target content with alternative content, etc. - **Explicit purpose labeling**: Each policy includes the purpose of moderation, helping administrators clearly understand why moderation is necessary. Through these methods, Moderator can significantly reduce the likelihood of users generating violating content and has shown good performance in experiments. For example, experiments show that Moderator can prevent 65% of users from generating violating content within 15 attempts and make the remaining users need an average of 8.3 times more attempts to generate inappropriate content. In conclusion, this paper aims to provide an efficient and flexible content moderation solution by introducing the Moderator system to address content security issues in text - to - image models.