Abstract:Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textit{ClimateTV} which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at <a class="link-external link-https" href="https://github.com/KathPra/MP4VisualFrameDetection" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in automated visual frame detection, especially clustering images to extract common themes and concepts in social science research. Specifically, the authors try to reduce the workload of manual annotation and improve the accuracy of automation by formulating the clustering task as the Minimum Cost Multicut Problem (MP).
### Main problems of the paper
1. **Challenges in automated visual frame detection**:
- Visual frame analysis plays an important role in social science research, but most of the existing methods rely on manual annotation, which is not only time - consuming but also prone to introducing bias.
- Automatically detecting abstract and diverse concepts remains a challenging task and is an active research area.
2. **Limitations of existing methods**:
- Traditional clustering methods (such as k - means, DBSCAN, etc.) have limited effectiveness when dealing with complex and diverse datasets.
- Pre - defined frameworks may limit the discovery of new frameworks, leading to result bias.
3. **Proposed new method**:
- The authors propose a new method, that is, formulating the image clustering task as the Minimum Cost Multicut Problem (MP), and using the embedding spaces generated by multiple visual models and vision - language models to represent image features.
- By maximizing the posterior probability, this method can obtain the optimal clustering results without hyperparameters.
### Solutions
- **Minimum Cost Multicut Problem (MP)**: Map images into a fully - connected graph structure, and the edge weights represent the cosine similarity between image embeddings. By solving MP, the clustering scheme with the minimum cutting cost can be found.
- **Selection of embedding space**: Compare the embedding spaces of multiple powerful visual foundation models (such as DINOv2, ConvNeXt V2, CLIP, etc.), and evaluate their effectiveness in detecting visual frames.
- **Calibration term**: Introduce a calibration term to adjust the decision boundaries of different embedding spaces to ensure the accuracy and consistency of clustering results.
### Experimental verification
- **Dataset**: Use multiple datasets for experiments, including ImageNette, ImageWoof and ClimateTV, to verify the effectiveness of the method.
- **Performance evaluation**: Evaluate the quality of clustering results through indicators such as variational information (VI) and conditional entropy, and compare with traditional clustering methods.
### Conclusion
By formulating the clustering task as the Minimum Cost Multicut Problem and combining the embedding spaces of multiple visual models, the authors have successfully improved the accuracy and efficiency of automated visual frame detection. This method not only reduces the workload of manual annotation but also can discover new visual frameworks, providing strong support for social science research.