Abstract:Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textit{ClimateTV} which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at <a class="link-external link-https" href="https://github.com/KathPra/MP4VisualFrameDetection" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in automated visual frame detection, especially clustering images to extract common themes and concepts in social science research. Specifically, the authors try to reduce the workload of manual annotation and improve the accuracy of automation by formulating the clustering task as the Minimum Cost Multicut Problem (MP). ### Main problems of the paper 1. **Challenges in automated visual frame detection**: - Visual frame analysis plays an important role in social science research, but most of the existing methods rely on manual annotation, which is not only time - consuming but also prone to introducing bias. - Automatically detecting abstract and diverse concepts remains a challenging task and is an active research area. 2. **Limitations of existing methods**: - Traditional clustering methods (such as k - means, DBSCAN, etc.) have limited effectiveness when dealing with complex and diverse datasets. - Pre - defined frameworks may limit the discovery of new frameworks, leading to result bias. 3. **Proposed new method**: - The authors propose a new method, that is, formulating the image clustering task as the Minimum Cost Multicut Problem (MP), and using the embedding spaces generated by multiple visual models and vision - language models to represent image features. - By maximizing the posterior probability, this method can obtain the optimal clustering results without hyperparameters. ### Solutions - **Minimum Cost Multicut Problem (MP)**: Map images into a fully - connected graph structure, and the edge weights represent the cosine similarity between image embeddings. By solving MP, the clustering scheme with the minimum cutting cost can be found. - **Selection of embedding space**: Compare the embedding spaces of multiple powerful visual foundation models (such as DINOv2, ConvNeXt V2, CLIP, etc.), and evaluate their effectiveness in detecting visual frames. - **Calibration term**: Introduce a calibration term to adjust the decision boundaries of different embedding spaces to ensure the accuracy and consistency of clustering results. ### Experimental verification - **Dataset**: Use multiple datasets for experiments, including ImageNette, ImageWoof and ClimateTV, to verify the effectiveness of the method. - **Performance evaluation**: Evaluate the quality of clustering results through indicators such as variational information (VI) and conditional entropy, and compare with traditional clustering methods. ### Conclusion By formulating the clustering task as the Minimum Cost Multicut Problem and combining the embedding spaces of multiple visual models, the authors have successfully improved the accuracy and efficiency of automated visual frame detection. This method not only reduces the workload of manual annotation but also can discover new visual frameworks, providing strong support for social science research.

I Spy With My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames

New Fusional Framework Combining Sparse Selection and Clustering for Key Frame Extraction.

A Novel Compact Yet Rich Key Frame Creation Method for Compressed Video Summarization

FrameFinder: Explorative Multi-Perspective Framing Extraction from News Headlines

Video abstraction based on the visual attention model and online clustering

A novel video abstraction method based on fast clustering of the regions of interest in key frames

Video Abstraction via Attention Model and On-Line Clustering

OpenFraming - Open-sourced Tool for Computational Framing Analysis of Multilingual Data.

Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage

Toward effective image forensics via a novel computationally efficient framework and a new image splice dataset

Towards Effective Image Forensics via A Novel Computationally Efficient Framework and A New Image Splice Dataset

An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture

Hyperspectral Video Analysis by Motion and Intensity Preprocessing and Subspace Autoencoding

Semi-supervised and Deep learning Frameworks for Video Classification and Key-frame Identification

IVS3D: An Open Source Framework for Intelligent Video Sampling and Preprocessing to Facilitate 3D Reconstruction

Conditional deep clustering based transformed spatio-temporal features and fused distance for efficient video retrieval

Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows

Understanding Compositional Structures in Art Historical Images Using Pose and Gaze Priors

A General Framework for Comparing Embedding Visualizations Across Class-Label Hierarchies

The Bare Necessities: Designing Simple, Effective Open-Vocabulary Scene Graphs

Intelligent Frame Selection as a Privacy-Friendlier Alternative to Face Recognition