Abstract:Unsupervised learning has become an essential building block of AI systems. The representations it produces, e.g. in foundation models, are critical to a wide variety of downstream applications. It is therefore important to carefully examine unsupervised models to ensure not only that they produce accurate predictions, but also that these predictions are not "right for the wrong reasons", the so-called Clever Hans (CH) effect. Using specially developed Explainable AI techniques, we show for the first time that CH effects are widespread in unsupervised learning. Our empirical findings are enriched by theoretical insights, which interestingly point to inductive biases in the unsupervised learning machine as a primary source of CH effects. Overall, our work sheds light on unexplored risks associated with practical applications of unsupervised learning and suggests ways to make unsupervised learning more robust.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the widely - existing "Clever Hans" (CH) effect in unsupervised learning. Specifically, the paper points out that when unsupervised learning models generate representations or detect anomalies, they may make correct predictions based on data - quality artifacts (such as irrelevant text annotations in images or high - frequency noise), but the reasons for these predictions are wrong. This phenomenon is similar to the famous historical case of the "Clever Hans" horse trainer, in which the horse seemed to be able to perform mathematical calculations but was actually only responding to the subtle cues of the trainer. By using specially - developed Explainable AI (XAI) techniques, the paper reveals for the first time the ubiquitous CH effect in unsupervised learning and, combined with theoretical insights, points out that the inductive bias in unsupervised learning machines is the main source of the CH effect. These findings not only reveal the potential risks of unsupervised learning in practical applications but also propose methods to improve the robustness of unsupervised learning. ### Main research contents 1. **Identifying and understanding the CH effect in unsupervised learning**: - Use the BiLRP technique to analyze how unsupervised learning models generate representations based on input features. - Combine the virtual - layer technique to reveal the roles of pixels and frequency components in predicting anomalies. 2. **Experimental verification**: - In the representation - learning task, use the PubMedCLIP model to process COVID - 19 X - ray scan images and find that the model judges similarity based on text annotations in the images rather than actual medical features. - On the ImageNet dataset, use the CLIP, SimCLR, and BarlowTwins models for classification tasks and find that these models rely on artifacts in the images (such as logos and humans in the background) to judge similarity. - In the anomaly - detection task, use the MVTec - AD dataset for industrial detection and find that the D2Neighbors model over - relies on high - frequency components to detect anomalies. 3. **Methods to mitigate the CH effect**: - Remove the feature maps in the early layers of the CLIP model to remove the feature maps with the strongest response to text. - Insert a blurring layer at the input end of the anomaly - detection model to reduce the dependence on high - frequency components. ### Experimental results - **Representation learning**: - In the COVID - 19 classification task, the false - positive rate of the model without CH mitigation is as high as 40% on the GitHub subset, and after CH mitigation, the performance is significantly improved. - In the ImageNet classification task, the accuracy of the CLIP model on the original data is 85.0%, but it drops to 80.5% on the data with inserted logos. After CH mitigation, the performance recovers to a level close to the original. - **Anomaly detection**: - The F1 score of the D2Neighbors model on the original data is 0.92, but it drops to 0.80 under the deployment conditions after the introduction of the anti - aliasing algorithm. After CH mitigation, the performance is significantly improved to 0.93. - The more complex PatchCore model also shows a similar trend, and after CH mitigation, the performance is improved from 0.85 to 0.96. ### Discussion The paper emphasizes the importance of identifying and mitigating the CH effect in unsupervised learning. Unlike traditional supervised learning, the CH effect in unsupervised learning stems more from the inductive bias of models and learning algorithms rather than from the data itself. By using advanced XAI techniques, the paper not only reveals the formal reasons for these unstable behaviors but also shows how to systematically improve the performance of models on difficult data subsets or under deployment conditions by pruning high - frequency or erroneously amplified features. In conclusion, this paper provides important theoretical and practical guidance for the reliable application of unsupervised learning and points out the directions for future research.

The Clever Hans Effect in Unsupervised Learning

Unmasking Clever Hans Predictors and Assessing What Machines Really Learn

Deceptive AI Explanations: Creation and Detection

"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

Don't be Fooled: The Misinformation Effect of Explanations in Human-AI Collaboration

Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language Models

Making deep neural networks right for the right scientific reasons by interacting with their explanations

Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models

Deceptive XAI: Typology, Creation and Detection

Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations

Expl(AI)ned: The Impact of Explainable Artificial Intelligence on Users' Information Processing

Utilizing Human Behavior Modeling to Manipulate Explanations in AI-Assisted Decision Making: The Good, the Bad, and the Scary

Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting

Interpretability Illusions in the Generalization of Simplified Models

Helpful, Misleading or Confusing: How Humans Perceive Fundamental Building Blocks of Artificial Intelligence Explanations

Unsupervised Selective Rationalization with Noise Injection

Understanding the Effect of Counterfactual Explanations on Trust and Reliance on AI for Human-AI Collaborative Clinical Decision Making

Deceptive AI systems that give explanations are more convincing than honest AI systems and can amplify belief in misinformation

Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations

Position: Stop Making Unscientific AGI Performance Claims

Unsupervised Learning of Unbiased Visual Representations