Vision-Language Interaction via Contrastive Learning for Surface Anomaly Detection in Consumer Electronics Manufacturing
Honghao Gao,Wangyang Jiang,Qionghuizi Ran,Ye Wang
DOI: https://doi.org/10.1109/tce.2024.3378771
2024-01-01
IEEE Transactions on Consumer Electronics
Abstract:As the consumer electronics industry develops, surface anomaly detection methods support the identification of anomalous products that have surface flaws and defects to prevent them from entering the commodity market. However, the scarcity and variety of types of anomalous product surfaces pose significant challenges to deep learning models, preventing them from effectively capturing the features associated with surface abnormalities and thus exacerbating the difficulty of surface anomaly detection. To address this problem, we propose a vision-language interaction method via contrastive learning for surface anomaly detection in the consumer electronics manufacturing domain. First, surface images of consumer electronics products are captured from surveillance video during the manufacturing process, and product specifications in the form of texts are collected from production plans and manufacturing requirements. They are encoded as embedding sequences and then transformed into feature vectors via a self-attention mechanism. We use the deep learning model to learn visual information about the appearance of a product from images and language descriptions of its key features, such as its shape, colour, and texture, from texts. Second, a contrastive learning method is used to interact with vision and language information to learn feature representations from images and texts. This approach fuses the text features and then copies these fused features to equal the number of image features for composing feature pairs. The primary objective is to maximize the similarity between these pairs, thereby enhancing the overall performance. Third, a similarity-based classification method is employed to accurately identify anomalous targets by calculating the degree of cosine similarity between each pair of image and text features. If the similarity is greater than the given threshold, the product is assumed to be good; otherwise, it is anomalous. Finally, experiments conducted on public datasets such as CIFAR-10, MNIST, and Fashion-MNIST demonstrate the applicability of the proposed approach in the realm of consumer electronics manufacturing.
telecommunications,engineering, electrical & electronic