Health CLIP: Depression Rate Prediction Using Health Related Features in Satellite and Street View Images.

Tianjian Ouyang,Xin Zhang,Zhenyu Han,Yu Shang,Yong Li
DOI: https://doi.org/10.1145/3589335.3651451
2024-01-01
Abstract:Mental health is a state of mental well-being that enables people to cope with the stresses of life, realize their abilities, learn well and work well, and contribute to their community. It has intrinsic and instrumental value and is integral to our well-being, and its correlation with environmental factors has been a subject of growing interest. As the pressure of society keeps growing, depression has become a severe problem in modern cities, and finding a way to estimate depression rate is of significance to relieve the problem. In this study, we introduce a Contrastive Language-Image Pretraining (CLIP) based novel approach to predict mental health indicators, especially depression rate, through satellite and street view images. Our methodology uses state-of-the-art Multimodal Large Language Model (MLLM), GPT4-vision, to generate health related captions for satellite and street view images, then we use the generated image-text pairs to fine-tune the CLIP model, making its image encoder extract health related features such as green spaces, sports fields, and infrastructral characteristics. The fine-tuning process is employed to bridge the semantic gap between textual descriptions and visual representations, enabling a comprehensive analysis of geo-tagged images. Consequently, our methodology achieves a notable R2 value of 0.565 on prediction of depression rate in New York City with the combination of satellite and street view images. The successful deployment of Health CLIP in a real-world scenario underscores the practical applicability of our approach.
What problem does this paper attempt to address?