Peeking into the Gray Area of Mobile World - an Empirical Study of Unlabeled Android Apps.
Sen Chen,Lingling Fan,Cuiyun Gao,Fu Song,Yang Liu
DOI: https://doi.org/10.1109/issre52982.2021.00065
2021-01-01
Abstract:For the real-world dataset collected by our industrial partner, Pwnzen Infotech Inc., one of the leading industrial security companies, there are a large number of unlabeled Android applications (called unlabeled apps in this paper) that are unlikely to belong to known Android malware families nor ordinary benign apps according to the industrial black-list (i.e., signatures) and white-list (i.e., certificates). However, such apps have rarely been studied previously, but are important to peek into the gray area of mobile world. It is a time-consuming task for software analysts to understand the negative characteristics of these samples, which would lead to potential security or privacy threats for app users, significantly negative impacts on mobile system performance, and bad user experience, etc. To investigate the characteristics of these industrial unlabeled apps in a large-scale in practice, and provide insights to industrial software analysts as well as research communities, we collect a large-scale dataset of unlabeled apps (i.e., 22,886 in total) from our industrial partners. Given the common industrial perception of software analysts that a high percentage of these unlabeled apps could have some similar behaviors, we leverage the popular community-detection techniques based on widely-used app features in mal ware detection to cluster these unlabeled apps. After that, we investigate the common behaviors for different clusters with substantial human efforts and also conduct cross-validation across co-authors to check the results. Our manual analysis unveils the characteristics of these unlabeled apps by sampling data from different clusters, and discovers 11 categories, some of which have never been discovered by previous grayware research. Besides, from our exploration, we find that the community-based techniques are not effective enough in clustering unlabeled apps, so that manual analysis is encouraged. Manual analysis is an important first step towards studying unlabeled apps and understanding their characteristics. Finally, we highlight the lessons learned through real case studies, comparison study with existing malware/grayware research, in-depth discussion with industrial partners, and feedback from industrial partners.