Abstract:A long-standing challenge in analyzing information leaks within mobile apps is to automatically identify the code operating on sensitive data. With all existing solutions relying on System APIs (e.g., IMEI, GPS location) or features of user interfaces (UI), the content from app servers, like user’s Facebook profile, payment history, fall through the crack. Finding such content is important given the fact that most apps today are web applications, whose critical data are often on the server side. In the meantime, operations on the data within mobile apps are often hard to capture, since all server-side information is delivered to the app in the same way, sensitive or not. A unique observation of our research is that in modern apps, a program is essentially a semantics-rich documentation carrying meaningful program elements such as method names, variables and constants that reveal the sensitive data involved, even when the program is under moderate obfuscation. Leveraging this observation, we develop a novel semantics-driven solution for automatic discovery of sensitive user data, including those from the server side. Our approach utilizes natural language processing (NLP) to automatically locate the program elements (variables, methods, etc.) of interest, and then performs a learning-based program structure analysis to accurately identify those indeed carrying sensitive content. Using this new technique, we analyzed 445,668 popular apps, an unprecedented scale for this type of research. Our work brings to light the pervasiveness of information leaks, and the channels through which the leaks happen, including unintentional over-sharing across libraries and aggressive data acquisition behaviors. Further we found that many high-profile apps and libraries are involved in such leaks. Our findings contribute to a better understanding of the privacy risk in mobile apps and also highlight the importance of data protection in today’s software composition.

Automated Identification of Sensitive Data via Flexible User Requirements.

Automated Identification of Sensitive Data from Implicit User Specification

Owner-Centric Protection of Unstructured Data on Smartphones.

Taming Information-Stealing Smartphone Applications (On Android)

Identifying User-Input Privacy in Mobile Applications at a Large Scale

Sensitive data identification for multi‐category and multi‐scenario data

Detecting Passive Content Leaks and Pollution in Android Applications.

Automated Android Application Permission Recommendation

An Application Programming Interface (API) Sensitive Data Identification Method Based on the Federated Large Language Model

Enhanced User Data Privacy with Pay-by-data Model

Privacy Requirements Patterns for Mobile Operating Systems

UiRef: analysis of sensitive user inputs in Android applications.

UIPicker: User-Input Privacy Identification in Mobile Applications

Semantics-Aware Privacy Risk Assessment Using Self-Learning Weight Assignment for Mobile Apps

Characterizing Privacy Risks of Mobile Apps with Sensitivity Analysis

Finding Clues for Your Secrets: Semantics-Driven, Learning-Based Privacy Discovery in Mobile Apps

Method for identifying permission-irrelevant private data in Android application program

LeakSemantic: Identifying Abnormal Sensitive Network Transmissions in Mobile Applications

Mobile APP Personal Information Security Detection and Analysis

Automated Detection of Consistence Between App Behavior and Privacy Policy of Android Apps??

Security Analysis for Android Applications Using Sensitive Path Identification