Privacy-Preserving Internet Traffic Publication

Longkun Guo,Hong Shen
DOI: https://doi.org/10.1109/trustcom.2016.0152
2016-01-01
Abstract:As machine learning (ML)-based traffic classification develops, Internet traffic data is published in public to serve as test data. Although the IP addresses therein are anonymized, it is given explicitly which data belongs to an identical user. Then using the information, an adversary can identify a user from the anonymized users. The paper first gives a k-anonymity method to reduce the probability of information leak to P/k, where P is the probability of information leak without k-anonymity. Assume the number of the flows belonging to an IP address follows Normal distribution, the information loss is shown mu(2) + sigma(2) / k mu(2) + sigma(2), where mu and sigma are respectively the mean and the variance of the Normal distribution. Later, random noise is added to further reduce the probability of information leak to P/k(2), with an expected distortion rate of approximately 2(d+log k-log vertical bar X vertical bar), where d is the number of dimensions and vertical bar X vertical bar is the number of the vectors. At last, real-world Internet traffic data is used to evaluate the utility of the anonymized traffic data. According to the experimental results, the k-anonymized noised data can be clustered with an overall accuracy rate close to the state-of-the-art results for non-anonymized traffic data.
What problem does this paper attempt to address?