A Privacy Preserving Markov Model for Sequence Classification.

Suxin Guo,Sheng Zhong,Aidong Zhang
DOI: https://doi.org/10.1145/2506583.2506636
2013-01-01
Abstract:Sequence classification has attracted much interest in recent years due to its difference from the traditional classification tasks, as well as its wide applications in many fields, such as bioinformatics. As it is not easy to define specific "features" for sequence data as in traditional feature based classifications, many methods have been developed to utilize the particular characteristics of sequences. One common way of classifying sequence data is to use probabilistic generative models, such as the Markov model, to learn the probability distribution of sequences in each class. One thing that should be considered in the research of sequence classification is the privacy issue. In many cases, especially in the bioinformatics field, the sequence data contains sensitive information which obstructs the mining of data. For example, the DNA and protein sequences of individuals are highly sensitive and should not be released without protection. But in the real world, data is usually distributed among different parties and for the parties, training only with their own data may not give them strong enough models. This raises a problem when some parties, each holding a set of sequences, want to learn the Markov models on the union of their data, but do not want to reveal their data to others due to the privacy concerns. In this paper, we address this problem and propose a method to train the Markov models, from the ones of the first order to the ones of order k where k > 1, on sequence data distributed among parties without revealing each party's private sequences to others. We apply the homomorphic encryption to protect the sensitive information.
What problem does this paper attempt to address?