Classifying Partially Labeled Networked Data via Logistic Network Lasso

Nguyen Tran,Henrik Ambos,Alexander Jung
DOI: https://doi.org/10.48550/arXiv.1903.10926
2019-03-26
Abstract:We apply the network Lasso to classify partially labeled data points which are characterized by high-dimensional feature vectors. In order to learn an accurate classifier from limited amounts of labeled data, we borrow statistical strength, via an intrinsic network structure, across the dataset. The resulting logistic network Lasso amounts to a regularized empirical risk minimization problem using the total variation of a classifier as a regularizer. This minimization problem is a non-smooth convex optimization problem which we solve using a primal-dual splitting method. This method is appealing for big data applications as it can be implemented as a highly scalable message passing algorithm.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to perform classification in partially labeled network data. Specifically, the author focuses on data points represented by high - dimensional feature vectors, which are inter - related through the inherent network structure. Due to the limited labeled data, traditional classification methods may not be able to provide sufficient statistical strength to learn an accurate classifier. Therefore, this paper proposes a new method - **Logistic Network Lasso (lnLasso)** - to utilize the network structure of the data to enhance classification performance. ### Specific description of the problem 1. **Partially labeled data**: In practical applications, obtaining a large amount of labeled data is usually expensive and time - consuming. Therefore, how to learn an effective classifier from a small amount of labeled data is an important challenge. 2. **High - dimensional feature vectors**: Each data point is represented by a high - dimensional feature vector, which makes the classification task more complex. 3. **Network structure**: There is an inherent network structure among data points (such as social networks, literature citation networks, etc.), and this structure can provide additional information for classification. ### Solutions To address the above challenges, the paper proposes the following solutions: - **Logistic Network Lasso (lnLasso)**: This method learns the classifier by minimizing the empirical risk with regularization. Among them, the regularization term uses the total variation (TV) of the classifier to ensure that the classifier is approximately constant on closely connected sub - graphs (clusters). Specifically, the optimization problem can be expressed as: \[ \hat{w} \in \arg\min_{w \in C} \hat{E}(w)+\lambda\|w\|_{TV} \] where: - \(\hat{E}(w)\) is the empirical risk, which measures the error of the classifier on the training set. - \(\|w\|_{TV}=\sum_{\{i, j\} \in E} A_{ij}\|w(j)-w(i)\|\) is the total variation regularization term, which measures the difference of the classifier on adjacent nodes. - \(\lambda\) is the regularization parameter, which is used to balance the empirical risk and the regularization term. - **Large - scale scalability**: To handle large - scale data sets, the paper proposes an efficient solution algorithm based on the primal - dual splitting method. This algorithm can be implemented on the network structure through message passing and has good scalability. ### Main contributions 1. **Novel implementation method**: The efficient solution of Logistic Network Lasso is achieved by applying the primal - dual method. 2. **Convergence proof**: The convergence of the proposed primal - dual method is proved. 3. **Experimental verification**: The effectiveness of this method is verified through data sets with chain - like and grid - like structures. In conclusion, this paper aims to propose an efficient classification method by combining network structure and partially labeled data to meet the challenges brought by high - dimensional features and limited labeled data.