A Top-down Supervised Learning Approach to Hierarchical Multi-label Classification in Networks

Miguel Romero,Jorge Finke,Camilo Rocha
DOI: https://doi.org/10.1007/s41109-022-00445-3
2022-03-24
Abstract:Node classification is the task of inferring or predicting missing node attributes from information available for other nodes in a network. This paper presents a general prediction model to hierarchical multi-label classification (HMC), where the attributes to be inferred can be specified as a strict poset. It is based on a top-down classification approach that addresses hierarchical multi-label classification with supervised learning by building a local classifier per class. The proposed model is showcased with a case study on the prediction of gene functions for Oryza sativa Japonica, a variety of rice. It is compared to the Hierarchical Binomial-Neighborhood, a probabilistic model, by evaluating both approaches in terms of prediction performance and computational cost. The results in this work support the working hypothesis that the proposed model can achieve good levels of prediction efficiency, while scaling up in relation to the state of the art.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of node classification in networks, especially the multi - label classification problem when these node attributes have a hierarchical structure. Specifically, it focuses on **Hierarchical Multi - label Classification (HMC)**, where each node can belong to multiple classes simultaneously, and there is a strict partial order relationship between these classes (i.e., forming a directed acyclic graph DAG). The author proposes a top - down supervised learning method to meet this challenge. #### Main problem description: 1. **Node classification problem**: - **Definition**: Infer or predict the missing node attributes from the information of other nodes in the network. - **Background**: Most existing techniques classify each class independently, ignoring the potential relationships between classes, which may lead to inconsistent prediction results. 2. **Hierarchical multi - label classification problem**: - **Definition**: Predict the association between nodes and classes given the network and the hierarchical structure of classes, ensuring that the prediction results conform to the "true - path rule", that is, if a node is predicted to be a certain class, it must also be predicted to be all of its ancestor classes. - **Challenge**: Existing methods either ignore the hierarchical relationships between classes or are too computationally expensive and difficult to scale to large - scale data sets. #### Specific contributions of the paper: - **Propose a new top - down supervised learning model**: By constructing a binary classifier for each class, gradually classify from the root node to the leaf node to ensure the consistency of prediction results. - **Introduce a correction mechanism**: Use cumulative probability to ensure that the prediction results satisfy the true - path rule and avoid inconsistent predictions. - **Apply case study**: Verify the effectiveness and computational efficiency of this model through the gene function prediction of the rice variety Oryza sativa Japonica. #### Formula representation: - **Strict Poset**: \[ (C, \prec) \] where \( C \) is the set of classes, and \(\prec\) is the strict partial order relationship between classes, satisfying asymmetry, anti - reflexivity, and transitivity. - **Cumulative probability calculation**: \[ P(v, C)=\prod_{A \in \text{ancestors}(C)} P(v, A) \] where \( P(v, C) \) represents the probability that node \( v \) belongs to class \( C \), and \(\text{ancestors}(C)\) represents all the ancestor classes of class \( C \). Through this method, the paper not only improves the prediction accuracy but also significantly reduces the computational cost, making it applicable to larger - scale data sets.