Ming Yu,Addie M. Thompson,Karthikeyan Natesan Ramamurthy,Eunho Yang,Aurélie C. Lozano
Abstract:Inferring predictive maps between multiple input and multiple output variables or tasks has innumerable applications in data science. Multi-task learning attempts to learn the maps to several output tasks simultaneously with information sharing between them. We propose a novel multi-task learning framework for sparse linear regression, where a full task hierarchy is automatically inferred from the data, with the assumption that the task parameters follow a hierarchical tree structure. The leaves of the tree are the parameters for individual tasks, and the root is the global model that approximates all the tasks. We apply the proposed approach to develop and evaluate: (a) predictive models of plant traits using large-scale and automated remote sensing data, and (b) GWAS methodologies mapping such derived phenotypes in lieu of hand-measured traits. We demonstrate the superior performance of our approach compared to other methods, as well as the usefulness of discovering hierarchical groupings between tasks. Our results suggest that richer genetic mapping can indeed be obtained from the remote sensing data. In addition, our discovered groupings reveal interesting insights from a plant science perspective.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively share information and discover the hierarchical structure between tasks in multi - task learning. Specifically, the author proposes a new multi - task learning framework, which can automatically infer the complete hierarchical structure of task parameters from the data, assuming that these task parameters follow a hierarchical tree structure. The paper mainly focuses on the applications in plant trait prediction modeling and genome - wide association study (GWAS). By using large - scale automated remote - sensing data, it improves the performance of the prediction model and reveals the hierarchical grouping between tasks, thus obtaining a richer genetic mapping.
### Main problems
1. **Information sharing in multi - task learning**: Traditional multi - task learning methods usually assume that the same information is shared among all tasks, which may overlook the specificity of the relationships between tasks. The method proposed in this paper aims to solve how and with whom to share information between tasks.
2. **Hierarchical structure of task parameters**: The author assumes that task parameters follow a hierarchical tree structure. The leaf nodes of the tree are the parameters of each task, and the root node is a global model that approximates all tasks. This method can automatically learn the hierarchical relationships between tasks from the data.
3. **Challenges in application fields**: In plant trait prediction modeling and GWAS, how to use large - scale automated remote - sensing data to improve the performance of the prediction model and reveal the hierarchical grouping between tasks, so as to obtain a richer genetic mapping.
### Solutions
1. **Multi - task linear regression model**: The author proposes a regularized regression problem, which simultaneously estimates task parameters and the hierarchical relationships between tasks by minimizing the loss function.
2. **Convex clustering penalty term**: A penalty term similar to convex clustering is introduced to encourage the sharing between task parameters, thus forming a hierarchical tree structure of task parameters.
3. **Optimization algorithm**: The proximal decomposition method is adopted to efficiently solve the proposed convex optimization problem, and the numerical convergence of the algorithm is proved.
4. **Theoretical guarantee**: The asymptotic properties of the estimators are provided, and it is proved that under certain conditions, the joint limit distribution of the estimators is concentrated on certain specific lines, reflecting the sharing relationships between task parameters.
### Application effects
1. **Simulation data experiment**: Experiments were carried out on synthetic data sets, and the results show that the proposed method is superior to other baseline methods in prediction accuracy.
2. **Practical application**: Experiments were carried out using real data in plant trait prediction modeling and GWAS. The results show that the proposed method not only improves the prediction accuracy, but also reveals interesting and interpretable relationships between tasks.
In conclusion, by proposing a new multi - task learning framework, this paper solves the problems of information sharing and task hierarchical structure discovery in multi - task learning, and demonstrates its effectiveness and practicality in plant trait prediction modeling and GWAS.