Accelerating ERM for data-driven algorithm design using output-sensitive techniques

Maria-Florina Balcan,Christopher Seiler,Dravyansh Sharma
2024-10-24
Abstract:Data-driven algorithm design is a promising, learning-based approach for beyond worst-case analysis of algorithms with tunable parameters. An important open problem is the design of computationally efficient data-driven algorithms for combinatorial algorithm families with multiple parameters. As one fixes the problem instance and varies the parameters, the "dual" loss function typically has a piecewise-decomposable structure, i.e. is well-behaved except at certain sharp transition boundaries. In this work we initiate the study of techniques to develop efficient ERM learning algorithms for data-driven algorithm design by enumerating the pieces of the sum dual loss functions for a collection of problem instances. The running time of our approach scales with the actual number of pieces that appear as opposed to worst case upper bounds on the number of pieces. Our approach involves two novel ingredients -- an output-sensitive algorithm for enumerating polytopes induced by a set of hyperplanes using tools from computational geometry, and an execution graph which compactly represents all the states the algorithm could attain for all possible parameter values. We illustrate our techniques by giving algorithms for pricing problems, linkage-based clustering and dynamic-programming based sequence alignment.
Data Structures and Algorithms,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is the design of computationally efficient data-driven algorithms, particularly for combinatorial algorithm families with multiple parameters. Specifically, the paper focuses on how to accelerate the Empirical Risk Minimization (ERM) process by enumerating the pieces of the loss function, thereby improving the algorithm's performance on real-world problem instances. ### Background and Motivation Traditional combinatorial algorithm design methods typically consider worst-case problem instances. However, in practical applications, algorithms often need to handle multiple related problem instances from the same domain rather than the worst case. The data-driven algorithm design paradigm allows leveraging machine learning techniques to learn the optimal algorithm parameters from multiple problem instances in the same domain, thereby achieving better performance in practical applications. ### Contributions of the Paper 1. **Output-Sensitive Cell Enumeration Algorithm**: - A computational geometry-based method is proposed for enumerating cells induced by a set of hyperplanes. This method is applicable to any piecewise linear loss function and has polynomial time complexity in terms of output size (i.e., the number of pieces). - By removing redundant constraints in the polyhedra and performing an implicit search of the adjacency polyhedra graph, the method achieves output polynomial time complexity. 2. **Applications to Specific Problems**: - **Two-Part Tariff Pricing Problem**: An output-sensitive algorithm is proposed for enumerating the pieces of the total revenue as a function of the price parameters. - **Hierarchical Clustering Algorithm**: The execution tree method for single-parameter families is extended to multi-dimensional parameter families by efficiently computing the pieces through tracking the convex polyhedral subdivision of the parameter space. - **Dynamic Programming Sequence Alignment**: The execution tree method is extended to execute directed acyclic graphs (DAGs), efficiently computing the polyhedra of subproblems under fixed parameter values and performing output-sensitive enumeration of candidate hyperplanes within each region. ### Key Insights and Challenges - **Local Sensitivity**: The algorithm leverages the local structure near each cell to reduce the number of candidate hyperplanes, thereby improving efficiency. - **Output Sensitivity**: Only non-redundant hyperplanes are computed, ensuring output-sensitive time complexity. - **Multi-Dimensional Parameter Families**: Effective methods for computing pieces are proposed for multi-dimensional parameter families, such as hierarchical clustering and dynamic programming sequence alignment. ### Related Work - **Data-Driven Algorithm Design**: There has been extensive research on data-driven algorithm design, particularly in clustering, decision tree learning, computational biology, and other fields. - **Linkage Clustering**: Introduced multiple parameter families, extending the execution tree method for single-parameter families. - **Sequence Alignment**: Proposed algorithms for computing the partition of the weight space to obtain different optimal alignments. - **Pricing Problems**: Studied computational efficiency issues in multi-dimensional mechanism design, particularly in the two-part tariff pricing problem. ### Conclusion This paper addresses the computational efficiency problem in data-driven algorithm design by proposing an output-sensitive cell enumeration algorithm, particularly for combinatorial algorithm families with multiple parameters. These methods can significantly improve algorithm performance in practical applications, especially when handling multiple related problem instances from the same domain.