The Long Way to Deforestation (Technical Report): A Type Inference and Elaboration Technique for Removing Intermediate Data Structures

Yijia Chen,Lionel Parreaux
DOI: https://doi.org/10.1145/3674634
2024-10-03
Abstract:Deforestation is a compiler optimization that removes intermediate data structure allocations from functional programs to improve their efficiency. This is an old idea, but previous approaches have proved limited or impractical: they either only worked on compositions of predefined combinators (shortcut fusion), or involved the aggressive unfolding of recursive definitions until a depth limit was reached or a reoccurring pattern was found to tie the recursive knot, resulting in impractical algorithmic complexity and large amounts of code duplication. We present Lumberhack, a general-purpose deforestation approach for purely functional call-by-value programs. Lumberhack uses subtype inference to reason about data structure production and consumption and uses an elaboration pass to fuse the corresponding recursive definitions. It fuses large classes of mutually recursive definitions while avoiding much of the unproductive (and sometimes counter-productive) code duplication inherent in previous approaches. We prove the soundness of Lumberhack using logical relations and experimentally demonstrate significant speedups in the standard nofib benchmark suite.
Programming Languages
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **to reduce the creation of intermediate data structures in functional programs in order to improve program execution efficiency**. Specifically, the paper introduces a technique named Lumberhack for optimizing the "deforestation" process in the compiler. This optimization technique aims to improve the performance of functional programs by eliminating the allocation of intermediate data structures. ### Problem Background In functional programming, "deforestation" refers to reducing the intermediate data structures created during program execution through program transformation techniques. For example, in the example shown in Figure 1, the intermediate list created by `map double ls` is immediately consumed by `map incr`, so the creation of this intermediate list can be avoided by fusing these two traversal operations. However, most existing compilers are unable to automatically achieve this optimization, especially for combinations of recursive functions. ### Limitations of Existing Methods Many deforestation methods proposed in the past have the following limitations: 1. **Code Bloat**: These methods usually lead to a large amount of code duplication, which causes the code volume to increase dramatically and the compilation time to become impractical. 2. **High Complexity**: These methods are very complex, difficult to prove their correctness, and also difficult to implement correctly. 3. **Only Applicable to Non - Strict Languages**: Many methods cannot handle call - by - value semantics and can only be applied to non - strict languages (such as Haskell). 4. **Limited Applicability**: Some methods are only applicable to specific predefined combinators, rather than general recursive functions. ### Characteristics of Lumberhack To solve the above problems, the paper proposes a new general deforestation method - Lumberhack. Lumberhack has the following characteristics: - **Generality**: It can be applied to call - by - need and call - by - value languages. - **Wide Applicability**: It can handle programs containing a large number of mutually recursive definitions, not just predefined combinators. - **Avoid Code Bloat**: It will not lead to uncontrollable code duplication. - **Easy to Implement and Prove Correctness**: It is relatively simple, easy to implement and prove its correctness. ### Experimental Results Through experiments on the standard nofib benchmark suite, Lumberhack shows a significant performance improvement. Among 38 benchmark programs, the average speed has increased by 8.2%, while the size of the compiled binary file has increased by about 1.79 times on average. In particular, the speed of 17 programs has increased significantly by 16.6%, and only the speed of two programs has decreased slightly (with an average slowdown of 1.8%). ### Summary Through type inference and refinement techniques, Lumberhack can effectively fuse producer and consumer functions while maintaining the original program structure, thereby reducing the creation of intermediate data structures and improving the execution efficiency of the program. This method is not only applicable to lists, but can also be applied to other general functional data structures, such as binary trees, etc.