Abstract:We present an evaluation of bucketed approximate top-$k$ algorithms. Computing top-$k$ exactly suffers from limited parallelism, because the $k$ largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top-$k$ operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-$k$ to select the most important parameters or activations. We also release a fast bucketed top-$k$ implementation for PyTorch.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: on large - scale parallel machine - learning accelerators, computing the exact Top - k operation, due to its inherent serial characteristics, leads to limited parallelism and low computational efficiency. Specifically:
1. **Limitations of Top - k Operations**:
- The Top - k operation refers to selecting the largest \(k\) elements from a vector of length \(n\).
- Computing the exact Top - k operation requires aggregating these \(k\) maximum values along the vector, so it is not suitable for highly parallel machine - learning accelerators (such as GPUs).
2. **The Need to Improve Parallelism**:
- When dealing with large - language models (LLMs), the Top - k operation is often used to select the most important parameters or activation functions, especially in sparsity algorithms.
- In order to improve computational efficiency and utilize more parallel computing resources, the paper proposes to significantly increase the available parallelism by relaxing the requirement for the exactness of Top - k results and adopting the bucketed approximate Top - k algorithm.
3. **Specific Problem Description**:
- **Limited Parallelism**: Traditional Top - k algorithms have limited parallelism when processing large - scale data and cannot fully utilize the computing power of modern accelerators.
- **High Computational Overhead**: For larger \(n\), computing Top - k can be very slow, especially in cases where each training iteration may involve millions of parameters.
- **Low Resource Utilization**: Existing Top - k implementations cannot fully utilize computing resources. In particular, when generating each token, the Top - k operation consumes a large amount of time.
### Solution
The paper proposes a bucketed approximate Top - k algorithm to solve the problem in the following ways:
- **Bucketing Strategy**: Divide the input vector into multiple buckets, and each bucket independently performs a smaller Top - \(k_b\) operation, and then merges the results of these buckets.
- **Reduce the Need for Cooperation**: Since the exact Top - k results are not required, the need for cooperation between buckets is reduced, thereby improving parallelism.
- **Adapt to Different Application Scenarios**: Adjust the parameters of bucketing and Top - \(k_b\) within the bucket according to the ratio of \(k\) to \(n\) (\(k \ll n\) or \(k\propto n\)) to optimize performance.
### Experimental Verification
The paper proves through theoretical analysis and experiments that the bucketed approximate Top - k algorithm can achieve a significant speed increase in many scenarios while maintaining the stability of downstream task performance. For example, in the sparse attention mechanism task, using the bucketed approximate Top - k can increase the speed by more than 4 times with almost no performance degradation.
In summary, this paper aims to solve the efficiency problem of traditional Top - k operations in large - scale parallel computing, especially in application scenarios when dealing with large - language models, by introducing the bucketed approximate Top - k algorithm.