The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

Fulu Li
2024-10-24
Abstract:In this paper, we give an in-depth analysis on the mathematical problem formulations and the probabilistic optimization explorations for some of the key components in Transformer model [33] in the field of generative AI. We explore and discuss some potential further enhancement for current state of the art methods for some key underlying technologies of generative AI models from algorithmic and probabilistic optimization perspective. In particular, we present an optimal solution for sub-word encoding (SWE) based on similar initial settings as that of byte-pair encoding (BPE) algorithm in [9] with similar objectives as that of WordPiece approach in [28, 31] to maximize the likelihood of the training data. We also present cross entropy optimization method to optimize hyperparameters for word2vec model [17]. In addition, we propose a factored combination of rotary positional encoding (RoPE) [32] and attention with linear biases (ALiBi) [23] with a harmonic series. We also present a probabilistic FlashAttention [6, 7] (PrFlashAttention) method with a probability distribution over block distances in the matrix to decide which block is likely to participate in a given round of attention computation while maintaining the lower triangle shape of the tensor for autoregressive language models by re-shaping the tensors. Finally, we present staircase adaptive quantization (SAQ) of key-value (KV) cache for multi-query attention (MQA) based on the framework presented in [16] to have gradual quantization degradation while achieving reasonable model quality and cost savings.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the mathematical modeling and probability optimization of the key components of the Transformer model in generative AI. Specifically, the paper explores the following issues: 1. **Optimization of Sub - word Encoding**: - The paper proposes an optimal solution based on the initial settings similar to the Byte - Pair Encoding (BPE) algorithm to maximize the likelihood of the training data. The goal is to find a method to deal with the problems of rare and unseen words while optimizing the vocabulary size and the word - frequency distribution in the training data. 2. **Optimization of Word2Vec Model Hyperparameters**: - The paper proposes a cross - entropy optimization method to optimize the hyperparameters of the Word2Vec model in order to improve the model performance. 3. **Improvement of Positional Encoding and Attention Mechanism**: - The paper proposes a method that combines Rotary Positional Encoding (RoPE) and Attention with Linear Biases (ALiBi) and introduces the harmonic series. This aims to improve the representation of positional information and the capture of long - distance dependencies in the Transformer model. 4. **Introduction of Probabilistic FlashAttention (PrFlashAttention)**: - The paper proposes a method based on the probability distribution of matrix block distances to determine which blocks participate in the attention calculation in a given round while maintaining the lower - triangular shape of the tensor, which is suitable for autoregressive language models. 5. **Quantization Optimization in Multi - Query Attention (MQA)**: - The paper proposes the Step - Adaptive Quantization (SAQ) method for the key - value cache in multi - query attention to achieve progressive quantization degradation while maintaining reasonable model quality and cost savings. ### Summary of Mathematical Formulas - **Maximum Likelihood Estimation**: \[ \max_\theta \left( \sum_{i = 1}^N \log Pr(A_i|Q_i; \theta) \right) \] where \(\theta\) is the model parameter and \(Pr(A_i|Q_i)\) is the probability of the correct answer \(A_i\) given the question \(Q_i\). - **Cross - Entropy Loss Function**: \[ \min_\theta \left( -\sum_{t = 1}^T \log Pr(s_t|s_1, s_2, \ldots, s_{t - 1}; \theta) \right) \] where \(s_t\) is the \(t\)-th word in the sequence and \(\theta\) is the model parameter. - **KL Divergence**: \[ KL(W\|Z) = \sum_{x\in\chi} W(x)\times\log \left( \frac{W(x)}{Z(x)} \right) \] ### Conclusion This paper proposes a series of improvement schemes from the perspectives of algorithm and probability optimization by in - depth analysis of the key components in the Transformer model, aiming to improve the performance and efficiency of generative AI models. These improvements include but are not limited to sub - word encoding, hyperparameter optimization, improvement of positional encoding and attention mechanism, and quantization optimization, etc.