Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost

Sania Nayab,Giulio Rossolini,Giorgio Buttazzo,Nicolamaria Manes,Fabrizio Giacomelli
2024-07-29
Abstract:Today's large language models (LLMs) can solve challenging question-answering tasks, and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention for enhancing the explanation and correctness of outputs. Nevertheless, models require significant time to generate answers augmented with lengthy reasoning details. To address this issue, this paper analyzes the impact of output lengths on LLM inference pipelines and proposes novel metrics to evaluate them in terms of \textit{correct conciseness}. It also examines the impact of controlling output length through a refined prompt engineering strategy, Constrained-CoT (CCoT), which encourages the model to limit output length. Experiments on pre-trained LLMs demonstrated the benefit of the proposed metrics and the effectiveness of CCoT across different models. For instance, constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01\% (CoT) to 41.07\% (CCoT) on the GSM8K dataset, while reducing the average output length by 28 words.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when large language models (LLMs) generate answers, due to the use of chain - of - thought (CoT) prompting techniques, the output is too long, thus increasing the generation time. Specifically, the paper focuses on the following aspects: 1. **The relationship between output length and inference time**: The paper first shows through experiments the impact of output length on the LLM inference time. As the output length increases, the time required for the model to generate an answer increases significantly, which is an important issue in applications that require real - time interaction. 2. **Improvement of evaluation metrics**: Existing evaluation metrics mainly focus on the accuracy of model output, while ignoring the simplicity and response time of the output. Therefore, the paper proposes three new evaluation metrics, aiming to comprehensively consider the correctness and simplicity of the output: - **Hard - Constrained Concise Accuracy (HCA)**: Only calculate the proportion of correct answers with a length not exceeding the specified value \(k\). - **Soft - Constrained Concise Accuracy (SCA)**: Apply an exponential decay penalty to correct answers that exceed the maximum length \(k\). - **Consistent Concise Accuracy (CCA)**: Further consider the consistency of output length and penalize outputs with large length variations. 3. **Methods for controlling output length**: In order to reduce the output length, the paper proposes an improved prompting engineering strategy - Constrained Chain - of - Thought (CCoT). CCoT encourages the model to generate a more concise reasoning process by explicitly requiring the model to limit the output length in the prompt. Experimental results show that CCoT can significantly reduce the output length and generation time while maintaining or improving accuracy. In summary, the main goal of this paper is to solve the problem of excessive output length when large language models generate answers by introducing new evaluation metrics and an improved prompting engineering strategy, thereby improving the efficiency and practicality of the model.