Abstract:Today's large language models (LLMs) can solve challenging question-answering tasks, and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention for enhancing the explanation and correctness of outputs. Nevertheless, models require significant time to generate answers augmented with lengthy reasoning details. To address this issue, this paper analyzes the impact of output lengths on LLM inference pipelines and proposes novel metrics to evaluate them in terms of \textit{correct conciseness}. It also examines the impact of controlling output length through a refined prompt engineering strategy, Constrained-CoT (CCoT), which encourages the model to limit output length. Experiments on pre-trained LLMs demonstrated the benefit of the proposed metrics and the effectiveness of CCoT across different models. For instance, constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01\% (CoT) to 41.07\% (CCoT) on the GSM8K dataset, while reducing the average output length by 28 words.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when large language models (LLMs) generate answers, due to the use of chain - of - thought (CoT) prompting techniques, the output is too long, thus increasing the generation time. Specifically, the paper focuses on the following aspects: 1. **The relationship between output length and inference time**: The paper first shows through experiments the impact of output length on the LLM inference time. As the output length increases, the time required for the model to generate an answer increases significantly, which is an important issue in applications that require real - time interaction. 2. **Improvement of evaluation metrics**: Existing evaluation metrics mainly focus on the accuracy of model output, while ignoring the simplicity and response time of the output. Therefore, the paper proposes three new evaluation metrics, aiming to comprehensively consider the correctness and simplicity of the output: - **Hard - Constrained Concise Accuracy (HCA)**: Only calculate the proportion of correct answers with a length not exceeding the specified value \(k\). - **Soft - Constrained Concise Accuracy (SCA)**: Apply an exponential decay penalty to correct answers that exceed the maximum length \(k\). - **Consistent Concise Accuracy (CCA)**: Further consider the consistency of output length and penalize outputs with large length variations. 3. **Methods for controlling output length**: In order to reduce the output length, the paper proposes an improved prompting engineering strategy - Constrained Chain - of - Thought (CCoT). CCoT encourages the model to generate a more concise reasoning process by explicitly requiring the model to limit the output length in the prompt. Experimental results show that CCoT can significantly reduce the output length and generation time while maintaining or improving accuracy. In summary, the main goal of this paper is to solve the problem of excessive output length when large language models generate answers by introducing new evaluation metrics and an improved prompting engineering strategy, thereby improving the efficiency and practicality of the model.

Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

The Impact of Reasoning Step Length on Large Language Models

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness

Concise and Organized Perception Facilitates Reasoning in Large Language Models

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Extending Token Computation for LLM Reasoning

Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models

Understanding Chain-of-Thought in LLMs through Information Theory

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Rational Metareasoning for Large Language Models

Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought

How Likely Do LLMs with CoT Mimic Human Reasoning?

Over-Reasoning and Redundant Calculation of Large Language Models

Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language Models