Advanced Prompt Engineering Techniques
Large Language Models (LLMs) can handle complex tasks like math problems and commonsense reasoning with the help of prompt engineering. LLMs are not inherently capable of performing such complicated tasks. They require guidance and optimization to extend their capabilities and broaden the range of tasks they can perform effectively. It can be achieved through the use of prompts. Prompts can specify the desired output format, provide prior knowledge, or guide the LLM through a complex task. Using advanced prompting techniques like Chain-of-Thought (CoT) prompting can significantly improve problem-solving rates in LLMs.
In this article, we will explore the advanced prompt engineering techniques that will help your business gain a competitive advantage.
What is Prompt Design?
Prompt design is creating the most effective prompt for a LLM with a clear objective. Crafting a successful prompt requires a deeper understanding. Different LLMs may interpret the same prompt differently, and some may have specific keywords with particular meanings. Also, depending on the task, domain-specific knowledge is crucial in prompt creation. Finding the perfect prompt often involves a trial-and-error process.
A prompt has three main types of content: input, context, and examples. The former specifies the information for which the model needs to generate a response. Inputs can take various forms, such as questions, tasks, or entities. The latter two are optional parts of a prompt. Context provides instructions on the model’s behavior. Examples are input-output pairs in the prompt to demonstrate the expected response. They customize the response format and behavior of the model.
Common prompt design strategies improve LLMs performance significantly. Include clear and concise instructions to guide the model's behavior effectively. Use examples for desired response patterns to the model to improve results, depending on the model's complexity. Show desired patterns rather than showing what to avoid. Provide partial content as it allows the model to generate the rest, considering examples and context. Include instructions and information to aid the model in problem-solving. Add prefixes to the input or output to provide semantic cues or formatting guidance to the model.
Advanced Prompt Engineering Techniques
Advanced Prompt Engineering Techniques are a set of methods for improving the performance of large language models on complex tasks. These techniques involve providing the LLM with more informative and structured prompts, as well as using prior knowledge and logical reasoning to guide the LLM's responses.
Chain-of-Thought (CoT) Prompting
Chain-of-Thought prompting (CoT) is a technique that provides the LLM with a sequence of intermediate steps that lead to the desired answer. It improves the reasoning abilities of large language models (LLMs). It allows the model to focus on solving one step at a time, rather than having to consider the entire problem all at once. It can be used for several reasoning tasks, including math word problems, commonsense reasoning, and symbolic manipulation. It can be readily implemented in sufficiently large language models without any special training or fine-tuning of the model. For example, CoT prompting in the PaLM model significantly enhanced performance in the GSM8K benchmark, improving it from 17.9% to 58.1%.
Few-shot CoT prompts LLMs with examples of similar problems to improve reasoning abilities. It is more effective than a few-shot baseline but can be more complex to implement. Zero-shot CoT involves adding "Let's think step by step" to the original prompt. This prompts the LLM to think about the question and come up with a chain of reasoning that leads to the answer. The reasoning is extracted from the LLM's response using a second prompt, “The answer is.” Zero-shot CoT has been shown to outperform other methods for evaluating the zero-shot reasoning abilities of LLMs.
CoT reasoning emerges in LLMs exceeding 100 billion parameters. This ability may stem from large LLMs' training on extensive datasets that include step-by-step reasoning. While instruction-following isn't essential for CoT, it might enhance its quality. Further research is required to fully understand the origins and potential of CoT reasoning in large LLMs. The researchers found that CoT prompting consistently outperformed standard baseline prompting across various linguistic styles, annotators, examples, and language models. It shows its robustness and effectiveness in enhancing language models' performance on diverse tasks. Sensitivity in CoT prompting pertains to how prompt design influences model performance. Well-matched, clear prompts are crucial, especially for complex tasks. Coherence in CoT ensures that reasoning steps follow a logical order. Later steps shouldn't depend on earlier ones, and vice versa. Removing coherence negatively affected system performance.
Self-consistency is a technique for generating multiple diverse chains of thought for the same problem and then training the model to select the most consistent answer among these chains.
It is used to enhance the performance of language models, especially in tasks requiring multi-step reasoning, like chain-of-thought prompting.
It improves performance of CoT prompting across various benchmarks, such as GSM8K by 17.9%, SVAMP by 11.0%, and AQuA by 12.2%.It's an unsupervised technique that is compatible with pre-trained language models, requiring no extra human annotation, training, fine-tuning, or model changes. It remains robust across different sampling strategies and parameters, consistently enhancing performance. The benefits of self-consistency become more significant as language model scale increases. For example, it contributes up to +23% accuracy improvement for larger models like LaMDA137B and GPT-3. Even for large models that already perform well, self-consistency consistently offers additional gains, such as +12%-18% accuracy improvement on tasks like AQuA and GSM8K over PaLM-540B.
Tree-of-Thoughts (ToT) Prompting
Tree of Thoughts (ToT) is a new framework that extends the Chain-of-Thought approach by allowing language models to explore coherent units of text ("thoughts") as intermediate steps towards problem solving. ToT enables LMs to make deliberate decisions, consider multiple reasoning paths, and self-evaluate choices. It also allows LMs to look ahead or backtrack when necessary for making global decisions.
Tree of Thoughts enhances language models' problem-solving abilities on tasks like Game of 24, Creative Writing, and Mini Crosswords.
For example, IO, CoT, and CoT-SC perform poorly on the task of solving Game of 24, achieving only 7.3%, 4.0%, and 9.0% success rates, respectively. ToT achieves much better results on this task. ToT with a breadth of b = 1 (meaning that it considers one possible solution at a time) already achieves a success rate of 45%, while b = 5 (meaning that it considers five possible solutions at a time) achieves 74%.
ToT is effective in tasks that require non-trivial planning or search. In the average GPT-4 scores for the three methods (ToT, IO, and CoT) across 100 tasks, ToT has the highest average score (7.56), followed by IO (6.19) and CoT (6.93). ToT is able to generate more coherent passages than IO and CoT on average.
Active prompting uses uncertainty-based active learning for adapting large language models (LLMs) to different tasks. It works in four stages. The first stage is uncertainty estimation. In this stage, the LLM is queried k times to generate possible answers with intermediate steps for a set of training questions. The uncertainty of each question is then calculated based on the k answers with a method called disagreement. Disagreement measures how much the k answers disagree with each other. The second stage is selection. The most uncertain questions are selected for annotation. The algorithm starts with the most uncertain question and then selects the next most uncertain question that is not already selected. The third stage is annotation. Humans annotate the selected questions with human-designed CoT reasoning. The CoT reasoning provides the LLM with additional information about how to answer the questions. The fourth stage is inference. The LLM is used to infer the answers to the questions. The LLM uses the new annotated exemplars to improve its performance on the questions.
Active prompt achieves the best performance compared with all baseline models. It is the most effective method for improving the performance of large language models (LLMs) on a variety of reasoning tasks.
It outperforms self-consistency by an average of 2.1% with code-davinci-002 and 7.2% with text-davinci-002. This suggests that Active-Prompt is a more effective way to improve the performance of LLMs than self-consistency, which is a previous method for training LLMs. The largest improvement is observed in GSM8K (4.2%) and AQuA (3.1%). This suggests that Active-Prompt is particularly effective for tasks that do not require the transferability of CoT prompts.
Reasoning WithOut Observation (ReWOO)
ReWOO (Reasoning WithOut Observation) is a technique that detaches the reasoning process from external observations, such as the ability to access and process information from the real world. This detachment significantly reduces the amount of tokens that the LLM needs to consume, which in turn improves the efficiency of the LLM. ReWOO divided the workflow into three separate modules: Planner, Worker, and Solver. The Planner takes a question as input and breaks it down into a sequence of steps. Each step is then formulated as a plan. The plans are interdependent, meaning that the output of one plan is used as the input to another plan. The Worker takes a plan as input and retrieves external knowledge from tools to provide evidence. The evidence can be anything from factual information to code snippets. The Solver takes the plans and evidence from the Worker module and synthesizes them to generate the ultimate answer to the initial question.
ReWOO was evaluated on six public NLP benchmarks and a curated dataset. It consistently outperformed the baseline methods on all of the benchmarks. For example, on HotpotQA, a multi-step reasoning benchmark, ReWOO achieved 5× token efficiency and 4% accuracy improvement. ReWOO also demonstrated robustness under tool-failure scenarios. It means that ReWOO is still able to perform well even when the external tools that it relies on are not available.
ReWOO outperforms ReAct. ReWOO was able to reduce token usage by 64% with an absolute accuracy gain of 4.4%. It is able to elicit more reasoning capabilities from LLMs than ReAct. ReWOO was also found to be more robust to tool failures than ReAct. When tools malfunction and return errors, ReAct-like ALM systems are highly fragile. ReWOO, on the other hand, is less compromised. ReWOO also performed well on the curated dataset, SOTUQA. SOTUQA is a document QA dataset that is more closely aligned with real-world ALM applications than previous public NLP benchmarks.
ReWOO decouples parametric modules from nonparametric tool calls. It means that the LLM can be fine-tuned to offload some of its reasoning ability to smaller language models. This offloading can substantially reduce the number of parameters that the LLM needs to store, which further improves the efficiency of the LLM. ReWOO can offload reasoning ability from a 175B GPT3.5 model to a 7B LLaMA model. It has the potential to create truly efficient and scalable ALM systems.
Reason and Act (ReAct)
ReAct is a technique that combines reasoning and acting with language models for solving various language reasoning and decision-making tasks. It prompts language models to generate both verbal reasoning traces and actions. It enables dynamic reasoning, high-level planning for acting, and interaction with external environments.
It is evaluated on four diverse benchmarks, including question answering (HotPotQA), fact verification (Fever), text-based games (ALFWorld), and web page navigation (WebShop). On HotpotQA and Fever, ReAct was able to overcome prevalent issues of hallucination and error propagation in chain-of-thought reasoning. It also outperformed imitation and reinforcement learning methods with an improved 34% and 10% on ALFWorld and WebShop. This is because ReAct is able to learn from human examples and apply that knowledge to new situations.
ReAct is designed to be intuitive, general, performant, and robust. It is applicable to diverse tasks, including question answering, fact verification, text games, and web navigation. It provides an interpretable decision-making and reasoning process, allowing humans to inspect reasoning, factual correctness, and even control or correct the agent's behavior during task execution.
Reflexion is a framework that uses linguistic feedback to reinforce language agents. Linguistic feedback is feedback that is expressed in natural language. Reflexion agents learn to reflect on task feedback signals, and then maintain their own reflective text in an episodic memory buffer. This reflective text is then used to induce better decision-making in subsequent trials. The Reflexion framework uses self-reflection. It generates verbal self-reflections to provide more informative feedback. These self-reflections are then stored in the agent's memory. The agent can then use this information to improve its performance on future trials.
Reflexion is flexible enough to incorporate various types and sources of feedback signals. For example, feedback signals can be scalar values (such as rewards or punishments), or they can be free-form language. Feedback signals can also be external (provided by a human or another agent), or they can be internally simulated (generated by the agent itself).
Reflexion agents outperforms strong baseline approaches in decision-making tasks, reasoning tasks, and programming tasks . In decision-making tasks (AlfWorld), Reflexion agents improve by 22% over 12 iterative learning steps. In reasoning questions (HotPotQA), Reflexion agents show a 20% improvement. In Python programming tasks (HumanEval), Reflexion agents achieve an improvement of up to 11%. It achieves a 91% pass@1 accuracy on the HumanEval, surpassing the previous state-of-the-art GPT-4 that achieves 80%.
Expert Prompting is an augmented strategy for instructing Large Language Models (LLMs). It envisions a distinguished expert agent tailored to each specific instruction. LLMs are asked to answer instructions conditioned on the identity of the envisioned expert. It is an automatic prompting method. Expert identities are generated using In-Context Learning. It requires writing several instruction-expert pair exemplars. The generated expert identities are found to be satisfactory.
Expert Prompting is a generalized prompting method. Expert identities are defined with detailed and elaborate descriptions. It can match instructions in various domains or genres. It's adaptable to different areas, such as nutrition or physics. It is simple to implement. It doesn't require complex crafting of prompt templates or iterative processes. Writing good expert identity is critical. It should be specialized, detailed, and comprehensive for each instruction. The descriptions must be automatically generated to be practical and efficient.
Automatic Prompt Engineering (APE)
APE is a technique that treats the instruction as the “program,” and it optimizes the instruction by searching over a pool of instruction candidates proposed by an LLM. The LLM candidates are scored using a chosen score function, and the instruction with the highest score is selected. APE is inspired by classical program synthesis and the human approach to prompt engineering. Program synthesis is the task of automatically generating code from a natural language description of the desired behavior. The human approach is the process of manually crafting instructions effective at using LLMs to produce desired outputs.
APE achieves human-level performance on zero-shot learning with model-generated instructions on 24/24 Instruction Induction and 17/21 Big-Bench tasks. It surpasses human performance with the InstructGPT model, obtaining an IQM of 0.810 compared to humans' 0.749. To achieve this, a dataset of questions and reasoning steps is generated using InstructGPT with the prompt "Let's think step by step." Then any data points that had incorrect answers were removed. Finally, APE was used to find a prompt starting with "Let's" that maximized the likelihood of these correct reasoning steps. APE produced the prompt "Let's work this out in a step-by-step way to be sure we have the right answer." This generated prompt further improved performance on two tasks: MultiArith from 78.7 to 82.0, and GSM8K from 40.7 to 43.0.
Auto-CoT is a process of automatically constructing demonstrations with questions and reasoning chains. It first clusters the questions in a dataset into a few clusters. Then, it selects a representative question from each cluster and generates its reasoning chain using Zero-Shot-CoT with simple heuristics. The Auto-CoT method has several advantages over other methods. It is automatic, scalable, and effective, which means that it generates demonstrations that are accurate and informative.
On comparing the accuracy of Auto-CoT with the four baseline methods on ten datasets from three categories of reasoning tasks, Auto-CoT consistently matches or exceeds the performance of the CoT that requires manual designs of demonstrations. The reason for this is that Auto-CoT is able to generate demonstrations that are task-adaptive. It means that the demonstrations are tailored to the specific dataset and reasoning task. In contrast, Manual-CoT may use the same demonstrations for multiple datasets, which can lead to lower accuracy.
Automatic Multi-step Reasoning and Tool-use (ART)
ART is a framework that uses large language models to automatically generate intermediate reasoning steps for a new task. The LLMs are frozen, which means that they are not updated during the reasoning process. It allows ART to be more efficient and scalable than frameworks that use trainable LLMs. ART selects demonstrations of multistep reasoning and tool use from a task library. A decomposition is a high-level description of the steps involved in solving a task. ART then selects and uses tools in the tool library alongside LLM generation to complete the intermediate reasoning steps. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. This allows ART to leverage the capabilities of external tools to solve complex tasks.
ART has been shown to be effective on a variety of tasks, including natural language inference, question answering, and code generation. It outperforms previous approaches to few-shot reasoning and tool-use, and it is able to solve tasks that were previously thought to be impossible for LLMs. Humans can optionally edit decompositions to improve performance. For example, they can correct errors in code or incorporate new tools. ART is extensible, which means that it can be easily extended to include new tasks and tools.
ART consistently matched or outperformed automatically generated CoT reasoning chains on 32 out of 34 BigBench tasks and all MMLU tasks. On average, it achieved an improvement of over 22 percentage points. The use of tools in ART significantly enhanced performance on test tasks, with an average improvement of over 12.3 percentage points compared to scenarios where no tools were allowed. ART also improved over direct few-shot prompting by an average of 10.8 percentage points across unseen BigBench and MMLU tasks. Its improvements were particularly remarkable in tasks requiring arithmetic and algorithmic reasoning, where it outperformed direct few-shot prompting by 12.5%. ART also surpassed previous best-known results for GPT3, which use supervision for decomposition and/or tool use, by 6.1 percentage points. ART allows for human intervention and performance improvement by updating the task and tool libraries with new demonstrations. With additional human feedback, ART surpassed the best-known results for GPT3 by an average of over 20 percentage points on 12 test tasks.
Advanced Prompt Engineering Strategies
You can enhance your prompts with some effective prompting strategies, such as temperature and token control, prompt chaining, multi-turn conversations, and more. Temperature and token control fine-tune language model behavior. Temperature adjusts randomness, with higher values promoting creativity. Lower temperature refines responses for precision. Token control sets response length, useful for brevity.
Prompt chaining is the practice of connecting multiple prompts together to create a continuous flow of conversation by referencing previous inputs or the language model's previous responses in each prompt. Multi-turn conversations are conversations that consist of multiple exchanges between the user and the language model by the user providing multiple prompts, or by the language model providing multiple responses. Multi-turn conversations allow for a more detailed and nuanced conversation, as the user and the language model can build on each other's contributions. For example, to engage in a detailed discussion, users could chain prompts together to explore a topic in depth. The language model could then provide different perspectives on the topic, allowing for a more nuanced and informative discussion.
Also, tailoring prompts to specific industries or fields ensures relevant responses from LLMs, building user trust and encouraging future use. Domain-specific prompts enable better context understanding and accuracy, as LLMs are trained on domain-relevant text. This enhances the overall user experience, leading to greater satisfaction. The ability to handle unclear or contradicting user inputs can improve prompting in any LLM by ensuring that the model is able to understand the user's request and generate a relevant and informative response. It involves actively engaging with the user and seeking clarification, using natural language understanding to identify the user's intent, and generating multiple responses. For example, if the user asks "I'm looking for a restaurant," the chatbot could generate responses that recommend restaurants based on the user's location, budget, or dietary restrictions.
Tools For Implementing Prompting Techniques
Tools like Langchain, Guidance AI, Semantic Kernel, and Auto-GPT make it easier for us to interact with language models. These powerful tools offer innovative solutions for crafting more effective and context-aware prompts and enhancing the capabilities of language models.
Langchain is a versatile framework for building data-aware and agentic applications using language models. It was launched in October 2022 by Harrison Chase at Robust Intelligence. Langchain provides standard and extendable interfaces for modules like models, prompts, memory, indexes, chains, agents, and callbacks. This makes it easy to build applications that use language models for a wide range of tasks.
Semantic Kernel is an open-source SDK that makes it easy to integrate AI services like OpenAI, Azure OpenAI, and Hugging Face with traditional programming languages like C# and Python. It provides a set of connectors that make it easy to add memories and models to your apps, giving them a simulated "brain." Semantic Kernel also provides a set of AI plugins that allow your apps to interact with the real world via prompts and native functions. These plugins are like the "body" of your AI app.
It focuses on avoiding software bloat. It employs a planner to break down tasks and interlock parts, turning user queries into desired outcomes. SK enables integration of LLMs with traditional programming by combining natural language semantics with code functions. SK uses embeddings-based memory for enhanced application capabilities, supporting prompt engineering, chaining, retrieval-augmented generation, and more. SK offers contextual and long-term vectorized memory, allowing access to external knowledge stores and proprietary data. SK incorporates design patterns from AI research for intelligent planning and reasoning.
Guidance by Microsoft is a templating language for controlling large language models (LLMs). It supports a variety of prompt engineering techniques and is well-suited for use with powerful LLMs like GPT-4. Guidance offers efficient and effective control of LLMs by integrating generation, prompting, and logical control in a continuous flow, which matches how LLMs process text. It provides a simple and intuitive syntax based on Handlebars templating. It can be used to create rich output structures with multiple generations, selections, conditionals, and tool use.
It offers a playground-like streaming experience in Jupyter/VSCode Notebooks, making it easy to experiment with different prompts and parameters. Smart seed-based generation caching is supported for optimization, which can significantly speed up the generation process. Guidance is compatible with role-based chat models like ChatGPT, and it seamlessly integrates with Hugging Face models. Guidance offers a number of features that can improve the performance and usability of Hugging Face models, such as guidance acceleration, token healing, and regex pattern guides.
Auto-GPT is an experimental, open-source application that demonstrates the capabilities of the GPT-4 language model. It is a popular tool for designing LLM agents, chaining together LLM thoughts to autonomously achieve user-defined goals. Auto-GPT showcases the potential of GPT-4 to operate autonomously, with key features that include internet access for searches, long-term and short-term memory management, and the ability to use GPT-4 instances for text generation. Auto-GPT supports file storage and summarization using GPT-3.5. The application is extensible with plugins.
Auto-GPT is an AI agent that can achieve goals set in natural language. It breaks down goals into sub-tasks and uses internet resources and tools to complete them. It can operate autonomously without requiring manual commands. Auto-GPT can create and revise its own prompts, and it can manage memory by reading from and writing to databases and files. However, Auto-GPT cannot modify the underlying GPT models, as they are proprietary, and it does not typically access its own base system code.
Want to write high quality prompts for LLMs?
We are a team of researchers and engineers who have been working on AI for very long. We have written may prompts, some as long as 500 words. If you are looking to improve performance of your prompts or setup monitoring systems for your language models, reach out and we’ll be happy to help!