The Prompt Optimization Stack

11 min readOct 6, 2023

Editor’s note: Mike Taylor is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “Prompt Optimization with GPT-4 and Langchain,” there!

The difference between the average person using AI and a Prompt Engineer is testing. Most people run a prompt 2–3 times and find something that works well enough. Prompt Engineers run the same prompt 100s of times, and A/B test it against other variations in order to determine how often it meets key evaluation criteria, incurs a lot of cost, or gives undesirable responses.

I define prompt engineering as “the process for discovering prompts that reliably yield useful or desired results”, and it’s the reliability that’s the hard part. AI can almost replicate human-level intelligence in some tasks, but unfortunately it seems a human-level lack of reliability is a package deal with the increase in intelligence.

Responses from generative AI models have an element of randomness we’ve never seen in programming a computer. You might run the exact same prompt 100 times, and on the 101st try it gives you a racist response, make something up, or completely fail to do a task. As models get bigger and demand for GPUs increases, optimizing your prompt can also save you a lot of money.

Logging these edge cases, cutting unnecessary tokens, and testing how to improve the prompt, is exciting work — particularly when the field is moving so fast. Despite the insane pace of innovation, there are a few standard tools that I’m finding myself converging over time, which help me A/B test prompts and run them at scale to see when and where they break.

LangChain

Many AI engineers have a love-hate relationship with the open-source framework LangChain, and I know several who have ripped it out of their product, only to later decide to put it back in. Having standard components is important to an industry, even if the standard can be imperfect.

Async Calling

This is something I didn’t realize the value of, but now use all the time. If you run 100 prompts serially (one after another), it can take more than 10 minutes (longer if you get rate-limited). I didn’t expect much of a speed-up from running them asynchronously (all at the same time) but it can be as much as 6x faster! OpenAI can deal with multiple calls at once, even if each individual call takes some time.

Retry Logic

What I particularly enjoy about LangChain is its built-in retry logic. In an industry where time is money, this feature is invaluable. If a prompt fails or outputs undesired results, LangChain automates the re-running process, eliminating the need for manual intervention. This is truly essential when you’re running hundreds of calls, in particular when you’re doing them async. It’s completely unworkable without this, especially with the reliability issues that plague the service.

Standard Convention

LangChain adheres to industry-standard coding practices. This might sound like table stakes, but you’d be surprised how often this is overlooked in the AI development space. The library’s use of standard conventions makes it easier to integrate with other tools and services, and reading their documentation is like a degree in prompt engineering. As a budding Prompt Engineer, this is one of the few transferrable skills you can learn that everybody recognizes. As an employer, it pays to adopt a popular framework rather than training new employees on your custom setup.

Abstract Components

The final feather in LangChain’s cap is its ability to abstract components of your prompt engineering tasks. This means you can reuse code blocks for different projects, streamlining the development process and increasing efficiency. It’s like having a Lego set where each block is compatible and purposeful. Abstracting away which model you use, and stuffing your prompts into standard templates, is often the first step towards running large-scale tests. It can get onerous and feel like overkill the first few times you use it, but trust me it grows on you.

LangSmith

Picasso said good artists don’t talk about style, trend, and meaning when they get together — they talk about where to buy good turpentine. For good prompt engineers, the topic is logging, which is why it’s no surprise that the first commercial product from LangChain (free in beta) is a logging and evaluation tool.

Logging

LangSmith provides detailed logs that capture not just the output, but also valuable metadata. This enables you to analyze both successes and failures, drawing insights that help refine your prompts. It’s akin to a flight data recorder for your AI projects — everything gets noted, leaving no room for ambiguity.

Debugging

Because you have everything in the logs, it acts as a powerful debugging tool that allows you to peel back the layers of your prompt responses. This feature is crucial for understanding the why and how behind each output, helping you optimize for future iterations. I have it on all the time for precisely this reason, because if I see something odd happening I can look in here without running endless print statements.

Test Results

A cool feature is the ability to incorporate test results and evaluation metrics. These aren’t just pass or fail grades; they provide deep-dive analytics into the performance metrics of each prompt, including automated evaluation metrics. This is deeply integrated with LangChain (as you’d expect) so you can use all the standard LangChain evaluators here as well.

Fine-tuning Data

LangSmith allows you to then take that evaluated prompt data and dump it into a useful format for fine-tuning a custom model, when you’re ready to move off OpenAI. You can train the AI models based on your specific needs, ensuring the prompts are laser-focused for your project’s objectives, and at a lower cost.

GPT-4

The state-of-the-art model GPT-4 from OpenAI is what I usually go to first when working on a new task. I tend to use the Playground more these days, because my ChatGPT is polluted with Custom Instructions, and I want to get an unbiased view.

Best Quality

While LangChain lets you abstract away what model you’re using, there’s one model that rises above the rest. GPT-4 is the gold standard for natural language generation, no doubt about it. Whether you’re looking to draft emails, write code, or even script your next marketing campaign, GPT-4 brings to the table an unparalleled quality of output. I’ve seen it compose responses that are so coherent and contextually relevant that they could easily pass for human-written text.

Slow Latency

Now, for the catch — latency. GPT-4’s complexity does come with a speed trade-off. If you’re looking to run prompts at scale, this could be a bottleneck. To me, this is like having a Lamborghini that’s stuck in traffic — it’s great, but not fully utilized. This is a significant consideration for projects that require real-time responsiveness. This is one of the few things that gets me switching over to Anthropic’s Claude.

Most Expensive

Quality comes at a price. Literally. The computational resources required to generate responses from GPT-4 can quickly inflate costs. If you’re a startup with a tight budget, or even an enterprise looking to scale, this is an unavoidable hurdle. However, given its capabilities, you’ll often find that the ROI justifies the cost. In some cases it may be better to test on GPT-3.5-turbo (which is more likely to surface mistakes), find all the flaws and deal with them, then switch back to GPT-4.

Often Unreliable

While GPT-4 offers top-tier performance, it isn’t without its operational hiccups. The service can be frustratingly inconsistent; it’s prone to downtimes and random rate-limiting. This unpredictability can be more than just an inconvenience; it can be a project-stopper when you’re running time, and is a serious consideration for everybody I know building LLM products. There are solutions like switching to a lesser model when being rate-limited, and LangChain’s out-of-the-box retry logic helps, but we all can’t wait for the industry to mature.

IPyWidgets

This simple open-source library is basically the quickest possible way to build an interface. It works in Jupyter Notebooks, which to be honest is where most data scientists and prompt engineers spend their time anyway, and it forces you not to procrastinate by pushing pixels around a screen to get the design exactly right.

Notebook UI

IPyWidget’s notebook-based library is a convenient way to spin up a quick interface. When I introduced it to my co-founder he asked “why do we need an interface if we are technical enough to read and edit code?” The answer is that even the most ardent coder gets tired editing text all day, and something visual is less straining on the brain. Sure you could do this in the command line or a spreadsheet, but computers have GUIs for a reason.

Minimal Styling

The library opts for minimalistic styling, putting functionality at the forefront. This approach keeps you focused on what really matters — optimizing your prompts and scripts without any distractions. In an era of over-sophisticated tech and gratuitous design, simplicity is often overlooked. This gets it right. You can always design a beautiful react front-end later if you need to bring in non-technical raters.

Works in Python

The compatibility with Python and Jupyter Notebooks is a significant win in my books, because I think in Python code. It’s a real chore to have to move out of Python and into Typescript and React to design some flashy interface, just to build a demo to see if something works. Python is the lingua franca of the AI world, so it makes sense to have an interface that integrates seamlessly with the rest of the code you’ve been writing.

Display HTML

You aren’t stuck with the widgets that come out of the box, you can also display HTML, which opens up a broad spectrum of what you can display. I quite often put my tables and charts into widgets, and format them how I like. For Prompt Optimization that means displaying the results of the test inline, right after you finish rating.

Pandas

The Excel of the Python world, many a data science career has been launched off a good knowledge of Pandas. None of the analysis for prompt engineering is particularly challenging, so you don’t need anything more sophisticated than this.

Pivot Tables

Pandas makes it super easy to create pivot tables, which lets you quickly see which prompt variants are performing across your eval metrics. Pivot tables in Pandas can help you identify patterns and trends, offering a data-driven approach to prompting.

Filters

With its robust filtering capabilities, Pandas empowers you to slice and dice your data like a Michelin-starred chef. Whether you’re looking to identify what components of prompts had the best success rate, response time, or any other metric, custom filters allow you to get granular and uncover insights that can be easily missed.

Eval Metrics

What gets measured gets managed. Pandas lets you run functions against any column in your dataframe, and create new columns based on the output. This allows you to inject whatever evaluation metric you want into the data, making programmatically defined performance metrics easier to work with.

CSV Export

Pandas also offers a straightforward CSV export feature, enabling you to take your insights and share them across platforms or with team members. Pandas may be the Excel of Python, but often it’s kind of nice to just put this data in Excel, and this makes it easy.

Bonus: Thumb

Thumb is the open-source prompt optimization & testing library I built, which incorporates everything above. I was reusing the same components so often that I thought it’d make sense to package them up as a module, then release it open-source for others to benefit from, and add to.

Async Testing & Caching

Thumb goes beyond basic prompting with its asynchronous testing and caching features. The async testing allows you to run multiple prompt tests concurrently, slashing the time it takes to optimize. The caching feature, on the other hand, stores each result, reducing unnecessary wasted tokens if your test is interrupted or fails for some reason. This is the single biggest reason I wanted to build thumb, because I was so tired of a test breaking in the middle and having to re-run the whole thing again from scratch!

LangChain & LangSmith Integration

Of course thumb was built entirely out of LangChain components under the hood, so you get all the power of the retry logic but without having to deal with the extra admin of remembering exactly how things are formatted. Setting up logging with LangSmith is as simple as a single environment variable: LANGCHAIN_API_KEY. If you want to add additional features or customize it, because it’s open-source you can just peek at the source code and make changes, extending it to handle more of the key LangChain components.

IPyWidgets Interface

Thumb started as a simple internal tool I used to make my work efficient, and I’ve tried to keep it that way. This means you can quickly run the simple user interface in a Jupyter Notebook, and rate a bunch of prompts, before dumping the data and doing analysis. Though there’s no shareable link to pass to team members, this focuses on the 80% of the work that happens in the middle, between crafting your first prompt, and optimizing it before it goes into production. Once in production, there are plenty of tools out there, like prodigy or brat, for labelling and rating machine learning responses.

Conclusion

Although things may change, some early leaders and beneficiaries of the AI wave are emerging. Even if these tools aren’t the eventual winners, they will serve as an unavoidable reference for what comes next. Learning these tools today will help propel you into an AI future.

Catch me on stage at OSDC West 2023, where I’m giving the talk “Prompt Optimization with GPT-4 and LangChain.” I’ll run through an actual test with this stack to show you how to optimize your prompts. If you want more Prompt Engineering from me, you can check out my course on Udemy, or see an early release of my book with O’Reilly.

About the author

Mike is a data-driven, technical marketer who built a 50-person marketing agency (Ladder), and 300k people have taken his online courses (LinkedIn, Udemy, Vexpower). He now works freelance on generative AI projects, and is writing a book on Prompt Engineering for O’Reilly Media.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.