Why AutoGPT fails and how to fix it

A couple weeks after AutoGPT came out we tried to make it actually usable. If you don't know yet, it looks amazing on first glance, but then completely fails because it creates elaborate plans that are completely unnecessary. Even asking it to do something simple like "find the turning circle of a Volvo V60" mostly takes it into a loop of trying to figure out what its goal is... by googling.

How do you improve that?

Simplifying the code

Step zero was to refactor the system a bit -- starting from the langchain.experimental implementation. If you look at how the repo is structured, it looks quite complex: there's some advanced assembly of "constraints", "goals", and others - spread across several files and functions in the code. But really it's just a string template. Rewriting the prompts into actual Jinja templates made Prompt engineering much faster and easier.

The second refactor I did was to have the LLM directly output JSON instead of a dash-delimited list. That simplifies output parsing and makes failures more explicit (but still rare).

AutoGPT prompts

AutoGPT has basically two prompts:

  1. Goal prompt. This is what turns the user's input ("find the turning circle of a Volvo V60") into a multi-step plan ("1. search the internet for Volvo v60 turning circle, 2. read through first results, 3. return answer to user).
  2. Loop prompt. This takes in everything that fits in the context window: the generated plan, previous chat history, previous commands and their outputs, etc. The output is a structured object that contains several things, but most importantly the LLM's pick for the next tool to use, along with input argument values.

While the loop prompt had some room for optimization (mostly removing things), the goal prompt was clearly broken. The output felt like a plan created by some GPT-2 level Mckinsey consultant: with lots of unnecessary "strategic" steps that made sure the agent would take the longest route possible to achieving the actual user intent.

Of course GPT-4 is not at fault. Rather, it is the prompt which makes AutoGPT fail like that. This gist shows the AutoGPT goal prompt verbatim.

It is very clear now why the agent outputs such bad plans: it's all in the prompt.

An improved AutoGPT prompt

Here are the improvements I made to the prompt. By no means is this a simple step-by-step recipe; the whole area of Prompt engineering is so raw that all you have to go on is intuition, Twitter, and a few guides.

You can see the complete rewritten prompt in this Github gist; I'll walk you through the most important parts of those.

Simple writing style

Writing style in the prompt affects the output style of an LLM, so I rewrote everything in a simpler style. For example, instead of Your task is to devise up to 5 highly effective goals I wrote You need to find the simplest and most effective way to achieve the user's task.

List of capabilities

Since GPT is trained on the internet, it might assume (hallucinate) that it has access to anything it has seen on the internet. For example, asking AutoGPT to find the yearly maintenance cost of a car, it once made a plan to run a survey of thousands of car owners in Europe. To fix that I added a list of capabilities:

The capabilities of the agent are limited to the following: 1. search and browse the web 2. read and write text files

These were the actual tools I gave the agent access to, and this was an effective way to curtail the agent's enthusiasm for elaborate plans.

Conditional complexity

At first I tried to reduce the plan's complexity by prompting the goal to have as few steps as possible. This worked well for simple web queries but failed on more complex multi-step tasks (like searching for 5 things and then putting together a comparison table).

Then I realized I could just request a plan conditional on the difficulty of the input:

If the task is straightforward, output only one step. Always output less than 7 steps.

That seemed to make the plans consistently good on both simple and complex requests.

"Be brief"

Be brief is the standard approach to reduce verbosity, but I also added a target word count: Each step should be described in 3-10 words.

Examples, examples, examples

Out of everything I did, putting more examples into the prompt (In-context learning) seemed to have the most consistent effect on plan quality -- improving the description of the task (Zero-shot) did not have nearly as much effect. This was true across all axes of the prompts. When I wrote the examples below I of course had to follow my own instructions to get the desired effect!

So I added four examples. I kept the original one to retain some of the enterprise-gibberish'ish flavour of the original AutoGPT but added several examples with straightforward plans. I'll reproduce just one here (note I added a few fields but won't cover the reasoning for those here):

Example input: Make a list of all Volvo SUVs currently in production with price when new

Example output:

"steps": [ "Search the internet for Volvo SUVs and make a list", "For each SUV, search for its price and add the price to the list", "Return results to the user as a well formatted table, and a list of full URLs used for the information" ],