Three categories of guardrails

There are only three categories of guardrails to prevent harm from agents.

First, relying on hard constraints to only allow certain kinds of behaviour. For example, limiting which tokens can be decoded (structured output) or exposing only a specific set of tools to an agent. Assuming correct implementation, these guarantee certain behaviours won't be possible.

Second, the LLM's own intelligence. The system prompt, specific training against jailbreaks, and other instructions can prevent unwanted behaviour. Their effectiveness can be validated through evals, red-teaming, and other empirical methods, but not strictly proven.

And third, having a human in the loop. This can be done on different levels, like approving a high-level plan or approving every individual tool call. But fundamentally human judgment is built into the system in some form.

The third option seems best, but it might not be. Even ignoring costs, humans have biases, get hungry and fatigued, aren't good at context-switching, won't be alert for 1-in-10,000 issues, etc. The LLM-judgement and human-judgement are similar in nature and will be even more so, the better LLMs we get.