The stages of an LLM app seem to go like this:
- Hardcode the first prompt, get the end-to-end app working.
- Realise that the answers are bad.
- Do some prompt engineering.
- Realise the answers are still bad.
- Do some more prompt engineering.
- Discover vector databases!!!1
- Dump a ton of data as plain strings into the vector db for semantic search on embeddings.
- Post your achievement on Twitter.
The journey usually ends here -- with an impressive demo. But the demo is usually hand-picked out of many examples, and for most users' most queries the system doesn't work.
What's next? Improving on this would take much more work. Setting up even semi-rigorous evaluation takes annoying work including manual labelling. Fetching the right context takes even more work. Prompt engineering turns into orchestrating multi-prompt chains with intent detection leading to interleaved Python code and LLM calls...
Which is to say, another form of engineering.
What I wanted to focus on, though, is the "fetching the right context" part. While it may seem new, the problem is the age-old Information retrieval problem -- and solutions are probably similar. So my suggestion to anyone working to remove hallucinations: brush up on your Information Retrieval 101, and be inspired by the search-engine-builders of 20+ years ago.