stream

Index to reduce context limitations

Taivo Pungas

Mar 24, 2023 • 2 min read

There is a very simple, standardized way of solving the problem of too small GPT context windows. This is what to do when the context window gets full:

Indexing:

Chunk up your context (book text, documents, messages, whatever).
Put each chunk through the text-embedding-ada-002 embedding model, and store the resulting vectors -- each consisting of 1536 floats.
Build a fast search index over these vectors, e.g. using e.g. spotify/annoy or Pinecone or similar.

At runtime:

Given the user's message m, embed that using the same embedding model you used during indexing in
Query the vector index you built using the embedding you received in
Put the top N of the resulting documents into the context window (at least one, and as many as you can afford).

That's it! And fortunately you don't really need to do any of it yourself: there is a Python package llama_index (no relation to the Llama LLM) to do this in 3 lines of code. Or one line of code in langchain with VectorStoreIndexCreator.

None of this is novel - this is a stock problem in the field of Information retrieval. Versions of this have been around for a long time. The novel part is that the OpenAI embedding model is cheap and good enough for similarity search to work okay, and the generative GPT model is good enough to make a Q&A bot out-of-the-box.

The generalized version of this consists of the following components:

A chunker: splits documents/text into atomic pieces. A dumb on splits on every n characters. A smarter one might take into account headings, paragraph changes, etc.
A feature extractor: embeds each chunk into a (fixed-length) vector. In prehistoric times this may have been tf-idf, today usually a neural network's activations.
A vector index: Finds similar vectors fast, even if you've stored millions of them. There are many options, see e.g. this list.
A generative model: this is what you already have - ChatGPT.

You could improve on this architecture too. For example, if fetching 10 nearest neighbors from the index, you could do some LLM-based filtering to make sure you only retrieve only the most relevant ones and discard superficially-similar-but-unrelated ones. Or perhaps you want to have three parallel pipelines, one of which chunks on the document level, the second on the paragraph level, and the third on the sentence level.

There was also a paper where, instead of (or in addition to) embedding the user's question, you first ask the LLM to hallucinate the answer, embed that, and use it to retrieve from the index. Of course, through testing and tinkering, you can come up with many more tricks and improvements like this.

Sign up for more like this.