All top LLMs, including all GPT-family and Llama-family models, generate predictions one token at a time. It's inherent to the architecture, and applies to models running behind an API as well as local or self-deployed models.
Armed with this knowledge, we can make a very accurate model of what the LLM's response time will be in any given situation.
The basic formula is this:
T = const. + k N, where:
Tis the total response time you see to a query sent to an LLM.
constdepends on a whole host of things outside your control: DNS lookups, proxies, queueing, and input token processing.
Nis the number of output tokens generated by the model during your request.
kis a measure of how long it takes to generate one token. It's a function of the model specifics on the one hand: its size, quantization, any optimizations that have been applied, etc. On the other hand, it also depends on the execution environment: hardware, interconnect, potentially even CUDA drivers, etc. From a developer's perspective, though,
kis a relatively stable value and out of your control (unless you're in charge of your own LLM deployment).
For example, using GPT-4 on OpenAI from my laptop and office network, the constant is around 1 second. The
k N component is around 47 seconds. In general, unless you're making very small queries (outputting only 20-30 words or less) or using an extremely small model (smaller than Llama-7B), the
k N component will dominate the response time.
You can measure the value of
k for any given deployment, and that's exactly what I have done in GPT-3.5 and GPT-4 response times.
Linearity is to me unintuitive in this context (often there are diminishing returns to these sorts of optimizations). So the important thing to remember is: it doesn't matter how many tokens you're generating now; a 2x reduction in token count will get you a 2x reduction in latency. It's scale-independent. If you're looking for concrete tips, see my post about how to make LLM responses faster.
You might also wonder (I did!) why the response time is not linear in input token count. The answer is that input token processing is embarrassingly parallel: the embedding for each token is independent of other tokens in the input -- because they are all known in advance. For output, however, generating the N-plus-1-th token needs to look at the N-th token, so every token will necessarily add some latency.