GPT-3.5 and GPT-4 response times

Some of the LLM apps we've been experimenting with have been extremely slow, so we asked ourselves: what do GPT APIs' response times depend on?

OpenAI docs have a section on this, and the gist is: the response time mostly depends on the number of output tokens generated by the model.

To quantify this, I ran some tests with different max_tokens values, to understand how much latency each token adds. I tested Azure GPT-3.5 as well since a friend claimed Azure models are 3-10x faster (I don't yet have access to Azure GPT-4).

Here are the results.

  • OpenAI GPT-3.5: 73ms per generated token
  • Azure GPT-3.5: 34ms per generated token
  • OpenAI GPT-4: 196ms per generated token

Azure is more than twice as fast for the exact same GPT-3.5 model! And within the OpenAI API, GPT-4 is almost three times slower than GPT-3.5.

You can use these values to estimate the response time to any call, as long as you know how large the output will be. For a request to Azure GPT-3.5 with 600 output tokens, the latency will be roughly 34ms/token x 600 tokens = 20.4 seconds.

Or if you want to stay under a particular response time limit, you can figure out your output token budget. If you are using GPT-4 and want responses to always be under 5 seconds, you need to make sure outputs stay under 5 seconds / 0.196 sec/token = 25.5 tokens. (Optionally enforcing it via the max_tokens parameter.)

These numbers are bound to change as OpenAI make changes to their infrastructure. Latency may also vary with total load and other factors. However, the key thing to remember is this: to make GPT API responses faster, generate as few tokens as possible.

If you want to dive deeper, I recently wrote a whole post about how to make GPT faster.

Experiment details

Here's how I ran my experiments.

  • Made an API call for each value of max_tokens, per model and provider. (Three repetitions for each combination to also understand variance).
  • Fit a linear regression model to the outputs, including intercept.
  • Reported the coefficient as "latency per token" and ignored the intercept, because the fit was good and the intercept depends more on your particular network characteristics so is less broadly relevant.

And some more boring details:

  • Experiments were run on 10-11 May 2023.
  • Network: 500Mbit in Estonia (at these scales network latency is anyway a very small part of the wait).
  • For OpenAI, I used my paid account.
  • For Azure, I used US-East endpoints.

For convenience, here are the best linear fits and raw datapoints for each model: