Hacky multimodality

GPT-4 supports images as an optional input, according to OpenAI's press release. As far as I can tell, only one company has access. Which makes you wonder: how can you get multimodality support already today?

There are basically two ways for adding image support to an LLM:

  1. Train a vision encoder that makes the image digestible for an LLM. This is what GPT-4 and the recently released LLaVA do.
  2. Hack support by converting images into text, and manipulating text using images. This can range from just OCR-ing the image and chucking the output into GPT, to multi-step workflows like Grounded-Segment-Anything or the paper (which I can't find now) that allowed an Agent to use any HuggingFace model as a Tool.

Option (1) is the native and powerful one, whereas (2) is a limited hack. But a major benefit of the second approach is that you don't need access to the weights of the LLM (you can do it with API-only models like ChatGPT or Anthropic AI), nor do you need to train anything yourself. Which of course means we will see a lot of (2) in open-source and academic projects -- I expect much more juice to be pressed out of this category of fruit.