Datasets carve the terrain of AI

Lately, Twitter has been full of the 2020 US election, which has displaced everything interesting. That means it’s a good time to blog.

I’ve recently been reading James C. Scott’s Seeing Like a State, which manages to combine forestry, agriculture, and land/city planning into a history of top-down planning. One thing that stuck with me – from centralising units (like the meter) and legal systems in France – was that applying a map or measurement system on top of the world is not just a simplification; it actually carves itself into the territory and changes how people behave. It can be disruptive for existing powers and it can upset local tradition.

There is a clear connection to managing teams and products.

Your development team will improve things that are measured and tracked. A typical example is conversion rate: if thousands of people go through your signup funnel every day, it will get a lot of attention. The data points needed for measuring how well the funnel works is naturally generated through the actions of users moving through the funnel, so you only need to put in the effort of instrumenting and cleaning it.

In contrast, when you’re building an AI decision-maker you need to consciously invest into collecting data. Remember: your team will improve things that are measured and tracked, and it’s impossible to evaluate an AI without even a rudimentary dataset of, say, 100 examples. So, you need a dataset.

Unlabelled data is usually easy to gather because it arises from product usage: users upload pictures of ID documents for verification, capture videos to share with friends in a group chat, send and receive emails, etc.

In contrast, labelled data, or pairs of examples of input and desired output, are surprisingly hard to create. A typical misunderstanding is that building a dataset is extrinsic to the process of creating an AI product; that it’s a mundane task to be outsourced so the team can properly focus on building models. You couldn’t be more wrong. For standard ML tasks — such as image classification — a state-of-the-art model doesn’t give meaningful improvement over a model from 3-4 years ago. In contrast, the dataset from which the task is learned makes a major difference.

When I talk about “building a dataset” I mean something more than “just drawing boxes”, referring to the task of labelling images, perceived to be menial and mindless. Building a dataset consists of two critical parts.

Sampling: Choosing which examples to add to the dataset.
Labelling: Judging the correct output, given an input.

One reason sampling is important is that it defines the weight you give to each example. The default — uniform random sampling — might seem like a good conservative choice, but only if all relevant examples occur relatively often.

Of course, that’s never the case: the tail is always long and fat. It might rarely snow in Paris, but we don’t want all self-driving cars to crash or turn off when it does. If you take a uniform random sample out of all drives a French car sees, you’ll barely see any examples of snow, so your algorithms will not just be wrong, it’s worse: they’ll be unpredictable.

The sampling step is also important because it can make a big difference in labelling efficiency. Labelling is a menial job if you have to make the same obvious decision 10,000 times per day. However, labelling a well-considered set of 100 edge cases, where the desired answer is not at all obvious, is much more engaging.

To borrow the stop-sign example from Andrej Karpathy’s talk about scaling autonomous driving at Tesla, finding a good ontology and judging the correct decision in samples in the following images is definitely not trivial, and will have major consequences on the safety and applicability of the system. The difficulty here is distinguishing (without extensive human input) what examples would be interesting for the model to label.

So don’t fall into the trap of having an intern, an outsourced company, an uninformed client, or any such external party create your datasets. The goal that a dataset defines will be carved into the terrain of the trained model, the application, and the product. Building datasets is a core activity in the AI development process. Learn to love it — it’s not going away.