A startup’s AI work often starts from a developer hacking algorithms on the side.
When a generalist engineer with no data science background works on prediction problems, they often don’t use a dataset at all, or at best a small one. For example, in the early days of Veriff, a front-end developer tried to build a feature of detecting too dark document pictures (defined as “document is lit well enough”) by writing some code and testing it live on his webcam.
Anecdotal testing is valuable for exploring an algorithm’s behaviour in real life, but not sufficient. First, the engineer could never understand the error metrics and trade-offs between them, because he kept changing the data he evaluated the algorithm on, always testing on new images. Second, he had no way of telling if his method worked well in the wild. Fake activity from a person that intimately knows the product cannot capture more than a shred of the variety in user behaviour.
Overall he made it very difficult to improve any algorithm he developed. Success should be well-defined and easy to track; in the darkness detection case, it depended on whether the weather was sunny or cloudy near the office.
The issue described above is the engineer’s problem: he is the one unable to deliver a solution on time. However, there is also a product-related problem. When you build a piece of the product that is rated on its statistical performance, the evaluation dataset becomes the product specification. As a product manager, I would not want ad-hoc experiments by an engineer to define the desired behaviour of the product.
What would be a better approach? As even a junior data scientist would tell you, instead of anecdotal testing you need statistical testing. This roughly means assembling a) a representative sample of data that comes from real-world customer product usage, b) with labels that accurately describe the intended behaviour.
That is actually straightforward. For the darkness-detection task, the data scientist could select a random sample of 100 user-captured images into a CSV file, and then mark for every image whether it is too dark for verification or not. In probably less than an hour, she can start building an algorithm that does a good enough job on this dataset.
After the algorithm has been observed in production for a while, you start to discover data you had not considered before.
If there is a well-lit document in the photo but it is only partially in the frame, can you say the document is well-enough lit? Probably.
If there is a well-lit document-like card in the photo but it is not an actual document, can you say the document is well-enough lit? Not at all obvious.
You get into metaphysical discussions: can you even say anything about the document being well-lit if there is no document visible anywhere?
You may think the above examples are unrealistic but I’ve seen users present an egg instead of their document for completing a verification.
It turns out there is even more to a good dataset than the two criteria above. It must be large enough to cover edge cases you care about. It must be fast to iterate on so that previously missed data points can be easily added. The freshest version of the dataset must be easily accessible: changes should quickly reach data scientists. And inevitably you will discover data points where the intended output is not obvious, so you’ll need a process for discussing these cases and then documenting judgement calls made.
Think of another example, a visual object detector for self-driving. The goal is to detect all nearby cars, bicycles and pedestrians; this will be input to the vehicle control system. It’s straightforward to assemble a dataset for the first version of the model: just take a few thousand examples and draw boxes around all relevant objects.
However, after the new model is live, curious questions start rolling in.
Is a car reflected in a storefront window really a car?
Is a bicycle on a bus a bicycle or a bus?
Is a sitting person a pedestrian?
Dealing with all of these questions involves manually reviewing lots of images, making judgement calls, re-labelling data, updating labelling guides, and serving the updated datasets for building new versions of the model.
The default is to expect the data science team to handle all of it. That’s a bad idea.
Data scientists in this context are hired for their ability to do technical work with algorithms, which mostly means a strong math and programming background. However, these skills don’t help in assembling a good dataset, which requires:
- Choosing a valuable sample for labelling.
- Manually labelling thousands of images.
- Coordinating with Product, Data Science and other functions to make judgement calls in uncertain cases. At worst, everyone disagrees on the desired outcome and part of the job is mediating a heated discussion.
- Documenting labelling instructions and changes in them.
- Building labelling workflows in a combination of in-house and outsourced labelling teams.
- Choosing, integrating, and hacking on labelling tools.
- Designing data pipelines for moving between data stores, labelling tools, and model evaluation & training tools.
In short, having good datasets requires setting up loops of iteration on datasets.
A data scientist is not hired nor motivated to do the above well. Yet a good dataset can be a 10x multiplier on her productivity. Assigning all dataset-assembly work to her is a waste of money because she is highly paid, and gives subpar results because she’s unlikely to do it well.
At Veriff we’ve brought all the above activities under a single umbrella, a function called DataOps (alternative names in other companies include ML Ops, Data Annotation, Engineering Operations, or something domain-specific – perhaps because “DataOps” has a different meaning in data analytics). Their mission, simply stated, is to provide good datasets, in all the meanings of “good” I’ve discussed above. They are the product specialists of AI, but instead of defining priorities on the feature level, they define the goal in the most detailed yet straightforward way possible: by labelling single data points.
The ideal of the DataOps team is for every prediction task to have a corresponding dataset that comprehensively and accurately reflects what the product intends to achieve, and for this dataset to always be up to date and easily accessible for data scientists. Of course, the more independently of other teams this can be done, the better. This is no easy feat: there is no beaten track to follow, no best practices to implement. DataOps as a discipline is in its very early stages and barely acknowledged as distinct from Data Science.
Companies seem to talk publicly more about models and results than their boring DataOps processes. This is exactly why I find it exciting: it’s both underappreciated and critical. To the standard partnership of Product and Engineering, you need to add not just Data Science but also DataOps to build a successful AI company.
Thanks to Joonatan Samuel, Henri Rästas and Kaur Korjus for feedback on this post, and Maria Jürisson for continuing to drive the DataOps vision at Veriff.