What makes machine learning expensive?

Taivo Pungas

Aug 28, 2018 • 8 min read

It’s often assumed you need a number of PhDs and double the number of developers to create useful machine learning (ML) solutions. This is only sometimes true. If you’re creating leading-edge products — meaning you’re developing brand new machine learning methods — then you will certainly need a team of highly skilled and quite expensive talent. However, most companies can take existing technology and apply it to their own problems, and this can be done without the army of PhDs.

How, then, can you build ML solutions on a smaller-than-Google budget? Machine learning experts of top companies get paid handsomely, and the salaries of the whole tech industry don’t lag far behind. The steep price of hiring ML talent makes it crucial to have them work on problems with maximal return on investment. This requires understanding what makes a machine learning task difficult — and thus expensive.

1. Error Rate

Machine learning always comes with some level of error. The question is what level of accuracy your use case demands. If your system cannot tolerate a single error then machine learning may not suit your need.

The statistical approach taken in ML can perform very well, but still fails in some percentage of cases. How much error is acceptable for your solution?

If your solution requires high accuracy (that is, almost no errors) then it may necessitate substantial development work — meaning a larger team, more technical complexity and a longer development time. All of which adds up to increased costs.

Conversely, if you allow a greater margin for error, meaning that the resulting application doesn’t need such a high level of sophistication, then a smaller and less specialized team can produce the solution with less work. All of which lowers your development costs.

So let’s look at a coffee-machine user-authentication solution as an example. Your CEO mandates you to make coffee machine to automatically dispense coffee for free to all employees and for the regular price to everyone else.

The worst error the system can make here is giving free coffee to someone who should actually pay. The monetary loss of such an error could be a couple of dollars. On the flip-side, the seriousness of an error that prevents an employee from getting coffee is not that great — the person can just try again or ask a co-worker to get their coffee.

A delicious cup of morning coffee… with a splash of face recognition.

In this example, an accuracy of perhaps 90% will suffice. The one-in-ten errors are manageable and the time to solve this task with a 90% accuracy rating would be in the order of weeks rather than months. This keeps the cost low.

Compare the coffee machine example with, say, a face recognition feature on a smartphone. If face recognition unlocks everything on the phone, the stakes are much higher.

The cost to the owner of a device that has got into the hands of a person with malicious intent and who has gained access to the phone — which could include access to credit card details, sensitive work documents, email accounts, social media accounts, private conversations and other personal and sensitive details — is high. For the face recognition function to be credible, we want an accuracy rate that is approaching perfect — meaning, we can accept no more than, say, 1 successful ‘attack’ per 100000 attempts.

Such accuracy requires an extremely good solution. That could take a team of software, hardware, and machine learning engineers two years to produce. This makes it a very expensive development compared to the coffee machine example.

All of these people, jumping at the chance of stealthily seeing the Fruit Ninja high scores on your phone.

2. Response Time

How quickly do you want your machine learning solution to respond to a request or an input?

A requirement for a quick response time — say, one second — requires a quite different solution to a requirement for a greater time — say ten seconds. For an even longer response time — an hour, perhaps — the solution can be fairly basic.

The cost soars if the computation has to take place within the app or device. If you require a 1 second response time, then the primary computational tasks have to be carried out on the device itself. This requires very sophisticated software plus good integration with the hardware — and in addition, you are restrained in your choice of programming language. All this leads to the requirement of a substantial development team — in the case of smartphone face recognition, tens of people working for 1–2 years — to develop.

If a 10 second response time is acceptable this can fundamentally reduce the development challenge. Instead of computations taking place on the device itself, they can be offloaded to a remote server where much more computing power, and any platform of your choice, is available. With 3G/4G technology allowing a round-trip to the server in just a few seconds you can still fit in the 10-second limit. Not having to develop a solution that handles the bulk of the computations on-device means the solution is less technically sophisticated — and so easier, quicker and substantially cheaper — in our experience, perhaps twenty times cheaper — to develop.

Once we leave behind the need for response times in the seconds or minutes and can accept response times of an hour or more the development challenge changes yet again. This time, new options present themselves — options which include the possibility of putting a human in the loop for more complex cases.

In these cases you develop face-recognition software that can authenticate the obviously genuine, that can reject the obviously non-genuine, but for those grey areas where only sophisticated, clever, and expensive software can accurately make the required distinctions, the case is simply handed off to a human to make the decision. Chatbots do this. They often make very few automated decisions before directing the customer to the appropriate human.

3. Cost of human input

It’s unlikely that automating a task can be done in a single leap of technological advancement. For most problems, it is much easier to make small steps. For example, when building a customer service chatbot, solving the simplest 50% of cases might be trivial: simply sending the user to the right Help page might work. The other 50% can be left to humans while data is collected and the bot developed further.

This gradual approach to automation is a very common and useful pattern. A downside is that outsourcing the most difficult cases to humans can cost a lot: computing is much cheaper than relatively expensive human labor. However, this also defines a very clear metric for improvement: increase the percentage of cases the system handles autonomously while keeping the quality up.

4. Cost of gathering/labeling data

We need humans to gather or label data for us. Self-driving car companies might pay workers to annotate each image by drawing boxes around cars, humans, bicyclists, traffic lights and other objects. To build a speech recognition system, you need to have a set of speech clips, each annotated with a transcript. In the case of fraud detection, every transaction that a human has reviewed yields a label — it is implicit in their decision to either allow or deny the transaction. It is often possible to extract labels from pre-existing processes, like the aforementioned human decisions about fraud, captions on images on Flickr, existing speech transcriptions from the EU parliament, or some other clever source. This way we can get large labeled datasets with the drawback of having some errors in the labels.

Labeled data is the ground truth for machine learning. Producing high-quality datasets may be costly and time-consuming.

The type of model being trained, and the performance required, usually determines how much labeled data we need. This is rarely known beforehand: a data scientist starts with some amount of data and based on the results may decide that more data is needed. As we mentioned in the previous post, the best deep neural networks are very data-hungry and may require millions of labeled examples. At the other extreme are simple classical models, which usually require at least 1000 examples for reasonable performance, though it can vary a lot with the complexity of the task.

In addition to training data, you also need test data to measure how well your system is doing. Andrew Ng has come up with a handy rule to do this: you should have enough test data that you can see differences in your quality metric with the desired granularity. For example, if you have 200 test examples, you can only distinguish the accuracy of results to within 1 test case, which is 1 / 200 = 0.5%, i.e. half a percentage point. If you care about 0.1% differences, you need at least 1000 test cases.

5. Cost of interpretability

Sometimes a client or the law demands that each decision has to be interpretable. There is an inherent trade-off between interpretability and accuracy on the task: simply solving a task is easier than solving it while explaining to a human how each decision is made. The cost, then, is measured in a drop of performance of the model which directly translates to cost in dollars due to error rate requirements.

One thing that distinguishes machine learning from the much older field of statistics is that ML is an engineer’s approach: most ML systems target maximum accuracy on the task, and not a perfect understanding of how the model works. Thus it is acceptable and common in ML to use black-box models which work very well, but whose inner workings are difficult or impossible to understand. This contrasts with the much older field of statistics, which tries to make sure every nut and bolt has a known, specific function.

But despair not: not all machine learning models are black boxes. For example, the inner workings of decision trees and random forests are easy to interpret, as are most linear models. Since interpretability is more important in business than scientific benchmark problems it has been somewhat neglected in research, but there are already some neat tools for looking into black boxes.

The most famous approach is LIME which answers the question “how would the output change with a slight modification of inputs?” It thus gives a local interpretation, as opposed to the much more difficult problem of global interpretation, which tries to explain the decision process for all possible inputs. Even a human cannot usually provide global interpretation: could you perfectly describe how you go from a set of pixel values to understanding that an image contains a king?

The reasoning is difficult. Do some of these pixels seem especially kingly?

As you can see, small changes in requirement can make costs rise or reduce quite dramatically. The five most telling variables are:

how accurate do you need the outputs to be?
how quickly do you need to produce those outputs?
how much human work enters the loop?
how large should be the datasets?
how important is it to interpret the system’s decisions?

The key to lower costs in a machine learning application is to be critical of the requirements. Do you really need correct decisions 100% of the time? Are you sure the answer absolutely must be given within one second? Who will need to interpret the decision, and why?

Of course, depending on the application, there may simply be trade-offs you cannot make. But if you can, the cost and time savings you reap will be considerable.

This post originally appeared on Medium.

1. Error Rate

2. Response Time

3. Cost of human input

4. Cost of gathering/labeling data

5. Cost of interpretability

Sign up for more like this.