Testing the un-Testable Machine Learning Model
If one of my former students finds out that I don’t bother to write test cases for the Random Forest model I just built, they’d likely complain to the chair of the computer science department that I was wasting their time pushing software testing down their throats for the past eight years. As a grad student I had spent a dissertation focusing on making testing more feasible and useful, and as a professor, I built introductory programming courses around the idea of test-driven development. Most of those techniques, however, are useless when testing software without known outcomes for unseen inputs, such as what we often face when we deploy machine learning models in live production.
Having just recently made the switch from academia to industry, I’m not trying to test the un-testable. Instead, a much better use of my time now involves developing and measuring simple metrics that can ensure that my model isn’t hurting the business while I wait for the results of an AB test (or something similar). The alternative is grim: either I deploy a model that I eventually find out hurt my bottom line, or I wait a year or two to collect enough data to try to avoid this situation. In industry—and unlike in academia—I don’t have that kind of time or the luxury to fail. Therefore, I tackle the problem of unseen data in a practical manner by:
- Using historical data to generate a distance metric that can serve as a sanity check for model predictions
- Measuring how similar my live data are to my training dataset to get an idea if my model will generalize
- Verifying that the model is internally and externally consistent between development and deployment to ensure I’ve deployed what I expected to
The Oracle Problem
Testing a machine learning model is fundamentally different than other types of testing scenarios for two reasons: 1) we aren’t feasibly going to trace through the guts of the learned decision-making process (for example, to explore all possible paths of a neural network), and 2) because of the oracle problem: for a set of novel inputs that our model has not encountered before, it may be impossible to verify whether the predicted results are correct. For example, imagine we are trying to predict how many pairs of shoes a new customer, Samantha, will buy on a website in a particular year. Since Samantha wasn’t in our training data set, we cannot possibly know her answer when we are building the model because it is an event that hasn’t transpired yet.
Although we presumably built our model on historical data with known correct answers, what assurance do we have that a productionalized model is performing correctly? If we put aside for a moment the fact that we don’t have known targets for new inputs, the logic of non-trivial models is too intractable to do white-box testing. We can’t get a low-level trace of how the model’s decision-making works, nor can we typically compare it to an existing model if we are building a model for the first time (so we can’t do regression testing in this scenario either).
Sanity Checking and Metamorphosis Testing
Unfortunately, the gold standard for the evaluation of a predictive model on unseen inputs without a defined target value will always be “watch and wait.” You’ll have to wait for the target values to actually occur (e.g. how many shoes Samantha ended up buying) before you have a definitive answer, but in the meantime you can watch your model.
Let’s imagine for a moment that we’re dealing with a new problem: we are trying to predict what Samantha’s shoe size is (the target) for an online company, depending on a series of fit questions she answers (features). For example, she may tell us her current shoe size is an 8.5, and if she reports that her feet slip out of heels as she walks, we would consider suggesting she wear a size 8. Although she may purchase a pump from our website in the same session she took our fit survey, we won’t know if our prediction was correct until she chooses to return the product or not (a forty-five day window exists for returns and exchanges). How can we have some assurance that the deployed model is not hurting the business, without knowing what a new customer’s true shoe size actually is?
An obvious way to check the model is to depend on known relationships between the inputs and the targets. Metamorphic testing is the idea that making modifications to one or more input parameters will produce expected changes in the target predictions. If you’re familiar with mutation testing, the concept is similar; we mutate the inputs and in turn expect to see corresponding changes in the output. From a testing perspective, these mutations would occur on the training dataset because we would like to have a known correct target. For example, if our customer reports a higher level of perceived toe compression, the model would be expected to recommend a larger shoe size for her than it would have otherwise.
In this case, feature importance can be useful to identify which feature mutations could affect the prediction, and a distance metric (discussed below) could be applied to see if the amount of change to the prediction is consistent with the amount of change to the input(s).
Although metamorphic testing is a viable approach to testing machine learning models, in practice it is not commonly used. First, testing in general is often, sadly, an afterthought in many real-world applications; conventional thinking leads us to ask why invest time in an activity that, if all goes well, does not add any measurable value or reveal any novel artifacts? Secondly, metamorphic testing, if done manually, is time-consuming and depends on human inspection and judgement about what mutations are meaningful. Although there is a small body of research focused on automating metamorphic testing (e.g. Murphy, 2010), the utility of using such an approach for verification or validation purposes is limited since this type of testing can reveal the presence of an error, but not the absence of it, for a particular input. It still also requires significant investment of human effort to decide on and document the potential mutations and how they are expected to affect the targets in cases of change thresholds and/or non-determinism. Given the real-world limitations of applying even these automated approaches, is there something else we can do that might more directly reveal faults for very little effort?
Distance Metrics for Good Business Decisions
A very useful approach that I’ve found to add value to test machine learning models is the development and use of a distance metric for the predictions. For a given model, there is often business knowledge that is easily quantified and intuitively obvious for approximating such a metric: for example, we know that shoe sizes are normally distributed, they have a known standard deviation, and that recommending someone a size that is three units different from what they’re currently wearing is probably never going to be something we want to do.
Our training set with known target values is an excellent database from which to derive a distance metric; for a given shoe size, for example, we can plot the distribution of actual purchased shoe sizes and count the number of units, on average, between an input shoe size and a purchased shoe size that was kept by a customer. For example, in the chart above, we might say we should reject any model-predicted size where the difference is more than 1.0. We could also annotate those that were kept versus those that were returned in terms of how “correct” a prediction was.
A distance metric can be used as a guardrail in a deployed algorithm when derived from historical data; if we are certain very few people keep a shoe that is more than a size different than their current size, perhaps we would reject such a live recommendation by our model in favor of a truncated version that is always within one unit of their input size.
Model Non-Determinism: a Philosophical Approach
In non-academic settings, the urgency of delivering useful results often informs the allocation of resources, especially developers’ and scientists’ time. Data scientists are hesitant to test real-world machine learning models into perfection because they are intuitively aware of how dependent a model’s accuracy can be on what data it’s trained upon. Even if you are building a deterministic model, any fluctuations in the training data set could yield slightly different predictions. In this sense, even a deterministic model (which is what we will focus on) can be viewed as flexible.
It is better to spend your limited time on selecting, filtering, and cleaning your training dataset than to try to perfect your trained model to a degree that is not defended by the business problem.
For example, whether a customer chooses to keep a 9 versus a 9.5 size shoe may depend more upon specific characteristics of different shoes styles (e.g., open-toe versus closed) than on the size itself. Depending on the dataset, it may not be possible to build a shoe size model in isolation, let alone test it. Alternatively, just because a customer kept a certain shoe, it does not mean that it was the correct or ideal size: perhaps they were simply too lazy to return it. Even if you are building a deterministic model, your business problem may be fuzzy in that the training data isn’t truly a ground truth: customers may simply not notice or care about suboptimal outcomes.
Rather than attempting to validate that the model is always predicting the ideal size (if there even is such a thing), it may be best to make sure it’s not predicting anything truly bad and to focus testing efforts on verifying that the deterministic model is internally and externally consistent. For example, does the model:
- Produce the same output for the same set of inputs?
- Produce the same output for data seen in production versus an identical sample in training?
- Predict roughly the same ratio of different sizes as we saw in training for whatever input feature(s) were most important?
- Elegantly handle user input with exceptions and/or reject predictions with low confidence in favor of a backup alternative?
The questions above all have known answers in training data and serve not only to verify the model for consistency (i.e., points 1 and 2 above) but also to validate the model by observing if the live test set is similar to that which was used in training (point 3). Unit testing using historical data can allow us to see if our models and the infrastructure within which they are deployed are recognizing and handling exceptions correctly (point 4).
The tests above are easy to code and automate; they help ensure that the model is deterministic, which is important when you go to evaluate it against either a baseline or a future competitor model. If your dataset splitter or model uses randomization, you should also make sure to set random seeds in the final productionalized version for reproducibility in future model comparison endeavors.
Few people like to spend time testing real-world software even though testing early and often saves money in the long run. For data science projects, verifying machine learning models often has strong limitations due to the oracle problem: no one knows what the correct output should be for an input (or set of inputs). Thus, a lot of the testing in this domain relies on assumptions regarding the data, human effort, and approximate results.
Consequently, even for the lucky instances in which there is both time and desire on the team to devote to testing, it may be more effective to work on collecting better training data than investing time and overhead in verifying models that may not depend on absolute (i.e., perfect) precision. Instead, using a distance metric as a guardrail against predictions that are likely to be grossly incorrect is an extremely budget-conscious tool that—in combination with AB testing—can prevent harm to the business while waiting for live results to come in and validate new machine learning models.