Mixture-model likelihoods for outlier detection
Data science at Secondmind sits at the intersection between real-world data and problems, and the cutting-edge research and development work done by our researchers and engineers.
But connecting the real and R&D worlds is not without its challenges.
In this blog post we'll look at how we can adapt some of our modelling techniques to make them robust to the presence of "outliers".
What is an outlier?
When measuring any physical process, you're likely to encounter outliers.
Broadly speaking, an outlier is any data that is not generated by the process we're interested in modelling. Outliers may be generated because the apparatus used for measurement suffers from some transient malfunction, for example, or the data transmission/transcription process becomes corrupted through human error. As such, to define any outlier (what we're not interested in) we must first define what we are interested in, and we do this using a model.
Let's consider the data below
This could represent a set of measurements a customer has sent us from an electric motor, where the power output of the motor is measured as the applied current is varied. There are a number of measurements that seem out of place, a power spike around measurement 200, three sharp dips in the current between measurements 400-600 and a current spike around measurement 900.
Typically, we want to construct highly accurate predictive models of the motor's response to inputs such as the current. These models may be used to simulate an electric motor (or combustion engine) for testing elements of our Active Learning platform. Alternatively, the models could be deployed in an active learning loop within the platform itself. As such, it's crucial to account for the transient current spikes and detector noise in data we receive, because these features (aka "outliers") do not relate to the true behaviour of the motor/engine in question. We consider these issues distinct from engine "noise" itself. For example, notice how the power output appears to vary more when the current is low. This sort of variance in our data is likely due to true stochasticity in the complex physical and chemical processes taking place within the motor/engine and, therefore, this phenomenon is something we do want to model correctly.
As we want to build a model that can predict the power output as a function of current. Let's visualise this data slightly differently in the figure below, putting the current on the x-axis and power on the y-axis.
We can see that most of the data is generated from some noisy central process. As we saw above, the noise also appears to be heteroskedastic, in that it has some x dependence (the variance of the data seems greater at x=0 than x=1). Thus, a heteroskedastic chained Gaussian process (HetGP e.g. Saul et al. 2016) may be suitable to model the motor response. HetGPs are novel yet extremely powerful extension of the Gaussian process framework that allows heteroskedastic data to be modelled in a scalable way. This allows us to accurately capture any heteroskedastic behaviour we see in our motor data.
The 'spikes' we identified in the first figure are also clearly visible and are indicated by the arrows. As previously discussed we don't want our models to really consider these points, as we don't think they're representative of the true engine behaviour generating the majority of the data.
So, what should we do with these points?
We could, of course, develop some heuristics to remove them, and apply these heuristics before the data is even passed into our models for training. However, such heuristics can be time-consuming to define well and quite inflexible, in that they are often not easily adaptable from one modelling problem/dataset to another. Instead, we'll adopt another approach. Let's leave the data as it is for now and instead consider our heteroskedastic Gaussian process modelling assumption.
We assume that our data is drawn from a Gaussian distribution i.e.
where f and g are latent Gaussian processes, we have dropped the x dependence of these i.e. f(x), for clarity, and the exp transform ensures positivity of the variance. It is this quantity that allows us to write down the "likelihood" of the model, P(Y|x), which is essentially what we are trying to maximise during model training. Let's see how such a model performs on this data.
Whilst generally we can see that the model adequately captures the mean function of the generating process the noise estimate is clearly perturbed by the presence of potential outliers at x∼0.4. This is not necessarily an issue with the model as such, it's trying to explain the data as best it can and the variance is much larger at x∼0.4 due to the outliers, it's just that the variance it has learned isn't really the quantity we're interested in. Let's now reconsider our modelling assumptions.
A robust likelihood
We stated earlier that we had assumed our data is generated by a heteroskedastic Gaussian process. Let's revisit this, now considering the implications of the candidate outliers at x∼0.4.
These potential outliers are clearly not generated from the same process as the central data, so we should adapt our likelihood to account for this. We can write this adapted likelihood as a mixture distribution, that is a weighted sum of our distribution of interest, N(f,exp(g)2) and some other distribution. It doesn't matter too much how we parametrise this other distribution (it's not something we care about too much after all) as long as it is sufficiently broad to encapsulate any data it may encounter. Here we use a Gaussian distribution N(0,100).
Thus, we can rewrite our modelling assumption as
Here, woutlier∈[0,1] is a single global parameter (i.e. it has no x dependence) that describes the relative weight in the mixture of the outlier component. In general, we do not know the value of this parameter, so it is learned from the data during model optimisation. We call our model with this mixture likelihood a "RobustHetGP" as it is now "robust" to the presence of any outliers.
Let's now compare the predictions of such a model to the data:
We can see that this model results in a much better fit to the central data and has effectively ignored the potential outliers at x∼0.4. This is fantastic, as we believe this is a much more accurate representation of the engine's true response to the current.
For each data point we can also ask the model how likely it is that the data point was drawn from our outlier distribution. This value, poutlier, is obtained through integrating our latent functions (recall that f and g describe distributions over functions in our Gaussian process framework) over the components of our robust model likelihood e.g.
,where Pgood and Pmix represent the probability of observing data Y under the generating distributions described in equations 1 and 2 respectively. Typically, this integral does not have an analytical solution but there are numerical techniques we can adopt to solve it. We compute poutlier for each point in our dataset, as indicated by the colour scale in the figure below (points with poutlier>0.5 are circled in red).
The model correctly identifies the five outlier points (0.5% of the data, note that the model learned woutlier=0.00455 which is remarkably close to the true value).
Thus a relatively minor adjustment to our modelling assumptions has equipped our model to cope with potential outliers in a robust manner, allowing us to accurately model the true engine reponse to current.
Another use case for the mixture likelihood described above is to alert us to instances of model misspecification. In other words, where our modelling assumptions may be flawed.
Below is some new synthetic data. It has the same "outliers" and x values as before, but the generative distribution (i.e. motor/engine) is different. It has a more concentrated central process, but with heavier tails towards greater y values. It should be noted though at each x it has the same mean and variance as the data in our earlier example above.
Let's now fit our robust model to this data:
Clearly, our model has determined that there are many more outliers in the data. In fact, woutlier now has a value of ∼0.043 which is 10× larger than with the earlier data! Whilst a proportion of 4-5% outlier contamination may not seem large, it could be cause for concern if we are expecting this to be at the sub-percent level.
In fact, a larger value of woutlier may indicate that our Gaussian assumption, the N(f,exp(g)2) term in our model likelihood, may be a poor assumption for this data, and that we should investigate other ways of describing it in our model.
In this technical blogpost we've shown how a relatively simple change to our models can make them robust to the presence of outliers, a common phenomenon in real-world automotive data. This gives our models the extra expressivity to reason if individual data points are valid, as well as determining globally what proportion of the data can be explained well by our assumed model. Doing this within the principled modelling framework provided by Gaussian processes saves us the extremely time-intensive procedure of developing heuristics for cleaning each dataset we investigate. This allows us to have more meaningful and insightful conversations with our clients regarding their data at an earlier stage in our work with them, and ultimately construct more accurate engine calibration models in less time.