Whether your team has recently deployed their first or their 100th ML model to production, you likely appreciate the importance of proactively monitoring your models’ inputs and outputs. In an ever-changing world, the model you trained in the lab is almost certainly performing differently out in the wild. In order to ensure that all stakeholders can trust your ML systems, you need to ensure that your models continue to perform well, that they can be understood, and that they don’t discriminate unfairly. These ideas form the major pillars for AI Observability: Performance, Explainability, and Fairness. In this post, we’ll walk through the components of building an enterprise-wide system for model monitoring, so that you can provide an always-on visibility to stakeholders across your company, including data scientists as well as risk officers, legal advisers, and business stakeholders. This guide will help you understand what you’d need to build for your enterprise, or, you can reach out to us at ArthurAI to learn more about our turnkey system for enterprise model monitoring.

Platform Agnostic Integration

At a large organization, you might have many different data science teams, all of whom use their favorite tools for model development and deployment. Perhaps some corners of your org are big fans of Tensorflow, others love SageMaker, others love H20, and yet some others have in-house tooling built with open source stacks. We see this pattern all the time and we think it’s great: let data scientists use the tools that they love so they can be as effective as possible. When it comes to model development and deployment, this heterogeneity can be a sign of a diverse and creative data science community. However, from the perspective of technology risk management, all these disparate systems might make you (and your legal team) a little uneasy. To answer a simple question like “Are all the models doing ok today?”, you’d have to go check many different systems and hope they all offer comparable information. What you’d really want is a single go-to place for seeing how all your models are performing, and a way to be notified as soon as something goes wrong. This is why it is important that across your ML ecosystem you need to create a centralized and standardized layer for model monitoring and governance. With this goal in mind, your monitoring system needs to be platform agnostic and allow drop-in integration across all those different stacks.

Fortunately, much of model monitoring can be achieved without tight coupling to the specific environments for model development and deployment. As we’ll describe below, monitoring the stability of a model’s outputs is a process that can be completely ignorant of the model’s development process or deployment environment. The same is true for monitoring a model’s inputs and for ensuring a model is fair. And with many methods for model explainability, the specifics of the model training procedure are irrelevant once we have a trained model to probe. Therefore, it is possible to build out a platform-agnostic monitoring system that is accessed through REST APIs so that all the models in your ecosystem are consistently monitored in real-time, no matter how they were built or deployed. Even better, it’ll be important to provide data scientists and engineers with easy-to-use client libraries in their favorite languages (mostly Python, but possibly also Java, Scala, or R).

Performance and Data Monitoring

Model Outputs

The most salient question you want to answer is this: Is my model performing as well today as I expect it to? Notions of “good” performance will vary depending on different business cases, but there are many general themes you can bring to bear. In the ideal setting, your models will have access to a timely ground-truth so that you can effortlessly compute things like accuracy, confusion matrices, ROC curves, and so on. Monitoring a model’s accuracy can make you aware of a performance issue as soon as it happens. In fact, it would be great if you could make any such metrics extensible, because surely someday you’ll find a data scientist who wants something slightly different than the metrics you are calculating in your system. Allowing users to onboard custom metrics calculations will ensure that no data scientists are left out. In a use case such as product recommendation or targeted advertising, you’ll know instantly whether your model’s outputs resulted in a click or not. However, there are many cases where ground-truth is hard to come by, and what can we do then?

An example is issuing credit cards - if you approve someone for a credit card, it will likely be months or years before you decide that was a bad decision. In these cases, you’ll need to generate proxy metrics for accuracy by monitoring the stability of a model’s inputs and outputs. The stability of a model’s output predictions is a useful proxy in the absence of ground truth. Since a model, once fitted and deployed, never changes it’s view of the world (its decision surface), any notable changes to its output can be attributed to significant changes in its inputs - something your data scientists will want to know about. You should consider monitoring the distribution of output predictions from a model, whether a regression model, classification model, multilabel model, multimodal model, or anything else. There are many ways to quantify similarity/stability of distributions, depending on the type of model and dimensionality (more on that below). When it comes to monitoring the stability of distributions, you might quantify changes through time, or changes relative to baseline (such as the training set), or both.

Data Drift

Just as important as monitoring a model’s outputs, it is vital to monitor the stability of a model’s inputs: often referred to as Data Drift or Concept Drift. Your model is trained on a fixed snapshot of the world - the training data. Once your model is deployed, the world will inevitably change and depending on how severe that change is, the learned decision boundary of your model might become irrelevant. By regularly quantifying the similarity of the data coming through the model today to the data the model was trained on, you’ll be able to quickly identify when things start to go off the rails.

There are many ways to quantify distributional similarity and to build a system for data drift detection. Your first approach might be to monitor each of the incoming features to a model and quantify similar to the training set. For example, if a model has Age as an input feature, then you’d want to look at the training set and develop a statistical profile of the data in that training set. Then going forward, you can calculate the similarity of the data coming through, as compared to the training data. Measuring the similarity of two distributions can be achieved in many ways including non-parametric hypothesis tests, KL Divergence, Jensen-Shannon Divergence, Population Stability Index, Wasserstein distance, and many more. What these have in common is that they will take in two empirical distributions as input and will result in a similarity score as output. When that similarity score starts to have big changes, it is an indication that the data today is quite different than the training set, and maybe it’s time for a retrain and redeploy.

Computing distributional similarity for each input feature independently will generally tackle a large proportion of the problem, but is not without limitations. An early thing to consider is that we’re implicitly looking for drift only in the marginal distributions of each feature and not in the higher-dimensional joint distribution over the data. For high dimensional datasets, and especially for unstructured data such as imagery and text, it will be important to consider multivariate methods for quantifying distributional similarity. While there is generally not an easily-computed high-dimensional analog for KL Divergence, we can take a model-based approach to quantify data drift. In this approach, we will train a model of some kind on the training set. This model isn’t a classification or regression model, but instead is some kind of density model (or similar) which is trying to understand how the data is distributed in the high dimensional space. Once we have fitted such a model to the training set, we have a snapshot of what “normal” data looks like. Going forward, we can collect new inferences and query this model to quantify how well the new inferences adhere to the distribution of the training dataset. For each single inference, this gives us a powerful mechanism for anomaly detection, since we can now identify incoming points that the model has scored but yet don’t really look anything like the training data. Further, as we aggregate over larger groups of inferences, we have an holistic view into multivariate data drift in a high dimensional space.

That general framework can be accomplished with many different modeling techniques. One approach might be using an Isolation Forest or KDTree to fit to your dataset. More recent techniques show promise for fitting properly-normalized probability models to high dimensional datasets, including Variational Autoencoders and Normalizing Flows. Additionally, preprocessing methods such as dimensionality reduction have been shown to be helpful for high-scale problems. In all cases, you’d need to also build a system for training, storing, containerizing, and serving each one of these density models so that streaming inferences can be scored as they come in.

Finally, it is worth noting that not all data drift is created equal. This idea is sometimes referred to as ‘virtual’ concept drift and denotes instances where the data has drifted but in a direction that doesn’t materially affect the model’s outputs. Thus, it is very helpful to combine data drift monitoring for each feature with a simultaneous quantification of how important each feature is for model decisioning. Combining data drift modeling with model explainability (more on that below) is a powerful way to prioritize your teams’ time and attention.

Explainability as a Service

Understanding a model’s decisions is an important part of building trust for the adoption of ML across your organization. With increasingly-complex ML techniques, their flexibility is often accompanied by a difficulty in understanding why they make their predictions. The field of Explainable AI has put forth many valuable techniques for calculating explanations of model predictions. In your model-monitoring system, if you are able to provide these explanations for every prediction your models make, this can go a long way toward building trust and comfort among a broad class of stakeholders.

There are many great techniques for calculating local explanations of ML models. These methods could be model agnostic (like LIME), model-based (like DeepLift), or both (like Shap). In all cases, you need to access a model’s predict function in order to probe the relationships between a model’s inputs and outputs. For your monitoring platform, this means you don’t need to be tightly coupled to the model training environment. Instead, you only need access to the final trained model and ability to probe it. You might hook into existing model deployment APIs, or you might build and replicate a model microservice solely for computing explanations on the fly. In this case, you’ll need to think about containerization for replicating the model’s dependencies and environment. And if your model requires high throughput or high dimensional data (or both) you’ll want to think about ways for autoscaling the computation of explanations in order to keep up with the inference load. Additionally, you might put some thought into refactoring some of those favorite explainability methods to make them more performant for your use cases.

Once you’ve computed explanations for every inference coming through your models, it opens many exciting possibilities for helping your organization. First, it provides your data scientists with a granular view in a model’s decision surface, allowing them to identify areas where a model might be underperforming and helping to debug models in production. Second, these explanations form a useful audit trail for your modeling system, ensuring that every decision that a model makes is logged and can be understood at a later time. And finally, considering local and global feature importance will help you understand and prioritize data drift and the emergence of new patterns in your data.

Algorithmic Fairness

It is important to not only ensure your models are making the “right” decision from a statistical standpoint, but also the “right” decisions from an ethical standpoint. It has become clear across many industries that systems for automated decision making can exacerbate disparate impact and discrimination against specific groups of people. Ensuring that your models are fair is tantamount to ensuring that they result in similar predictions/outputs for all relevant subpopulations of your customer base. Traditionally, fairness analysis is conducted over protected classes as defined by race, age, and sex. But for your business, there could be many other dimensions and factors over which you want to ensure your model is resulting in comparable outcomes. For example, you might want to understand model disparities by geography, income, spending level, or any business-driven segmentation of your customer base.

Many researchers have attempted to quantify notions of algorithmic fairness, with many such definitions proposed in recent years. For your business, you should decide which definitions and fairness metrics are most aligned with your company’s goals for Responsible AI. The next step is to build a Fairness Audit framework into all modeling done in your organization. In this pursuit, you would examine all models’ inputs and outputs per any sensitive/protected classes and arrive at a quantification of disparate impact. Naturally, this kind of fairness audit would not be a static process but would be an ongoing analysis that is continually conducted as new inferences go through your models. Ideally, you could provide an easy-to-use dashboard for identifying, exploring, and mitigating disparate impact in your models. This dashboard would make these fairness metrics accessible to all important stakeholders in your organization, and ensure that this information is not just residing with data scientists.

APIs, Storage, and Compute

With inferences stream into your monitoring platform, and real-time metrics being computed for drift, accuracy, and more, you’ll need to put some careful thought into data storage, access, and streaming analytics. Some data science uses cases operate in large batches while some might operate in a streaming fashion, so a combination architecture with Kafka and Spark might prove useful. Many of the metrics and analytics we’ve described can be computed in a streaming context and autoscaled to meet load requirements. Once these metrics, analytics, and explanations have been computed over the inferences, it would nice to make all of this data available to data science teams to explore on their own. You might consider storing the metrics in a datastore appropriate for fast access and large scale, so that your data scientists can quickly slice and dice this data. You could even connect this backend store to an interactive data visualization dashboard, allowing your teams to explore a model’s decision space and better understand areas to improve and debug.

Real-Time Alerting

Once you’re calculating all the previously described metrics (and housing them in a scalable data store), you’ve got everything you need for real-time alerting. You can let data scientists know the moment a model’s accuracy starts to drop too much or an important feature seems to have drifted significantly away from the training set. You can build alert integrations directly to where people are spending their time, including email, Slack, or ServiceNow. Apologies in advance when you get a 2am wakeup call about your model’s performance plummeting.

User Interface

The great thing about instrumenting a platform-agnostic model monitoring tool, is that it suddenly enables disparate stakeholders to have effortless access to a model’s outputs and behaviors. Concepts that might have typically been stuck inside a data scientist’s notebooks (things like model explanations) are now readily available for a broader audience to consume. You now have a few different personas to think about when designing interfaces. The first might be data science practitioners who are hands-on with model development and will want a very specific and technical view into the data surrounding each model. The second might be data science leaders and risk management leaders, who primarily will be concerned with ensuring that everything is healthy and nothing needs escalation. Finally, you’ll want to think about less-technical business stakeholders who are using these models to accomplish their business goals (and accepting these operational risks). You’ll want to make sure they have easy-to-understand access to major insights around model performance and risk.

The Other Important Stuff

Let’s not forget about things like role-based access control (RBAC), single sign on (SSO) integration with enterprise user directories, end-to-end encryption and other policy compliance. You’ve got to get this stuff right the first try, so be sure to move very carefully and thoughtfully through these topics.

Putting It All Together

With these pieces in place, your data science teams should be able to effortlessly onboard new (and old) models into your monitoring system and begin sending inferences and telemetry. Your monitoring system will aggregate this information and compute real-time metrics for stability, performance, fairness, and anything else important to your organization. Alerting will provide real-time awareness so that data scientists can begin solving problems before it’s too late. The dashboarding you’ve built will provide access to these concepts to a whole new suite of stakeholders across your business - not just for data scientists anymore. Good luck!