If you’re an enterprise with AI models in production across multiple lines of business, chances are you already have a model governance plan in place to comply with business, operational, and regulatory requirements.
But governance goes beyond access controls, fancy arrow flow charts, policy PDFs, and checklists. How can you ensure your resulting model is accurate, robust, and reliable over time?
Model monitoring is a critical part of the AI lifecycle that enables data science teams to detect—and ultimately address—issues like data drift and algorithmic bias, while providing the necessary tools for correcting performance issues in the real world.
According to Harvard Business Review, nearly 80% of AI projects won’t scale beyond proof of concept. Of the 20% remaining, it can take from 3 months to 2 years to roll a model into production. Given both the significant time and investment poured into ML model development, experimentation, training, and deployment, why wouldn’t you want to ensure that your model performs well post-production?
What Is Worth Monitoring in a Model?
Model performance metrics can be a hybrid depending on model complexity.
- Regression: MAE, MSE, RMSE, Residual Histogram, Predicted vs. True, Forecast Horizon
- Classification: Accuracy, Confusion Matrix, Precision/Recall Tradeoff, F-1 Score, AU-ROC, Lift Curve, Cumulative Gains Curve, Calibration Curve, etc.
These measures of accuracy are only useful, however, if you have access to the ground truth of your model. In a production environment, ground truth can sometimes take days, weeks, or even years to acquire (if it is available at all).
In order to assess the performance of models without these prohibitive delays, proxies such as data drift can be used as a real-time barometer of your model’s activity. Drift metrics can be a meaningful leading indicator of the need for models to be retrained or restructured.
2. Drift and Degradation
Over time, data changes. Data drift is defined as a variation in the production data from the data that was used to train, test, and validate the model before putting it into production. The first two questions we often ask ourselves when we are faced with data drift are “When did this happen?” and “How did this happen?” Coincidentally, those are also some of the best ways to characterize different types of data drift. We can start to characterize how drift occurs by looking at the distributional makeup of the drift (i.e. what pieces of our data are drifting). We can then look at when the drift occurs by characterizing drift by intensity or timeline.
Covariate drift is one of our most common types of real-world drift. It occurs when there is a change in the feature space of the model—or, in other words, when the distribution of one or more of the features has changed.
Concept drift is a change in the relationship between input and output data variables over time. This change can be gradual, reoccurring, or sudden. Oftentimes in production, we may not have access to the ground truth immediately. For example, if we are predicting whether or not a customer will default on a loan, we may not know if we are correct for months—or even years. In cases like these, it can be useful to also evaluate the relationship between features and your predicted values, instead of just the true (or ground truth) target variables.
Common in real-world model scenarios is distribution drift. A key feature of distributional drifts in production is that they are silent. Arthur provides the ability to both monitor and set alerts to detect for different data distributional drifts, using both statistical and model-based methodologies.
Model degradation, or model decay, happens when a model’s performance becomes less reliable over time due to changes in the environment. When data drift occurs, the production-time assumption we made about the similarity between our training dataset and our production dataset is no longer true, which can cause the model’s decay.
End users, data scientists, business leaders, and regulators need to understand how models make decisions. Improving model transparency can reduce model development and debugging time, highlight areas of concern for data drift and bias, and increase overall trust in the model.
Explainability vs. Interpretability
Model interpretability refers to models that are inherently understandable to humans. These models are simple enough that a human looking at the logic and internals of the model can understand how the model makes an inference given a particular input. In practice, few models are truly interpretable. Instead, there is often a tradeoff between interpretability and performance—especially for models performing complex tasks. High-performing models that do complex tasks are often the least interpretable models.
The goal of model explainability is to provide visibility into models that are too complex to be inherently interpretable. This often requires additional models and other techniques to generate explanations that are comprehensible to humans. Arthur’s platform offers powerful explainability techniques to provide prediction-level and whole model-level visibility into any model, including advanced “what if” analysis and feature importance ranking.
Local vs. Global Explainers
Global explainers provide holistic model-level explanations. Global explanations are often presented as a summary of feature importance across the entire model. These explanations show which input features make the greatest impact on the output predictions of the model.
Since global explanations serve as simplified summaries of model behavior, they may not be accurate for specific data samples. However, they can help data scientists contextualize data drift to understand when a model needs to be retrained. This is especially useful when ground truth labels are unavailable. Global explainers are also useful to identify differences between groups for bias/fairness or debugging purposes.
Local explainers provide a hypothesis of why a model made the prediction it did given a specific input sample. These explanations are useful in providing specific explanations to end users. They can also be helpful to data scientists when trying to identify and understand the cause of specific production issues. Local explanations can be aggregated across many samples to form global explanations.
Types of Explanations
For explainers to be useful, they must present explanations in a way that is comprehensible and intuitive to humans. This is often presented in the form of data visualizations for feature importance. Explanations for tabular data can be intuitively represented as a bar plot of feature importances. Explanations for computer vision models and image data are provided by highlighting the most significant regions of an image, while natural language processing models can be explained by highlighting significant words and phrases.
Arthur leverages the industry standard LIME and SHAP algorithms to provide local and global explanations for tabular, computer vision, and natural language processing models. Both algorithms create simplified surrogate models to provide local explanations, which can be aggregated into global explanations. LIME and SHAP are model-agnostic explainers, meaning that they can generate explanations for any type of model, without accessing the internal logic and parameters of the model. Depending on the particular model and use case, a data scientist may favor one of these algorithms over the other. On the Arthur platform, LIME is used for image and text data, while SHAP is used for tabular data.
4. Bias / Fairness
Local and federal regulations around detecting and addressing bias are in the works (see NYC hiring law, Algorithmic Accountability Act of 2022).
We have learned that traditional approaches which equate to “fairness through unawareness” simply do not work. In this case, ignorance is not bliss, and not enough to address existing/upcoming regulations. This unawareness-based approach may meet the requirements of avoiding the discriminatory practice known as disparate treatment, but does not address the possibility of the discriminatory practice known as disparate impact (this distinction is rooted in the Civil Rights Act of 1964). Essentially, a model that does not take into account membership of a protected class can still have adverse effects on members of that protected class.
Detecting bias and discriminatory practices requires actively probing your data to see if groups are being treated unfairly. Arthur does this active probing for you and makes it easy to detect bias by making comparisons between subgroups, even if that group identity is not being used as an input to your model.
There are a number of different metrics to quantify fairness. The three most common ones are demographic parity, equality of opportunity, and equalized odds. Arthur allows you to quickly identify, quantify, and visualize the degree of bias/fairness (using standard or custom fairness metrics) in your model’s outputs.
If bias is identified, Arthur can help mitigate that bias based on post-processing techniques which do not require fundamentally changing your training data or model architecture.
Don’t forget that an essential part of model governance is tracking model health post-production through model monitoring. Automated monitoring of performance, drift, model degradation, explainability, bias, and fairness as well as alerts/notifications of potential issues is an important aspect of ensuring responsible AI in your MLOps lifecycle.
Want to learn more about Arthur? Schedule a demo with one of our experts to see the AI monitoring, explainability, and bias analytics platform in action.
Photo by Tingey Injury Law Firm on Unsplash