Machine learning models in production can degrade in lots of unexpected ways. There are certain questions, concepts, and metrics that we might know ahead of time that we’ll want to check in on week over week. And then, of course, there are unknown unknowns - problems, questions, and calculations that we are unable to anticipate but will become critical in 6-12 months from now. For this reason, the Arthur platform has been built to be as configurable as possible, to not be prescriptive, and to allow our users full and flexible interfaces for interrogating and understanding model performance. In this post, we’ll describe some of the architectural choices that underlie this design philosophy, and share a couple of the powerful modes of interaction that are made possible by these choices.
As your model data is ingested into the Arthur platform, it is indexed into a distributed columnar datastore. Not only is this highly scalable (can easily handle petabytes of model data), but it allows for rapid aggregations and queries, thus allowing exploratory and ad-hoc access to all the information about historical model performance. You don’t have to make the difficult choice to subsample or summarize your data, and you don’t need to decide your favorite metrics ahead of time. You’ll have at your fingertips all the historical data about your model’s inputs, predictions, performance, explanations, and other insights. This enables a powerful ability to slice-and-dice model performance across any facets or subpopulations that are relevant to you. For Arthur users, this capability results in two particularly powerful ways of monitoring data and models.
The Arthur platform provides an interactive data visualization suite that allows you to explore and understand the data pertaining to your model. Our backend architecture allows us to compute the necessary aggregations, groupings, and filters in milliseconds, even over hundreds of millions of data points.
In the examples below, we can visualize the distributions and correlations amongst model inputs, outputs, and even explanations. We can quickly navigate through different time slices, facet the data by groupings of different variables, and understand your model’s predictions and data landscape.
In addition to the rich set of visualizations and metrics available in the Arthur UI, you can also fetch any and all of this underlying data (and computation) through our API. Our API-first approach means that data scientists can quickly check-in on model performance using a familiar tool, such as a Jupyter notebook. Our Query Engine exposes a SQL-like language that will be quickly familiar to data scientists, so that they can compute and visualize ad-hoc summaries and aggregations on large sets of data. As an example, one day we might be curious to know how our model is performing for Males versus Females, and if that has been changing over time. We construct a query with familiar group-by’s and filters and the Arthur backend computes aggregations over millions of inferences in just a few milliseconds. The result of this query is easy to drop into a pandas DataFrame for quick visualization.
In addition to model evaluation metrics, it is a snap to get a quick view of the distribution(s) of a model’s input or outputs, to understand how they may be shifting over time.
From a notebook, we can quickly and easily dive into any subpopulations and assess model performance and data stability. Once we are alerted to issues with model performance or data drift, having this data at our fingertips empowers us to conduct an investigation and drill down to its root cause.