ML Model Monitoring

Keep the Lights On: Making Deployed AI/ML Better for Everyone

Keep the Lights On: Making Deployed AI/ML Better for Everyone

It’s easier than ever to build and deploy ML models. Storage is cheap, compute is cheap, pre-trained models are prevalent and, did I mention, cheap! As developers and practitioners, we’ve felt the pressure to deliver models into production to provide analytics for internal processes all the way up to running decision-making for mission-critical problems. Our job is, in a nutshell, to create a model that performs well right now, according to one or more downstream KPI-related metrics. That’s generally doable—we handle data identification, ETL pipelines (and their variants), model training, verification, and beyond. A model is trained that exceeds expectations and is deployed. But what happens after that?

No matter how clean the input data, no matter how well-trained the model, it’s known that a model posted to production will degrade with respect to downstream metrics. Input distributions will shift (think COVID-19 impacting restaurant seating, hurricanes rolling through wedding destinations, or sudden demand spikes due to viral marketing). Furthermore, those dynamics, relative to downstream business metrics, may unduly impact particular subgroups due to latent legal, demographic, or political shifts. A “perfect” deployed model today is not a perfect model tomorrow, or in three months, or in a year. It’s important to keep the future in mind when deploying now, and to understand that model deployment is not the end of the model lifecycle. More bluntly: once the thing is properly built, we need to make sure it stays good.

Even more bluntly: model monitoring—production monitoring—shouldn’t be walled off behind an enterprise sales team, held outside of the developer-first MLOps pipeline, but instead should be easily accessible by all practitioners. That’s something we at Arthur want to, and can, enable. Over the last few years, we’ve built the world’s strongest model monitoring solution, battle-tested by some of the world’s largest enterprise clients. We’re looking forward to spending more time focusing on the developer community, working closely with practitioners to learn and shape what the most effective model monitoring solution should be.

Story Time: Making Sure Organs Go to the Right Place

Career data scientists, machine learning practitioners, machine learning scientists, statisticians, business analysts—with a wide variety of application areas, and a global scope for applications, we’re all being asked to build and deploy models. Arthur is a product built by engineers who needed a product like Arthur in their past jobs, and need a product like Arthur for their side projects. In my own career wearing many of those hats,[1] I’ve built and deployed models for bundled advertising campaign pricing, enablement of Indian election prediction markets, global blood donation recommendation systems, international drug interdiction allocation and efficacy estimation, television advertising allocation, and organ donation, to name a few. I’ll lean on that experience for a little “monitoring matters” wisdom below, after a quick story.

In organ exchange, patients in need of an organ enter an organized barter market to find a willing, compatible donor. Organized kidney exchange has existed for two decades, and I’ve been heavily involved in that process for 13+ years, with large exchanges running code I wrote to match patients to donors, and organizational committees using my code to provide “what-if” analyses during policymaking decisions. Time and time again, it’s been made clear to me that deploying a computationally “optimal” approach to clearing these exchanges, then letting that code run day after day, is not sufficient. Value judgments are made, medical technology improves, supply increases or decreases, the legal landscape shifts—what worked well yesterday may not work well today. In short, we write code to solve a problem based on a model of the real world at a given point in time; that model is a noisy proxy for what actually matters, and what matters changes over time. In practice, an “optimal approach” is deployed, but:

  • The model is uncertain. The inputs are noisy to begin with. Problems include missing variables, missing constraints, improperly set weights and costs, and beyond. Is a particular transplant center reliable? Does a particular social variable correlate with likelihood to donate?
  • The model is brittle to shifts in the underlying environment. For example, during COVID, living organ donation rates dropped due to fear of entering a hospital and/or capacity constraints at transplant centers. How does that impact transplantation rates? And, if COVID impacted particular populations more than others, how does that impact metrics for fairness and bias in organ allocation? Any allocative model will disparately treat specific subpopulations, and measuring and monitoring for that is imperative for downstream policymaking.
  • The model is poorly understood by stakeholders. Visualizing complex statistics is hard. But, we use machine learning models to address problems that are hard for humans to understand. So, it’s important to communicate results to end stakeholders (in the organ exchange case, doctors, patients, donors, lawyers, etc.). Those statistics change over time, as the world changes. Communicating that change in a comprehensible way matters.
  • Certain demographics are systematically mistreated by the model. When we train models, we typically aim to maximize/minimize a specific objective function. That function may maximize utility or welfare for the many at the cost of utility or welfare for the few. This plays out in healthcare, including organ exchange, frequently; and, this can change drastically as the underlying political, legal, or demographic landscape shifts—regardless of the model that was trained and deployed.

In my experience, these general concerns arise in most application areas, not just organ exchange. Pricing advertisements depends on underlying social trends as well as external demand for correlated inventory. Drug interdiction success rates correlate with weather as well as USCIS/CPB patrol policies. Worldwide blood donation efficacy correlates with national and WHO policy. I’m sure you can think of examples from your own past or present, too. That’s part of our motivation behind building Arthur—creating a scalable platform for solving general problems in model performance across industries.

Monitoring Models with Arthur

We built Arthur to monitor models in production, and to aid in the model verification process. Our enterprise clients—across banking, healthcare, agriculture, logistics, news, and beyond—have all felt that pain of unobserved model deployment, directly via revenue loss or indirectly via damage to their brand. We are continuing to translate that technology—distribution and concept drift detection and connection to downstream KPIs, bias and fairness definition, detection, and mitigation, and model explainability across the board—to the individual developer and team of developers. Our platform already handles structured and unstructured data, and we have an exciting roadmap over the coming year expanding in our core strengths like computer vision and NLP connected to foundation models, robust approaches to measuring all of the metrics, and effortless scaling as our clients’ needs grow.

We look forward to continuing to partner with the MLOps community! If you’re in Austin next week (Feb 21-23), come find us at the Data Science Salon Austin—we’re going to be sharing some exciting things we’ve been building to tackle these very issues.


[1] All of these examples are independent from my time building Arthur! We encounter the same style of (broad, reaching) problem at Arthur, and I am happy to dive into those details over a coffee or beer. My goal with this post is to identify with the reader as an ML practitioner, not necessarily a C-suite executive.