ML Model Monitoring

Detecting Unexpected Drift in Time Series Features

Detecting Unexpected Drift in Time Series Features

Aside from crystal balls, time series models have become the predominant approach to predicting the future. Often implemented at organizations with access to large amounts of historical data, these models leverage time dependent patterns and trends in input features (like economic indicators, weather patterns, even heartbeats) in order to forecast into the future. Time series models have use cases spanning nearly every data type and industry, yet a great deal of uncertainty still remains when productionalizing and particularly when monitoring these models. 

The Challenges of Monitoring Time Series Models

Validating time series models on historical data is standard practice prior to putting them into production. However, monitoring time series models once they are in production becomes tricky because forecasting horizons may be far into the future—meaning that the ground truth that traditional validation and performance metrics depend on may take a prohibitively long time to become available, exposing organizations to the risks associated with underperforming models in the meantime. Many data scientists have addressed this limitation by using data drift metrics on input features as a leading indicator of model performance and a key piece of information in determining if/when a model should be retrained. However, time series models oftentimes contain features that naturally drift over time and display seasonality, meaning that data drift is to be expected and potentially no longer a principled justification for retraining. In the remainder of this article, I will discuss an approach that would allow one to continue to use data drift metrics as an informative leading indicator of model performance by first accounting for the expected drift of input features and using traditional drift metrics to track the residuals of those time dependent features.  

To get a better intuition for this approach, let’s squeeze in an ice cream break. The ice cream shop near my house is open year round. It is situated right across the street from one of the city’s largest playgrounds and is a go-to for parents needing to make good on bribes they’ve offered their children. The line for this shop can sometimes wrap around the block. This past summer, I started to fancy myself somewhat of a sweet toothed soothsayer; before even leaving my house I could tell how long the wait would be based on the weather that day. At 90 degrees Fahrenheit, a single scoop of pistachio would take upwards of an hour, while at 70 degrees, the place was practically deserted. My dessert-line wait-time model was working just fine, until temperatures started to decline. On a late winter’s day this past year, my model failed me. A meager lunch of leftovers and an unseasonably warm (70 degree) day had me craving a banana split, so I dusted off my old model and predicted that I could be back home with plenty of time to spare before my mail carrier arrived with my eagerly awaited copy of Designing Machine Learning Systems. Turns out a 70 degree day in winter is not at all the same as a 70 degree day in summer and I ended up spending close to 40 minutes in line, missing my book delivery. I had fallen victim to a case of mistaken stationarity. 

Stationarity in Time Series Data

Dealing with time series problems more pressing than my quest for a banana split, lickety split, usually warrants using a machine learning algorithm. Many common approaches (known as Autoregressive models), create models based on the fact that quantities close in time are often similar (i.e. yesterday’s wait time is likely pretty similar to today’s). More recently, neural network based architectures, which were designed to handle long sequences of data (like natural language), have emerged as a popular approach for time series forecasting. These approaches all involve a series of preprocessing steps, some of which we can use to establish a notion of expected drift in order to isolate the informative signal that is unexpected drift.  

Data cleaning and feature selection/extraction are so commonplace in the preprocessing pipeline at this point that they will not be discussed here. Rather, special attention will be devoted to a potential thorn in the side of anyone attempting to glean insights from time series data: the removal of nonstationarities. Stationarity refers to the tendency of data to have a constant mean, variance, and covariance. In the context of time series features, these constants equate to data that essentially does not depend on when it was observed. The nonstationarities of most features of interest can usually be broken down into components which are either trends or seasonal in nature. These seasonal cycles could be yearly, quarterly, daily, or even lengths of time which may seem arbitrary at first glance. Fortunately, there are a number of open source approaches to identifying and decomposing those cyclical time dependent components of a model's features, so that fluctuations in the remaining signal can be disentangled from the expected drift. 

Approaches to Addressing Nonstationarities in Data

One tool that has proven particularly useful and flexible in addressing nonstationarities is the open-source python package Darts. Darts describes its primary goal as “simplifying the whole time series machine learning experience.” The darts package includes functionality which helps detect and extract nonstationarities in data. Users can feed in their raw time dependent features and retrieve a transformed feature, one that is time independent and takes into account where (or really when) a value occurs in time such that values separated in time can still be meaningfully evaluated with standard metrics—for example, the metrics we referred to at the start of this guide, those used to measure data drift for the purposes of evaluating, or reevaluating, your model’s performance. 

There are other common approaches to forcing stationarity on time series features. Many of these approaches fall under the umbrella of differencing: essentially tracking the difference between consecutive observations rather than the observations themselves. In practice, the definition of consecutive here could mean consecutive days, weeks, quarters, years, or essentially any difference as long as it is consistently applied. Tracking these differences often accounts for the trends and seasonal tendencies of features in such a way to allow for the remaining quantities to be informative signals which can be tracked for drift.  

It is worth noting that many modern approaches (neural network based architectures) can natively handle non-stationary data. Though the removal of nonstationarities may not be a critical part of the preprocessing pipeline for these models, it may still be worth creating stationary versions of those features, particularly under circumstances where forecasting horizons are distant and unexpected trends in input features can serve as a leading indicator of performance.

Returning to our original motivation, the key idea here is that by accounting for expected drift in our time dependent features, we can be sensitive to unexpected drift and use it to guide decisions about evaluating, or reevaluating our model. Fundamentally, this approach is the difference between a short line for a rootbeer float on a temperate day in summer and a long line for that same float on an unseasonably temperate day in winter.


What are the specific performance metrics used in AI and ML to assess time series models before and after accounting for expected drift?

In the context of AI and machine learning, specific performance metrics commonly used to assess time series models include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), as well as more sophisticated metrics like Mean Absolute Percentage Error (MAPE) and Symmetric Mean Absolute Percentage Error (sMAPE). These metrics are crucial for evaluating the accuracy of forecasts by comparing the predicted values generated by the model to the actual values observed in the time series data. Both before and after accounting for expected drift, these metrics help in quantifying the effectiveness of the model, guiding data scientists in LLM (Large Language Models) and ML in refining and improving their forecasting algorithms by highlighting discrepancies between predicted and observed outcomes.

In the realm of ML and AI, how does one quantify the level of acceptable drift before deciding to retrain a time series model?

Quantifying the level of acceptable drift in the realm of machine learning and artificial intelligence before deciding to retrain a time series model involves setting precise threshold values for data drift metrics. This process is influenced by historical model performance, industry benchmarks, or the specific objectives of the ML forecasting task. A sensitivity analysis can be particularly useful in understanding the effects of various levels of drift on the predictive performance of the model, a common practice in AI. Once these thresholds are established, exceeding them indicates that the model's predictions are losing accuracy, suggesting a need for retraining. This approach, crucial in the lifecycle of ML models, helps balance the costs associated with model updates against the potential risks of using outdated predictions.

What are the limitations or challenges associated with using the Darts package for addressing nonstationarities in time series data within ML and AI frameworks?

The Darts package, while a valuable tool in the ML and AI toolkit for addressing nonstationarities in time series data, presents certain limitations such as computational demands, especially when dealing with extensive datasets or intricate time series. The complexity might deter users new to Python or those unfamiliar with advanced time series analysis techniques common in AI research. While Darts offers a range of functionalities for decomposing and modeling time series data, its applicability might not extend to all types of nonstationary data or might not always be the most efficient approach, particularly for datasets characterized by high irregularity or noise. Additionally, integrating Darts within broader ML and AI pipelines can present challenges, especially when those pipelines employ different programming languages or are structured in unique computational environments.