Aside from crystal balls, time series models have become the predominant approach to predicting the future. Often implemented at organizations with access to large amounts of historical data, these models leverage time dependent patterns and trends in input features (like economic indicators, weather patterns, even heartbeats) in order to forecast into the future. Time series models have use cases spanning nearly every data type and industry, yet a great deal of uncertainty still remains when productionalizing and particularly when monitoring these models. 

The Challenges of Monitoring Time Series Models

Validating time series models on historical data is standard practice prior to putting them into production. However, monitoring time series models once they are in production becomes tricky because forecasting horizons may be far into the future—meaning that the ground truth that traditional validation and performance metrics depend on may take a prohibitively long time to become available, exposing organizations to the risks associated with underperforming models in the meantime. Many data scientists have addressed this limitation by using data drift metrics on input features as a leading indicator of model performance and a key piece of information in determining if/when a model should be retrained. However, time series models oftentimes contain features that naturally drift over time and display seasonality, meaning that data drift is to be expected and potentially no longer a principled justification for retraining. In the remainder of this article, I will discuss an approach that would allow one to continue to use data drift metrics as an informative leading indicator of model performance by first accounting for the expected drift of input features and using traditional drift metrics to track the residuals of those time dependent features.  

To get a better intuition for this approach, let’s squeeze in an ice cream break. The ice cream shop near my house is open year round. It is situated right across the street from one of the city’s largest playgrounds and is a go-to for parents needing to make good on bribes they’ve offered their children. The line for this shop can sometimes wrap around the block. This past summer, I started to fancy myself somewhat of a sweet toothed soothsayer; before even leaving my house I could tell how long the wait would be based on the weather that day. At 90 degrees Fahrenheit, a single scoop of pistachio would take upwards of an hour, while at 70 degrees, the place was practically deserted. My dessert-line wait-time model was working just fine, until temperatures started to decline. On a late winter’s day this past year, my model failed me. A meager lunch of leftovers and an unseasonably warm (70 degree) day had me craving a banana split, so I dusted off my old model and predicted that I could be back home with plenty of time to spare before my mail carrier arrived with my eagerly awaited copy of Designing Machine Learning Systems. Turns out a 70 degree day in winter is not at all the same as a 70 degree day in summer and I ended up spending close to 40 minutes in line, missing my book delivery. I had fallen victim to a case of mistaken stationarity. 

Stationarity in Time Series Data

Dealing with time series problems more pressing than my quest for a banana split, lickety split, usually warrants using a machine learning algorithm. Many common approaches (known as Autoregressive models), create models based on the fact that quantities close in time are often similar (i.e. yesterday’s wait time is likely pretty similar to today’s). More recently, neural network based architectures, which were designed to handle long sequences of data (like natural language), have emerged as a popular approach for time series forecasting. These approaches all involve a series of preprocessing steps, some of which we can use to establish a notion of expected drift in order to isolate the informative signal that is unexpected drift.  

Data cleaning and feature selection/extraction are so commonplace in the preprocessing pipeline at this point that they will not be discussed here. Rather, special attention will be devoted to a potential thorn in the side of anyone attempting to glean insights from time series data: the removal of nonstationarities. Stationarity refers to the tendency of data to have a constant mean, variance, and covariance. In the context of time series features, these constants equate to data that essentially does not depend on when it was observed. The nonstationarities of most features of interest can usually be broken down into components which are either trends or seasonal in nature. These seasonal cycles could be yearly, quarterly, daily, or even lengths of time which may seem arbitrary at first glance. Fortunately, there are a number of open source approaches to identifying and decomposing those cyclical time dependent components of a model's features, so that fluctuations in the remaining signal can be disentangled from the expected drift. 

Approaches to Addressing Nonstationarities in Data

One tool that has proven particularly useful and flexible in addressing nonstationarities is the open-source python package Darts. Darts describes its primary goal as “simplifying the whole time series machine learning experience.” The darts package includes functionality which helps detect and extract nonstationarities in data. Users can feed in their raw time dependent features and retrieve a transformed feature, one that is time independent and takes into account where (or really when) a value occurs in time such that values separated in time can still be meaningfully evaluated with standard metrics—for example, the metrics we referred to at the start of this guide, those used to measure data drift for the purposes of evaluating, or reevaluating, your model’s performance. 

There are other common approaches to forcing stationarity on time series features. Many of these approaches fall under the umbrella of differencing: essentially tracking the difference between consecutive observations rather than the observations themselves. In practice, the definition of consecutive here could mean consecutive days, weeks, quarters, years, or essentially any difference as long as it is consistently applied. Tracking these differences often accounts for the trends and seasonal tendencies of features in such a way to allow for the remaining quantities to be informative signals which can be tracked for drift.  

It is worth noting that many modern approaches (neural network based architectures) can natively handle non-stationary data. Though the removal of nonstationarities may not be a critical part of the preprocessing pipeline for these models, it may still be worth creating stationary versions of those features, particularly under circumstances where forecasting horizons are distant and unexpected trends in input features can serve as a leading indicator of performance.

Returning to our original motivation, the key idea here is that by accounting for expected drift in our time dependent features, we can be sensitive to unexpected drift and use it to guide decisions about evaluating, or reevaluating our model. Fundamentally, this approach is the difference between a short line for a rootbeer float on a temperate day in summer and a long line for that same float on an unseasonably temperate day in winter.