MLOps and Continuous Training
When embarking on an individual data science project, documenting, standardizing, and tracking may not seem like a top priority. However, as data science teams expand and numerous teams emerge within an organization, along with the development of centralized MLOps systems, the importance of establishing standards and tracking systems becomes evident. These measures allow data science teams to work efficiently and avoid costly mistakes.
A well-designed MLOps system should track all the information and components needed to retrain a model from scratch such that it is approximately identical to the original model (same training data, model algorithm, hyperparameters, etc.). Most of this is done through model registries, feature stores, CI/CD and CT (continuous training) tools such as Dagshub and other versioning systems. By employing these systems, model artifacts, data, and code versions can be tracked effectively, facilitating continuous retraining and, if necessary, emergency rollbacks of production models.
Within the model pipeline, two crucial components must be carefully maintained and tracked across iterations to facilitate this seamless orchestration.
1. Model: What is the current model deployed in production and what conditions are needed to retrain it or pass inference data into it?
To answer these questions, we need to keep track of several key elements:
- The saved model artifact, which serves as the core representation of the trained model.
- The predict function, which handles the model’s predictions.
- The software and environment requirements necessary for running the training and prediction processes.
- The code employed to train the model, encapsulating the very essence of its development.
2. Data: What data was utilized for training and validation?
This aspect necessitates monitoring two data subsets:
- The training data, which molds the model’s understanding and helps it glean patterns and insights.
- The test and validation data, which enables the evaluation of the model’s performance and generalization abilities.
However, in order to know how to bring these pieces together, more information is needed: the model schema.
What is a model schema?
A model schema describes the relationship between a dataset and the model.
A model schema is much like a database schema in that it outlines the structure and relevant metadata of a dataset. In the case of a model schema, this outline describes the relationship between the dataset and a model. It includes information such as which data columns are used as direct inputs to the model, how model inputs and outputs are structured, and what bounds on data values are expected.
The ultimate goal of a model schema is to allow a user to load and reconstruct a dataset as it was used during the initial training and validation phase. This in turn enables easier model retraining and rollbacks, data validation, and any further analysis or model validation that is needed. Data scientists can reload and explore their datasets and model outputs without the need to refer back to training code. Meanwhile, MLOps administrators can define robust model schema standards and establish automated systems for data and model monitoring and validation, streamlining the overall process.
In the upcoming sections, we will delve into the intricate details of what essential information should be encompassed within a model schema. Additionally, I will provide a concrete example that illustrates how a model schema may be structured. Furthermore, we will explore how the adoption of model schema can yield standardization benefits within an organization and bolster the overall MLOps system. Throughout this blog post, we will employ the example of a tabular classification model to illustrate the concepts at hand. Rest assured, these principles extend beyond this specific scenario and are applicable to a wide range of model types and data domains.
What information should be captured in the model schema?
Exactly what information is needed in a model schema will depend on the model type, use case, size of the organization, and how the model schema is intended to be used. In general, best practice is to standardize model schemas across an organization to facilitate automation. The recommendations here should be treated as a starting point. My recommendation for a tabular model is given here:
1. Data Columns
How are the data columns used and how does each column relate to the model?
- Model inputs: Actual input values to the model
- Model outputs: Predicted probabilities, logits, and/or predicted class
- Non-input data: Any data that is relevant to the model but is not used as model input such as id columns, timestamps, or features that contain cohort data that is of interest such as race or gender for bias tracking
- Model target: The ground truth column(s) that the model is predicting
For each column type, consider what information would be needed for a colleague to reload the data and evaluate the model without access to the code used to train the model. For model inputs, non-input data, and target columns, include basic information for each column:
- Column name
- Column index: Where the column is located in the DataFrame it is saved in
- Data type: What the expected datatype for the column is
- Categorical or continuous: This will tell future users how the column should be treated during analysis
- Categories (if applicable): What are the expected categories if the column is categorical?
- Value bounds: What are the expected bounds if the column is numerical?
- Cohort/segment column (if applicable): Whether or not the column denotes data cohorts or segments of interest—used to compare model performance across groups
- Sample ID or timestamp column: Whether the column is a sample ID or timestamp column (his may only applicable in some cases)
For model output columns, it is useful to include both raw model outputs and the final prediction made by the model. In addition to the fields outlined above, columns with raw model outputs may require a field to map raw outputs to the final model output. If the model is a multi-label model, it is also important to note which prediction task the column corresponds to.
2. Data Access
Details about how to access the data and how it is structured. The specifics of what is needed here may vary depending on how data is managed within the organization. Some examples of useful content are given below:
- Data files: For each data file, give the filepath, which components of the data are included (e.g. [‘model inputs’, ‘model target’]), expected column names, the join key to integrated data with other sources, and any other information need to load the data file.
- Data splits: If test or validation data is included in the data files, provide details about training vs test indexes. If training and validation data is saved in separate files, describe where each set can be located.
- Inference data: If applicable, provide details on where inference data is to be stored and how it is formatted if different from training data.
As an example, I will go through a model I trained on a modified version of the folktables dataset ACSTravelTime. The objective of the model is to predict whether an individual has a commute of more than 20 minutes, and indicated by the binary target “JWMNP.”
The dataset contains the following features:
- SERIALNO: person serial number
- WAGP: Wages or salary income past 12 months
- AGEP: Age of the householder
- SCHL: Educational attainment
- MAR: Marital status
- SEX: Sex
- DIS: Disability recode
- MIG: Mobility status (lived here 1 year ago)
- RELP: Relationship
- RAC1P: Recoded detailed race code
- PUMA: Public use microdata area code
- CIT: Citizenship status
- OCCP: Occupation recode
- JWTR: Means of transportation to work
- POWPUMA: Place of work PUMA
- POVPIP: Income-to-poverty ratio recode
The dataset is derived from the American Community Survey Public Use Microdata Sample (PUMS) and full documentation can be found here.
The features in this dataset were not all used as input to the model. SERIALNO was used only as an identifier and join key during data manipulation. RAC1P and SEX were not used as model input, but were important information to track nonetheless to evaluate model fairness concerns. The features also include a combination of numerical and categorical data.
I used data from the years 2014 and 2015 as training data and reserved data from 2016 to test my model.
Prior to training my model, I defined each of the relevant groups of columns in my code. I also defined data bounds for numerical features, and which columns were to be used to track specific cohorts of interest within the dataset.
This makes it easy to separate input and non-input data without losing track of non-input attributes.
I trained a Gradient Boosting Classifier and being satisfied with the results, I saved the pickled model.
While this is usually the end of the story in a data science blog post, the next few steps are key to maintaining and improving upon a model in a real-world production environment.
While we already have the data saved somewhere (in this case, in the U.S. census database), it is important to keep track of the exact data used to train and validate a model. This especially comes into play when the time comes to retrain a model or roll it back to a previous version. It may also be required for regulatory purposes.
For my ACSTravelTime model, I reorganized all of my input and non-input data for both my train and test sets into one DataFrame, keeping track of the indexes for each dataset.
Model outputs on both training and test datasets are also important to maintain, as this facilitates comparisons of multiple versions of a model over time. I saved raw prediction probabilities and the final prediction for my ACSTravelTime model and combined training and test data into one DataFrame. I saved both DataFrames as .csv files.
Model Schema Architecture
I designed a model schema for this model with the objective of making it easy to reload the dataset, define data validation standards, and evaluate and monitor the model’s performance. The schema is first structured as a python dictionary, and then saved as a .yaml file, which can easily be reloaded along with the dataset.
As described above, I divided my model schema into 2 sections. In the data_access section, I provided all the information needed to load the dataset and run basic validation using just the model schema file.
In the data_columns section, I provided more granular details for each column, which would allow for full data validation, model evaluation, and model monitoring. For each input and non-input column, there are details about data bounds and categories, the data type of the column, and whether the column should be used to track cohort performance.
The model_outputs section details which columns correspond to the final prediction and the predicted probability for each class.
Using my model schema file, I am able to reload my data from scratch and understand how my training data was structured, what data bounds I should expect, and which columns were of interest when tracking data subsets. Along with other versioning tools, I could use this to help me retrain a new version of my model, roll back to a previous version, and perform monitoring and evaluation on my model.
The complete notebook used in this example is available here.
The full model schema file is available here.
The Value of Model Schemas
At first glance, model schemas might appear superfluous, possibly dismissed as an afterthought by an individual data scientist. However, within the organizational landscape, they are a vital part of maintaining consistency, efficiency, and automation in an MLOps system.
They provide an outline of exactly how data is used in a model, while also establishing a framework for comprehensive evaluation and continuous monitoring. Model schemas can fit easily into an MLOps system as a model metadata artifact, and be saved within model registries or other versioning systems. This level of standardization allows data science and MLOps tools to be automated and speeds up the process of retraining, updating, and monitoring models. It also reduces the reliance on institutional knowledge, facilitating hand-offs between team members.
With a shared understanding encapsulated within the model schemas, data scientists and MLOps practitioners can unlock the true potential of automation, empowering tools and processes to operate seamlessly. This not only streamlines the workflow but also reduces the risk of human error, ushering in a new era of efficiency and collaboration within the organization’s data science ecosystem.