From simple chatbots to document classifiers to generative models like GPT-3, natural language processing models are seemingly everywhere these days. NLP models are powerful tools for processing unstructured text data—but with great power comes great responsibility. If you’re not monitoring your NLP models just as you would your tabular models, you can overlook a number of sticky issues that could quickly become billion-dollar problems.
Here are a few things that any organization deploying NLP models into production should be doing to ensure that those models continue to perform as expected.
Monitor for data drift to prevent drops in performance.
An NLP model, just like any other machine learning model, is trained on a training dataset that represents the world at a point in time—and as the world changes and the input data coming into your language model starts to change, your model will start to run into performance issues. Over time, your NLP model performance will degrade if you’re not proactively monitoring for and correcting data drift as it occurs.
Monitoring NLP models for data drift involves comparing the statistical similarity of new input documents to your training set documents. As input documents shift in their typical word-use, you might need to update your model to account for the linguistic patterns. Understanding when and where drift is occurring is essential to maintaining the integrity of your NLP models over time.
The Arthur platform monitors NLP models for data drift and automatically alerts you when your input documents start drifting beyond acceptable bounds.
Monitor for bias to protect against discrimination.
There are several methods of detecting and mitigating bias, which we cover extensively in blog posts, Making Models More Fair and A Crash Course in Fair NLP for Practioners. In addition to the techniques laid out in those blog posts, there are a few additional ways to track potential bias in your machine learning models.
First, you can track the performance of your NLP models on input documents partitioned by sensitive attributes that you may know. For example, you can track whether your medical document classifier is more or less accurate for documents from medical visits of men versus women, or Black patients versus white patients. Uncovering differences in accuracy and other performance measures across different subgroups can help you identify—and fix—unfair model bias. Check out our Data Drift Detection Part II: Unstructured Data in NLP and CV blog post. The Arthur platform offers this type of performance-bias analysis for natural language and tabular models, as well as the ability to partition by multiple attributes at a time to provide you more granular insights into potential biases
Another step is to take proactive steps to debias language outputs from your models. We at Arthur are working on building out more features to do this in our platform—stay tuned for more!
Use token-level explanations to understand black box NLP model behavior.
Whether your NLP model is a “bag of words” model (the position of the words doesn’t matter) or a sequence-based model (the context and position of words do matter), getting a token-level explanation of your model output is incredibly useful and important for understanding why your model might be getting something wrong, and understanding how to take the right steps to fix it.
For example, if we are using a medical document classifier to predict different document types in a given set of medical records, we can use this technique to understand the specific words that resulted in a given document being classified as a document about cardiovascular and pulmonary issues.
On the far right-hand side, we see the top words contributing to the predicted class in green and words contributing against it in red. We see this illustrated in the document itself. This is useful for a domain expert to understand in human context the “real world” value of the highlighted words. For example, we see the word “procedure” appears at the top of the list of explanations, meaning it most significantly impacted the (correctly) predicted classification of Cardiovascular/Pulmonary.
If we click through, on the left-hand side, some of the other incorrect predictions, we begin to notice this word is also a top influencer in those predictions as well. See Urology, below:
Above, we see that “procedure” was the second most influential word in this classification prediction.
This is where we see the importance of domain expertise and human involvement. Given the context of a healthcare system, it is easy for us to imagine how the word “procedure” might appear frequently across medical specialties. Perhaps it’s worth exploring a different model that accounts for the positioning of particular words, or perhaps we simply need to retrain how it treats certain words based on what we know is most influencing predictions. In any case, we see how providing this analysis for each prediction and word is crucial information.
Monitor your NLP Models with Arthur.
The Arthur platform has recently released extensive monitoring support for NLP models, including NLP data drift detection, token-level explainability that provides insight into the key drivers in NLP classification model predictions, and bias detection for NLP models.
If you’re deploying NLP models into production and are looking for a solution for monitoring those models over time, we’d love to connect and show you how Arthur can help. Request a demo today.