Performance monitoring has always been at the heart of Arthur’s mission and offering. We know that teams do not put models into production for no reason. Yet time and again it seems that every “state of ML” report released still lists that ML teams struggle to communicate performance to their external stakeholders. Our mission has always been to help teams create workflows and toolkits that not only enable teams to use reporting to build better models, but also empower them to communicate with external stakeholders.
Nowhere is that mission more important in 2023 than in generative AI. Organizations are terrified to fall behind and are scrambling to implement LLMs into their processes. However, it’s important for teams to think critically about how they’re planning to evaluate and communicate these findings to external stakeholders.
In this blog post, we’re going to cover some of the core challenges teams run into when they try to evaluate generative text outputs. Then, we’ll give a brief overview of some metrics that are commonly discussed by research communities and explain why we don’t see them working in production environments. Finally, at the end, you’ll find our suggestions for how to evaluate and monitor generative text to provide actionable outcomes.
Core Challenges of Evaluating Generated Text
1. There isn’t one ground truth output.
In open text generation, the output of the model is unstructured text, so there is no ground truth label in the same way we might think of evaluating a traditional classification or regression task. One solution for this gap is to ask humans to accomplish the same task asked of a generative text model, and use the human-generated text as ground truth or the “correct” solution to the task. This is time-consuming and infeasible to scale to the needs of continuously evaluating a production application.
Even given infinite resources, there is no clear definition of what the best piece of text is.
An evaluation procedure for a generative text model should properly take into account that the quality of the response will depend on its intended context, use case, and audience. For example, this type of language model that answers science questions should strike a balance between the simplicity of its answers and the thoroughness of its answers based on the user at inference time, so evaluation metrics should flexibly control for this contextual shift.
2. We lack consistent automated metrics for comparing two pieces of text.
Ideally, we’d like an automated score that could help us compare the above model outputs, or select between models that produced this set of inferences. While many metrics have been proposed, it remains a difficult question to select which metric, and associated hyperparameters best suit a specific use case. Different metrics have varying behavior on underlying qualities of text like tone, relevance, truthfulness, and also coherence, grammar, and lack of repetition. In the next section, we’ll outline some options for automated metrics.
What Automated Metrics Exist?
When we have a reference text, there are some metrics we can explore for comparing the model generated text to the reference text. In this section, we’ll discuss the benefits and limitations of some commonly used metrics.
N-gram Precision Metrics: BLEU and ROUGE
The first class of metrics directly compares the overlap of tokens in the golden reference text to the tokens in the model generated text. A popular example of these metrics is the BLEU score. BLEU was originally developed for machine translation and compares the number of n-grams in the machine generated text that also exist in the reference text.
The first term encourages the reference text and the generated text to be of similar length. If the length of the reference text is equal to the length of the generated text, the first term will equal 1 and the value of the metric will be determined by the second term, the precision.
To calculate the precision term, BLEU measures the fraction of i-grams in the generated text that are also contained in the reference text for i=1 to 4.
BLEU and its variants are very brittle:
- They don’t capture semantic information. Swapping the word “happy” for “joyous” and for “sad” are considered equally wrong.
- They don’t effectively measure grammar. Jumbling the order of the words in a piece of text can render the text nonsensical, while barely changing the BLEU score.
Adding Context: BERTScore
The next class of metrics attempts to add semantic meaning and contextual information to techniques like BLEU, by leveraging embeddings instead of raw tokens or words. Embeddings are learned representations that map words to a vector in high-dimensional space such that the vectors capture the meaning and relationship between words.
In BERTScore, each token in the reference text and generated text are first embedded using the BERT language model. Using a language model like BERT provides embeddings that hold both semantic meaning as well as contextual information, because BERT can generate different representations of the same word depending on its surrounding sentence, or context. The BERTScore is then a combination of:
- A precision term: Of all the embeddings in the generated text, compute the mean of the maximum cosine similarity with any embedding in the reference text.
- A recall term: Of all the embeddings in the reference text, compute the mean of the maximum cosine similarity with any embedding in the generated text.
While BERTScore addresses a core issue of incorporating semantic information into the metric, it has its own drawbacks:
- Introducing embeddings increases the computational complexity of the metric.
- When using embedding-based metrics, the choice of embedding can influence what aspect of the text (semantic meaning, tone, style) the metric is most optimized to measure.
Correlating with a Human Text Distribution: MAUVE
For both BLEU and BERTScore, a single human reference text is compared to a single model generated text. The authors of MAUVE propose a metric that compares the distribution of model generated texts to the distribution of human generated texts, and show that this can better correlate with human judgments of models completing generative tasks.
To compute MAUVE, samples are taken from a human distribution of text and from the model. A separate model (the authors use GPT-2) is then used to embed all the texts. Each underlying distribution is then approximated by clustering the texts and counting the number in each cluster. The final MAUVE score estimates the divergence between the human text distribution and the model text distribution.
Downsides of MAUVE:
- MAUVE still requires a reference, human-generated text distribution.
- MAUVE scores can vary based on the hyperparameters chosen. Some best practices for using MAUVE are here.
Monitor What Matters
All of the metrics described above require a reference text to compare model output to, which is often difficult to obtain in production settings. Exciting new metrics like USR explore training a model to score texts from a set of model texts and ground truth texts, such that no reference text is required at scoring time. Model-based metrics are also a promising direction towards metrics that allow users to adapt a metric for a given use case. But in production, “good” is whether the LLM is successful in the product it is deployed in.
When deploying a language model in an application or other business setting, we can more directly measure performance by collecting implicit performance metrics from the LLM’s environment. There are two types of performance measures that could be relevant to track:
These metrics are just the starting point in a holistic monitoring solution for LLMs, and are not meant to encourage driving engagement or automation at the expense of fairness or cognitive engagement. For a deeper look at designing human-centered evaluation for LLMs, check out Teresa Datta’s recent blog.
Future Research Directions
How can we develop metrics that are optimized for the context of the deployment?
Given the above performance proxies for an LLM in production, we’d like to automate metrics to correlate with those criteria. These metrics can then be used during other phases of the model lifecycle where production signal is not available, such as model selection and validation.
Can we use LLMs to generate feedback for other LLMs?
There is exciting research exploring the possibility of using LLMs to grade, score, and monitor other LLMs. For example, in Self-Refine, the authors propose a framework in which an LLM iterates on a task utilizing the feedback provided by an LLM. At Arthur, we are exploring utilizing embeddings and LLMs for scoring LLM outputs, providing natural language descriptions of model performance, and benchmarking the strengths and limitations of using LLMs during evaluation.