When it comes to text evaluation, you may have heard of methods like BLEU (evaluation based on word presence) or BERTScore (evaluation based on pre-trained NLP models). While these methods are still useful, they have taken a bit of a back seat recently with the technological advancements around LLMs. In this post, we’ll discuss what LLM-guided evaluation—or using LLMs to evaluate LLMs—looks like, as well as some pros and cons of this approach as it currently stands.

What does LLM-guided evaluation look like?

The process of using LLMs to conduct evaluation—as opposed to the BLEU and BERTScore methods mentioned above—is conceptually a bit simpler. You take the generated text whose quality you want to evaluate, you pass it into a prompt template that you provide to an LLM, and then that LLM provides feedback about how “good” the generated text was.

One way to think about LLM-guided evaluation is using LLMs for classification. This typically involves providing an evaluator LLM with a scoring rubric in a prompt template. This rubric instructs the evaluator LLM how to classify some other candidate LLM’s response (see the appendix of Meta’s LIMA paper for an example).

There’s also binary classification, where you might ask something like “Does this response contain hate speech?” (or any other property) and get a “yes” or “no” answer. To do this, you would have to construct examples of hate speech (or another property you want to evaluate for), provide those in a prompt template, and then prompt the language model based on all these examples to binarily classify a new response on that property.

These techniques are called few-shot prompting, and we’ve found that this can go quite a long way in creating a basic first implementation of LLM-guided evaluation.

Why does LLM-guided evaluation help?

The reasons we see value in this approach tend to revolve around speed and sensitivity. 

Typically Faster to Implement

Compared to the amount of work it may have required before the era of LLMs to get an evaluation pipeline set up, it’s relatively quick and easy to create a first implementation of LLM-guided evaluation.

For LLM-guided evaluation, you’ll need to prepare two things: a description of your evaluation criteria in words, as well as a few examples to provide in your prompt template. Contrast this with the amount of effort and data collection required to build your own pre-trained NLP model—or to fine-tune your own NLP model—to use as an evaluator. When using LLMs for these tasks, evaluation criteria are much quicker to iterate on.

Typically More Sensitive

The sensitivity aspect can be good and bad. On the positive side, sometimes it’s really just one subtle word or token that changes the meaning of a sentence (e.g. “not”). LLMs are better at flexibly handling these scenarios than pre-trained NLP models and previous evaluation methods discussed. On the flip side, this sensitivity can make LLM evaluators quite unpredictable, which we discuss more in the next section.

What are some challenges with using LLMs as evaluators?

Too Much Sensitivity/Variability to Be a Fully Automatic Solution on Their Own

Like we discussed above, LLM evaluators are more sensitive than other evaluation methods. There are many different ways to set up an LLM as an evaluator, and it might act very differently depending on the configuration choices you make, such as which LLM you’re using as an LLM evaluator and also the prompt formatting/methodology.

Constrained by the Difficulty of the Task Being Evaluated

Another challenge is that LLMs will often struggle if evaluating the task at hand requires too many reasoning steps or too many variables to be managed simultaneously. We do anticipate this improving over time as more tools and APIs become available for LLMs to interface with, but for now, this is the case.

While we have not covered this specific limitation in our experiments, previous research has shown limits to transformer “reasoning.” Transformers, the AI models that power LLMs, are sometimes able to do tasks that require multi-step reasoning (e.g. multiplying numbers with many digits), but there’s a limit to how well they can generalize beyond their training examples. Even if you fine-tune an LLM on many examples of multi-step multiplication, you get massive correctness drop-offs once you go just beyond the size of the problems in the training data. For more detail about this particular phenomenon, check out this paper.

Our Experiments

We’ve launched a suite of products for LLMs at Arthur this year—most recently Arthur Bench, an LLM evaluation product. Additionally, our ML Engineering team has experimented extensively with LLM-guided evaluation, particularly focusing on the sensitivity and variability challenge.

In a recent webinar, ML Engineers Max Cembalest & Rowan Cheung did a deep dive into some of these experiments. They tested some of the well-known LLMs (gpt-3.5-turbo, claude-2, LLaMa2-70b, command, etc.) as both candidates and evaluators, under the hypothesis that an LLM evaluator would be biased towards text it itself had generated over text other models had generated. Watch the webinar on YouTube to see the results in detail and find out if this hypothesis was supported.

Interested in learning more about Arthur Bench? Find more information here.