You probably haven’t heard of human-centered evaluation of LLMs, and that needs to change. Human-centered work seeks to understand how real humans interact with technology, so that we can understand how humans (with all of their cognitive biases and quirks) interact with LLMs, and how these models affect individual human decision-making. 

This work was accepted to the CHI 2023 workshop on Generative AI and HCI. To read our full paper, please visit here.

What are LLMs? 

In the past year, Large Language Models (LLMs) have exploded in popularity—from research, to industry, to public awareness and accessibility. For instance, ChatGPT set historic records for its customer growth, with over 100 million users in its first two months. These models predict next tokens (character, word, or string) based on past context, generating free-form text for a variety of tasks in almost any specified style. This means that people can repeatedly integrate LLMs into their daily lives—to decide what to eat for breakfast, to write the responses to emails left unanswered from yesterday, to develop the sales pitch they have to present mid-morning, to generate a funny joke during a break from work, etc. 

A variety of concerning issues of LLMs have already been identified, such as biased, toxic, or hallucinated outputs, but these largely only reflect distributional or instance-wise properties of their outputs. The potential ubiquity of this tool means that we need to consider how humans will actually interact with and use this new technology, while also acknowledging that we are all prone to cognitive biases and other quirks. This area of research is referred to as human-centered evaluation, and it has not yet been thoroughly explored for LLMs. Human-centered evaluation is, however, already popular in the Explainable AI (XAI) community.

What is Explainable AI? 

Defining explainability for ML models is a subject of ongoing discussion. For the purposes of our discussion, we will focus on the most common type of model transparency seen in industry: post-hoc explanations of black box models. These often rely on only a trained model’s inputs and outputs to identify patterns in how models make decisions. These methods aim to unlock transparency in order to allow stakeholders to understand the decision-making process of models to improve trust and mitigate downstream harms. Arthur’s explainability features offer a variety of explanation options including counterfactual explanations (understanding how a model’s predictions would have changed given a hypothetical, what-if scenario) and popular methods such as LIME and SHAP

There are a variety of quantitative metrics for evaluating XAI—these are mostly scores that calculate certain ideal properties of explanations (for more information, see this piece), but perhaps more importantly, there are a variety of qualitative evaluation considerations. These are important for two main reasons.

What makes Explainable AI (and LLMs) unique?

Qualitative evaluation is important for XAI and LLMs because they are distinct from the classic ML paradigm for two key reasons:

1. There is no ground truth.

There is no exactly correct explanation for a black box system. LLMs are inherently open-ended systems that don’t have a ground truth output for each input.

2. In practice, XAI and LLM outputs are often actually meant for some downstream decision or task.

Practitioners often use XAI as an assistive tool for model debugging, generating hypotheses, and ensuring compliance. LLM outputs are often a tool to help you decide—what email to send to your client, what quick summary of an important document you will read, what answer is provided for a pertinent question, etc. This means that the context of XAI and LLM use involves some human using an explainer/LLM as a piece of evidence to make some decision. Thus, we need to consider how practitioners use, receive, and comprehend outputted AI. Especially because humans are susceptible to cognitive biases when processing information and making decisions.

What does qualitative, human-centered evaluation look like in practice?

There are three areas of focus to consider.

1. Understanding Users’ Mental Models

A user’s mental model, a term coined by Don Norman, of a technology is their internal understanding of how a technology works. For instance, maybe your mental model of a crosswalk is that pushing the crosswalk button will cause the walk signal to appear more quickly. However, for many cities, that button does not actually do anything. This is an example of a user's mental model not aligning with a technology's true model. People rely heavily on their mental models of technology to decide when to use the technology, to evaluate how much to trust the outputted results, and to make sense of any results (see, e.g. Cabrera et al. 2023, He & Gadiraju 2022, Kaur et al. 2022). 

These personal mental models are formulated from a user's perceptions and interactions with the technology and how they believe the system works. While ML practitioners may have had access to specialized training on how LLMs work, this is not the case for the vast majority of the general population. We cannot assume that everybody will have the same understanding of how a technology works as we do. To our knowledge, no work has explored the mental models the general public holds for LLMs. How a general user believes an LLM to work may be very different from how it actually works, and this mismatch can be dangerous. It is not difficult to imagine frightening scenarios where users anthropomorphize or deify an LLM chatbot, understanding it to be a "magical'' source of ground truth. This could very quickly lead to conspiracy theories and the legitimization of disinformation campaigns. It is important to consider if this is an issue of messaging and education—informing the public via AI literacy—or of regulators—to implement policies that force the algorithm providers to provide accurate, comprehensible warning labels on the limitations of their technology.

2. Evaluating Use Case Utility

As previously discussed, XAI and LLMs are often tools for accomplishing some other goal. The term use case in the XAI literature refers to a specific usage scenario and its associated downstream task or end goal. It has been found in the XAI literature that although it might be easy to assume that an explanation will be helpful for a user accomplishing a task like model debugging or model understanding, this is not necessarily the case. When the performance of that downstream task is measured, the presence of explanations can sometimes have no effect, or can even have a negative effect on performance, especially if the XAI is faulty (see, e.g. Jacobs et al. 2021, Adebayo et al. 2020). Very limited work has explored the utility of LLMs in use case–specified user studies, but a user study on Microsoft/GitHub's Copilot, an LLM-based code generation tool, found that it “did not necessarily improve the task completion time or success rate.” Basically, we want to understand if the AI assistive tool is actually helpful for successfully accomplishing the end goal.

3. Acknowledging Cognitive Engagement

Cognitive effort is a form of labor, and unsurprisingly, people tend to favor less demanding forms of cognition and other mental shortcuts. As an example, when asked to "agree'' to a user agreement when signing up for a new platform, you are probably more likely to check the box than to cognitively engage with the language of the agreement. 

Unfortunately, this human tendency can lead to unintended or dangerous outcomes because humans are susceptible to a wide variety of cognitive biases. For XAI, this manifests as practitioners only superficially examining explanations instead of digging deeply, leading to over-trust, misuse, and a lack of accurate understanding of the outputs. This can be dangerous when it results in the over-confident deployment of a faulty or biased model. 

Issues of cognitive engagement should be held front and center for researchers of LLMs. Because of their massive scale and public accessibility, LLMs may quickly become ubiquitous in all aspects of daily life. Realistically, how much will users actually cognitively engage with the magnitude of generated outputs to ensure that they are correct and aligned with their intentions? Consider an LLM-generated email: how often and how deeply will a user review that generated email before sending it? What if it's not just one email, what if it's every email? Will they always catch when the generated output says something incorrect, or worse, inappropriate? Furthermore, our attention spans have decreased dramatically with the increase in digital stimulation. 

Another aspect of concern is that LLM outputs often sound very confident, even if what they are saying is completely false. When the user inquires about the incorrectness, they also have a documented tendency to argue that the user is wrong and that their response is correct. (In fact, some have called LLMs "mansplaining as a service.") This can make it more difficult for humans to implement cognitive checks on LLM outputs. 

Why is this important?

The scale of the reach of LLMs is massive, and so the consequences of not having a qualitative understanding of the utility of their outputs are grave. Beyond the environmental and computational costs of such models, there are social consequences that are entirely unknown from the offloading of our cognitive load onto these agents.

We need to understand how users make decisions about whether to utilize the outputs of LLMs, the mental models that users have of these technologies, whether LLM outputs are actually helpful in downstream tasks, and how much users cognitively engage with the outputs to verify their correctness and lack of harm. It is dangerous to continue developing and making available larger and larger language models without a proper understanding of how humans will (or will not) cognitively engage with their outputs.