Large Language Models

Why LLM Hallucinations Are So Hard to Deal With

Exploring the complexities of llm hallucinations: challenges and solutions in ai

Why LLM Hallucinations Are So Hard to Deal With

Introduction

Large language models (LLMs) producing nonsensical content or contradictory content based on the prompt—what is normally called hallucinations—is quite a hard problem to solve. LLM Hallucinations can take on many forms, such as not solving a math problem correctly or making false statements about presidents. However, describing hallucinated content in this way, although helpful for our colloquial understanding, may not benefit us in the long term.

How we describe LLM hallucinations is important. If we decide to look at high-level definitions of hallucinated content, such as unverifiable or false responses, it is hard to dissect what exactly we mean. But if we stick to describing specific instances, such as getting the step of a math problem wrong or producing a research paper when answering a question, we don’t have a good way to gain general understanding for how hallucinations occur. Moreover, all these ways of talking about hallucinations give some agency to the model. Although unintentional, when we say that an LLM got a math problem wrong or said something false about a celebrity, we are implying that the LLM somehow has the capability of knowing the correct answer—when in reality, any of the released LLMs to date don’t have the capability to understand.

Why does this matter?

Without proper definitions and understandings behind LLM hallucinations, the AI community is not able to create high-quality datasets about hallucinations. And without high-quality datasets, our ability to build solutions to tackle hallucinations is hindered because we aren’t able to train models or produce valid evaluations.

The current datasets that exist today are fairly broad, binary (thus, unable to get any granular feel for the types of hallucinations), and at times a bit dirty. And this isn’t to say there haven’t been some great attempts at analyzing hallucinations. One of our favorite papers provides preliminary taxonomies for hallucinated content, while another produces what are seen as some of the best datasets to date. But overall, the field of hallucination detection and mitigation is quite nascent, and the need for high quality data amongst the entire AI community is needed.

Here at Arthur, one of our focuses is on LLM hallucinations. We believe that having a rigorous understanding of hallucinations, where they come from, and how they are generated can help us not only gain a deeper understanding of hallucinations, but also help us create such a dataset for the AI community. Read the blog posts from the Arthur team on some of our work trying to compare the rates of hallucinations from different language models and start to analyze the types of hallucinations that are occurring.

Help us collect data!

We created a taxonomy, so that we can create high-quality data. There are two ways to go about this: either start by generating themes from research, datasets, etc. that exist and then collect data against it, or collect data and start seeing what themes emerge from the data itself. We are at the point where we need to start collecting some data! 

As you stumble across hallucinations, please fill out this form. Any and all hallucinated content will be useful. We will use this to inform our taxonomy development and, in the near future, we will open source a high-quality LLM hallucinations dataset for the AI community to build upon. If you have any questions, concerns, or want to collaborate, feel free to email raphael@arthur.ai.




FAQ

What are the key benefits of having a well-defined taxonomy for AI systems and their outputs?

Having a robust taxonomy provides several important benefits - it enables standardized evaluation of AI systems, facilitates targeted improvements by clearly delineating issues, improves transparency and explainability of AI inner workings, and enables better knowledge sharing across the broader AI community.

How do taxonomies differ between narrow AI applications and general/large language models?

Taxonomies for narrow AI tend to be more tightly scoped and stable, focusing on specific output types. In contrast, taxonomies for general large language models need to be more expansive and flexible to account for the wider range of potential failure modes, while still requiring rigorous definitions and transparency.


How can taxonomies be practically applied to improve AI system robustness and safety?

Taxonomies have practical applications like informing comprehensive testing, guiding targeted mitigation efforts, boosting interpretability, facilitating collaborative learning, and supporting regulatory compliance - all of which can enhance the overall robustness and safety of AI systems.