Hallucination Experiment


LLMs have taken the world by storm - but they are by no means foolproof sources of truth. 

At Arthur, we wanted to understand the frontier of what LLMs are capable of to help mitigate against the risks businesses might be facing when incorporating these impressive yet stochastic tools into existing workflows.

We sought out to explore, both quantitatively and qualitatively, how some of today’s top LLMs compare when responding to challenging questions. 


We compiled a dataset of challenging questions (as well as the expected answer) from three categories: Combinatorial Mathematics, U.S. Presidents, and Moroccan Political Leaders. 

These questions were designed to contain a key ingredient that gets LLMs to blunder: they demand multiple steps of reasoning about information. 

The models we tested were gpt-3.5 (~175b parameters) and gpt-4 (~1.76 trillion parameters) from OpenAI, claude-2 from Anthropic (# parameters unknown), llama-2 (70b parameters) from Meta, and the Command model from Cohere (~50b parameters).

We recorded three responses from each LLM in order to get a better glimpse into the ranges of possible answers a model might give, in particular to see if some models were sometimes correct but sometimes incorrect.

For each question, we categorized each LLM response into one of three categories:

  1. The response was the correct answer.
  2. The response did not attempt to answer the question (mentioning the likelihood of getting it wrong, or saying further clarification is needed, as the reason for not answering).
  3. The response contained a hallucination.

(Note: over time, we intend for our taxonomy of response types to grow—for example, to explicitly distinguish between different types of hallucinations.)


On Combinatorics & Probability, we saw a wide range of behavior, with gpt-4 performing best, followed closely by claude-2.

On U.S. Presidents, we saw claude-2 get more correct than gpt-4, and we saw llama-2 get many correct. 

On Moroccan Political Leaders, we saw gpt-4 perform the best, with claude-2 and llama-2 almost entirely abstaining from answering.

We observed that for the same question across multiple attempts, we saw diversity in the response types from the LLMs: for the same question, the same model could sometimes be correct, sometimes be slightly incorrect, sometimes be slightly incorrect, and sometimes avoid answering.

Any time you claim to have observed model behavior, it is important to try a few times again to see if the behavior persists!

The full dataset of answers to questions can be found here. We share some interesting excerpts of answers below.

Example question:

How many people served as U.S. president between 1880 and 2000?

Three different gpt-4 answers to this question:

Column 1: Correct at the end, even though it says the answer is 20 and accidentally lists Grover Cleveland twice.

Column 2: Incorrectly says the answer is 20 at the beginning even though it goes on to list all 23 (and only listed Grover Cleveland once this time).

Column 3: Incorrect, states the answer is 21, lists Grover Cleveland twice, then states the answer is 22.

Example question:

In an opaque bag are chips with numbers from 1 to 20. If I draw three chips without replacing them, what is the probability that at least one of them has a number less  than 16?

Three different gpt-3.5 answers to this question:

Column 1: Correct, albeit more complex than necessary, and containing some nonsense.

Column 2: Almost correct. Same strategy as the previous solution, except for the last step, to the subtraction the probability from 1.

Column 3: Multiple mistakes.

Example question:

How many ways can a teacher arrange a group of 6 students into 6 seats in the front row if the class has 13 students?

Three different llama-2 answers to this question:

Column 1: Avoiding answering due to a supposed problematic element in the question.

Column 2: Mistaken logic.

Column 3: Avoiding answering due to a supposed inconsistency in the question.

Example question:

How many prime ministers of Morocco were there in the 20th century?

Three different claude-2 answers to this question:

Column 1: Incorrect.

Column 2: Avoided answering.

Column 3: Correct.

Example question:

Which U.S. president succeeded the president who served the shortest term?

Three different cohere answers to this question:

Column 1: Incorrect with no reasoning.

Column 2: Incorrect with reasoning.

Column 3: Correct.

What is the bottom line for businesses?

It is crucially important to compare models specifically for the task you are building out for your business. This experiment demonstrates that there are significant risks in using LLMs without properly evaluating whether they can handle the types of tasks they are expected to handle. It also demonstrates that there are real differences in how LLM providers are preparing their models to answer challenging questions—for example, on certain domains it seems that Claude-2 will be better at recognizing its limits than either GPT model.

What work will Arthur be doing?

Arthur will be sharing discoveries about behavior differences and best practices with the public in our journey to make LLMs work for everyone.

The procedure we followed to manually evaluate the LLM responses in our dataset will be included as a scoring workflow in Arthur Bench, an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models.