Does Your LLM Do What You Ask It To Do?


Large language models (LLMs) have been running wild for the past year, and we’ve been hard at work providing products that enable our customers to leverage these in the most effective and performant way for their business and use cases. 

At Arthur, one of the things we spend a lot of time researching is hallucinations. A hallucination is when a generative AI model (in this case, an LLM) generates a response which is determined to be incorrect or made up. Technically speaking, this happens because LLMs don’t have a deep understanding of the material they’ve been trained on, and instead, are using the provided data to probabilistically generate output; occasionally, the generated output is completely made up but communicated in a way that seems salient. 

As our research and understanding of hallucinations have progressed, we noticed that hallucinations seem to be decreasing in prevalence in newer generations of frontier LLMs. Though model developers have by no means perfected the ability for their models to be entirely factual (it’s not clear such a thing is possible or well-defined), many top-of-the-line LLMs are now competent at remaining grounded against the context the user provides them. This way of using LLMs, now commonly referred to as RAG or grounded generation, has exploded in popularity among businesses looking to integrate generative AI with their proprietary data.

To explain why this matters, let’s pretend that you are a customer service representative looking to empower faster time to resolution by using a LLM-enabled chatbot. For this application, you want this chatbot to leverage your product’s documentation as well as previous support cases to help customers resolve issues faster with fewer personnel. To do this effectively, you set up an integration with ChatGPT so that your users can have a conversation with an AI Support Agent. When users interact with this Support Agent, you would also pass in relevant documentation and resolved support tickets which are related to the conversation. This additional information allows the AI Support Agent to “understand” your product, and is passed into the context window of the LLM request. It is important when users are interacting with this AI Support Agent that the agent isn’t hallucinating, as this would mean the agent is giving the customer incorrect, or ungrounded, information. 

To give a real-world example, recently Air Canada rolled out a customer service application that leveraged LLMs to respond to customer requests. A passenger had an interaction with this application where it gave incorrect information (i.e., hallucinated) regarding the airline’s bereavement policy.

The passenger claimed to have been misled on the airline’s rules for bereavement fares when the chatbot hallucinated an answer inconsistent with airline policy. The Tribunal in Canada’s small claims court found the passenger was right and awarded them $812.02 in damages and court fees. (Forbes)

In this iteration of our Generative Assessment Project (GAP), we evaluate how well popular LLMs use data in a RAG context according to our instructions of what to discuss and what not to discuss. Based on our experiment, we provide suggestions for application developers integrating GenAI into their workflows using RAG systems and LLMs on which models will perform best at using the provided context; this minimizes the risk that these models will hallucinate answers. 

The results from our experiment are this dataset available on Hugging Face, and you can visit this GitHub repo for the experiment code.


We wanted to compare how good LLMs are at answering questions using a context. Doing this task well involves an inverse skill: recognizing when the necessary information to answer a question is absent, and choosing instead to not answer. One name for this is “staying grounded” in the context that you provide in your prompt to the LLM. 

If an LLM is perfectly grounded in its responses to our experiment, that means that the information it gives you can always be traced back to the context it was provided in its prompt (we provide the context we used across the board in our experiment, an excerpt from the Wikipedia page for Python, below). 

Whether or not an LLM is not perfectly grounded in its response, we additionally check whether it was correct (either using our own background knowledge about the questions we were asking or performing some quick internet searching of our own). This way, we give credit to the models that have memorized lots of useful information in their training data, while also marking that they provided information that went outside of the scope of their provided context despite the instruction in the prompt not to do so.

Questions & Context

The data consists of questions about the Python programming language. The context for each example is the summary paragraph & History section from the Python Wikipedia page.

The questions were chosen such that half of them are answerable given the context provided, and half of them are not answerable. We instruct the LLMs to only answer questions if they are answerable, i.e. if the necessary information is included in this context.

The context was chosen to be the opening paragraphs and the History section.

Example of Answerable: What happened in January 2019 regarding the leadership of the Python project?

Example of Not Answerable: What frameworks and libraries are most commonly used with Python today, especially for web development and data science?

Figure 1: The Wikipedia page for Python, against which we instructed LLMs to stay grounded.


template = """You are a helpful AI assistant. You only answer questions about the context you are provided. Even if you know the answer, you NEVER provide information or answer any questions about information outside the 'context: ' the user gives you. context: question: Now, answer the question BRIEFLY based on the context. If the question is asking about something not in the context, politely explain that you CANNOT answer the question. """

Figure 2: The prompt template.


While LLM-assisted evaluation is increasingly popular and time-saving, we decided it would be particularly valuable to grade answers manually, since doing so will help us measure the calibration of future LLM-assisted evaluators against our human judgment.

Each LLM answer was graded as either correct or incorrect. This was based on personal judgments based on whether the LLM was correctly understanding the questions and giving true information about the Python programming language. Most of the time, this was perfectly straightforward and easy to do with prior knowledge about Python, but occasionally required double checking with internet searches and carefully rereading the Python Wikipedia page to truly know whether the LLMs were saying factual information or not.

Each LLM answer was also graded as either grounded or ungrounded. This was based on checking if the LLMs mentioned any bit of information that was not also mentioned in the provided context from Wikipedia. Therefore, mentioning true information about Python, (e.g. its relation to C), would be graded as ungrounded if that information was not already included in the context we provided to the LLMs in our prompts.

You can view every grade we assigned in the dataset we uploaded to HuggingFace.


Figure 3: Measuring changes in correctness & groundedness on our custom benchmark.

No models scored 100% on the correctness metric—that means, each model had at least one instance of reporting incorrect information about Python, or responding in a way that demonstrates a lack of correctly parsing the intent behind the question. Llama-3-70b got the most questions correct, with 52/53. Making sure we accurately graded models for correctness required us to read each model’s answers carefully to try and spot subtle mistakes; for instance, noticing that the improved error reporting claimed by Llama-3-8b to be included in Python 3.12 was actually included in Python 3.11.

Though no models were entirely correct, several models scored 100% on the groundedness metric: Llama-3-70b, both Claude models, gpt-4-turbo-1106, and gpt-4-turbo-0429. 

This is by no means the final word on these models—further experimentation with different choices of context and answerable/unanswerable questions may reveal different trends from the ones above.


The best way to view these results in the context of all the ways these LLMs can be deployed is to get a rough sense of the relative costs of one model vs. another.

We ran our experiments using the OpenAI SDK to call the GPT models, and we used AWS Bedrock to prompt all the other models. Below, we have collected the price per thousand tokens (which is different for every model between inputs/prompts and outputs/responses) from the OpenAI pricing site and the AWS Bedrock pricing site

These price rates do not reflect the exact cost it takes to run each different model: the way text is processed into token can be very different from model to model, and the amount of text that models tend to output can vary as well, both of which can have drastic impacts. 

So, putting aside the complexities of true end-to-end system cost for now, we can get an approximate comparison of model affordability by simply viewing the token price rate between the offerings from OpenAI and AWS Bedrock.

Figure 4: LLMs sorted by price rate per tokens used in inputs (prompts) and outputs (LLM responses).

However, we can do even better. We calculate the actual experiment cost for each model as a function of its token pricing, its actual volume of outputs generated (which often ranged from a couple of words to entire paragraphs), and the specific tokenizer each LLM uses for processing text. Though these tokenizer choices may not reflect 100% the exact procedure for all the different model versions tested, we used the following to estimate our experiment cost: we use the Mistral-specific tokenizer for the Mistral models, the Command-R-specific tokenizer for the Cohere models, the Llama-specific tokenizer for the Llama models, and tiktoken for all the OpenAI and Anthropic models.

Figure 5: LLMs sorted by the actual estimated cost of running our experiments.

Now, the cost effectiveness of Claude-3 Haiku is even more apparent. Even though its token rates are more expensive than, say, Llama-3-8b, it is equipped with a capability to use its tokens better. If you take a look at the actual outputs produced by Llama-3-8b, you will see that it often produced way more tokens than necessary to answer the questions! These inefficiencies of models over-answering questions can really add a lot of cost to an experiment, and are not captured in the data you will get just by looking at the prices of token rates. Rather, the end-to-end cost of an experiment is a much better signal of the efficacy of your entire pipeline of text processing, prompting, LLM effectiveness, output parsing, etc. 

Evaluating the cost of LLMs in real-experiment terms like Figure 5 (as opposed to abstract token rate terms like Figure 4) is going to be more important over the coming years for enterprises to get ahead of—especially those with a user base speaking many of the world’s languages. Notably, the new GPT-4o model from OpenAI has undergone major improvements in the cost-effectiveness of its tokenizer, with a focus on efficiently processing the world’s most highly-spoken languages. This means this new model, and those that use the same tokenizer, will be able to sample fewer tokens to express ideas in those languages relative to models using the previous tokenizer.

In order to properly compare LLMs in the coming years, especially across languages, the generic cost per token will not tell you how effectively your LLM is able to briefly and correctly do what you ask it to do in a token-efficient manner. Therefore, measuring the full cost of your experiment will give you a better view of how ready your system is for reliable and affordable handling of natural language.


One of the things that we believe at Arthur is that there is no one-size-fits-all model for every use case. Application integrators and users should determine the requirements necessary for their use case when picking a model and then evaluate which model best fits their needs. 

We can now make some rough judgments that take into account all of these factors: 

  • Performance measured by the capacity to retain information for this particular benchmark
  • Performance measured by the capacity to follow our instructions to stay grounded to context
  • Cost measured by the experiment cost formula for the inputs/outputs on this particular benchmark
  • Accessibility, whether a model is open-access (released weights) or closed-access (API only)

Based on the results of our experiments, Arthur makes the following recommendations for application integrators and users:

Overall Best: Anthropic’s Claude v3 Haiku

  • The Haiku model makes the overall best choice due to its balance of ease-of-use, cost, and performance.
  • Haiku is available for use on AWS Bedrock, which makes it incredibly easy to get started with minimal setup.
  • Haiku is up to three orders of magnitude cheaper than the other large foundation models that have similar performance.

Best Open-Access Models

  • Llama-3-70b and Cohere’s command-R-plus models will both make sense for enterprise deployments. Llama-3-70b scored higher on factual correctness and the instruction to remain grounded to the context of the prompt, but this was just a single domain (the wikipedia page for Python) so results could easily change on different domains.

Below is a deeper dive into the results of our experiments.

General Trend: Larger + Later Foundation Models (GPT-4, Claude Sonnet, Command-R-Plus, Llama V3) Perform Better

The consistent trend in model development is that models tend to get more correct and better at following instructions as models get bigger and as time progresses (since companies presumably are improving their training datasets, fine-tuning datasets, etc.). 

The trend is certainly visible within the models available on AWS Bedrock: moving “up” a model series within a provider—for example, moving from cmd-R to cmd-R-plus, or moving from Llama-3-8b to Llama-3-70b, or moving from Mistral’s open weights models to their API-only Large model.

Open-Weight Excellence from Command-R-Plus and Llama-3-70b

Cohere’s command-R-plus and Meta’s Llama-3-70b did really well on this assessment. These models have had their weights released online, so are free to use if you are already managing your own infra—for this experiment, we used them via AWS Bedrock so we still paid an API cost at a price range similar to Claude-3-sonnet.

Claude-3-Haiku: API-Only, But Very Good for Its Cost

Take a look at where claude-3-haiku is relative to gpt-4-turbo, as well as command-R-plus and Llama-3-70b. The price-per-token of claude-3-haiku is quite lower via AWS Bedrock than command-R-plus and Llama-3-70b. And it’s about 20x–40x cheaper than gpt-4-turbo-0429. Certainly there will be tasks like Python code generation that gpt-4-turbo-0429 is more equipped for than claude-3-haiku, but for tasks that involve text comprehension with the added property of groundedness when you instruct for it, it looks like claude-3-haiku is extremely capable and affordable.

OpenAI gpt-3.5-turbo-0125: A Regression in Groundedness

The only model series that regressed in groundedness is the gpt-3.5-turbo series. We don’t have any direct information as to what happened between 1106 and 0125 to cause this regression, but we do know that the cost of the LLM was lowered by OpenAI, so it is possible that simple reductions in parameter count, parameter precision, and compute-behind-the-scenes-spent-per-predicted-token can all contribute to worse groundedness despite improved correctness.

Figure 6: gpt-3.5-turbo pricing history.

The trend of this series, just judging by the price points OpenAI has decided to go with, may indicate a fundamental strategic emphasis on cost reduction over time. This has led to many benefits for customers, including at Arthur. But given what we have observed in this experiment, we believe unanticipated changes in the distribution of upstream LLM behavior is something that demands proper testing be put in place to handle the friction of model upgrades.