Large Language Models

The Ultimate Guide to LLM Experimentation and Development in 2024

In this article, we highlight the major shifts we’ve seen over the last six months and explain how administrators, developers, and users can leverage tools, processes, and workflows for LLM experimentation.

The Ultimate Guide to LLM Experimentation and Development in 2024

Frontier research labs and massive tech companies are currently competing to produce the best performing and most popular AI models. Since necessity is the mother of invention, rapidly improving AI and the need to make use of it has brought with it a flood of new tools and tactics, for both individual developers and enterprise businesses, to rebuild existing workflows more efficiently while introducing new advanced capabilities to their tech stack. 

During this time at Arthur, we have been working to build best-in-class, enterprise-grade products that enable our customers to experiment, develop, govern, and monitor their AI applications—including ones which leverage generative AI. There have been a lot of changes in the industry, ecosystem, tooling, and deployment over this time, both in what we’ve observed while building our products as well as what we’ve seen in interacting with our customers. 

In this article (and in the webinar we hosted last week), we highlight the major shifts we’ve seen over the last six months and explain how administrators, developers, and users can leverage tools, processes, and workflows to better get value out of LLMs.

General Themes

Open Catching Up to Closed?

The largest models from the labs with the quickest and strongest start on the research and scaling front (OpenAI’s GPT-4 and Anthropic’s Claude-3) are still better across most tasks that involve wide-ranging information and nuanced instruction-following. Commonly-used benchmarks suggest that open-weight LLMs like Llama-3-70b, Mixtral-8x7b, DBRX, and Command-R-Plus are well on their way to catching up to the frontier. For example, their scores on MMLU roughly reach the scores achieved by Google and Anthropic’s models in 2023.

Figure 1: Comparing MMLU scores over time, from Nathan Lambert’s Interconnects.

With platforms like AWS Bedrock and Hugging Face, picking up a new language model for LLM experimentation could not be easier, and the increase in performance for open-weight models makes it easier for application developers to pick the model that is best suited for their task. 

Small Catching Up to Big?

Small language models (SLMs) show promise at being an efficient alternative for tasks that are narrow enough to not require the top LLMs. Rather than attempting to reflect the entire internet and knowledge of all humanity, smaller more modest language models trained on specific domains of text and tasks can be a more controllable and computationally feasible way to run AI on your hardware.

Figure 2: Phi-3 (~4 billion parameters) running directly on an iPhone. The text in the right image took ~10 seconds to generate.

Improvements in the training of small models, as well as improvements in the software and computation running the matrix math for the models, have enabled small language models to run directly on mobile hardware, which will probably be an important trend in coming years.

GPT-4 Still Shows Its Dominance for Particularly Quantitative Tasks

We instructed different large-scale models in the prompt to compute a relatively complex mathematical operation (e.g. a distance function (Jensen-Shannon divergence) between two probability distributions).

instruction = """ Produce python code that generates two probability distributions, both gaussians but with different means & standard deviations. The first distribution should be N(0, 1) and the second distribution should be N(1, 2) Write python code that calculates the jensen-shannon divergence between the two distributions Print your answer in the format "Solution: ". Do not give me code to execute on my own, I want you to execute the code and return the value """

GPT-4 did this correctly while the other models made mistakes. Either they implemented the formula incorrectly, or they did not properly take into account the spatial information of where the distributions sit along a number line.

Figure 3: (top-left) GPT-4 gets an A on this task. (top-right) Claude-3-Haiku gets an A-; it writes correct code for the formula but does not sample widely enough. (bottom-left) Llama-3-70b gets a B-; it made an error in the formula, and also didn’t sample widely enough. (bottom-right) DBRX gets a C+ for hallucinating a function from scikit-learn that doesn’t exist.

Overall, we see the general performance of these large foundation models catching up with OpenAI’s GPT models across a handful of dimensions, but it’s clear that GPT-class models are still state-of-the-art when it comes to particularly nuanced and complex tasks. 

How to Get Started Experimenting and Developing

Truly knowing whether you are achieving positive ROI—from big models, small models, closed models, or open models—is still a bit wild west at this point in time. This largely stems from a lack of clear, objective criteria that really define quality performance from LLMs at more subjective tasks like summarization and answering questions.

But the tools and techniques to get started with LLM experimentation are very easy to use, and getting easier all the time. We will cover a suite of these in three basic categories:

  • Touchpoints: Quick, minimal LLM experimentation interfaces
  • Evaluation: Metrics and relevant benchmark datasets
  • Enhancing Prompts: RAG, APIs, and well-chosen examples for your LLM to see how it’s done

The touchpoints, evaluation methods, and prompt enhancements we highlight below are suggestions for quick LLM experimentation, and do not represent official product endorsements or specific claims about what the affiliated companies or individuals can provide for your business. Rather, we want to highlight various projects that demonstrate what we think are steps in the right direction towards a world of productive AI tools that facilitate steering and understanding for the common practitioner.


Touchpoints are the interfaces that route between you and the LLMs you are using. You want these to be pretty minimal, easy to set up, and fast to run. You probably don’t want these to introduce too many unnecessary features if you are early on in your development—rather, you may be better served by the touchpoints that are the simplest and get the least in your way so that your process can be flexible and fluid.


If you want to start your LLM experimentation with testing some of the major providers, LiteLLM is a good client API with a common format for most of what you want. This client maintains enough of a common format for your LLM inputs that swapping between providers is rather frictionless. This means the same code will be able to easily support a host of LLM choices.

Figure 4: The simple client from LiteLLM allows for painless swapping between providers.

Other popular but heavier choices for LLM clients like this, such as LangChain, may introduce friction when performing these kinds of swaps due to the size of the library and the layers of Python classes/objects to navigate through when you need to understand how parameters are handled. LiteLLM, in contrast, keeps things minimal enough so that parameter handling is rather transparent.


Ollama is nice and easy for experimenting with open-source models, with a git-like CLI to fetch all the latest models (at various levels of quantization so you can run quickly from a laptop) and prompt from the terminal.

Figure 5: With Ollama on my computer, all the work needed to use new LLMs from my terminal is Ollama pull & run.

Ollama also spins up a local API so that you can call your LLM from other applications. That means any application you write in Python, Javascript, or any other language can get responses from an API that you yourself are running via your own computer as the server.

Hugging Face Transformers

The transformers Python package from Hugging Face has likely been the most popular and highly-used touchpoint in LLM application development over the past few years. The broad global ecosystem of open-source developers sharing models and datasets on the Hugging Face platform makes transformers the default go-to library used in many applications and tutorials in LLM development and NLP more broadly.

The work that the Hugging Face team has put in to make free AI more accessible and understandable for the entire world is an extraordinary set of achievements. However, this does not mean that it will always be the optimal code to run for a given choice of LLM and hardware. We think of Hugging Face more as a starting touchpoint for broad access across the ecosystem rather than a go-to touchpoint for fast responses, since newer libraries have implemented more hardware-efficient software to run LLMs.


MLX is an open-source Apple project to recreate a PyTorch-like Python package from the ground up for running and training all the standard and state-of-the-art AI models with efficiency on Apple hardware. This project demonstrates that with the willingness to rewrite existing standard modules from scratch with hardware-awareness can bring massive improvements to the speed and memory-efficiency of the matrix math that underpins LLMs and all of modern AI.

Figure 6: (left) Prompting Genstruct via transformers. (right) Same LLM & prompt in MLX, roughly the same amount of code, but significantly faster.

More Touchpoints to Explore

Many wonderful projects are focusing on creating the fastest touchpoints possible for LLMs by focusing on making better use of CPUs, GPUs, and thinking through hardware-software integration from the ground up. Llama.cpp is a project that seeks to get maximum performance for running LLMs on CPUs. vLLM is a project that is making more efficient use of GPU memory and parallel processing. Groq is a company that stands out via their patented LPUs (language processing units) and their leveraging of functional programming in the backend.

It is hard to say exactly what will be the most successful touchpoint in the future for LLM application developers, because the context of your application development cycle, the hardware you have available, and the nature of your experimentation needs will dictate which option makes the most sense for you. Regardless of which path you take, we hope that the overview of touchpoints provided above gives a quick sense of how one can get started running LLMs quickly, easily, and efficiently with the hardware you already have.


The first few rounds of informal experimentation can reveal a lot about the strengths and weaknesses of different LLMs. Many researchers in the industry actively extol the virtues of evaluating the models by just trying out chats and performing a “vibe check.” 

By attempting to over-quantify things and leave evaluation to the testing software, developers can sometimes miss out on the qualitative flourishes that make certain LLMs with linguistic flourish or entertainingly strange patterns of association stand out.

However, without some quantitative evaluation pipelines in place, it will be hard to know during the development of real applications when one LLM is truly the better choice over another. 

Types of Metrics

There are roughly three general categories of LLM evaluation we have seen emerge in popular usage:

  1. Metrics that strictly evaluate the exact correctness of an LLMs response
  2. Metrics that measure the distance of an LLM’s response from some ground truth / reference / “golden” response
  3. Heuristics that determine the quality of LLM responses without a reference output

Evaluating LLMs using multiple-choice benchmarks like MMLU fall into the first category: each LLM is prompted with a description of a scenario, four possible answers associated with the symbols A, B, C, & D. The LLM then outputs a single token, which will either be the token for the correct answer (A, B, C, or D), or will be some irrelevant token that will get marked as incorrect.

Evaluating LLMs using embeddings or BERTScore are examples of the second category: each LLM outputs a string that is measured against the target string; commonly a score of 1 will indicate exact equivalence whereas a 0 will indicate complete irrelevance.

Finally, anything ranging from the highly formal (like aggregating A/B tests via Elo scores, discussed below) to the highly informal (like just judging the vibe and deciding quality based on what you see in the outputs) can be examples of the third category.

Figure 7: Simplified comparison of evaluating via exact correctness (top), partial correctness (middle), or quality-sans-correctness (bottom).


Elo (named for the Hungarian-American physicist and chess master Arpad Elo) has become a very popular general-purpose metric of the third kind to quantitatively rank LLMs based on their success in head-to-head blind A/B preference tests.

The LMSYS Chatbot Arena facilitates ongoing online blind A/B tests to compile a public Elo leaderboard for the public to compare the top LLMs at the common task of giving the better of two responses to a user’s prompt.

Elo has its pitfalls as a measurement tool—online user voting can be swayed by superficial features such as, say, which of two LLMs wrote a longer response—when people are picking between the better of two written responses, they can fall prey to fallacies such as misattributing a response’s length & opacity for its depth & quality. But if applied diversely enough across participants and prompt domains, it can be an effective way to reveal the frontier of models that give sufficiently good answers relative to the top LLMs while staying cost-effective, like Claude-3-Haiku.

Figure 8: Plot giving a rough sense of the cost-effectiveness of Claude-3-Haiku vs LLMs, measured by token pricing & Elo.

Summarization Scoring & LLM-as-Judge in Arthur Bench

In this example notebook from the Arthur Bench repo, an open-source framework for evaluating LLMs, we show how to use LLMs as a judge to facilitate a quantitative comparison between LLMs available on AWS Bedrock at summarizing some news articles. 

Using LLMs as a judge can sometimes introduce biases into your evaluation. However, it can be a quick initial signal before you dive deeper yourself into more manual and reliable forms of evaluation.

Evaluation Data

What examples are you testing the model on? Do they reflect your intended use case? This is the real bottleneck you need to address before you can even get value from your metrics. If you set up a good curation loop, this can be seamlessly integrated into how you are experimenting!

This blog post by Hamel Husain discusses how important it is to focus on determining the right evaluation dataset & metrics for your application, and how determining the makeup of these datasets can be harmoniously integrated into your workflows for experimentation and application development cycles.

Figure 9: Hamel Husain’s depiction of the virtuous cycle of data curation integrated into experimentation.

Taste-Driven Experimentation

One trend we think is important is that benchmarking is becoming more taste-driven by people who know exactly what they want to be doing with LLMs. For example, DeepMind’s Nicolas Carlini developed his own personal repo to benchmark LLMs against a bunch of coding tasks he was already trying to use them for.

Figure 10: Examples of tasks & cross-model performance results on Nicolas Carlini’s personal coding benchmark.

Enhancing Prompts

With access to LLMs set up and some decisions in place around your evaluation data & metrics, you will be well prepared to see both qualitatively and quantitatively how much better your LLM responses can get if you maximize how much helpful information you can include in your prompts.

RAG (Retrieval-Augmented Generation)

This is a common design pattern you may be well familiar with already, since many businesses have already seen the value of RAG at bringing their company’s specific contextual data directly into their LLM prompts without the LLMs needing to have ever been trained on that data. RAG is all about deciding what helping information you can fetch before you prompt an LLM, either from an online source like a public weather API or an internal datasource like your enterprise’s data lake.

RAG does not take much to program into your LLM application: it typically requires a few lines of code beyond whatever existing code you have in place to send your prompts to an LLM, using the user’s recent query and existing chat history to fetch relevant context. 

One simple enhancement that we have seen work well across many RAG applications is to allow for a flexible number of rounds of retrieval to take place so that the LLM can evaluate whether it has yet received enough context to answer! 

Here is how this looks in DSPy, a library which introduces some extra complexity for the simple experimenter but has some great features we discuss later on. We only show the skeleton for this code and don’t specify how we prompt the LLM for writing intermediate queries & writing the final answer, since we want to show the modularity of a RAG workflow independent of how we query the database and how we specifically prompt the LLM to address the user.

Figure 11: Simple schema for multiple rounds of RAG alongside the skeleton of its corresponding DSPy implementation. 

Tools & Agents

Getting an LLM to write responses that we can plug directly into something else unlocks many opportunities, and this is mainly what people mean by the terms “tool” and “agent.” This simple LangChain blog post highlights the basic properties of bringing tools and agentic loops to your LLM applications.

A tool for an LLM is just a specific template for it to fill outputs into. The simplest tools are things like public weather APIs; the way an LLM would use a tool like that is to write a response that compiles to valid JSON indicating which inputs it is plugging into the API, like {“property” : “temperature”, “city” : “New York City”} (as opposed to writing a loose form-free string like “What’s the temperature in NYC?”).

Figure 12: LLM tools & agentic workflows simply look like aligning responses with API formats & looping outputs into subsequent inputs.

Some LLMs Use Tools Better Than Others

To effectively use tools like calculators, search engines, weather APIs, and more, LLMs need to be explicitly trained on a multitude of examples of how to write responses that work as inputs to the tools.

The Hugging Face model card for Cohere’s Command-R-Plus walks through how to assemble messages into different template forms that their models were trained to recognize so that they can reliably write valid inputs to tools. One prompt template can instruct the model to output tool-usage queries in JSON: this produces the “Action: ` ` ` json …” string seen in Figure 13 on the left. Then, a second prompt template can instruct the model, after retrieval is complete and context is assembled, to write responses with in-text citations against what was retrieved from the web.

Figure 13: (left) Response using an internet search API tool. (right) Response with in-text citations.

Other LLMs may be able to write queries and responses of a similar conceptual or linguistic quality—but getting LLM responses reliably to satisfy the format of JSON or to reliably stay grounded with in-text citations does not come out of a box as a capability of most LLMs. The fact that Command-R-Plus can perform these different types of actions reliably in response to different prompt templates is a testament to the careful fine-tuning the Cohere team has done in training their models to achieve more properties than factual correctness alone!

Constraining & Structuring Outputs

You will sometimes want to make sure outputs follow a certain structure even if you aren’t using “tools.” Sometimes, your constraint could be as simple as considering a small list of possible outputs and having the LLM pick from the list. Or sometimes, you won’t need to plug your LLM responses into external tools, but you merely want it to have specific properties like corresponding to certain data types you have in mind. We think of this necessity—for LLM responses of a specific type—as object-oriented prompting. Outlines is a Python library that is awesome for this: it forces LLM outputs to be guaranteed to follow whatever format you need, as long as that format is expressible by a formal grammar.

Figure 14: (top) Constraining an LLM to output a choice from a list. (bottom) Constraining an LLM to output an object from a custom class.

Providing Examples

RAG, tools, and structured outputs make up a lot of the progress in bringing new capabilities and enhanced reliability to the LLM externally. But what about unlocking the capabilities latent within the LLM weights internally by improving the prompt we include and how we demonstrate the job we want done?

“Chain-of-thought” prompting refers to the simple trick of simulating metacognition to get better LLM outputs: that is, including the kind of wording and phrasing that might show up on the blog of a particularly encouraging writing tutor. Experimental evidence suggests that these simple phrases can move the needle on getting LLMs to score better on benchmarks.

But we can do better than chain-of-thought prompting by explicitly giving example input/output pairs of how we want a job to get done when we prompt LLMs. If you give a model enough examples of doing a job, even if it was never trained on that job, it may be able to learn from your examples and do that job in the manner you demonstrated. “In-context learning” and “few-shot prompting” are both terms people are using for this. “Monkey see monkey do” is a much older proverb that similarly attempts to capture the power of demonstrations at teaching by simply showing.

Figure 15: Prompt examples can guide the LLM output in a desired direction (here, solving a problem with intermediate equations explicitly written).

Including lots and lots of examples in your prompts can help, but there is a limit to how useful this can be relative to how much input your model can receive. The “context window” refers to the maximum number of tokens a model can allocate attention to at any one moment, and filling the maximum amount of tokens to fit a context window can sometimes hurt more than it helps due to the increased complexity of your generation. 

But some new LLMs like Google DeepMind’s Gemini series have such long contexts—millions of tokens can be processed together at a given moment—that it unlocks yet-unexplored potentials for in-context learning.

Figure 16: Prompt examples can help a lot, and LLMs like Google DeepMind’s Gemini can receive hundreds or thousands of examples.

Auto-Choosing Examples to Provide

Coming up with the best examples to fit in the prompt is tedious, especially when you are considering many permutations and combinations of dozens of possible examples you could include and endlessly tweak the wording of (and especially when you are trying hundreds or thousands of examples, like Google DeepMind did).

Jason Liu, developer of the instructor library, has an excellent blog post on how to go about selecting these examples in an automated way with minimal complexity.

DSPy is a more complex approach, which may not be ideal for early experimentation but shows promise as a general purpose framework for building LLM pipelines that can auto-adapt to changes to any one particular node of the pipeline. If you make changes like switching up your LLM providers or retrievers, you can re-compile your program with DSPy to learn which examples work best for your new pipeline and auto-write additional text to nudge the prompts in the direction of better benchmark performance. It was designed to be analogous to PyTorch: every time the LLM, retriever, evaluation criteria, or anything else is modified, DSPy can re-optimize a new set of prompts and examples that max out your evaluation criteria.

Figure 17: Code excerpt from the DSPy documentation with a PyTorch-like pipeline & optimization paradigm.

The Risks of Learning From Examples

The ability for models to learn from examples presents risks too. Anthropic’s recent project on many-shot jailbreaking demonstrates the risk of applications that give the user too much leeway to override the model’s safety training. Even with the impressive advances in what seem like quantum leaps in advanced reasoning from AI over the past year, these algorithms fall prey to spurious correlations and are therefore susceptible to malicious steering. Due to this risk, robust security measures and strict monitoring of your system’s inputs and outputs are still of paramount importance.

Figure 18: Many-shot jailbreaking overrides a model’s safety training by mere demonstration of falling victim to malicious prompts.


There is a need for a more principled, scientific, and repeatable way to pick the right LLM and the right tools for the job. Perhaps this will never be easy and automatic, and a level of flexibility and ad-hoc artistry will always be necessary to decide which patchwork of features is best suited to serve an application’s needs. But at Arthur we believe that in order for AI to really work for people, they need to have control over steering the behavior of their AI applications. To get where they need to go as developers and designers who can steer these applications, they need a combination of fast iteration speeds, clearly-defined success criteria, data that is relevant to their use case, and the flexibility to integrate whatever combination of tools optimizes their workflows for their success criteria. 

It can be daunting keeping up with the pace of AI development, and we hope that through our collection of exciting tools and techniques in this piece that you have learned something new or seen a path to experimentation you may have not yet been able to feasible start. As enterprises rebuild their workflows and rethink how they plan to get value out of their own data, it will be crucial to ensure teams don’t reinvent the wheel and make the most out of the simple and powerful tools flourishing in the open-source community.