The Most Robust Way to Evaluate LLMs
Bench is our solution to help teams evaluate the different LLM options out there in a quick, easy and consistent way.


“LLMs are one of the most disruptive technologies since the advent of the Internet. Arthur has created the tools needed to deploy this technology more quickly and securely, so companies can stay ahead of their competitors without exposing their businesses or their customers to unnecessary risk.”

Model Selection & Validation
Arthur Bench helps companies compare the different LLM options available using consistent metrics so they can determine the best fit for their application in a rapidly evolving AI landscape.
Budget & Privacy Optimization
Not all applications require the most advanced or expensive LLMs — in some cases, a less expensive AI model can perform tasks equally as well. Additionally, bringing models in-house can offer greater controls around data privacy.
Translating Academic Benchmarks to Real-World Performance
Bench helps companies test and compare the performance of different models quantitatively with a set of standard metrics to ensure accuracy and consistency. Additionally, companies can add and configure customized benchmarks, enabling them to focus on what matters most to their specific business and customers.
Arthur Bench is the key to fast, data-driven LLM evaluation
Full Suite of Scoring Metrics
From summarization quality to hallucinations, Bench comes complete with a full suite of scoring metrics, ready to leverage. Additionally, you can create and add your own scoring metrics.
Intuitive User Interface
Leverage the Arthur user interface to quickly and easily conduct and compare your test runs and visualize the different performance of the LLMs.
Local and Cloud-based Versions
Gain access via our GitHub repo and run it locally or sign up for our cloud-based SaaS offering. We offer both versions for greatest flexibility.
Completely Open Source
The best part is that Bench is completely open source, so new metrics and other valuable features will continue to be added as the project and community grows.
The Generative Assessment Project
A research initiative ranking the strengths and weaknesses of large language model offerings from industry leaders like OpenAI, Anthropic, and Meta as well as other open source models.
The Generative Assessment Program
A research initiative ranking the strengths and weaknesses of large language model offerings from industry leaders like OpenAI, Anthropic, and Meta as well as other open source models.


