Stop Guessing.
Start Shipping Agents.

An open-source toolkit for building, testing, and monitoring AI agents in production.

How it works

One workflow for the whole agent lifecycle.

Step 1

Manage

Keep prompts versioned, tagged, and promotable across environments. Roll back in seconds when something regresses.

Step 2

Experiment

Test prompt changes, model swaps, and RAG configs against real data before anything ships. Know what changed and why it mattered.

Step 3

Monitor

Trace every agent run end to end. Catch hallucinations, failures, and drift in production before your users do.

All working together. No changes to your stack.

Ship Reliable AI Agents.
Fast.

MANAGE
1/5

Prompts that behave like code.

Most teams treat prompts like config files — unversioned, untracked, and painful to roll back. One bad change can quietly break production.

Version and promote prompts across environments without redeploying your agent

Roll back in seconds when performance drops — no firefighting, no guesswork

Template prompts to control structure and variables at runtime, across teams or tenants

Two stylized text blocks with highlighted lines and three small labels below reading 'SEO optimized', 'Brand voice' with a check mark, and 'Tone: Professional'.
EXPERIMENT
2/5

Test changes before they reach users.

Swapping a model or tweaking a prompt is a gamble without structured tests. Most teams ship first and find out what broke second.

A/B test prompts, models, and RAG configs against real production data — not synthetic examples

Test full agent workflows — tool use, reasoning paths, and output formatting, not just single completions

Score results automatically or with human review — and see exactly what changed and why

Two stylized text boxes with horizontal lines and tags below them reading SEO optimized, Brand voice with checkmark, and Tone: Professional.
TRACE
3/5

See exactly what your agent did.

When an agent fails, you shouldn't have to piece together logs and hope for the best.

Inspect every step — inputs, tool calls, reasoning paths, and outputs across every run

Filter by prompt version, user, outcome, or cost to find the source of a failure fast

Built on OpenTelemetry — works with LangChain, LlamaIndex, OpenAI, Anthropic, and anything else in the OpenInference ecosystem

Agent run trace detail screen showing the process of 'jirabotAgent' executing steps including reading a Slack thread, searching Jira issues, creating a Jira issue, and the corresponding input/output data.
MONITOR
4/5

Know before your users do.

Quality problems in production are invisible until someone complains. By then it's too late.

Run evals on live traffic — hallucination, PII, prompt injection, toxicity, and correctness, continuously

Set alerts the moment quality drifts — not after a user escalation

Validate before you ship with curated datasets and pre-deployment test runs

Screenshot of an annotations table for a trace showing four continuous evaluation entries with eval names, scores, explanations, run status marked as passed, and associated costs.
INTEGRATE
5/5

Works with what you already have.

You shouldn't have to rebuild your stack to get observability.

Use any model — OpenAI, Anthropic, Cohere, or open-source

Bring any framework — LangChain, LangGraph, LlamaIndex, Vercel AI SDK, and more

Deploy your way — Docker, CloudFormation, or Helm. Your environment, your data.

Arthur web app interface showing a Model Providers list with providers like Anthropic, OpenAI, Google Gemini, Amazon Bedrock, Vertex AI, and vLLM, displaying their enabled or disabled status and action icons for edit and delete.

Agent Framework 

Eval Platform

Arthur Engine + Toolkit

Build and run agents

Prompt versioning & management

Basic

Structured A/B experiments

Real-time guardrails (hallucination, PII, injection)

End-to-end trace debugging

Traditional ML model eval

Self-hosted / open-source

Varies

SaaS

MIT licensed

Dark-themed enterprise dashboard with six colorful glowing line graphs in pink, green, blue, orange, purple, and yellow and an export to CSV button.
Perfect fit

How It Fits Into the Arthur Engine

The Agent Toolkit is part of the Arthur Engine — Arthur's free, open-source AI evaluation and monitoring platform. The Engine provides the foundation: real-time guardrails, LLM eval infrastructure, and flexible deployment. The Toolkit builds on top of that with the full agent development workflow.

Ready to turn your AI into real-world impact?

We’ll help you move from pilots and prototypes to production-grade applications, with evaluation every step of the way.

Soft gradient mesh with blended colors of purple, green, orange, and blue.