Column

How to Set Up Tracing for AI Agents in Production: A Python Guide

April 23, 2026
4
min read

So your agent works flawlessly in your dev environment and starts hallucinating in front of customers, calling the wrong tools, or silently returning bad results and you have no idea why.

The root cause is almost always the same: no visibility into what the agent is actually doing.

Tracing solves this. It gives you a structured, step-by-step record of every decision your agent makes — every LLM call, every tool invocation, every document retrieved — so you can debug failures, measure performance, and build the feedback loops that turn a prototype into a reliable production system.

In this guide, we'll walk through how to set up production-grade tracing for Python-based AI agents. We'll cover the tracing standards that matter, what specifically to instrument, and how to get it done with just a few lines of code using the Arthur Observability SDK. Then we'll show how tracing connects to the broader production lifecycle: prompt management, evaluations, and guardrails.

TLDR

  • Tracing is the foundation of production agent observability; without it, debugging is guesswork.
  • Use OpenTelemetry with OpenInference semantic conventions for the richest agent trace data.
  • Instrument five areas at minimum: LLM calls, tool invocations, RAG calls, application metadata, and key decision points.
  • The Arthur Observability SDK (pip install arthur-observability-sdk) gives you one-line instrumentation for 30+ Python frameworks.
  • Auto-instrumentation is a starting point — always add manual spans at your agent's custom decision points.
  • Once tracing is in place, connect it to prompt management, continuous evaluations, experiments, and guardrails to close the production feedback loop.

Why Tracing Is the Foundation of Agent Observability

Agents aren't simple API calls. A single user request can trigger a chain of LLM calls, tool invocations, retrieval operations, and branching logic that spans multiple services. When something goes wrong — and in production, it will — you need to see the full execution path to understand where it broke.

Two real-world examples from Arthur's Forward Deployed Engineering (FDE) team illustrate this well.

One customer was preparing to roll out their agent to pilot users and had no idea where to start improving it. Should they focus on RAG tuning, prompt engineering, or context engineering? By instrumenting their agent with tracing, they could investigate the specific requests where the agent got the wrong answer, see exactly where it deviated from expected behavior, and focus their limited resources on the changes that actually mattered.

Another customer was selling their agent to large enterprise buyers who needed proof the agent was trustworthy before rolling it out across their organization. Collecting detailed traces allowed them to build a behavior dataset of real customer requests, establishing the evidence their buyers needed to move forward.

In both cases, the teams that invested in observability early were the ones that shipped with confidence. The ones that didn't were stuck in demos.

Semantic Conventions: OTEL GenAI vs. OpenInference

Within OpenTelemetry, there are two competing semantic conventions for encoding AI agent behavior into spans.

The OTEL-community GenAI semantic conventions are the official community standard, but are still maturing. They provide basic coverage for LLM calls but currently lack some of the expressiveness needed for production agent workloads.

The OpenInference standard is an open-source alternative that offers several advantages for production agents:

  • Richer semantic detail for LLM calls, including full prompt/completion content, token counts, cost, and model parameter metadata.
  • First-class support for retrieval and re-ranking spans, which is critical for RAG-heavy agents.
  • Better distinctions between different span types — LLM, TOOL, AGENT, CHAIN, RETRIEVER — instead of treating everything as a generic span.
  • Explicit types for messages (system, user, assistant, tool), documents, tools, and tool calls.
  • Out-of-the-box auto-instrumentation for popular LLM frameworks including LangChain, LlamaIndex, OpenAI, and more.

The difference is stark when you compare traces side by side. OpenInference traces clearly show the full hierarchy of an agent's reasoning: which tools were called, what context was retrieved, what the LLM was asked, and what it produced at each step. OTEL GenAI traces capture the same events but with less granularity, making it harder to diagnose issues at the level of detail that production debugging demands.

When building the Arthur platform, we chose OpenInference for these reasons, though the OTEL GenAI conventions are catching up.

What to Trace: The Five Areas You Can't Skip

Most common agent frameworks have auto-instrumentation packages for OpenTelemetry, but auto-instrumentation is a starting point, not the finish line. At minimum, you should instrument these five areas:

LLM calls. Trace every interaction with full context: prompts, completions, model configuration, token counts, and cost. Without this, debugging unexpected outputs or reasoning about performance tradeoffs is nearly impossible.

Tool invocations. When an agent calls an API, queries a database, or executes code, capture inputs, outputs, and latency. Many performance issues come down to inefficient tool usage that only becomes obvious when laid out in a trace.

RAG and retrieval calls. Agents often take wrong actions because they had bad context. Tracing retrieval calls lets you see exactly which documents were pulled — and often more importantly, which documents were not pulled — and why the model acted on them. This is the most common silent failure mode we see in production agents.

Application metadata. User IDs, session IDs, and domain identifiers connect agent behavior back to real user experiences. When a customer reports an issue, pulling up the exact traces for their session dramatically shortens time to resolution.

Key decision points. Make sure spans include key decision points or context important for the agent to function correctly. In our FDE engagements, we often see teams add manual spans at these points, then use the logged context to build test datasets and continuous evaluations to validate behavior in production.

Setting Up Tracing in Python with the Arthur SDK

The Arthur Observability SDK is a Python package that handles OpenTelemetry instrumentation for your agent with minimal setup. It supports 30+ frameworks out of the box and integrates directly with the Arthur Engine for trace storage, analysis, and the production feedback loop.

Installation

Install the core SDK and any framework-specific extras you need:

pip install arthur-observability-sdk

For framework-specific auto-instrumentation, install the corresponding extra:

pip install "arthur-observability-sdk[openai]"
pip install "arthur-observability-sdk[langchain]"
pip install "arthur-observability-sdk[anthropic]"

Or install everything at once:

pip install "arthur-observability-sdk[all]"

Initialize and Instrument

Getting tracing running takes just a few lines:

from arthur_observability_sdk import Arthur

# Initialize the SDK
arthur = Arthur(
    api_key="your-api-key",       # or set ARTHUR_API_KEY env var
    task_id="<your-task-uuid>",
    service_name="my-agent",
)

# Instrument your framework (replace with the respective instrument_<framework_name>() for your agent framework)
arthur.instrument_openai()

That's it. Every action your OpenAI agent makes is now traced — prompts, completions, token counts, latency, and model configuration are all captured automatically.

The SDK supports over 30 frameworks out of the box. Whether your agent is built on Google ADK, CrewAI, AWS Strands Agents, Anthropic, Pydantic AI, DSPy, Haystack, LlamaIndex, MCP, or others — each has a corresponding instrument_*() method and install extra. This means you don't need to wire up OpenTelemetry manually for each framework; the Arthur SDK handles the instrumentation and exports traces using OpenInference semantic conventions automatically.

(Optional) Adding Session and User Context

Production agents serve real users. Tagging traces with session and user IDs lets you pull up the exact execution path when someone reports an issue:

with arthur.attributes(session_id="sess-1", user_id="user-42"):
    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "What are my top-selling products?"}]
    )

Every span created inside the arthur.attributes() context manager automatically inherits the session and user metadata.

For more info on how to configure the Arthur observability SDK to instrument your agent, check out the docs.

Once You Have Traces: Closing the Loop

Tracing is the foundation, but the real value comes when you connect traces to the rest of your production agent workflow. Once tracing is in place, you unlock every other capability:

  • Observability & Tracing — the conceptual foundation for why tracing matters and how to think about agent observability.
  • Prompt Management — version, test, and roll back prompts without redeploying code, with traces linked to the prompt version that produced them.
  • Continuous Evaluations — run automated pass/fail checks against production traces to catch behavioral regressions before your users do.
  • Experiments & Supervised Evals — build datasets from real production failures and test improvements offline before promoting to production.
  • Guardrails — intercept agent behavior in real time with pre-LLM and post-LLM checks, emitting guardrail events as telemetry within the same trace.

To see this full pattern applied end-to-end — from instrumenting day one through evals, prompt iteration, and production deployment — read From Vibe-Coded Jira Bot to Reliable Agent.

Common Pitfalls and How to Avoid Them

After working with dozens of production agent deployments, here are the tracing mistakes we see most often:

Relying solely on auto-instrumentation. Auto-instrumentation captures framework-level operations but misses your agent's custom logic, branching conditions, and domain-specific decision points. Always verify what's actually being captured and add manual spans where needed.

Not tracing RAG calls. This is the most common silent failure mode. The agent confidently generates a response based on whatever context it received — bad context produces bad answers, and you'll never know unless you trace retrieval to see exactly which documents were pulled and which were missed.

Skipping session and user metadata. When a customer reports that the agent gave them a wrong answer last Tuesday, you need to find that exact trace. Without session and user IDs attached to your spans, you're searching through a haystack. Always use arthur.attributes() to tag traces with the context you'll need for debugging.

Getting Started

Tracing is not optional for production agents — it's the foundation that every other production capability depends on. Install the Arthur SDK, call instrument_*() for your framework, tag your traces with session and user context, and add manual spans at key decision points. The teams that instrument early are the ones that ship with confidence. Start tracing from day one.

Arthur Engine provides the Agent Development Lifecycle (ADLC) for AI agents — tracing, prompt management, continuous evaluations, experiments, and guardrails — in a single platform. Get started with Arthur Engine or explore the open-source Arthur Observability SDK on GitHub.