How to Set Up Tracing for AI Agents in Production: A Mastra TypeScript Guide
Your TypeScript agent nails every test case in development. Then it hits production and starts calling the wrong tools, hallucinating details to customers, or quietly returning bad results — and you have no way to tell what went wrong or where.
The root cause is almost always the same: zero visibility into what the agent is actually doing at runtime.
Tracing fixes this. It gives you a structured, step-by-step record of every decision your agent makes — every LLM call, every tool invocation, every document retrieved — so you can debug failures, measure performance, and build the feedback loops that turn a demo into a dependable production system.
In this guide, we'll walk through how to set up production-grade tracing for TypeScript AI agents built with Mastra. We'll cover the tracing standards worth adopting, what specifically to instrument, and how to get traces flowing into the Arthur Engine with just a few lines of configuration. Then we'll show how tracing plugs into the broader production lifecycle — prompt management, evaluations, experiments, and guardrails — so you can go from "it works on my machine" to "it works for our customers."
TLDR
- Tracing is the foundation of production agent observability. Without it, debugging is guesswork.
- Use OpenTelemetry with OpenInference semantic conventions for the richest agent trace data.
- Mastra provides comprehensive, built-in OpenTelemetry tracing for agents, LLM calls, tools, workflows, and integrations — no manual wiring needed.
- The @mastra/arthur exporter connects your Mastra agent to Arthur Engine in a few lines of code.
- Auto-instrumentation is a starting point — always add custom metadata, child spans, and session context at your agent's key decision points.
- Once tracing is in place, connect it to prompt management, continuous evaluations, experiments, and guardrails to close the production feedback loop.
Why Tracing Is the Foundation of Agent Observability
Agents aren't simple request-response APIs. A single user query can set off a cascade of LLM reasoning steps, tool invocations, retrieval operations, and conditional branching that spans multiple services and packages. In the TypeScript ecosystem, this complexity compounds fast and teams are composing agents from rapidly evolving frameworks like Mastra, Vercel AI SDK, and LangChain.js, deploying to serverless runtimes where execution is ephemeral, and wiring together tools that call external APIs, databases, and other agents. The surface area for silent failures is enormous.
Without tracing, debugging a production agent amounts to reading logs and guessing. You might know the agent returned a bad answer, but you can't tell whether the problem was a flawed prompt, a bad retrieval result, a tool that returned unexpected data, or a reasoning step where the model went off track. Tracing gives you the structured, hierarchical record you need: every LLM call, every tool invocation, every retrieved document captured as a span — complete with inputs, outputs, latency, and metadata. When a customer reports an issue, you pull up their session trace and see exactly what happened. When you're deciding whether to invest in RAG tuning versus prompt engineering, traces show you where the agent is actually failing rather than forcing you to guess. The teams that instrument early are the ones that ship with confidence.
OpenTelemetry and Semantic Conventions for Agent Tracing
OTEL has become the open source standard for distributed observability for many applications. The good news is that the same standard extends cleanly to agent observability, which means your agent traces can participate in the same distributed trace as your Express middleware, your database queries, and your API gateway.
The more consequential decision is which semantic conventions to use for encoding AI-specific behavior into those traces. Two options exist within the OpenTelemetry ecosystem, and the choice meaningfully affects how much visibility you get into agent failures.
The OTEL-community GenAI semantic conventions are the official standard, but they're still maturing. They cover the basics of LLM calls but treat most agent operations as generic spans — which means retrieval steps, tool invocations, and reasoning chains all look the same in your trace viewer. When you're trying to figure out why your agent called the wrong tool or acted on stale context, that lack of granularity slows you down.
The OpenInference standard was designed specifically for generative AI workloads and draws clearer distinctions between span types — LLM, TOOL, AGENT, CHAIN, RETRIEVER — so you can immediately see the structure of your agent's reasoning in a trace. It captures richer metadata for LLM calls (full prompt and completion content, token counts, cost, model parameters) and has first-class support for retrieval and re-ranking spans, which is critical if your agent relies on RAG. Mastra's Arthur exporter implements OpenInference conventions, so all of this structure is captured automatically when you export traces to Arthur Engine.
What to Trace in Your Agent
Auto-instrumentation captures a lot out of the box, but make sure your traces cover these five areas at minimum: LLM calls (prompts, completions, model config, token counts, cost), tool invocations (inputs, outputs, and latency for every API or database call your agent makes), RAG and retrieval calls (which documents were pulled and which were missed — the most common silent failure mode in production agents), application metadata (user IDs, session IDs, and domain identifiers that tie traces back to real user sessions), and key decision points (custom spans at your agent's branching logic and reasoning steps). Auto-instrumentation is the starting point; the manual spans you add at your agent's custom decision points are what make traces actually useful for debugging.
Setting Up Tracing with the Arthur Exporter
The @mastra/arthur package is an exporter that sends Mastra traces to Arthur Engine using OpenTelemetry and OpenInference semantic conventions. Setup takes just a few minutes.
Installation
Install the Arthur exporter alongside your Mastra dependencies:
npm i @mastra/arthur@latestBefore configuring the exporter, you need:
- An Arthur Engine instance. Follow the Docker Compose deployment guide to spin one up.
- An API key. Generate one from the Arthur Engine UI.
- A Task ID (optional). Create a task to route traces to a specific project.
Task Routing
Arthur Engine associates traces with tasks — logical groupings that correspond to your agents or services — in two ways:
- By service name: Set serviceName in the observability config. Arthur Engine automatically routes traces to the task matching that name, creating it if one doesn't exist.
- By task ID: Pass an explicit taskId to the exporter to send traces to a specific task directly.
If both are provided, taskId takes precedence.
Environment Variables
# Required
ARTHUR_API_KEY=your-api-key
ARTHUR_BASE_URL=http://localhost:3030
# Optional — route traces to a pre-existing task by ID
ARTHUR_TASK_ID=your-task-idZero-Config Setup
With environment variables set, the exporter works with no inline configuration:
import { Mastra } from '@mastra/core'
import { Observability } from '@mastra/observability'
import { ArthurExporter } from '@mastra/arthur'
export const mastra = new Mastra({
observability: new Observability({
configs: {
arthur: {
serviceName: 'my-agent',
exporters: [new ArthurExporter()],
},
},
}),
})
That's it. Every action your Mastra agent makes — LLM calls, tool invocations, workflow steps, memory operations — is now traced and exported to Arthur Engine automatically.
Explicit Configuration
You can also pass credentials directly (takes precedence over environment variables):
import { Mastra } from '@mastra/core'
import { Observability } from '@mastra/observability'
import { ArthurExporter } from '@mastra/arthur'
export const mastra = new Mastra({
observability: new Observability({
configs: {
arthur: {
serviceName: 'my-service',
exporters: [
new ArthurExporter({
apiKey: process.env.ARTHUR_API_KEY!,
endpoint: process.env.ARTHUR_BASE_URL!,
taskId: process.env.ARTHUR_TASK_ID,
}),
],
},
},
}),
})
For more detailed information on how to configure the Arthur exporter for Mastra visit the docs.
Closing the Loop: From Traces to Reliable Agents
Tracing is the foundation, but its real value emerges when you connect traces to the rest of your production agent workflow. Once instrumentation is in place, you unlock every other capability in Arthur Engine's Agent Development Lifecycle (ADLC):
Observability & Tracing — the conceptual foundation for why tracing matters and how to think about agent observability at scale.
Prompt Management — version, test, and roll back prompts without redeploying code. Every trace links to the prompt version that produced it, so you can see exactly how prompt changes affect real-world behavior.
Continuous Evaluations — run automated pass/fail checks against production traces to catch behavioral regressions before your users do. Only unsupervised evals can run continuously, and they need to be binary pass/fail — range-based scoring is inconsistent and pushes judgment to humans.
Experiments & Supervised Evals — build datasets from real production failures (surfaced through traces), then test improvements offline before promoting to production.
Guardrails — intercept agent behavior in real time with pre-LLM checks (PII redaction, prompt injection detection, sensitive data blocking) and post-LLM checks (hallucination detection, toxicity filtering, output format validation). Guardrail events emit as telemetry within the same trace, giving you a unified view of the agent's full execution path including safety interventions.
To see this full pattern applied end to end — from instrumenting day one through evals, prompt iteration, and production deployment — read From Vibe-Coded Jira Bot to Reliable Agent.
Getting Started
Tracing is not optional for production agents — it's the foundation that every other production capability depends on. Install @mastra/arthur, configure the exporter with your Arthur Engine credentials, tag your traces with session and user context, and add child spans at your agent's key decision points. The teams that instrument early are the ones that ship with confidence. Start tracing from day one.
Arthur Engine provides the Agent Development Lifecycle (ADLC) for AI agents — tracing, prompt management, continuous evaluations, experiments, and guardrails — in a single platform. Get started with Arthur Engine or explore the open-source Arthur Observability SDK on GitHub. For Mastra-specific setup, see the @mastra/arthur exporter documentation.