Agentic AI

Model Risk Management in the Age of Agentic AI: A Strategic Guide for Enterprise Leaders

By:Arthur Team

June 1, 2026

Model Risk Management in the Age of Agentic AI: A Strategic Guide for Enterprise Leaders

Most large financial institutions have crossed an inflection point with agentic AI in the last 18 months. Agents are now processing claims, drafting credit memos, summarizing filings, routing tickets, querying databases, and increasingly calling other agents to do the same. They have moved out of pilot status and into production, often faster than the governance functions around them were prepared for.

For senior AI executives, this creates a strategic problem that doesn't sit cleanly inside any one function. It cuts across model risk, internal audit, security, engineering, and compliance, which is why senior AI leadership teams are the ones increasingly being asked to answer it rather than any single team underneath them. The clearest place this friction is showing up is in model risk management frameworks like SR 11-7, which has governed model risk at US banks for more than a decade. The principles are sound, and examiners are starting to apply them to agentic systems. The catch is that those principles were written for a very different kind of model. Translating them to systems that reason on the fly, pull data at runtime, and act with much less human oversight than traditional models is where the work actually is.

The principles of model risk management still apply to agents. What has to change is the operational layer underneath them, and because that work cuts across model risk, audit, compliance, security, and engineering, it's increasingly senior AI leadership that has to own it.

Traditional MRM Was Built for a Different Set of Models

Traditional MRM frameworks were designed in and for a world of static, deterministic models, and that core assumption is what breaks first when agents enter the picture.

For most of MRM’s history, the discipline has worked in a fairly stable environment. Models were developed by quant teams, validated against structured inputs and outputs, and revalidated based on predictable cycles. The frameworks that govern this work, with SR 11-7 being the most prominent example in US banking, sit on three familiar pillars: sound model development and implementation, independent validation, and ongoing monitoring backed by effective governance.

Each pillar of the framework carries an implicit assumption that the model is deterministic, reproducible, and consistent. Pass the same input through a traditional model and you get the same output every time, and that output is what you end up validating, monitoring, and documenting.

Agents don't work that way. The same input passed through the same agent twice can produce two different outputs, because the agent decides at runtime which tools to use, what data to retrieve, and how to chain together reasoning steps. Behavior is shaped by the version of the underlying LLM, the state of upstream systems, and the specific context of the interaction. Behavior can shift because of a prompt change, a new tool, a retrieval strategy update, or a provider quietly releasing a new version of an underlying LLM. None of that fits cleanly into a governance model designed for static, probabilistic systems.

Why This Is a Senior-Leader Conversation

Three forces are converging right now to push agent governance out of the back office and onto the agenda of the people running enterprise AI.

The first is regulatory attention. SR 11-7 is the loudest signal in US banking, but the underlying principles show up everywhere model risk gets governed. The EU AI Act, sector-specific regulators, and internal audit standards are all asking variations of the same questions, just in different vocabularies. Examiners want a complete inventory, evidence of independent validation, ongoing monitoring, and clear lines of accountability, regardless of what the system being governed is.

The second is risk concentration. As agents take on more and more real work, the consequences of getting governance wrong are elevated. Whether it is a loan-recommendation agent producing biased outputs, a customer support agent leaking PII, or a research agent fabricating citations, none of these risks are hypothetical anymore. They all now carry regulatory, reputational, and financial exposure that used to be reserved for traditional systems.

The third, and arguably the most interesting, is internal demand. First-line application teams are increasingly asking for governance infrastructure rather than resisting it. These teams want guardrails, observability, evals, and approval workflows because they don’t want their agents going to production without them. That represents a meaningful shift from how model risk has historically worked, where controls were pushed down from the second-line. The pressure for governance is now coming from both directions at once: the second line pushing controls down, and the first line pulling them in. That's a much better starting point for setting enterprise strategy than the old model where MRM defined the rules and engineering treated them as friction.

Who Owns Agent Governance Across the Organization

Agent governance isn't owned by a single function, it's read by several at once, each looking at the same data through their own lens. A working agent governance strategy gives every stakeholder a different view into the same underlying system. The core data is identical (what agents exist, what they're doing, how they're behaving), but each function reads it for its own purpose.

Model risk and audit functions read it as evidence: complete inventories, decision-path traces, and after-the-fact records of what each agent did and which policies applied. Compliance reads it as enforcement, looking for consistent application of controls rather than per-team variations. Engineering reads it as productivity, both to ship faster and to debug why agents fail in production. Security reads it as access, especially as MCP servers and cross-system tool calls widen the surface area.

The point here is that all of this is the same data, but viewed differently. The strategic move here should be to build the governance layer to serve every function around it, versus have each function build their own incomplete version.

Where Model Risk Frameworks Hold, and Where Operationalization Has to Change

The core pillars of model risk management still apply to agents, but each one has to be operationalized differently to actually hold up in an agentic environment. The real question for senior AI leaders isn't whether to extend traditional MRM frameworks like SR 11-7 to agents, but how each pillar gets implemented in practice

Inventory: MRM frameworks require a comprehensive inventory of models with documented purpose, methodology, assumptions, and limitations. Agent inventories built on manual registration are already incomplete at most large institutions, and the reason is pretty simple. Agents are entering the environment through too many doors at once: internal teams, vendor platforms, feature updates to existing SaaS tools, and individual employees spinning up their own in low-code environments. Very few of those paths route through a model risk committee. Continuous discovery is becoming the practical baseline instead, with the output being a living registry tied to accountable owners and use-case classifications.

Validation: Every MRM framework calls for independent assessment of conceptual soundness. For traditional models, validators reason from the mathematical formulation. For an agent, soundness lives in the decision path, not in a formula. Validators need to see, for a given interaction, what the agent reasoned, what tools it called, and what data it retrieved. Without that level of depth, there's no real way to tell whether a bad outcome came from a faulty retrieval, a tool misfire, or a reasoning error in the middle of the chain. Outcomes analysis gets messier too. Agents don't just produce a single number you can grade against an actual result. They produce narratives, recommendations, and structured outputs, and each of those needs its own evaluation criteria.

Ongoing Monitoring and Governance: MRM frameworks expect monitoring to detect degradation and trigger revalidation when material changes occur, backed by governance with clear roles, senior oversight, and effective challenge from independent parties. Agents complicate both halves at once.

On the monitoring side, agent behavior shifts for reasons that never show up in a code repository: prompt edits, tool additions, an LLM provider releasing a new model version. Continuous evaluation against production traffic has to become the baseline, and "material change" has to expand to include the kinds of changes a code-based change management process would never see.

On the governance side, oversight is fragmented in most organizations today, with application teams building guardrails ad hoc and nothing rolling up to centralized reporting. Meeting the MRM bar calls for two things: a unified view that compliance, risk, and engineering can all work from, and governance that works across stacks, since most institutions are running agents across more than one cloud provider, framework, or LLM, and that isn't going to get simpler.

Putting This Into Practice

Imagine a global bank deploying a fraud detection agent to support its reviews of suspicious transactions. The agent pulls flagged transactions from the core banking system, looks at the customer's recent activity, queries vendor data on counterparty risk, cross-references known fraud patterns, and drafts a recommendation: clear, hold, or escalate to a human investigator. It's the kind of workflow that's an obvious productivity win for a team that's always behind on its queue.

The interesting thing is what this single agent demands from a governance standpoint. It touches every pillar of model risk management at once:

Model risk teams want it in the inventory with an accountable owner and a use-case classification
Validators want a traceable decision path, since that's where soundness lives for an agent
Compliance teams need assurance that the agent isn't introducing disparate impact across customer segments
Audit teams need an evidence trail behind every recommendation
Engineering teams need to know the second accuracy starts drifting, because a quiet regression here isn't just a quality issue, it's an active source of customer harm.

The fraud agent isn't a special case. Almost every production agent in a regulated environment looks like this. And the governance posture has to work for all five of these functions, from one foundation, or it stops working at all.

What "Winning" Looks Like

The institutions getting this right are turning governance into a source of velocity, while the ones that don't are quietly setting themselves up for one of two failure modes. The organizations that get this right will share a few characteristics. They will have more agents reaching production faster, because first-line teams can ship with confidence knowing that the guardrails, evaluations, and observability are all there. Compliance and audit teams can self-serve the answers they need, which turns regulatory reviews and audit cycles from periodic fire drills into a steady rhythm.

Additionally, the organizations getting this right will have figured out that governance infrastructure is a foundation for agent velocity versus a tax on it. Strong controls are what let the business move faster, because the consequences of getting something wrong stay bounded and observable. The institutions that don’t build these controls will end up being left behind in one of two places in the near future: Some will slow agent deployment significantly because the risk surface feels too uncontrolled to push further. Others will keep deploying and absorb the consequences through incidents, regulatory findings, and remediation costs that end up much larger than the original investment would have been.

What separates the institutions in the first group from the second isn't the framework they pick. It's whether they've rebuilt the operational layer underneath their existing model risk discipline, and whether they've done it as a cross-functional effort owned at the senior AI leadership level rather than as a project handed off to any single team.

The Strategic Takeaway

Model risk management as a discipline has held up well for decades because its principles are sound, and they still are. SR 11-7 is one example. The EU AI Act is another. Internal audit standards are a third. What has to change across all of them is the operational layer underneath. The framework's authors couldn't reasonably have anticipated this kind of system when they were writing for a world where the most sophisticated model in production was a logistic regression with a few hundred features.

For AI leadership teams that are now beginning to think about this, the practical question at hand is how to build a governance foundation that satisfies SR 11-7, adjacent regulatory frameworks, and internal stakeholders who all want visibility. The institutions that build that foundation once, and serve it to every function that needs it, are the ones that turn agentic AI from a governance headache into a competitive advantage.

If your team is working through how to extend its model risk and AI governance program to agentic systems, we've been having a lot of these conversations at Arthur.

Our platform was built for exactly the operational gap this post describes: continuous discovery of agents across the enterprise, decision-path traces validators can actually reason from, continuous evaluations against production traffic, and real-time guardrails that intercept bad inputs and outputs before they cause harm. The same foundation serves model risk, audit, compliance, security, and engineering from one source of truth, which is the architecture this problem ultimately requires.

If you're sizing up what it would take to get there, we're happy to compare notes. Reach out to our team here and we'll walk you through what we've seen work across other large institutions navigating the same shift.

‍