Two in a Series of Articles About Auditing Agentic AI Systems
AI agents are already inside enterprise systems. They’re drafting communications, querying databases, and executing transactions. Organizations are deploying them because the productivity case is compelling. But the governance to match that deployment pace hasn’t kept up.
This gap introduces risks that traditional audit approaches weren’t built to evaluate.
For auditors, the problem is that the tools, techniques, and frameworks that have worked well for decades run into limits when the subject of the audit can perceive, reason, decide, and act on its own. Most audit frameworks were built for systems that follow rules, produce consistent outputs, and leave behind structured evidence. Agentic AI often does none of those things.
Because of this new reality, understanding where traditional approaches break down and what is needed to replace them is critical for a modern successful audit program.
Static Systems vs. Traditional AI vs. Agentic AI
The best way to understand why auditing must change is to place the three generations of technology side by side.
A static system is deterministic. Same input, same output, every time. The rules are written in code and can be inspected directly, and logs often tell you what happened. Sampling-based audits work with static systems because they behave the same whether you’re watching or not.
Traditional AI introduces probability. Outputs can vary slightly, but the system is still reactive, meaning it responds when asked and waits otherwise. A human still sits between the AI’s output and any real-world consequence. With traditional AI, the audit expands to cover the model, training data, and bias, but the fundamental dynamic holds.
Agentic AI breaks both assumptions at once. It’s probabilistic and autonomous. It plans, decides, and executes across multiple systems using dynamic tool chains, and its behavior comes from how all these parts work together.
Agentic AI’s probabilistic and autonomous nature makes auditing these systems difficult. Evidence must now move beyond structured logs to include prompts, reasoning traces, tool calls, memory snapshots, and model version records. If you only review a basic log with requests and responses, you’re seeing just a small fraction of what you actually need to understand what the agent decided and why.
Additionally, sampling is no longer sufficient. A system that executes thousands of actions per minute, around the clock, cannot be governed by periodic review. Continuous monitoring and auditing are required to keep up with this technology.
The Threat Landscape: 3 Families of Risk
The risks associated with Agentic AI can be grouped into three primary categories that auditors must evaluate across technology and business environments.

Technical Risks
Technical risks are inherent to the design and behavior of AI systems themselves. These include prompt injections, insecure output handling, data poisoning, and excessive privileges.
Collectively, these risks expose the organization to compromise of model integrity, confidentiality, and control over AI-driven processes.
Operational Risks
Operational risks arise from how AI agents interact with enterprise systems and execute actions within business workflows. These include unauthorized autonomous actions, privilege escalation through chained agents, fraud amplification, and unpredictable behaviors in multi-agent environments.
Two risks warrant particular attention: Agent Sprawl, which is the uncontrolled proliferation of agents across departments without centralized oversight, and Shadow AI, which are agents deployed by business units without IT visibility.
These risks can reduce control effectiveness, increase attack surface, and undermine operational resilience.
Compliance Risks
Compliance risks stem from the potential for AI systems to violate legal, regulatory, and governance requirements. These include the unauthorized use of personal data, bias against protected groups, lack of auditability, and non-compliance with sector-specific regulations (e.g., financial services, healthcare).
Failure to address these risks can result in regulatory penalties, legal exposure, and reputational harm.

The 7 Challenges: Why Auditing Agents Is Hard
Auditing agentic AI is not simply a matter of applying existing IT audit techniques to a new technology. Each of the seven challenges outlined below represents a specific point where traditional audit methodology breaks down and requires a new approach.
Whether a technical, operational, or compliance risk, these challenges make the case that agentic AI demands a new audit framework, not just an updated checklist.

Challenge 1 — Non-Determinism
This challenge breaks one of the most basic foundations of how audits are conducted, as the discipline is based on a sequential structure: run a test, observe the result, conclude whether the control works. With an LLM, however, this structure can no longer be applied.
Because agentic AI is non-deterministic, the same test can yield different results. A single-run test is no longer conclusive when temperature, memory context, and stochastic sampling mean the same prompt can produce different outputs.
How to address it
Auditors can overcome the non-deterministic nature of AI by applying statistical testing. Instead of relying on a single test execution, the same scenario should be run 30–100 times to evaluate the result distribution.
Each test run must be fully documented and reproducible, including the temperature, model, version, and random seed. Auditors should also request benchmark-style evaluation suites from the organization and treat these as formal audit evidence, supported by tools such as Promptfoo, DeepEval, or Giskard.
With this challenge, valid testing must demonstrate a range of outcomes, showing how the system behaves under repeated execution, not just a single test case.
Challenge 2 — Lack of Explainability
Because generative models are able to think, reason, and decide inside a black box, understanding why an agent reached a particular conclusion or how it got there is very difficult. Reconstructing the agent’s reasoning could even be outright impossible. This poses a significant problem for auditors, since basic audit principles indicate that you cannot have an expert opinion on things you don’t understand.
Furthermore, this obscurity conflicts with various regulations. GDPR Art. 22 and EU AI Act Art. 13–14 require that automated decisions be explainable in a meaningful way. And for high stakes use cases such as credit scoring, HR decisions, or fraud detection, a lack of explainability is an audit gap that, under the EU AI Act, can invalidate the system entirely.
How to address it
A way to address this challenge is to require the capture of reasoning traces, including chain-of-thought outputs, ReAct logs, full prompts, retrieved context, and tool calls. This way, auditors can better follow the process of how and why an agent reached a particular decision. Auditees should also, at minimum, document what inputs the model considers, what rules apply, and what guardrails are in place for each use case.
It is important for auditors to rely on these hard-proof traces rather than the model’s own explanation because of post-hoc rationalization risks. AI systems can generate convincing but inaccurate justifications after a decision is made that do not reflect the true decision-making process.

Challenge 3 — Speed and Scale
AI agents can execute thousands of actions per minute. If something goes wrong, the impact can grow quickly before a human has a chance to step in. Because of this, traditional audit methods like manual sampling no longer work when the subject is Agentic AI. By the time an issue is identified, the damage may already be done.
For example, if an AI agent is responding to customer emails and a mistake or bias is introduced, it could send thousands of harmful responses before anyone notices. Simply put, audits that rely on periodic checks can’t keep up with how fast AI agents operate.
How to address it
Traditional auditing can’t keep up with the speed of Agentic AI, so the process must adapt. The key shift is that auditing must now be built into the system from the start rather than added later.
In practice, this means moving to continuous monitoring. Organizations should use real-time visibility tools such as dashboards, behavioral metrics, and automated alerts to catch issues as they happen. Looking ahead, dedicated “monitoring agents” will also be able to oversee other agents in real time.
At the core of this is one non-negotiable: strong logging. Systems must capture complete, reliable, and tamper-proof logs to support accurate auditing.
Challenge 4 — The Observability Gap
Traditional monitoring tools (APM, SIEM, syslog) track things like requests, responses, and latency. However, they don’t capture an AI agent’s complete prompt, the context, or the reasoning trace. Therefore, they don’t know how an agent actually made a decision.
Because of this, when auditors request “the logs,” they’re often given data that doesn’t explain what the agent saw, what context it had, or why it acted the way it did.
How to address it
Organizations need to use tools built specifically for AI and agents to close this gap. These tools already exist (e.g., Langfuse, LangSmith, Arize Phoenix, Helicone), and there are emerging standards, such as OpenTelemetry GenAI Semantic Conventions, already defining which fields need to be captured.
With these tools, auditors should require logging of the complete prompt (system, user, and retrieved data), the full output, any tools used with parameters, the agent identity and user involved, the model and version, and usage details like cost and tokens.
And most importantly, all of this must be immutable and properly retained. If the logs are not there, effective auditing simply isn’t possible.
Challenge 5 — Blurred Identity
When one agent triggers another, which then calls systems through a shared account, it quickly becomes unclear who actually performed an action. In these chains, human intent, agent behavior, and system activity can blend together, making accountability difficult to distinguish.
Traditional identity and access models were designed for humans and services, so they struggle to handle agents as independent actors. This weakens audit trails and makes actions harder to trace back to a specific user or source.
A common pattern of this challenge is agents running under shared service accounts with broad permissions. In that setup, if an attacker compromises the agent, they gain access to everything the agent can do, turning a small issue into a major privilege escalation.
How to address it
To avoid blurred identity, each agent should have its own identity, rather than sharing human credentials, and should only have the minimum access needed to perform its tasks. Just as importantly, systems must preserve the full chain of who initiated each action.
In practice, when an agent acts on behalf of a user, downstream systems should capture and log the original user, not just the service account. Access should also be reviewed regularly to prevent unnecessary permissions from accumulating up over time.
Auditors should review agent access the same way as any privileged account and verify that delegation is enforced in the system, not just documented in policy. Ultimately, the key question is simple: if something goes wrong, can you clearly identify which agent acted, for which user, and with what permissions?
Challenge 6 — New Third-Party Risks
AI models are third-party systems, which means organizations don’t control critical security elements, such as how models are updated, how they behave, or when changes occur. As a result, a silent update by a vendor can break controls that were recently working without anyone noticing.
This isn’t theoretical. Organizations have already validated controls against one model version, only to be automatically moved to a newer version where those controls no longer worked.
How to address it
Organizations need to treat AI vendors as high-risk third parties. This starts with adapting existing third-party risk management practices to account for AI-specific risks. At a minimum, organizations must ensure model provider contracts include requirements for change notifications, audit rights, data usage restrictions (including prohibitions on training with client data), and clear data retention terms.
Organizations should also plan for dependency risk by developing and testing exit strategies and identifying alternative model options for critical use cases. Finally, AI vendors should be integrated into governance and audit processes. They should be reviewed regularly and held to the same standards as any critical external provider.
The bottom line is that third-party AI is third-party risk. It must be evaluated, monitored, and audited with the same rigor as any critical external service provider.
Challenge 7 — Evolving Regulation
AI regulation is advancing at different speeds across the globe, with no unified framework emerging yet. The EU AI Act, for example, is being implemented in phases. Initial obligations took effect in February 2025, with additional requirements in 2025 and 2026, and full enforcement extending into 2027. Meanwhile, other countries are still developing their frameworks. Across much of the world, AI governance remains decentralized or still evolving.
The result is a globally fragmented regulatory landscape, where organizations must navigate a patchwork of requirements rather than a single standard. With this ambiguity, auditors must form opinions today on controls that will be measured against standards still under development.
How to address it
To address this regulatory ambiguity, auditors should anchor their approach in established, jurisdiction-independent frameworks such as the NIST AI Risk Management Framework, ISO/IEC 42001, ISO/IEC 23894, and the IIA AI Auditing Framework. However, it is critical to recognize that these standards, along the regulations that will eventually enforce them, are still evolving.
Because of this, organizations must also implement continuous regulatory monitoring and adaptation. This includes conducting formal reviews of applicable regulations at least twice per year, with more frequent monitoring of emerging developments. Internal AI policies should also be revisited and updated every six months, rather than the traditional annual review cycle.
Organizations should also actively track the regulatory pipeline to understand what requirements are on the horizon and begin aligning controls in advance. This includes engaging with industry groups, regulators, and practitioners who are closely following regulatory developments, as well as staying informed on external audit expectations, federal guidance, and evolving best practices.
Ultimately, in a rapidly changing regulatory environment, audit readiness depends on continuous alignment. Controls that are compliant today may not meet expectations tomorrow, so organizations must regularly reassess both regulatory requirements and internal policies to ensure their approach remains current and defensible.
Five Ways the Audit Role Must Evolve
Knowing the risks of Agentic AI and the challenges it poses to auditors is just the first step. The harder question is what auditors are actually supposed to do differently with the emergence of this new technology.

- From Static Controls to Dynamic Controls
Validating that a control exists is no longer sufficient. A control that appears properly designed on paper may fail in practice once the agent encounters new prompts, changing contexts, or unexpected scenarios. If the review stops at static control design, material weaknesses in live operation may go undetected. Auditors must now evaluate whether controls remain effective against an agent’s variable, real-time behavior.
2. From Auditing Processes to Auditing Automated Decisions
Because an agent makes decisions that were previously made by employees, the audit lens must shift from process compliance to decision quality and authority. The central audit risk is no longer whether a workflow was followed, but whether the decision itself was appropriate, authorized, and defensible. This is a fundamental shift. Now that agents are making decisions, the quality of those decisions is an audit issue.
3. From Static Evidence to Dynamic Evidence
Traditional logs record what happened at the system level, but auditing agentic systems requires full prompts, reasoning traces, tool calls with parameters, memory snapshots, and model version records. Without this deeper evidence, auditors cannot reconstruct why the agent acted the way it did, leading to weaker conclusions. Organizations must produce the right evidence to credibly explain, defend, or audit agent behavior.
4. From Internal Control to Third-Party Risk
Models like GPT, Claude, and Gemini are critical third parties. Every agentic deployment has a third-party AI risk dimension that standard audit frameworks were not designed to cover. Even a well-controlled internal deployment can fail if a third-party model changes in ways the organization does not detect.
5. From Periodic Audit to Continuous Oversight
Agents operate in real time, 24/7, and periodic audits cannot keep pace. Because a yearly or quarterly review may identify issues only after damage has already occurred, assurance in agentic environments can no longer depend on delayed, point-in-time review. It must become more continuous, more operational, and more closely tied to live system behavior. Monitoring, KPIs, and eventually AI auditing agents must replace the annual review cycle.
The New Audit Function
All of these challenges and shifts share a common thread: We are no longer auditing systems that follow rules. We are auditing autonomous behaviors. Traditional IT audit approaches were not built for this level of complexity, speed, or independence. As agentic AI becomes embedded across enterprises, auditing must evolve from a reactive control function to a continuous, intelligence-driven discipline designed for dynamic systems.
Last Article in the Series: A Practical Guide to Auditing Agentic AI Systems
Theory alone is not enough. Auditing agentic AI requires practical, defensible methods grounded in real failures. In the next article, we break down actual incidents, define a 10-domain audit framework, and provide a step-by-step methodology for evaluating controls, gathering evidence, and moving toward continuous monitoring in high-autonomy environments.
About the Author
Esteban Farao is a Director of ERMProtect’s Cybersecurity Consulting Practice and a recognized leader in artificial intelligence and cybersecurity risk. With over 25 years of experience, he has directed enterprise AI initiatives spanning AI risk assessments, adversarial and AI-focused penetration testing, and organization-wide AI strategy and implementation. Esteban has led thousands of assessments, penetration tests, and forensic investigations for complex, high-risk organizations. A PCI Forensic Investigator and holder of multiple industry certifications, Esteban brings deep expertise in PCI compliance, regulatory frameworks, and breach investigations, working closely with organizations to strengthen security posture and navigate critical incidents.