Home / Strategic Management / How Do You Govern AI You Can’t See Inside Your Apps?

How Do You Govern AI You Can’t See Inside Your Apps?

Apr 27, 2026 Interview

Richard, thanks for having me. Over the last few years, the center of gravity for AI in the enterprise shifted from obvious, blockable apps to quiet behaviors embedded in tools everyone already uses. That makes access control feel quaint; the real work now is visibility, observability, and behavior control across browsers, SaaS, and on‑premises stacks. In our conversation, we explore why blocking breeds shadow AI, how to put guardrails where work actually happens, and what orchestration and telemetry must provide to turn discovery into day‑to‑day governance. We also discuss data borders for valuable internal knowledge, the browser’s new role, and how to measure reliance when AI is nearly invisible inside approved workflows.

As AI features embed into tools like office suites, design platforms, and coding assistants, where does the real control point move to, and why? Can you share a story where access was permitted but behavior wasn’t visible? What first three controls would you implement?

The control point moves from app access to the behavioral layers where work, prompts, and data collide—most tangibly in the browser and in the orchestration or policy planes behind it. When AI lives inside Canva, Copilot, Google Workspace with Gemini, or Adobe integrations, you can no longer rely on a single “allow/deny” for ChatGPT, Claude, or Gemini; the question becomes how much AI behavior is shaping approved work. A memorable moment: we granted a team access to an AI‑enabled document editor, saw no spikes in app usage, yet quality drifted—templates came back with odd phrasings and subtle policy gaps because embedded suggestions were quietly accepted without record. We had access, but not behavior; it felt like looking through frosted glass. My first three controls: instrument prompts and outputs at the point of use (with redaction), enforce data‑scope policies that travel with identity regardless of app, and add pre‑flight checks that test prompts against policy before they reach any model.

When AI is no longer a separate app but a behavior inside approved software, how should risk assessments change? Which data-flow questions become non‑negotiable? What metrics best reveal hidden AI influence on work quality and cycle time?

Risk assessments need to treat AI as an operational behavior, not an external vendor risk; that means modeling the workflow steps where AI can read, transform, or generate content and asking how those steps intersect with sensitive data. Non‑negotiable questions include: where does prompt context originate, what data leaves the tenant boundary, and which outputs get auto‑applied versus held for review. Because visibility is the next enterprise AI problem, I look for signals that expose hidden influence: the ratio of AI‑suggested to human‑authored edits, the share of cycle time spent in “accept suggestion” states, and the divergence between policy‑aligned templates and AI‑modified artifacts. When those metrics shift while app access remains unchanged, it’s a tell that embedded behavior, not external use, is driving outcomes.

Many teams try to block AI and end up creating more shadow AI. What early warning signs show this is happening? Which targeted allowances reduce workarounds? Can you outline a 30‑60‑90 day playbook to shift from blocking to governed enablement?

Early warning shows up as mismatches: official tools are “quiet,” yet deliverables show AI fingerprints—repetitive phrasing, generic imagery, or uniform code comments. You may also see odd browsing patterns where employees jump between sanctioned browsers and personal ones, or copy‑paste spikes that betray side‑door use. Targeted allowances that help include sanctioned prompting inside approved browsers, narrow access to internal knowledge with clear data borders, and a safe‑harbor policy for declaring use without penalty. For a 30‑60‑90: first 30, instrument the browser and top tools for prompt and output events, publish a visible safe‑use policy, and open an intake channel. Next 60, roll out data‑scope controls tied to identity, pilot human‑in‑the‑loop for auto‑apply features, and tune content filters. By 90, formalize orchestration policies across key apps, stand up audit reports that trace prompts to outputs, and deprecate blunt blocks in favor of governed enablement.

Useful AI often depends on internal knowledge, yet that’s exactly what leaders want to protect. How do you set “data borders” that keep value high without leakage? What red‑line categories do you enforce, and how do you test them in practice?

I define data borders as enforceable scopes bound to identity and purpose, not to applications. Practically, that means policy tagging on content repositories, prompt‑time resolvers that only fetch allowed snippets, and output routes that keep results inside the tenant unless explicitly approved. Red‑line categories include regulated data, proprietary know‑how, and accumulated expertise that gives the business its edge; those never traverse external systems without approved minimization and redaction. Testing happens in two loops: pre‑production with synthetic prompts designed to tease out leakage, and post‑deployment canary prompts that run daily in the live environment. If either loop can elicit red‑line content beyond scope, we treat it as a containment breach and halt the pathway until policy and filters are corrected.

Visibility alone rarely satisfies IT leaders. What additional layers—observation, testing, guardrails—turn visibility into real control? Describe a concrete workflow where you instrumented prompts, data access, and outputs. What did the telemetry teach you?

Real control stacks three layers: continuous observation of prompts and outputs, adversarial testing that tries to subvert policy, and guardrails that can block, route, or require human review. In one workflow for customer proposals, we instrumented the browser to capture prompt metadata, used a resolver that logged each data fetch by scope, and tagged outputs with policy lineage before they hit the document system. Telemetry showed that most risk came from “helpful” auto‑inserts pulling from broad internal knowledge rather than from overt data grabs; a handful of phrases recurred, revealing that one embedded suggestion pattern was drifting outside policy. We tightened the resolver’s scope and moved those auto‑inserts behind a review gate; proposal quality improved and the subtle policy drift disappeared.

The browser has become a central place where prompts, data, and work collide. What controls belong in the browser versus in each app? How do you handle prompt security, extensions, and session isolation? Any lessons learned from incident postmortems?

The browser owns cross‑app controls: prompt capture and redaction, policy checks before prompts leave the tenant, session isolation, and extension governance. Apps should enforce fine‑grained permissions on read, write, send, and delete, but the browser is where oversight converges because it’s where people live all day. For prompt security, we inject a pre‑flight that inspects context, strips red‑line data, and annotates prompts with policy tags; extensions are allow‑listed and required to emit governance events. Session isolation keeps personal and enterprise contexts apart, preventing cookie and token bleed. Postmortems taught us that a single unsanctioned extension can unwind weeks of careful policy; locking extension installs and requiring event emission closed that door.

Fragmented AI creates fragmented visibility across SaaS and on‑premises tools. Which integration patterns help unify observability—proxies, SDKs, event buses, or EDR‑style sensors? What governance data should every agent emit by default?

You need a blend: proxies to normalize traffic where feasible, SDKs for deep app‑level hooks, an event bus for consistent telemetry, and EDR‑style sensors in the browser and endpoints for catch‑all coverage. Each pattern alone misses something; together they stitch a coherent fabric across SaaS and on‑premises. Every agent should emit a compact governance record: user identity, data scope invoked, action attempted, model or feature invoked, confidence if available, source references consulted, and output destination. With that standard signal flowing through an event bus, you can apply policies in real time and perform post‑incident analysis that isn’t hostage to any single platform’s logging quirks.

Vendors pitch orchestration layers to coordinate agents. What capabilities separate mere coordination from true cross‑platform visibility? How would you evaluate an orchestration platform—APIs, policy reach, audit depth, latency, or failure isolation? Share a proof‑of‑value checklist.

Coordination moves tasks; visibility explains behavior and enforces guardrails. True platforms expose rich APIs for policy injection, propagate identity‑bound data scopes across tools, and offer audit trails that trace prompts to outputs with action lineage. I evaluate on API openness, policy reach across SaaS and on‑premises, audit depth that survives platform upgrades, latency that doesn’t break user flow, and failure isolation so one agent’s stall doesn’t cascade. My proof‑of‑value checklist: instrument a browser‑based workflow, route prompts through the orchestrator with pre‑flight checks, enforce data borders, capture end‑to‑end audit, and simulate an outage to verify graceful degradation. If I can see and control behavior without users feeling the gears grinding, it’s a pass.

Once AI is active, leaders want granular behavior control. How do you specify and enforce allowed actions (read, write, send, delete) per context? What’s your approach to approvals, human‑in‑the‑loop, and rollback? Describe a scenario where this avoided harm.

I express permissions as policy contracts tied to context: for a given identity, location, and data scope, AI may read and suggest but not auto‑write; it may prepare drafts but cannot send externally without a human checkpoint. Approvals live where users work, not in far‑off portals, and human‑in‑the‑loop is mandatory when outputs cross data borders or touch red‑line categories. Rollback is built in: every auto‑apply change creates a reversible delta with lineage back to the originating prompt. In one scenario, an embedded agent tried to auto‑insert “helpful” internal insights into a customer artifact; the policy contract forced a review step, the reviewer rejected the insertion, and rollback purged traces from the document history. The result: value preserved, leakage avoided, and a clear audit of why.

Measuring reliance is tricky when AI is invisible inside workflows. Which metrics reveal over‑dependence—prompt frequency, auto‑applied changes, decision criticality, or model confidence vs. ground truth? How do you separate productivity gains from risk accumulation?

I watch for clusters: rising prompt frequency with a growing share of auto‑applied changes, especially in high‑criticality decisions, signals reliance that may outrun oversight. Where models expose confidence, I compare it to ground truth outcomes; persistent acceptance of low‑confidence suggestions is a red flag. To separate gains from risk, I map cycle‑time reductions to error rates and policy exceptions; if time drops but exceptions rise, the “help” may be turning into unvetted automation. Because a year or two ago we were chasing access, not behavior, this shift in metrics is essential—look past app counts to the invisible hands shaping the work.

Regulated data and proprietary know‑how raise stakes. How do you conduct data mapping and lineage for embedded AI features? What’s your process for testing data minimization and redaction end‑to‑end? Share any audit findings that changed your roadmap.

Start by tagging repositories with policy codes, then map which prompts can trigger which resolvers, and log each data touch as a lineage edge from source to output. For minimization, we test with synthetic prompts that push for more context than policy allows and verify that resolvers fetch only the slivers within scope; redaction runs both pre‑prompt and post‑output to ensure no leakage. End‑to‑end, we replay real workflows with telemetry on, then compare expected lineage to actual events; any drift is treated as an audit gap. One audit surfaced that a generic auto‑suggest feature was reaching broader knowledge than intended inside sanctioned workflows; we narrowed the resolver scope and moved that feature behind a human‑in‑the‑loop gate, which immediately reduced exposure.

Policies without enforcement fail. Which technical controls—network egress, token controls, model routing, content filters—matter most in real environments? How do you tune them to avoid breaking legitimate use? Describe the escalation path when controls block work.

In practice, network egress gates where data can travel, token controls bind usage to identity and purpose, model routing sends prompts to the right place with the right policy, and content filters act as the last barrier. Tuning starts with observation: capture real prompts, run them through dry‑run filters, and only then tighten thresholds; let safe prompts pass while catching edge cases. When controls block work, front‑line users need an in‑context appeal that routes to a rapid review team; the team can approve once with a logged exception, adjust policy if warranted, or propose an alternative workflow. This keeps the system humane: users aren’t punished for trying to work, and governance learns from real friction.

Change management is often overlooked. How do you train teams to recognize when “help” turns into unvetted automation? What documentation, prompts hygiene, and review rituals actually stick? Any anecdotes where frontline feedback reshaped governance?

We teach a simple mental model: suggestions are drafts, automation is action, and anything crossing data borders demands review. Documentation that sticks is short, example‑driven, and lives beside the tools; prompt hygiene checklists help people avoid oversharing context. Weekly rituals—lightweight peer reviews of AI‑touched artifacts—build the muscle without adding bureaucracy. Frontline feedback once revealed that policy language felt abstract; they suggested showing the exact pre‑flight prompt view the system sees. When we exposed that in the browser, understanding clicked into place and governance complaints dropped.

If you could standardize one telemetry schema for AI in the enterprise, what fields would it include—user, data scope, action, model, confidence, source refs, output destination? How would you use those fields for real‑time policy and post‑incident analysis?

My baseline schema includes: user, role, session, data scope requested and resolved, action attempted (read, write, send, delete), model or feature, confidence if present, source references, output destination, and policy decision with reason. In real time, the policy engine evaluates user plus scope plus action to decide allow, review, or block, and annotates the event with why. Post‑incident, that same record lets us reconstruct what happened without guessing—who asked for what, which knowledge was consulted, where the output went, and whether the system warned or intervened. One direction of visibility—toward usage—depends on this compact, consistent event that every agent can emit.

Budget and ownership get murky across IT, security, and business units. Who should own AI visibility and control, and how do you fund it? What SLAs, KPIs, and operating cadence keep everyone aligned? Share a model org chart that has worked.

Ownership sits best with a cross‑functional AI governance office that includes security, IT, data, and key business lines, with a single accountable leader for visibility and control. Funding comes from a shared pool justified by risk reduction and cycle‑time improvements inside approved software; tying dollars to both safety and productivity keeps support broad. SLAs cover response times for blocks and appeals, KPIs track policy exceptions, review throughput, and quality drift, and the cadence includes regular stakeholder reviews alongside frontline feedback loops. Org‑wise, place the governance office parallel to platform engineering and security, with dotted lines to major business units; this lets policy travel without getting trapped inside any single function.

What is your forecast for enterprise AI visibility?

Visibility will move from episodic discovery to continuous, behavior‑level observability that lives in the browser and orchestration layers. As AI keeps disappearing into approved software and everyday workflows, leaders will demand deeper insight into where AI is active, what data it can reach, and how much control they retain once features are in use. The most useful AI will still press against data borders, so the winning enterprises will be those that keep visibility aimed inward at usage while preventing outward leakage of proprietary and sensitive information. The old block‑and‑control model has fallen behind; the next wave blends instrumentation, testing, and guardrails so work feels natural and safe at the same time.

How Do You Govern AI You Can’t See Inside Your Apps?

Related Publications

Subscribe to our weekly news digest.