Eight questions every AI & Data practitioner runs into — and the deep answers our practice has built up to address them. Architectures, cost benchmarks, training tracks, peer-reviewed research, and the running newsletter, all in one navigable shelf.
7 Live · 1 Coming Soon
The 8 in the Repo
Numbered. Curated. In order.
1
2
3
4
5
6
7
8
Advanced Research & Peer Review Library
What did the smartest people in AI publish this year?
Thirty papers from the venues that actually move the needle — NeurIPS, ICLR, ICML, ACL, CVPR — highly selected and rewritten out of academic prose into something you'd read on a Sunday. Click any paper to dive into the original.
AI companies are hungry for more training data. Defunct startups are in their sights.
Our takeaway: selling day-to-day employee work data is helping failed startups recoup funds. Emails, Slack chats, and Jira tickets are fetching real prices.
We need to talk to clients about: who actually owns ROI?
Our takeaway: companies are still struggling to track ROI through business outcomes rather than AI deployment milestones. One way to kickstart productive conversations: ask if they're putting ROI responsibilities in the right place.
AI is acing its benchmark exams. Does that translate to business value?
The Stanford Institute for Human-Centered AI released its 2026 AI Index Report recently. There's lots of good news, including data that show AI models continuing to accelerate in performance capabilities based on widely-used benchmarks. But the Stanford researchers flag a critical caveat: strong benchmarks don't necessarily predict strong or reliable performance in real-world implementations.
Like many AI developments, there's an analogue with humans here. Good test scores might be a good predictor of career performance, or it might just mean someone has gotten really good at taking tests. So far, at the enterprise level, AI benchmarks are saturating faster than real-world deployment wisdom is accumulating.
Data · Model Training & Privacy
Slack chats, Jira tickets and email archives are commanding attention at startup fire sales
Speaking of real-world implementations: move aside, ping pong tables and cold brew taps (and patents and customer data). There's a new asset that defunct startups are selling to recoup funds when they close up shop. AI companies are looking for any real-world data they can use to train their models, from employee Slack chats to emails to Jira tickets.
SimpleClosure, a startup that helps other companies wind down operations, recovered more than $1M on behalf of founders with this approach in the past year, typically paying $10k–$100k per company. SimpleClosure CEO Dori Yona called it a "gold rush" as AI companies try to get their hands on real-world work data to improve their models. You can read more at Fast Company.
60→100%
AI Coding Accuracy in One Year
On SWE-bench Verified — a human-validated benchmark designed to test AI on real-world software engineering tasks — AI performance rose from 60% to near-human levels in a single year. The speed of that gain is what matters: benchmarks designed to last years are saturating in months.
Stanford HAI 2026 AI Index
0 of 9
RAI Benchmarks Reported by Most Leading Models
While nearly every frontier model developer publishes capability benchmark results, responsible AI benchmark reporting on issues like fairness, safety, and factuality is largely absent. That doesn't mean AI providers aren't testing on RAI topics like bias, safety, or factuality — but the lack of transparency leaves buyers in the dark.
Stanford HAI 2026 AI Index
02
The Prompt
A conversation starter for your next client session.
Who owns the measurement of ROI on AI investments?
Clients have gotten the message about connecting AI deployments to targeted business outcomes, but many are still struggling to show ROI. Is that because the impact is missing, or because they're missing a cohesive strategy for measuring it? Asking who "owns" ROI metrics opens a conversation about measurement architecture, governance ownership, and the gap between pilot success and production value — without positioning the client as behind. Use it to figure out where the real friction is.
03
In case you missed it
News, analysis & assets worth attention.
Enterprise AI Rollouts
Adobe levels up its AI efforts in Creative Cloud
Adobe incorporated AI capabilities into Photoshop, Illustrator and its other creative products early on. But its newly announced Firefly AI Assistant is what Ars Technica is calling "Claude Code for creative apps" — it works across the Adobe Creative Cloud suite and orchestrates workflows as needed to get to the user's requested outcome.
It's not just the major AI providers switching from task-specific AI to larger orchestration now. To be determined: will existing power users of Adobe embrace it, or will it open the door for less experienced creatives?
Center for Advanced AI Points of View · Internal Research
Check out our PoVs on MCP and Google's TurboQuant
Our researchers frequently write points of view on developments in the AI space. This issue, we're highlighting two: one on Model Context Protocol, which has emerged as a standard for multi-agent deployment; and one on Google's TurboQuant, which you may have seen in the news recently.
TurboQuant is a compression technique that reduces the memory overhead of the key-value cache. In short: with this, you need less memory to run AI. Google first published a related paper last year, but garnered new attention when they announced they'd present the research at a conference this April.
List price is fiction. The real number lives in tokens, tiers, regions, and what wasn't in the SOW. Four sub-topics that turn partner pricing pages into apples-to-apples decisions.
#03 · 3.a · Data Platform Cost Comparison · v2.0
Five platforms. One workload. A <10% spread.
When you put GCP, Azure Fabric, AWS, Databricks, and Snowflake on the same enterprise workload — 5,000 ETL jobs a day, 3.5 petabytes of data, 10 TB ingested daily — annual costs land between $3.28M and $3.7M. The dollar gap is real, but it's narrower than the strategy gap. Here's what's actually inside the bill.
5 platforms compared$3.28M low end$3.70M high end<10% delta
CuratorTeresa Tung·Lead — Center for Advanced Data
01
The story begins with a misconception.
Every data platform RFP we've seen starts the same way: leadership wants to know which platform is cheapest. Procurement builds a pricing matrix. Engineering picks the architecture. The board signs off on a number.
And then, twelve months in, the bill arrives — and almost nobody is over budget by more than a rounding error.
That's not an accident. It's the math. Modern cloud-native and independent data platforms — at enterprise scale — converge on cost. The interesting question isn't "which is cheapest." It's which one matches how your business actually works.
02
First, a fair fight.
To compare platforms honestly, you need an identical workload running on each. We picked one that looks like a real Fortune-500 data estate, not a partner benchmark.
The Sample Medium-Sized Enterprise Workload
5K
ETL jobs / day
20 min
Avg. execution time / job
3.5 PB
Data volume in platform
10 TB
Ingested daily
A second profile — an MVP/pilot at 40 jobs/day, 15 min/job, 100 TB in platform, 25 GB/day ingested — runs alongside as a sanity check.
03
Two ways to buy the same outcome.
Every modern data platform falls into one of two archetypes. Understanding the split is the prerequisite to understanding the bill.
Archetype A
Cloud Native Services
From AWS, GCP, and Azure — cloud-managed offerings where consumption drives the cost.
One bill, one partner. Single-cloud procurement and support contract.
Linear pricing model. Storage + per-query compute, easier to forecast.
Lower baseline for steady-state BI and analytics workloads.
Archetype B
Independent Data Platforms
Databricks and Snowflake — software deployed on top of cloud infrastructure. Consumption drives both software and infrastructure costs.
Dual billing. Platform service units (e.g., DBUs) plus cloud instances + storage + networking.
Higher complexity to forecast — but more predictable for Spark/ETL-heavy workloads with reserved resources.
04
Now the receipts.
Same workload. Same enterprise scale. Five different ways to deliver it. Here's what each actually costs, broken into the three layers that drive the bill: ETL pipeline compute, warehouse analytics compute, and storage.
Cloud Native Data Platform Services
Layer
GCP Native
Azure Native (Microsoft Fabric)
AWS Native
Pipeline compute (ETL)
Dataflow + BQ Spark + Composer
200 workers n2-std-4 + 3,200 BQ Slots
Dataflow 15 hrs / BQ Slots 24 hrs
$167K / mo
Data Factory + Spark
F2048 (2,048 CUs)
15 hrs / day
$165.9K / mo
AWS Glue + EMR
100 DPUs (G.2X)
15 hrs / day
$198.9K / mo
Warehouse compute (Analytics)
BigQuery Enterprise Slots
2,000 Slots
15 hrs / day
$54K / mo
Synapse DW + Power BI
F1024 (1,024 CUs)
15 hrs / day
$82.9K / mo
Redshift Serverless
384 RPUs
15 hrs / day
$61.5K / mo
Storage
BigQuery Active + Long-term
1,750 TB active + 1,750 TB long-term
50% active / 50% long-term
$52.5K / mo
OneLake (ADLS)
1,750 TB hot + 1,750 TB archive
50% hot / 50% archive
$43.75K / mo
AWS S3 Tiered
3,500 TB
88% Glacier
$19K / mo
Monthly total
$273.5K
$292.5K
$279K
Annual total
$3.28M
$3.51M
$3.30M
Independent Data Platform — deployed on AWS
Layer
Databricks on AWS
Snowflake on AWS
Pipeline compute (ETL)
Databricks Jobs + Spark
25 nodes r5n.4xlarge
48 hrs / node / day
$130K / mo
Snowpipe + Snowpark
5× XL Warehouses
15 hrs / day
$144K / mo
Warehouse compute (Analytics)
All-Purpose Clusters
75 nodes r5n.4xlarge
33 hrs / day (warm)
$157.5K / mo
Virtual Warehouses
4× XL Warehouses
15 hrs / day
$115.2K / mo
Storage
AWS S3 Tiered
3,500 TB
88% Glacier
$19K / mo
Snowflake Native + AWS S3
700 TB internal + 1,750 TB S3
50% compressed / 50% cold
$24K / mo
Monthly total
$307K
$283K
Annual total
$3.70M
$3.40M
05
The plot twist.
Look across both tables. Annual spend ranges from $3.28M (GCP Native) to $3.70M (Databricks on AWS). That's a delta factor of less than 10% on a multi-million-dollar enterprise commitment.
GCP Native
$3.28M
AWS Native
$3.30M
Snowflake on AWS
$3.40M
Azure Fabric
$3.51M
Databricks on AWS
$3.70M
$0annual cost — same workload, five platforms$3.70M
The MVP / pilot profile tells the same story even tighter: a delta factor of less than 5%.
If the cost spread is <10%, cost is not the deciding factor. Strategy is. Operating model is. Where you want to be in three years is.
06
So how do you actually decide?
A three-step executive decision hierarchy. Use TCO to confirm the choice, not to make it.
1
Strategic Positioning
Anchor the decision in the target operating model, governance posture, and innovation ambition. What kind of data company do we want to be?
2
Platform Archetype
Select the platform best aligned to the workload profile and enterprise consumption model. Cloud Native, Databricks, or Snowflake?
3
Validate Commercials
Use TCO to confirm the choice — not to replace the strategic decision with a narrow price comparison.
The three archetypes, at a glance
Cloud Native
Modularity + Engineering Control
Composable services aligned to existing cloud strategy
Strong fit for engineering-led operating models
More flexibility in architecture design and optimization
Databricks
Advanced Analytics + AI/ML
Lakehouse-centric platform for Data Engineering and Machine Learning
Strong support for streaming and notebook-heavy workflows
Well-suited for innovation-led data product teams
Snowflake
Governed Consumption + BI Scale
Enterprise-friendly model for governed analytics consumption
Strong data sharing and standardized business access
Well-suited for governed BI and EDW modernization
07
The bottom line.
At enterprise scale, cost differences across viable platform options are often narrower than expected. The more durable differentiators are governance model, engineering flexibility, business consumption patterns, and long-term innovation needs.
Platform selection should be driven first by strategic fit and operating model — with commercials used to validate the choice.
Ready to map this against your estate?
This breakdown reflects a Sample Medium-Sized Enterprise workload. Cost comparisons are illustrative for the defined workload profile and may vary based on architecture design, optimization practices, and enterprise commitments. The next step is overlaying your actual data volumes, job profiles, and existing cloud commitments against this framework to identify your archetype.
Frontier IQ is the real-time intelligence dashboard our practice uses to track generative and agentic AI models — not just the strongest, but the fastest, cheapest, and most practical options for the workload in front of you. Today it tracks 656 models, more than 100 providers, and the GPU SKUs across every major cloud. It's how we sit down with client executives and build agentic platforms that are rigorous and economically defensible.
Every week a new model lands and a new headline declares it "the best." Procurement bookmarks the link. Engineering kicks off a benchmark. Someone, somewhere, signs off on a model choice based on a single score on a single chart.
And then the production bill arrives.
Benchmarks tell you what a model can do. They don't tell you what it costs to run. A frontier score on reasoning is a starting line, not a finish line. The interesting question — the one that actually decides whether your agent ships — is which model gives you the right capability at the right unit economics for the way your workload actually runs.
[Image Suggestion: A split-screen visual — left side a polished AI leaderboard with confetti and a "WINNER" badge over a single benchmark score; right side the same model rendered as a real production bill with line items, GPU hours, and a highlighted total. Caption beneath: "Same model. Two very different stories."]
02
First, a fair fight — at scale.
To compare models honestly, you need a single source of truth that updates as the frontier moves. Frontier IQ pulls from public sources, normalizes everything into one schema, and refreshes automatically.
What's inside the dashboard, today
656
Generative & agentic models
100+
Inference & API providers
All
Major-cloud GPU SKUs
4
Use-case benchmark families
Benchmarks are organized by what the model is actually being asked to do: general intelligence, software engineering, agentic workflows, and multimodal workflows. For each, the dashboard surfaces the strongest, the cheapest, and the fastest — so the right answer depends on the question, not the headline.
03
The Frontier Curve.
A model isn't a point. It's a moving line. The Frontier Curve plots benchmark score on the y-axis against time on the x-axis, tracking the progress of both open-weight and closed-weight models as the field evolves.
It's how you tell the difference between a one-off spike and a real shift in the state of the art — and it's how you spot when an open-weight model is closing the gap on a closed one fast enough that procurement strategy needs to change.
[Image Suggestion: A clean, dark-mode line chart with two distinct curves — one in purple for closed-weight models, one in light cyan for open-weight — both rising over a 24-month x-axis with labeled inflection points (model release dates). Show the open-weight curve closing the gap at the right edge.]
04
Today's leaderboard, by capability.
A snapshot of where the frontier sits right now. The headline: there's no single "best model." There are best models for things.
Reasoning
A crowded summit.
The strongest model on the reasoning benchmark today is GPT-5.4 Pro (extra-high reasoning). The strongest open-weight model is GLM-5 by Zhipu AI.
Anthropic's Claude Opus 4.7 sits in the top tier.
Meta's Muse Spark sits in the top tier.
Most frontier-lab models perform within 90% of the leader — the leaderboard is full, not empty.
Software Engineering
A truly jagged frontier.
No single lab leads everywhere. The right answer depends on which slice of "software engineering" you mean.
Bug-fixing benchmarks:Claude models dominate.
General programming:OpenAI's GPT models dominate.
Terminal use:a mixture of Gemini and OpenAI on the frontier.
05
Now connect the score to the receipt.
Performance alone isn't enough. Frontier IQ pairs every benchmark with the cost economics behind it — list price per token, throughput per dollar, and the cheapest credible option in each performance band.
When the cheapest option is also a serious option
The dashboard differentiates closed and open-weight models when filtering for cost. Two examples worth flagging:
Closed-weight, low-cost
Gemini 3 Flash
Delivers a blend of strong performance with low cost — making it a credible default for high-volume agentic workloads where cost is a hard constraint.
Open-weight, low-cost
Kimi K2.5
Can be served very cheaply with good performance — a strong option when self-hosting is on the table or when the workload demands open-weight portability.
[Image Suggestion: A scatter plot with benchmark score on the y-axis and dollars-per-million-tokens on the x-axis. Each model is a dot, color-coded purple for closed-weight and cyan for open-weight. Highlight Gemini 3 Flash and Kimi K2.5 sitting in the desirable upper-left quadrant ("high score, low cost") with a labeled callout for each.]
06
Managed API or self-host? The math has an answer.
Benchmarks tell engineering what a model can do. For FinOps, the next question is harder: at what point does it become cheaper to run this model on our own GPUs than to pay per token? The Frontier IQ cost analysis tool plots exactly that.
You select a model. It charts the economics of a managed API against self-hosting on cloud GPUs and surfaces the break-even point — the monthly token volume at which self-hosting starts saving money. Two illustrative cases:
Case A — Phi-4 (small model, by Microsoft)
Dimension
Managed API
Self-hosted on cloud GPU
Setup
Pay-per-token
No capacity planning
Pricing scales with usage
Single GPU instance
Self-managed serving stack
Fixed monthly cost
Verdict
Self-hosting wins at scale. A single GPU delivers enough monthly token capacity that, past the break-even point, Phi-4 is materially cheaper to host than to call. For small models with steady-state production volume, owning the GPU is the right answer.
Case B — DeepSeek v3.2 (large model)
Dimension
Managed API
Self-hosted on cloud GPU
Setup
Pay-per-token
No capacity planning
Pricing scales with usage
Large AWS instance
8 × H200 GPUs
~$45,000 / month
Verdict
Managed API wins. The break-even point is far higher than the monthly token capacity that a single 8×H200 instance can deliver. For large models like DeepSeek v3.2, self-hosting doesn't make economic sense at typical enterprise volume — you pay for unused capacity.
The size of the model dictates the deployment strategy. Small models reward ownership; large models reward elasticity. Frontier IQ shows the crossover point in dollars, not in vibes.
07
From dashboard to deployed agent.
Frontier IQ isn't only a dashboard. All of its curated intelligence is exposed via API — which means agents themselves can consume it. The dashboard becomes a tool, not a destination.
1
Connect the agent
Give Claude (or any capable agent) the Frontier IQ skill and an API key. The agent now has live access to model benchmarks, provider pricing, and GPU SKU economics.
2
Brief it like an analyst
"Build a budget and project-cost estimate comparing open-weight and closed-weight models for a KYC / Anti-Money-Laundering agent." The agent runs for about two minutes.
3
Get a defensible cost model
What comes back: a model comparison across open and closed-weight options, API cost projections, self-hosted GPU projections, and a budget summary for pilot, growth/scaling, and full production deployment in the enterprise.
[Image Suggestion: A three-frame storyboard. Frame 1: an analyst hands a single-line brief to an agent icon. Frame 2: the agent silhouette spins through dashboard panels (benchmarks, pricing, GPU SKUs) with a small "~2 min" timer. Frame 3: a clean output document titled "KYC/AML Agent — Budget & Cost Model" with three labeled tiers (Pilot / Growth / Production) and crisp dollar figures.]
What makes this work
Curated data
Models · APIs · Infrastructure
Frequently updated public data on every tracked model
API cost per provider, normalized for comparison
Infrastructure cost across public-cloud GPU SKUs
Tokenomics tools
Context-window economics
MCP server tooling: see how each server consumes context
Model agentic workflows with progressive disclosure instead of full disclosure
Significant savings on context-engineering and per-call cost
API-first
Built for agents, not just humans
Every dashboard view is also a tool an agent can call
Securely-keyed access for enterprise integrations
Continuously upgraded as the frontier moves
08
The bottom line.
The goal of Frontier IQ is simple: help us and our clients understand the frontier of AI capability — and the economics and costs behind it.
Capability without cost is a press release. Cost without capability is a procurement spreadsheet. Frontier IQ is the place we put the two together — so the model strategy you walk into the boardroom with survives contact with the bill.
Ready to map the frontier against your workload?
Frontier IQ figures are illustrative of current public benchmark and pricing data; actual model selection and deployment economics will depend on workload profile, traffic patterns, region, and enterprise commitments. The next step is overlaying your specific use case — KYC, claims, code, customer service, anything — against the live frontier and the live cost curves.
#03 · 3.c · Claude Deployment Channels · April 2026
One model. Five front doors. Wildly different rooms.
You think you're picking Claude. You're actually picking five products. Same Sonnet. Same Opus. Same token economics — to a rounding error. Everything else — the feature surface, the residency story, the IAM model, the day-one velocity — splits five ways the moment you choose a door. This is the architect's guide to the door.
AuthorsAtish Ray·Lan Guan·Atish Ray (Chief AI Architect) · Lan Guan (Chief AI & Data Officer)
01
The story begins with a misconception.
Every enterprise Claude conversation we've sat in starts the same way. Leadership picks the model. Procurement picks the cloud. Engineering picks the SDK. Hands are shaken. Decks are filed. The deal closes.
Three months later, a developer files a ticket. Why doesn't Fast Mode work? Why is the Skills Marketplace empty? Where did Computer Use go? Why does our Foundry deployment ship our data to the United States?
That's not a tooling gap. That's the channel.
Claude isn't one product. It's the same model surfaced through five different procurement, governance, and feature shells — and the shell is what your CIO, your CTO, your enterprise architect, and your AI platform lead are actually buying.
The interesting question isn't "do we use Claude." It's "which front door makes the rest of our stack feel like one stack — and which features can we live without on day one?"
02
First, name the doors.
Anthropic ships five enterprise channels for building agents with Claude — and three more knowledge-worker surfaces sitting alongside. You can't choose what you can't name.
The five enterprise channels
1A
Claude in AWS Bedrock
1B
Claude Platform on AWS
2
Claude in GCP Vertex
3A
Claude in Azure Foundry
4A
Anthropic Managed Platform
Three knowledge-worker surfaces ride alongside: Claude in Microsoft 365 (3B) as the agent inside Copilot, claude.ai (4B) for the web and mobile chat experience, and Claude Desktop (4C) for power users. Same model. Different consumption shells. Different price tags.
03
Two ways to buy the same model.
Pull the logos off and the five channels collapse into two archetypes. The split is the whole story. Everything downstream — features, governance, residency, billing — falls out of which side you're on.
Archetype A
Hyperscaler-Operated
Bedrock (1A), Vertex (2), Foundry (3A) — Claude served from inside the cloud catalog you already buy from. Your IAM. Your audit trail. Your commit.
Cloud-native everything. Identity, networking (PrivateLink / VNet), observability, FinOps, cost attribution — all native to the hyperscaler you already operate.
Existing commit applies. Burns AWS EDP, MACC, or GCP commit. No new partner to onboard, no new procurement motion.
Feature surface is narrower. Messages API plus the cloud's own agent stack. The native server-side tools, beta features, and Skills Marketplace live on the other archetype.
Archetype B
Anthropic-Operated
Claude Platform on AWS (1B), Anthropic Managed Platform (4A) — Anthropic's native infrastructure, with optional cloud billing as a procurement convenience layer.
The full Anthropic feature set. Messages, Batches, Files, Models, Skills, Agents, Sessions APIs. Server-side tools, MCP connectors, Fast Mode, Skills Marketplace, Computer Use, beta access.
Data leaves your cloud boundary. Processed by Anthropic; non-US data routes to US.
04
Now the receipts.
Same Sonnet. Same Opus. Five very different ways to deliver them. Below: the row-by-row breakdown across the dimensions that actually drive the architecture decision.
Archetype A · Hyperscaler-Operated
Dimension
Bedrock (1A)
Vertex (2)
Foundry (3A)
Infrastructure
AWS-managed
Google-managed
Anthropic-managed (3P)
Availability
Native catalog (Bedrock)
Native catalog (Vertex / Gemini Enterprise model garden)
Azure Marketplace subscription, Foundry model catalog as 3P
Data residency
Fully within AWS — global & multiple regions (US, EU, APAC)
Fully within GCP — global & multiple regions (US, EU, APAC)
US only. Processed by Anthropic; data from non-US comes to US
Available features
Messages API only — comparable features delivered via AWS APIs
Messages API only — comparable features delivered via Gemini Enterprise APIs
Messages, Skills, Files, Token-count APIs. Foundry does not provide built-in content filtering for Claude at deployment time
Available models
Claude and other models on Bedrock
Claude and other models on Gemini Enterprise
Claude through marketplace. Not all Foundry regions support Claude for Claude Code deployments
Azure Marketplace billing, Microsoft Azure Consumption Commitment (MACC) eligible — no Azure credits
Pre-integrated apps
—
—
M365 (Copilot, Copilot Studio, Excel)
Claude Code
Seamless integration with Bedrock
Routed through Vertex AI; no Anthropic account or API key needed
Fully supported — only 2 regions for Claude Code
Claude Cowork
Claude Desktop app (macOS / Windows) running in 3P mode; routes inference to Bedrock with integrated IAM
Not available yet
Not available yet
Where it shines
Longest-running agents · deepest GovCloud / compliance posture · Intelligent Prompt Routing between Claude tiers automatically · most GA features · most enterprise deployments
Google Search grounding built-in · A2A GA (Google is a co-creator) · deepest data warehouse integration · strong on developer features
1,400+ Logic App connectors · M365 / SharePoint / Fabric grounding · GPT and Claude on one platform · partial Claude Platform integration through Marketplace
Where it doesn't
No native web search for Claude (must wire third-party). A2A still beta. No Vertex-style built-in data warehousing.
No long-running agent duration guarantee. MCP tool search disabled by default. Cowork 3P mode not yet available — only AWS has it.
Data doesn't stay in the Azure boundary — biggest architectural limitation. Newest partnership (Feb 2026); most features still beta/preview. No batch API. Two regions only for Claude Code.
Archetype B · Anthropic-Operated
Dimension
Claude Platform on AWS (1B)
Anthropic Managed Platform (4A)
Infrastructure
Anthropic-operated
Anthropic-operated
Front door
AWS account, AWS billing, AWS IAM — no separate Anthropic account
Anthropic accounts + API keys; SSO for Enterprise
Data residency
Anthropic infrastructure outside AWS — global and US regions
Processed by Anthropic; data from non-US comes to US
Available APIs
Full set — Messages, Batches, Files, Models, Skills, Agents, Sessions
Integrated with Claude Platform and claude.ai web — session memory, auto-compaction, Fast Mode, web tools, MCP connectors
Native — Claude Code can integrate
Claude Cowork
Full features — chat, Skills Marketplace, Computer Use
Native — Claude Cowork can integrate
05
The plot twist.
The model's the same. The token price lands inside a rounding error. The feature surface does not. This is where channels actually compete — and where most "Claude vs Claude" conversations should start.
Bedrock (1A)Hyperscaler-operated
Most GA · deepest GovCloud · Intelligent Prompt Routing
Vertex (2)Hyperscaler-operated
Search grounding · A2A GA · BigQuery integration
Foundry (3A)Hyperscaler-operated
1,400+ Logic App connectors · M365 grounding · most features still preview/beta
Claude on AWS (1B)Anthropic-operated
Full feature set · AWS billing & IAM · Cowork 3P mode
Anthropic (4A)Anthropic-operated
Earliest features · Fast Mode · full Skills Marketplace
Messages API onlyfeature surface — same model, five channelsFull Anthropic native
The asymmetries that matter aren't on the price page. They're on the spec sheet:
Bedrock · 1A
The compliance king with a search problem.
Most GA features. Deepest GovCloud and IL4–IL5 posture. Intelligent Prompt Routing across Claude tiers — automatic.
No native web search for Claude — wire a third-party.
A2A still in beta.
No Vertex-style built-in data warehousing.
Vertex · 2
Built for builders, missing the long run.
Google Search grounding native. A2A is GA — Google co-created the spec. Deepest data warehouse integration in the field.
No long-running agent duration guarantee.
MCP tool search disabled by default.
Cowork 3P mode not yet available — only AWS has it.
Foundry · 3A
The newest partnership — and the boundary problem.
1,400+ Logic App connectors. M365, SharePoint, and Fabric grounding. GPT and Claude on one platform.
Data doesn't stay in the Azure boundary — the biggest architectural limitation.
Newest partnership (Feb 2026); most features still beta/preview.
No batch API. Two regions only for Claude Code. Content safety not auto-applied.
1B + 4A
Where the native feature set actually lives.
Anthropic's full surface. Whatever ships next, ships here first.
Fast Mode — 6× speed on Opus 4.6.
Full Skills Marketplace, Computer Use, full Cowork.
Cost optimization: Batch −50% + cache reads −90%.
Claude Code with session memory and auto-compaction.
Token price is not the deciding factor. Feature velocity is. Residency is. Governance is. Existing cloud commit is. Strategy is. Operating model is. Where you want to be in three quarters is.
06
So how do you actually decide?
The deck offers a clean three-step hierarchy. Use it in this order. Skip a step, and you're optimizing the wrong axis.
1
Lead with governance posture.
Strict geographic data residency (EU, APAC)? Regulated industries needing cloud-boundary processing? Cloud-native IAM, VNet/PrivateLink, centralized audit? Cloud-native observability, cost attribution, FinOps? Existing cloud commitments (EDP / MACC / GCP commit)? FedRAMP High / DoD IL4–IL5 (Bedrock GovCloud only)? Need uncapped IP indemnification (AWS, GCP)? Yes to any — start hyperscaler-operated (1A, 2, 3A).
2
Then layer in feature ambition.
Need access to new models and the latest features? Multi-cloud flexibility, integrating Claude from a private cloud? Low-to-medium-complexity agentic apps on managed infrastructure? Dedicated engineering support and custom contracts? Skills Marketplace, Computer Use, full Cowork? Low latency: Fast Mode (6× speed on Opus 4.6)? Specialized advisor tooling (mid-generation pairing)? Claude Code session memory and auto-compaction? Cost optimization: Batch −50% + cache reads −90%?Yes to any — pair with Anthropic-operated (1B or 4A) for those workloads.
3
Build hybrid by design — not by accident.
Production workloads run on the hyperscaler path: Bedrock / Vertex / Foundry for Claude API, agent orchestration, and 3P MCP / Skills / Tools / Data — under AWS, GCP, or Azure administration, IAM, and operations. Exploration, specialized engineering, and rapid prototyping run on the Anthropic-hosted surface: full feature set, agent harness, Skills, MCP servers, connectors — under Anthropic admin, SSO, and IAM. Production where governance matters. Rapid prototyping where features matter.
The three patterns, at a glance
Pattern A · AWS-Anchored
Governance + Cloud-Native Estate
Bedrock for regulated workloads, GovCloud, FedRAMP High, IL4–IL5
A footnote on M365 — because someone on the call will ask.
Two products will collide in your M365 conversation, and they share a name. Claude-enabled Microsoft 365 Copilot (with Cowork inside Microsoft) and Anthropic Claude Cowork (the desktop app). Same word. Different products. Different bills.
Dimension
Claude in M365 Copilot (incl. Cowork)
Anthropic Claude Cowork
Where it runs
Cloud — inside Microsoft 365 (subprocessor)
Desktop app (macOS / Windows) on Anthropic infrastructure
Data access
Full M365 graph: Outlook, Teams, SharePoint, Excel via Work IQ
Local files · browser · MCP connectors (Drive, Slack, Salesforce)
Governance
Microsoft DLP, Conditional Access, Purview audit — runs within Microsoft's security, identity, and governance framework
Folder-level sandboxing — less centrally governed
Best for
M365-standardized enterprises with compliance boundaries
Power users, cross-tool flows, non-M365 estate
Price
$30/user/mo M365 Copilot license — Anthropic INCLUDED, not separate
$20/mo Pro · $100–$200/mo Max · $25–$125/seat Team
Availability
Toggle Dec 8 2025 → Subprocessor Jan 7 2026 → end March 2026
GA — macOS January 2026 · Windows February 2026
Update cadence
Microsoft cadence — historically slower
Anthropic-controlled — fast iteration
Geographic exclusion
Excluded: EU/EFTA/UK by default · GCC/DoD/sovereign
US-anchored; EU residency in beta
08
The bottom line.
Claude is one model and five products. The token price will not decide for you. Governance posture, feature velocity, residency, and existing cloud commit will.
Most enterprises end up with Patterns A or B (AWS-anchored) for production governance, supplemented by Pattern D (Anthropic-direct) for exploration and beta features. Channel selection should be driven first by operating model — with token economics used to validate the choice, not make it.
Pick the door for the room you actually want to live in.
Ready to map this against your estate?
This breakdown reflects the deployment options as of April 2026, verified against AWS, MS Learn, and Anthropic documentation. Re-validation runs quarterly — feature parity across hyperscaler channels moves on Anthropic's release cadence, not the cloud providers'. The next step is overlaying your residency requirements, existing cloud commits, M365 footprint, and target operating model against this framework to identify your channel mix.
"AI Architecture" isn't a slogan — it's a nine-viewpoint, ISO/IEC/IEEE 42010-aligned reference architecture for intelligent agents that works on any cloud and any model. And it isn't theoretical: Costco is shipping it in production right now. Pick a door. Read the framework, read what it looks like when a real enterprise applies it end-to-end, see how it lands on each major platform — or read the security pattern that runs through all three.
Every consulting firm has a reference architecture. Most live in PowerPoints that nobody reads twice. This one was different — because someone shipped it.
Behind Door A — The Blueprint — is the v7 Intelligent Agent Reference Architecture itself: nine domains, ISO/IEC/IEEE 42010 viewpoints, the agent-washing problem named, the 13 specification dimensions, eight archetypes, the integration protocols, the multi-agent topologies, and a deep-dive into the OWASP-aligned risk catalog.
Behind Door B — Costco Runs It — is the same framework applied to Costco's enterprise platform. Nexus architecture (core anchoring + satellite autonomy). GCP-first composable design. A 6-month MVP plan. A 5-year roadmap from MVP through strategic differentiation. Four priority use cases — Call Center, Personalized Search, Knowledge Assist, GEO — mapped to the same level-3 platform capabilities.
Behind Door C — Intelligent Digital Brain · Ecosystem — is the same framework translated onto each Major Agentic Platform: AWS, Azure, GCP, OpenAI on AWS, Databricks, and Snowflake. Service by service, layer by layer — and the nine universal gaps every platform leaves behind, with the partner stack that fills them.
Behind Door D — AI Security Architecture — is the security pattern that runs through all three: a four-zone enterprise stack (Channels → Agentic DMZ → Agentic Apps → Agentic Foundation) with the Agentic DMZ as the load-bearing security boundary, mapped to the same nine viewpoints from Door A. Not a layer to bolt on. A zone to architect around.
Read them in any order. The framework explains why; the spotlight shows what; the ecosystem map shows where; the security pattern shows how to keep it from blowing up. Together they cover the full distance from "we should build agents" to "this is what production looks like — on your platform, behind your boundary."
#01 · 1.a · The Reference Architecture · v7 · April 2026
"Logical architecture" is too vague for AI. Here's the blueprint that isn't.
The frameworks we inherited — 4+1 from 1995, C4 from the desktop era, the catch-all "logical architecture" of TOGAF and Zachman — were built before LLMs existed. They flatten data, models, cognition, security, and orchestration into a single hand-waving box. This is the alternative: nine domain-specific viewpoints, ISO/IEC/IEEE 42010-compliant, that name every component an intelligent agent system actually has — and let you build it on any cloud, with any model, without a rewrite.
9 architecture domains234 source slidesISO 42010 alignedv7 · April 2026
Walk into any enterprise AI program and someone will ask for the "logical architecture." A box marked Agent Framework. A box marked Vector Database. A box marked LLM. Arrows. Everyone nods.
Then the system fails in production. Why? Because the boxes hid everything that mattered.
4+1 was created in 1995, the heart of the client-server era. LLMs did not exist. Apps were desktop and batch-oriented. C4, designed for evolutionary architecture in agile teams, never made data for models a first-class citizen — and has no place for model lifecycle, model monitoring, or observability.
AI systems aren't a logical-architecture problem. They're a multi-viewpoint problem. When an enterprise architect asks "where's your logical architecture?", the right answer is: "Our logical architecture is expressed through multiple viewpoints per ISO 42010 — data, runtime, cognitive, security, integration, infrastructure, model, DevMLOps, and multi-agent orchestration. Each is a first-class architectural viewpoint."
That's not hand-waving. That's the blueprint. Nine viewpoints. One reference architecture. Partner-neutral by construction.
02
The nine viewpoints, named.
Every intelligent agent system decomposes into nine complementary architecture domains. Skip one and you've shipped a prototype. Cover all nine and you've shipped a system. Each maps cleanly to an ISO/IEC/IEEE 42010 viewpoint — meaning your enterprise architect already has a vocabulary for it.
Fig 1. The nine domains and how they relate. Eight feed into and consume from the agent's core; the ninth — Multi-Agent Orchestration — wraps the whole system as the coordination/interaction viewpoint.
Domain 1 · Information Viewpoint
Data Architecture
Spans physical data storage, ingestion pipelines from numerous sources, transformation of data into knowledge, data for model training, and agent state and operations data.
Ingestion pipelines, embeddings, indices for semantic search
Graph data — nodes, edges, attributes
Interaction history, tool cache, FAQ cache, workflow state
Concerns: data flows, schemas, provenance, embeddings, feature lineage
Domain 2 · Information Viewpoint
Runtime Architecture
Reusable, standard implementations of common functions to applications. Structures application flow control and enables observability. Where ReAct lives. Where harnesses are built.
The information processing mechanisms an intelligent agent uses to achieve its goals. Capabilities mapped to technologies, plus the information flow patterns that yield intelligent behavior.
Identity and access management for users, agents, and agent tools. Plus data privacy and integrity, system availability, and harmful use by both users and agents.
IAM for users, agents, and tools — authentication, authorization, encryption, key management
Protocols and standards for discovering and securely integrating agents and tools. The plumbing that lets agents call anything — and lets anything call agents.
MCP (Model Context Protocol) — Anthropic's open standard. Adopted by Claude Desktop, Zed, Replit, Codeium, Sourcegraph
A2A (Agent-to-Agent) — Google's open standard. Backed by 50+ companies including Atlassian, Cohere, Salesforce, PayPal
Two stacks under one roof: traditional compute/storage/network for agent applications, plus specialized hardware for model training and inference.
Application tech stack — agent orchestration frameworks, vector and graph DBs with ingestion pipelines, data transformed into searchable knowledge
Model tech stack — web-scale training datasets, specialized training/inference software, GPUs and TPUs
Sensors and actuators for agents to interact with their environments
Domain 9 · Coordination/Interaction Viewpoint
Multi-Agent Orchestration
Agent team roles, tasks, delegation authority, inter-agent communication, workflow management, and governance. Where teams of specialists become a system.
Hierarchical Team — single manager coordinates supporting agents
Fully Connected Team — all agents communicate directly with each other
Team of Teams — manager coordinates a collection of teams, each with its own manager
There is a widespread "agent-washing" trend to label even simple services as "Agents." A form-validation service gets called a "Validator Agent." A logging service gets called a "Logging Agent." This linguistic inflation creates architectural confusion — and ships brittle systems.
An agent is an individual, goal-oriented system that is the source of its own action, with autonomous decision-making across multiple possible actions. When a validation service is called a "Validator Agent," the implication is that the service has autonomy, goals, and decision-making capability that it simply does not possess.
The deck's solution is a five-row taxonomy that names the distinction. Read this twice.
Component Type
Rule-of-thumb to recognize it
Example
Agent
If it decides which action to take from multiple options and then uses the results of the actions to select the next action — it's an Agent
ReAct Agent — Reads user query, uses search, calculator, and other tools in a loop to gather information, perform calculations, and create an answer
Workflow
If it deterministically processes inputs step-by-step (even if it uses cognitive capabilities) — it's a Workflow
Call Analysis Workflow — Convert audio recording speech to text, classify the intent, analyze sentiment, report results
Tool
If it is used by an agent to perform a specific task — it's a Tool
Web Search Tool — Searches for content relevant to a user query on the web
Runtime Architecture Service
If it performs a common service to an application — it's a Runtime Architecture Service
Logging Service — Records application events and errors with metadata to logs
Application Component
If it performs application-specific functionality — it's an Application Component
Tax Calculation Component — Calculates sales tax based upon location. Screens, reports, interfaces, business logic
Fig 2. All five conditions must hold. A thermostat — boundary, loop, multi-action space, setpoint goal, internal decision logic — qualifies. A form-validation service has none of them.
"Agentic AI" literally means "Agentic human-made Intelligent Agents." "AI Agents" literally means "Human-made Intelligent Agents Agents."Both are as redundant as saying "ATM Machine." Simply say AI or Intelligent Agent — and reserve agent for systems that actually have agency.
04
How to specify an intelligent agent — without hand-waving.
The deck names 13 dimensions required to actually specify an agent. Skip any of them and you're building on assumptions. The same 13 work for a thermostat, a conversational agent, a visual monitoring agent, an autonomous vehicle, or a humanoid robot — only the values change.
Dimension
What it answers
Example · Conversational Agent
Example · Monitoring Agent
Agent Archetype
The role the agent will play (developer, analyst, manager…)
Conversational Agent
Visual Monitoring Agent
Goals & Performance Measure
The goal the agent must achieve, and how success is measured
Support users by answering questions and performing tasks
Detect objects, faces, events in images, videos, or live camera feeds
Environment
Where the agent is designed to operate
Virtual on PC/mobile or physical at a kiosk
Anywhere a camera can capture light
Sensors
How the agent gathers information
Camera, microphone, touch screen, keyboard, mouse
Camera
Actuators
How the agent interacts with its environment
Screen, speakers, messaging system
Screen, messaging system
Cognitive Capabilities
The faculties needed to decide next action
Intent Classification, Memory, Speech to/from Text, Language Understanding & Generation
Visual Perception, Language Generation
Powering Technologies
The technologies that power those capabilities
Multi-Modal LLM, ASR Model, Speech Generation Model
CNN, LLM
Information Flow Pattern
How information flows through capabilities to select next action
ReAct — reasons what tools to use, invokes them, observes until complete
Reflex — perceives objects in image and generates report
Action Space
The set of actions the agent can perform
Communicate, Reason, Plan, Invoke Tools
Detect Objects, Send Alerts, Log Detections
Action Decision Engine
The mechanism by which the agent selects next action
LLM
If-Then Rules
Tools
External tools the agent needs
Flight booking, PTO lookup, search
None
Skills
Procedural knowledge for multi-step tasks
Book a flight and hotel
Not necessary for simple reflex agent
Team Membership
The team and collaborating roles
Call center team — agents and humans
Visual inspection team — agents and humans
05
From thermostat to humanoid robot.
Intelligent agents range in complexity from thermostats to humanoid robots. The same 13 specification dimensions describe all of them. The values change. The framework doesn't.
Dimension
Thermostat
Virtual Agent · Pre-LLM
Virtual Agent · Post-LLM
Autonomous Vehicle
Humanoid Robot
Goal
Maintain temperature
Answer simple questions, perform transactions
Answer complex questions, perform transactions
Transport passengers to destination
Perform physical tasks — assemble a product
Sensors
Thermometer
Digital messages, screen, speakers, camera, touch screen, microphone
Digital messages, screen, speakers, camera, touch screen, microphone
Cameras, sonar
Cameras, microphones
Actuators
AC and Heater switches
Digital messages, screen, speakers
Digital messages, screen, speakers
Steering, brake, accelerator
Hands, arms, legs, feet
Action Decision Engine
If-Then rules on temperature
If-Then rules on intent and entities
LLM using instructions, context, history, tool output
ML Models
ML Models
Russell-Norvig Class
Simple Reflex
Simple Reflex
Goal-Oriented, Learning
Utility, Learning
Utility, Learning
Fig 3. The break point between rules-based and LLM-driven sits between the pre-LLM and post-LLM virtual agents. Everything to the right of the break runs on probabilistic models — and inherits all the architectural complexity that follows.
06
Eight agent archetypes you'll actually build.
Similar to the roles humans play in an organization, agent designs fit into common patterns based on their goals, capabilities, and tools. Eight archetypes cover the vast majority of enterprise agents.
Manager
Coordinates Specialist Teams
Plans which agents perform which tasks
Reasons about agent outputs
Example: software product manager coordinating Software Engineer and QA agents
Conversational
Natural-Language User Support
Classifies user intent
RAG-based answering or scripted dialog
"Why was my bill so high?" / "What's the sick leave policy?"
RPA Robot: rule-based document classification, screen vision, multi-app data entry
07
ReAct, RAG, and the rise of "harness engineering."
The runtime patterns that ship most production agents are ReAct (reasoning + acting in a loop) and RAG (retrieval-augmented generation). Together they form the OODA loop of modern agent systems — and the orchestration code that wraps them has earned its own name.
Pattern · ReAct
The OODA loop in action.
The model is given a prompt that asks it to Reason / Think, describes available tools, and the model responds with the next Action. The orchestrator takes the action by invoking tools and provides their outputs as Observations in the next prompt. Loop until the LLM reasons it has enough information.
Iterate: Reason → Act → Observe → repeat
If no known tool exists, the orchestrator can invoke a search service or invent a tool and store it in the registry
Origin: Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, Oct 2022
Pattern · RAG
Retrieval-augmented generation, three phases.
Ingest unstructured data into a vector database. Retrieve via metadata + keyword + semantic search + reranking. Generate using prompt templates, history, and top relevant context.
Ingest: extract metadata · break into chunks · create embeddings via bi-encoder
Generate: create prompt with user query + relevant context + history + instructions · LLM completion
Fig 4. ReAct is iterative — the loop only exits when the LLM concludes it has enough information. RAG is linear — each of the three phases enriches the context window before the model speaks.
08
Agents vs Workflows — the architectural decision that's not optional.
Agents are often confused with workflows. They aren't the same. Locus of control tells you which is which: in the agent, or in the orchestration engine.
Agent
Locus of control: in the agent.
Anything that perceives its environment through sensors and acts upon its environment through actuators — with goals, autonomy, and cognitive capabilities to decide which action to take next.
Adaptability: high — can change approach based on results, backtrack, try alternatives
Choose when: human-like reasoning is valuable, problems require creative problem-solving, multiple tools need dynamic coordination, outcomes > process consistency
Workflow
Locus of control: in the engine.
A structured sequence of predefined steps that transform inputs into outputs through deterministic operations — even if some of those steps use LLMs and ML models.
Choose when: process steps are well-defined, compliance is critical, high-volume repeatable operations, predictable performance, auditable execution
09
The integration layer is finally a real layer.
For decades, "AI integration" meant bespoke API wrappers per partner. 2025–2026 changed that. Two protocols emerged as the actual standards — one for tool use, one for agent-to-agent — plus a small zoo of commerce-specific protocols for the autonomous-purchasing era.
Protocol
What it standardizes
Owner / Backers
MCP (Model Context Protocol)
How AI models and agents connect to and interact with tools, APIs, data sources, and external resources. Client-server architecture for tools, resources, prompts.
Anthropic · adopted by Claude Desktop, Zed, Replit, Codeium, Sourcegraph
MCP Apps
First official MCP extension. Servers deliver HTML-based UIs (dashboards, forms, visualizations, workflows) that render in sandboxed iframes. Bidirectional via JSON-RPC over postMessage.
Supported by ChatGPT, Claude Desktop, Visual Studio Code, Goose
WebMCP
JavaScript library + W3C proposal letting websites expose client-side functionality as MCP-compatible tools agents can invoke directly in the browser. No backend required.
Currently in Chrome 146 Canary
A2A (Agent-to-Agent)
Application-level protocol for autonomous agents to discover capabilities (Agent Cards), negotiate modalities, manage long-running tasks, and exchange context.
Google · backed by 50+ companies including Atlassian, Cohere, Salesforce, PayPal
llms.txt
Markdown file at /llms.txt offering LLM-friendly site overview — like robots.txt and sitemap.xml. Companion /llms-full.txt for full flattened docs.
Auto-generated by Mintlify, Fern; supported by MCP servers for IDE integration
ACP (Agentic Commerce Protocol)
Agent-driven product discovery and checkout, with built-in tax, shipping, fraud protection via Shared Payment Tokens (SPTs).
Stripe + BigCommerce
UCP (Universal Commerce Protocol)
Lets AI agents facilitate purchases directly in AI Mode and Gemini app. Integrates with Google Shopping Graph (50B+ products).
Google · Shopify, Walmart, Etsy
AP2 (Agent Payments Protocol)
Payment-transaction layer for AI agents purchasing on behalf of consumers and merchants. Complements UCP.
Google · Mastercard, PayPal
OpenAPI (Swagger)
Industry-standard for describing RESTful APIs in machine-readable JSON/YAML. Widely used for LLM function calling — converts API definitions to tool schemas.
Compatible with OpenAI, Anthropic, others
10
Four ways to put agents on a team.
Multi-agent systems consist of specialized agents — each with their own goals, tasks, cognitive capabilities, and tools. How they communicate is an architectural choice, not a default. Four patterns cover the field.
Pattern A
Hierarchical Team
A single manager agent coordinates several supporting agents
Each team-member agent only communicates with the manager
Members do not talk to each other
Pattern B
Fully-Connected Team
All agents can communicate directly with each other
Each agent decides when to communicate and what to send
Most flexible — also the hardest to govern
Pattern C
Team of Teams
A manager agent coordinates a collection of teams
Each team has its own manager
Hierarchical at scale — the org-chart pattern
Pattern D
Custom Workflow
Each agent communicates with a subset of others
Some of the workflow is deterministic
Parts allow agents to reason and decide next actions
Fig 5. The four multi-agent topologies. Solid edges are deterministic; dashed edges in the custom workflow are points where an agent reasons about what to do next.
11
It's never just "the LLM."
Implementing intelligent agents involves integrating a portfolio of models — each performing specific functions, each with different inputs and outputs. The deck names six frequently-used types. If your architecture diagram has one box marked "LLM," it's wrong.
Model Type
Examples
Architecture
Key Functions
Native Multimodal
Gemini, GPT, Claude, Grok
End-to-end multimodal transformers built from the ground up to natively process video + audio + images + text + code simultaneously without separate fusion layers · video tokenized at ~258+ tokens/frame, audio at ~32+ tokens/second
Video understanding, audio transcription with speaker ID, multi-hour media analysis, cross-modal reasoning, multimodal agent orchestration
LLMs
GPT, Claude, Llama, Mixtral, Gemini
Decoder-only transformers (typically) with billions of parameters trained on massive text corpora using next-token prediction
Text generation, reasoning, tool calling, code generation, planning, memory management, orchestration logic for multi-agent systems
Bi-Encoders (Embedding)
SBERT, BGE, E5, Instructor, Nomic Embed
Dual transformer encoders that independently encode queries and documents into fixed-dimensional embeddings (384–1536 dimensions); similarity via dot product / cosine
Token-level pruning models or learned compression transformers that identify and remove less informative tokens while preserving semantic content
Context window management, reducing API costs, handling long documents, improving latency, fitting more context within token limits
Small Language Models (SLMs)
Phi, Gemma
Compact decoder transformers (1–10B parameters) using knowledge distillation and high-quality training data
Edge deployment, fast inference, tool calling in latency-sensitive contexts, local agents, cost-effective repeated operations
12
The chapter that demanded its own page.
Security architecture in AI systems isn't a footnote — it's a category of its own, with risks that don't exist anywhere else in software engineering. Prompt injection. Excessive agency. Vector and embedding weakness. Unbounded consumption. The deck dedicates a full risk catalog mapped to OWASP Top 10. Click in for the full breakdown.
Coming Soon
The Specification Worksheet
The Intelligent Agent Tech Arch Specification worksheet — a record for each component across all nine domains. The deck embeds it as a downloadable. Future drop.
13
From blueprint to working architecture.
The deck names a seven-activity, two-phase process for moving from "we want to build agents" to "we have a future-state architecture and a roadmap." Use it to scope assessments, brief teams, and sequence work.
1
Assess Requirements & Current Capabilities
Understand AI Platform Requirements — identify business processes and tasks agents will automate; identify agent archetypes and their technical requirements. Deliverable: high-level requirements driving agent architecture.
2
Survey Current Architecture Assets
Interview technical resources to understand current-state agent architecture components in place. Deliverable: inventory of current architecture assets.
3
Identify Gaps & Opportunities
Given requirements and current state, identify gaps and opportunities for expansion to meet future requirements. Deliverable: architecture gap assessment.
4
Identify In-Scope Architecture Components
Create an inventory of architecture components needed to realize current and planned requirements. Deliverable: to-be agent architecture component inventory.
5
Recommend Tools, Patterns, Frameworks
For each in-scope architecture component, identify, assess, and select relevant products. Deliverable: to-be agent architecture specification.
6
Create Implementation Roadmap
Develop a roadmap for realizing the architecture — which may include implementing a proof-of-concept application. Deliverable: roadmap for architecture implementation.
14
The bottom line.
Build agent systems on a nine-viewpoint blueprint, not a one-box logical diagram. Reserve the word "agent" for things that actually have agency. Specify every agent across all 13 dimensions — goal, environment, sensors, actuators, capabilities, tools, action space, decision engine, team. Pick your runtime patterns intentionally — ReAct, RAG, harness — and your multi-agent topology — hierarchical, fully connected, team-of-teams, or custom — to match the work.
The point of partner-neutral architecture isn't theoretical purity. It's the option to lift-and-shift across clouds and models without rewriting your system. The framework is what survives the next partner cycle. The partners are what you swap out.
Pick any cloud. Pick any model. Lift-and-shift without a rewrite.
Ready to assess your agent architecture?
This page is a living summary of the v7 Intelligent Agent Reference Architecture, released 2026-04-22 by the Accenture Center for Advanced AI. Content is under active development — some sections are complete, others under construction. Expect gaps. Re-validate against the latest Toolkit GA release on the KX before scoping a new engagement.
Agent systems introduce a class of risks that don't exist anywhere else in software engineering — and most of them are now codified in the OWASP LLM Top 10 (2025 release). This is the catalog: 13 distinct risks across 5 categories, every one mapped to an OWASP entry where one exists, plus the controls and guardrails that mitigate each — slotted into the exact stage of the request → orchestration → LLM → output pipeline where they belong.
13 distinct risks5 risk categoriesOWASP LLM Top 10 · 20255-stage control plane
Five categories. One pipeline. Thirteen ways it goes wrong.
Most security thinking inherited from web applications still applies — authentication, authorization, encryption, key management. But agents add five new risk categories: Confidentiality, Integrity, Availability, Harmfulness, Honesty. Each is sourced by a different actor — the user, the agent itself, the model, the system designer, or an external attacker. Each lands at a different stage of the pipeline. Each needs a different control.
What follows is the deck's full catalog, reproduced with every risk, every OWASP mapping, and every description.
02
⚠️ The risk catalog, part 1 — Confidentiality & Integrity.
Eight risks. Six map directly to the OWASP LLM Top 10 (2025); two are agent-specific extensions where OWASP does not yet have an entry.
Category
Source
Risk
OWASP
Description
1. Confidentiality
User
LLM02:2025 Sensitive Information Disclosure
Yes
LLMs expose sensitive data — PII, proprietary algorithms, confidential details — through their output. Includes credential leakage, business data disclosure, and IP exposure. When embedded in applications, LLMs can unintentionally reveal sensitive information, resulting in unauthorized data access, privacy violations, and legal/compliance issues.
1. Confidentiality
Agent
LLM06:2025 Excessive Agency
Yes
LLM systems have too much authority to call functions or interface with other systems, enabling damaging actions from unexpected or manipulated outputs. Root causes: excessive functionality, permissions, and autonomy granted to the LLM. Impact varies based on which systems the LLM application can interact with.
1. Confidentiality
Agent
Unauthorized Agent Use
Related to LLM06
An agent discovers another agent and delegates a task to it — but the requesting agent is not authorized.
1. Confidentiality
Agent
Unauthorized Data Access by Agent
Related to LLM06
An agent accesses data it is not authorized to access.
1. Confidentiality
Agent
Unauthorized Tool Use by Agent
Related to LLM06
An agent discovers and invokes a tool it is not authorized to use.
1. Confidentiality
System Design
LLM07:2025 System Prompt Leakage
Yes
Disclosure of system prompts or instructions that guide model behavior — which may contain sensitive information not intended to be discovered. The core risk isn't the prompt itself but the underlying sensitive data, guardrail details, or permission structures revealed. System prompts should never contain credentials or be used as security controls.
1. Confidentiality
System Design
LLM08:2025 Vector and Embedding Weaknesses
Yes
Affects systems using RAG with LLMs. Vulnerabilities in vector generation, storage, and retrieval can lead to unauthorized access, data leakage, cross-context information exposure, and embedding-inversion attacks. In multi-tenant environments, weaknesses can result in information leaks between users or contradictory knowledge retrieval.
2. Integrity
Attacker
LLM03:2025 Supply Chain
Yes
Vulnerabilities affecting the integrity of training data, models, and deployment platforms. Risks: third-party package vulnerabilities, compromised pre-trained models, weak model provenance. Newer fine-tuning methods like "LoRA" and on-device LLMs further increase attack surface.
2. Integrity
Attacker
LLM04:2025 Data and Model Poisoning
Yes
Training data is manipulated to introduce vulnerabilities, backdoors, or biases that compromise model security and behavior. Can degrade performance, generate toxic content, enable downstream system exploitation. Poisoning can target pre-training, fine-tuning, or embedding processes — risks especially high when using external data sources.
03
⚠️ The risk catalog, part 2 — Availability, Harmfulness & Honesty.
Five more risks plus seven harm-content sub-cases. Prompt injection is the most dangerous of these — it's the only one that can bypass nearly every other control if not caught at the input stage.
Category
Source
Risk
OWASP
Description
3. Availability
User
LLM10:2025 Unbounded Consumption
Yes
Excessive and uncontrolled inference operations leading to denial of service, financial losses, model theft, or performance degradation. Attack vectors: variable-length input flooding, denial-of-wallet attacks, continuous input overflow, resource-intensive queries. The high computational demands of LLMs make them particularly susceptible to resource exploitation.
3. Availability
User
Unbounded Task Steps
Related to LLM10
Agents and agent teams typically take multiple steps (observe, decide, act) to complete goals. The vulnerability is that the team — and individual agents — will continue acting yet never (or only after an exceedingly large number of steps) complete their goal.
4. Harmfulness
User
LLM01:2025 Prompt Injection
Yes
User prompts alter the LLM's behavior in unintended ways — potentially causing the model to violate guidelines, generate harmful content, enable unauthorized access, or influence critical decisions. Inputs can affect the model even if they are imperceptible to humans, making this particularly dangerous. Both direct and indirect prompt injections can lead to security breaches.
4. Harmfulness
LLM
LLM05:2025 Improper Output Handling
Yes
Insufficient validation, sanitization, and handling of LLM outputs before passing to other systems. Since LLM outputs can be controlled by prompt input, this creates risks similar to giving users indirect access to additional functionality. Successful exploitation can result in XSS, CSRF, privilege escalation, or remote code execution.
4. Harmfulness
LLM
Biased Content Generation
Related to LLM05
Model generates biased content.
4. Harmfulness
LLM
Hate Speech Generation
Related to LLM05
Model generates hate speech.
4. Harmfulness
LLM
Insult Generation
Related to LLM05
Model generates insults.
4. Harmfulness
LLM
Sexual Content Generation
Related to LLM05
Model generates sexual content.
4. Harmfulness
LLM
Violent Content Generation
Related to LLM05
Model generates violent content.
4. Harmfulness
LLM
Misconduct Suggestion
Related to LLM05
Model suggests misconduct.
5. Honesty
LLM
LLM09:2025 Misinformation
Yes
LLMs produce false or misleading information that appears credible, with hallucination being a major cause. Compounded by user overreliance — excessive trust in LLM outputs without verification. Risks: factual inaccuracies, unsupported claims, misrepresentation of expertise, generation of unsafe code.
04
The five-stage control plane.
LLM-powered applications present unique risks that can be mitigated by implementing controls at each stage of processing: request, tool/data access, model consumption, agent action, and model output. The deck slots every guardrail into exactly one of these five stages.
Stage 1
Input Guardrails — at the prompt.
Catch the malicious request before it touches the model. The single highest-leverage stage in the pipeline.
Fig 6. Five stages, six OWASP-mapped threats. Each guardrail has its preferred stage but every later stage has the chance to catch what slipped through earlier.
No single control is enough. The point isn't to pick one stage — it's to defend at every stage simultaneously. Prompt injection bypasses Stage 4 if you didn't catch it at Stage 1. Excessive agency can't be undone at the output if it already wired the agent to a system it shouldn't have reached. Defense in depth is not a slogan here. It's the architecture.
05
The risk management process.
AI risk management starts with a comprehensive assessment of AI risks across the enterprise. Controls then need to be implemented to mitigate the risks. Risk management resources continuously monitor risk metrics and address issues. Three activities. One ongoing loop.
1
Assess Risks of AI Applications
Create the AI risks catalog. Define risk KPIs. Assess each application against the catalog. The same catalog reproduced above is the starting point.
2
Plan Risk Mitigation
Define controls for each AI risk. Match every entry in the catalog to one or more guardrails in the five-stage control plane. Document the mapping.
3
Monitor & Address
Continuously monitor risk KPIs. Address issues as they emerge. Re-assess on cadence. This isn't a project. It's an operating discipline.
06
The bottom line.
Agent and model security is its own discipline. 13 distinct risks. 5 categories. 6 OWASP LLM Top 10 mappings. One control plane spanning input, orchestration, model, output, and usage stages. Treat it as a first-class architecture domain — because per ISO 42010, that's exactly what it is.
Ready to map this against your applications?
The catalog above is the input to a real risk register. The next step is overlaying your in-flight and proposed agent applications against it, scoring each on likelihood and impact, and assigning controls from the five-stage plane to mitigate. Additional context: A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures (arXiv:2506.19676).
Most reference architectures live in PowerPoints. This one runs in production. Costco set out to build an enterprise agentic AI platform on the same nine-viewpoint blueprint described in 1.a — and turned it into a Nexus architecture (core anchoring + satellite autonomy), a GCP-first composable stack, a 6-month MVP, a 5-year roadmap through FY31, and four priority use cases. This is what the framework looks like with receipts.
Nexus architecture6 months MVP plan5 years · FY26 → FY314 priority use cases
In 1.a we argued that "logical architecture" is too vague for AI — that nine domain-specific, ISO 42010-aligned viewpoints are the actual answer. 1.b is what happens when an enterprise actually does it.
Costco is one of the world's largest retailers. Their challenge wasn't "should we use AI." It was "how do we build an enterprise platform that lets every team build agents — without each one reinventing data, models, governance, security, and operations."
The deliverable: a target-state Enterprise Agentic AI Platform Architecture Blueprint covering guiding principles, the Nexus architecture, the full capability stack, layer-by-layer technology decisions, MVP scoping, a 5-year roadmap, and architecture mappings for the four priority use cases — Call Center, Personalized Search, Knowledge Assist, and GEO (Generative Engine Optimization).
Every choice you'll see below was made against the same nine-domain blueprint from 1.a. The framework gave them the structure; their context (Fortune-15 scale, GCP-first posture, regulated workloads, knowledge-heavy use cases) drove the specifics.
02
First, the guiding principles.
Before naming a single technology, the team named what they wanted to be. Two layers: enterprise-wide architecture principles inherited from Costco's existing EA practice, and AI-specific principles layered on top.
Enterprise Architecture · 10 principles
The Costco baseline.
The principles every enterprise initiative inherits — including agentic AI.
Business and IT Alignment with measurable value
Customer-Centric Design
Security, Compliance, and Privacy by Design
Simplicity and Scalability
Modular and API-Driven Architecture
Reuse Over Build or Buy
Global Availability and Resilience
Data-Driven Decision Making
Adaptive Governance
Innovation and Continuous Improvement & Automation
AI Architecture · 6 principles
The agentic-AI overlay.
What changes when you put intelligent agents on top.
Lead from the Top
Responsible Development & Deployment
Composable AI Architecture with GCP First
Interoperability
Empower the Workforce
Partner for Acceleration
03
Seven design principles before any technology.
The platform's design principles operate at a higher altitude than tools. Get these right and the technology choices fall out almost mechanically.
Principle 1
Knowledge-first context engineering
Semantic data modeling
Context isolation
High-quality data preparation and normalization
Continuous knowledge governance and lifecycle management
Principle 2
Federated deployment, centralized governance
Domain autonomy, platform consistency
Shared reference architecture with local extensions
Common guardrails enforced through a central policy
Unified agent registry and identity
Automated deployment tooling
Principle 3
Standards-driven, governance by design
Standardized agent lifecycle management and certification
Governance embedded into workflows
Standardized interfaces and protocols
Global safety and risk framework
Unified observability, telemetry, auditability
Principle 4
Composable design for rapid innovation
Service-oriented design approach
Loose coupling via abstractions
Declarative orchestration
Principle 5
Elasticity for high-volume processing
Elastic, on-demand orchestration
Resilient and fault-tolerant execution
Inferencing through request batching
High throughput, low latency
Principle 6
High-performance, safety-first agent ops
Standardized red teaming and AI judge framework
Tunable agent reasoning levels based on task complexity
Defensive UI for agentic experience
Network isolation
Principle 7
Cost efficient by design
Right-sized models and adaptive routing
FinOps by design — cost visibility and guardrails
Operational simplification through platform consolidation
Semantic caching
04
The plot twist: Nexus architecture.
The single biggest architectural decision in the deck isn't which model, which database, or which cloud. It's this: agents will be federated in the organization. Centralized in some places. Distributed in others. The trick is knowing which is which.
The Core
Anchored capabilities. Built once.
The core represents the solutions developed as foundational and differentiated capabilities of the organization. Built and operated centrally — because consistency is the moat.
The knowledge layer — a shared organizational substrate
Utility agents — pre-built, certified, reusable
Centralized governance spanning custom and commodity agents
AI operations — the control plane for the whole estate
The Satellites
Autonomous capabilities. Bought, not built.
Satellites represent the non-differentiated or commodity agentic capabilities developed by ecosystem products — for faster time to market. Agents stay close to the data, process, and experience affinity.
Salesforce, SAP, ServiceNow agents — "agents as a service"
Each satellite owns its own domain
The core enables centralized governance for both differentiated custom and commodity agents
Fig 7. The Nexus topology. Differentiated capabilities live in the core. Commodity agentic capabilities (Salesforce, SAP, ServiceNow) ride as satellites — close to the data they already serve, brokered through MCP, governed centrally.
The point of Nexus is sovereignty over what's differentiated and speed over what isn't. Build the knowledge layer and governance once, in the core. Buy commodity agents from partners and let them live close to the data they already serve. One central control plane. Many federated executors.
05
The capability stack — five layers, top to bottom.
Costco's enterprise agentic AI platform decomposes into five capability layers. Each is an architectural concern, with its own ownership, technology decisions, and governance posture.
Layer 1
AI Strategy & Tech Business Mgmt
Establishes AI technology strategy and standards
Governs investments and organizational behavior
Ensures alignment with priorities and responsible behaviors
Technical governance and operational services for AI agents
The substrate every team builds on
Layer 3
Data & Analytics
Delivers and governs high-quality, trusted data to power AI agents
Provides data and analytics capabilities that inform AI strategy
Continuously shapes the enterprise direction
Layer 4
Solution Delivery & Management
Designs, delivers, and manages AI use cases end-to-end
Ensures solutions are built, deployed, and continuously improved
Delivers measurable business value
Layer 5
Infrastructure, Operations & Security
Resilient, secure, and optimized cloud and infrastructure services
Continuously runs AI solutions responsibly across the enterprise
The runway under everything
Fig 8. The five layers. Strategy at the top sets direction; the platform layer is the substrate every team builds on; data feeds it; solutions ride on it; infrastructure runs underneath. The two purple-highlighted layers are the ones the deck treats as bookends.
06
Now the parts list — Level 3 capabilities, by domain.
Drilling down: the platform's seven internal domains and the specific Level 3 capabilities each one ships. An asterisk-marked existing capability means it lives in Costco's estate today and will need enhancements during use-case enablement.
Domain
Level 3 capabilities
Cloud & Infrastructure
Server / Container (Agent Run Time) · Cost Control (Tagging, Budgets, Alerts) · Observability (Telemetry) · Identity & Access Management · Network Management (VPC, Subnets, Routes) · API Management · Standards & Policy Management (NIST Controls) · Vulnerability Management
Data
Enterprise Data Governance (Data Catalog) · Analytical Data Stores · Operational Data Stores (Near & Real-time Data Products) · Object Storage · Data Security Management (Masking, Encryption) · Data Integration Management (ETL, Pub/Sub, CDC)
Model Registry (approved foundation models) · Model Fine-Tuning (domain adaptation) · Model Benchmarking (right-fit per use case) · Model Security (Guardrails, Content Filter)
The technology decisions — GCP first, but not GCP only.
"Composable AI Architecture with GCP First" is a guiding principle, not a religion. Where GCP-native fits, use it. Where it doesn't, build or buy. Below: the layer-by-layer decisions reproduced from the deck — exactly as scoped — across GCP services, non-GCP services, and 3rd-party services.
Knowledge Layer · Technology Decisions (1/5)
Capability
What it does
GCP Services
Non-GCP / 3rd Party
Knowledge Ingestion
Scalable processing of semi-structured, unstructured, and structured enterprise data (documents, images, audio, video, relational). Modules for entity/metadata extraction, classification tags, chunking for embeddings, enrichment for downstream retrieval.
Gemini Enterprise · Vector Search · Alloy DB · Cloud Run
None
Knowledge Retrieval
Optimizes the search space by combining retrieval/reranking strategies to identify the most optimal and relevant context to pass to the language model.
Services for generating, validating, and integrating synthetic data to support prompt tuning, scenario generation, and evaluation. Provides broad coverage of diverse data types including edge-case and safety scenarios.
None
Python · RAGAS / DeepEvals
Metadata Management
Defines, organizes, and governs metadata across knowledge assets. Covers data access rules, categories, timestamps, lineage, quality attributes. Enables retrieval filtering and context isolation via high-precision descriptors.
Dataplex
None
Taxonomy Management
Structured classification system that organizes knowledge into categories, hierarchies, and relationships. Creates a consistent vocabulary that humans and AI models can interpret reliably.
Dataplex
UI and Backend
Ontology Management
Semantic representation of the business domain capturing entities, attributes, relationships, constraints, and interactions. Provides LLMs and agents with structural understanding to improve grounding and reasoning.
Alloy DB · Firestore (optional)
UI and Backend
Knowledge Graph
Dynamic representation of knowledge that models concepts within a particular domain and the relationships between them. The digital brain of the AI agent.
None
Neo4j · UI and Backend
Vector Store
Specialized databases for storing and searching high-dimensional numerical representations of data, enabling AI systems to find semantically similar items.
Alloy DB
None
Model Layer · Technology Decisions (2/5)
Capability
What it does
GCP Services
Non-GCP / 3rd Party
Model Registry
Set of approved models from different providers, exposed via the AI gateway. Provides scoped access to approved models.
APIGEE
Kong · LiteLLM
Model Security
Mechanisms to enforce safety constraints, prohibited topics, refusal behavior, and output filtering at the model level.
Model Armor
None
Model Benchmarking
Suite for testing and evaluating base models and custom models against well-defined metrics; creates benchmarks for business-related functional areas.
Vertex AI Evaluation Service
Front-end and back-end service
Model Fine-Tuning
Capability to train or adapt foundation models with domain-specific Costco data so the model internalizes the vocabulary, semantics, and constraints of the problem space.
Vertex AI Fine Tuning
None
Agent Layer · Technology Decisions (3/5)
Capability
What it does
GCP Services
Non-GCP / 3rd Party
Agent Orchestration
Highly customizable, low-code and pro-code, scalable framework with chain-of-thought reasoning, dynamic task decomposition and management. Agents collaborate via integrated memory; multi-agent collaboration via a 3-layer orchestrator/super/utility agent topology.
Vertex AI Agent Engine · Google ADK
None
Agent Tools and Protocols
Pre-built services that allow agents to integrate securely to enterprise data and systems (CRM, ERP, ITSM, etc.).
APIGEE
Kong · LiteLLM
Agent Memory
Secure, governed, persistent layer that lets agents store specific episodes of interactions for later retrieval — so they can learn from past interactions. Stores key facts, preferences, actions, and outcomes across semantic, episodic, and entity dimensions. All options will be available through a memory abstraction.
Vertex AI Agent Engine Memory Bank
Langmem · mem0
Agent Explainability
Continuous stream of spans and traces capturing agent interactions, prompts, tool usage, latency, cost, errors, and action outcomes — providing observability into agent execution.
Cloud Trace · Cloud Logging · Cloud Monitoring
None specified
Prompt Registry
Centralized, version-controlled catalog where all prompt templates are managed and stored. Treats prompts as first-class artifacts — reviewed, tested, tagged, versioned. Single source of truth.
Vertex AI Prompt Management
GitHub / CICD
Agent Governance · Technology Decisions (4/5)
Capability
What it does
GCP Services
Non-GCP / 3rd Party
Agent Certification
Process of assessing agents against capability maturity and readiness dimensions. Capability maturity defines the autonomy/agency level; readiness is measured by security, effectiveness, and interoperability aspects.
None
Custom Developed (Python + REACT)
Agent Evaluation
Measurement systems that evaluate how well an agent reasons, retrieves, and acts. Ensures continuous reliability and tracks drift over time.
System of record for all certified agents — capturing identity, owner, purpose, versions, allowed tools/data, policy constraints. Each agent documented through an A2A-compliant Agent Card.
None
Custom Developed (Python + REACT)
Agent Security
Treats every AI agent like a non-human entity with strong control over what it can access and do. Each agent has a unique, verifiable identity used for authentication, authorization, and full audit logging of actions and tool calls.
Vertex AI Agent Engine Identity
None
AI Operations · Technology Decisions (5/5)
Capability
What it does
GCP Services
Non-GCP / 3rd Party
AI Gateway
Centralized control plane between agent applications, model providers, and MCP servers. Enforces governance and operations at runtime — auth, rate limits, policy checks, logging/tracing, spend/budget controls. Standardizes access; enables semantic caching and usage analytics.
APIGEE
Kong · LiteLLM
MCP Gateway
Control plane / proxy layer managing how agents securely access tools, data, and resources through MCP servers. Acts as the policy-enforcing middle layer — validating requests, brokering capabilities, ensuring every tool invocation follows enterprise rules around safety, observability, authorization.
APIGEE
Kong · LiteLLM
Agent Deploy
Enhances traditional DevOps with checks unique to agentic systems — prompt scanning, MCP tool scanning in the pipeline.
(Assessment in flight)
GHEC · GitHub Actions
Agent Observability
Collects, analyzes, and observes how agents behave in production. Captures end-to-end telemetry across agent runs, model calls, tool interactions — latency, errors, quality signals, cost.
GCS · AlloyDB
Arize · Dynatrace · Grafana
Agent Improvement
Continuous cycle of making agents more accurate, safe, cost-efficient based on real production signals. Uses evaluations and human feedback to facilitate reinforcement learning for continuous improvement.
Vertex AI Fine-tuning
None
08
MVP scoping — three sizes. Pick one.
Costco's deck offers three MVP scoping options, each strictly additive: small is foundational; medium adds utility agents and an agent appraisal framework; large adds prompt analytics and a knowledge graph builder. Increasing in scope as you move to the right.
Option · MVP-Small
No-regret foundational capabilities.
The baseline platform. Eight deliverables. Everything below is required regardless of which path Costco picks.
POC validation for APIGEE, Aura DB, Dataplex, and Dynatrace integration for operational metrics
Cloud & Infrastructure foundation — GCP onboarding, IAM foundation, IaC, containers
Platform foundation services — Alloy DB, Agent Engine, Aura DB
Knowledge Layer — Data-to-Knowledge patterns for RAG-based use cases
Approved language models configured in AI Gateway (APIGEE) for governed access
Agent governance — human feedback collection, operational metrics, Dynatrace integration
Semantic Memory as a Service — for consistency and cost reduction
Knowledge Serving Layer (Hybrid Search and Semantic Search)
Option · MVP-Medium
Utility agents + agent appraisal.
MVP-Small + 5 deliverables. Adds the first wave of platform-supplied agents and a real evaluation framework.
Knowledge Serving Layer enhancement to serve the intent graph
Knowledge Assist (utility agent) and Intent Resolver (utility agent)
Agentic AI Evaluation Framework + Agent Appraisal Dashboard
Option · MVP-Large
Prompt analytics + KG builder.
MVP-Medium + 2 deliverables. The fully scoped platform launch.
Prompt Analytics Dashboard — track and monitor interaction patterns; insight for performance and security improvement
Knowledge Graph Builder Service — manage and maintain domain graphs leveraging ontology and taxonomy
Fig 9. The MVP options are strictly nested — Large contains Medium, Medium contains Small. The 6-month timeline maps each scope to the months it lands in, with the Pharmacy FAQ on Knowledge Assist as the named delivery milestone.
09
The 5-year roadmap — FY26 through FY31.
MVP gets you to month 6. The deck looks five years out. Three macro phases: MVP build, platform maturity / operational excellence, and strategic differentiation. Agentic capabilities with repeatable patterns go into the platform — not into individual use cases.
Fig 10. The 5-year program isn't sequential — it's overlapping. Maturity work begins in mid-FY27 while MVP wraps; strategic differentiation begins in FY29 while maturity continues. The lower row names the three flagship strategic outcomes by FY31.
1
FY26 · MVP Build 1.0 + 2.0
Q3 FY26 — Q2 FY27. Cloud Foundation setup for Agentic AI · KAD, POCs, and Testing · Knowledge Ingestion (D2K pipeline) · Knowledge Retrieval (Semantic, Hybrid, Graph RAG) · Vector Stores · Model Registry Setup · Model Security (Model Armor) · MCP Gateway setup · AI Gateway Setup · Agent Deploy · Agent Memory · Prompt Registry · Pre-built Utility Agents · Agent Pattern Catalog · Agent Certification (Process) · Agent Explainability / Observability · Agent Evaluation · AI Gateway / Observability / Graph DB · Certify D2K with Knowledge Assist · Platform Testing (Pen Testing, Vulnerability Testing).
Metadata Management · Knowledge Graph Builder · Knowledge Operations · Taxonomy and Ontology Management · Adaptive Learning Framework · Agent Improvement (RL Models, Cross-Encoder re-rankers) · Model Benchmarking · Agent Certification (Implement) · Agent Registry · Agent Security · Agent Onboarding · AI Gateway Setup Enhancements (A2A integration, integrate with OpenAI, Anthropic) · Fine-tuning workbench · AI for BI · Knowledge Assist · Chargeback model · Platform consumption tracking.
3
FY29–FY31 · Strategic Differentiation
POCs for strategic differentiation: Agent Commerce, Agent Marketplace, Agent Economy · Mind of Costco Ecosystem (Organizational Knowledge Graph, Organization Memory Graph) · Autonomous agents — Controlled Autonomy (continuous environmental sensing + IT/business operation actions) · Publishing Costco-specific agents for external marketplace integration (e.g., GEO, instant checkout from ChatGPT) · Agent Commerce (UCP). Plus ongoing platform operations, maintenance, and enhancements (Vector Store, Agent Engine provisioning).
10
Four priority use cases — same framework, different surfaces.
The platform exists to enable use cases — not the other way around. The deck names four priority workloads, each mapped to the same Level 3 capability matrix. Same scaffolding, four different agents on top.
Use Case 1
Call Center · Contact Center
Conversational + content-analyst archetypes
Mapped to the full MVP capability matrix — Level 3 capabilities across Cloud, Data, Knowledge, Model, Agent, Governance, Operations
Inherits the platform's identity-management, data and tool controls, MCP / A2A connectors
Use Case 2
Personalized Search
Knowledge-heavy retrieval with member-context personalization
Routes to AI Gateway with model-tier selection by query complexity
Use Case 3
Knowledge Assist
Utility agent — synthesizes and contextualizes trusted enterprise knowledge for users
By month 6 of MVP: framework ready with Pharmacy FAQ
Foundation for the Intent Resolver utility agent and the Knowledge Graph Builder service
Use Case 4
GEO · Generative Engine Optimization
Crawl phase begins month 4 of MVP
FY29–FY31: publishing Costco-specific agents for external marketplace integration — including instant checkout from ChatGPT
Agent Commerce on the Universal Commerce Protocol (UCP) joins the program in the strategic differentiation phase
11
The bottom line.
Costco didn't build a use case. They built a platform — and use cases ride on top. Nexus architecture for sovereignty over what's differentiated and speed over what isn't. GCP-first composable design for partner leverage without lock-in. Five capability layers, seven internal domains, dozens of L3 capabilities. A 6-month MVP that proves the foundation. A 5-year roadmap that extends from foundational onboarding to autonomous agents and Costco-specific marketplace integration.
Most importantly: every architectural choice traces back to one of the seven design principles — knowledge-first context engineering, federated deployment with centralized governance, standards-driven by design, composable for rapid innovation, elastic for high-volume processing, safety-first ops, and cost-efficient by design.
If 1.a tells you why the framework matters, 1.b shows you what it looks like in production.
Want the framework behind this?
The architectural decisions on this page weren't invented for Costco — they were the deliberate application of the v7 Intelligent Agent Reference Architecture from 1.a. Open the blueprint to see the nine-domain, ISO 42010-aligned framework that informed every choice above. Or jump straight to the OWASP-aligned risk catalog deep-dive that's now part of the standard pre-flight check.
#01 · 1.c · Intelligent Digital Brain · Ecosystem · February 2026
One brain. Six platforms. The same nine gaps.
The blueprint in 1.a tells you what to build. The Costco spotlight in 1.b shows you how a Fortune-15 enterprise actually built it. 1.c shows you what it looks like on each Major Agentic Platform — AWS, Azure, GCP, OpenAI on AWS, Databricks, and Snowflake — service by service, layer by layer. And it shows you something more uncomfortable: every platform leaves the same handful of gaps. Knowing where the natives stop is the difference between a brain that ships and a brain that stalls.
The platform decision is not the architecture decision.
Almost every enterprise agentic AI conversation begins with the same question: "Should we build on AWS, Azure, GCP, Databricks, or Snowflake?" It's the wrong opening question — but the right one to disarm.
The right opening question is: "What does an Intelligent Digital Brain actually look like?" Once you know that, the platform question stops being a religious war and becomes a translation exercise.
Same brain. Different services. Different gaps. The brain has the same 23 capabilities on every platform — agent orchestration, semantic layer, model recipe, governance, observability, and so on. What changes from one platform to the next is which native services map to which capability — and, critically, where the platform's natives run out.
[Image Suggestion: A hex-grid of six logos (AWS · Azure · GCP · OpenAI on AWS · Databricks · Snowflake), each connected by purple threads to a single luminous "Brain" node in the center. Subtle ghosted text below: "Same blueprint, different translations."]
02
First, what the brain actually is.
Before you can map a brain onto a platform, you need to agree on the brain. The L2 reference is a layered architecture organized around seven steps of agentic execution — the loop every enterprise agent runs, regardless of cloud:
The seven-step agentic execution flow (L2)
1
Orchestrate · agents coordinate domain requests
2
Gateway · models, tools, knowledge as control points
3
Reason · ensemble of continuously-learning models
4
Ground · semantic layer + ontology + data products
5
Act · update data products as agents make changes
6
Integrate · enterprise systems with embedded agents
7
Govern · controls + visibility logs at every stage
Underneath, the brain is organized into five enterprise layers — Industry Pattern Libraries · AI Lifecycle Management · Agent Ensemble · Domain Ontologies + Specialized Models · Data Foundation — sitting on a shared Brain Infrastructure of compute, networking, identity, secrets, and resilience.
That's the constant. The platform choice determines the spelling — which native services play which roles — but not the structure.
03
Six platforms, mapped.
For each platform we draw the same L2 picture, then label every capability with the native services that fulfill it. What follows is a quick-reference card per platform — the headline services that do show up natively, and the platform's distinctive flavor.
Platform 1 · AWS
The deepest service catalog.
Strong across orchestration, model recipe, data foundation, and infrastructure. The brain plumbing is mostly already there.
Microsoft Foundry plus Azure OpenAI Service form the spine; Semantic Kernel and AutoGen carry agent orchestration.
Orchestration: Microsoft Agent Framework SDK, Foundry workflows, Semantic Kernel/AutoGen
Models: Azure OpenAI Service, Azure Machine Learning, Foundry IQ for grounding
Data: Azure Synapse, Data Factory, Cosmos DB (Gremlin), Data Lake Storage Gen2, Purview
Govern + observe: Foundry evaluations, Azure Content Safety, Responsible AI Toolbox, Azure Monitor, Application Insights, Azure Red Teaming Agent
Platform 3 · Google Cloud
Vertex everywhere, ADK for agents.
Vertex AI, Agent Builder, and the Agent Development Kit (ADK) form the agentic substrate; Gemini provides cognition.
Orchestration: Vertex AI Agent Builder, ADK, Vertex AI Pipelines, A2A protocols
Models: Vertex AI, Model Garden, Gemini, Vertex AI Studio
Data: BigQuery, AlloyDB, Spanner, Dataform, Feature Store
Govern + observe: Vertex AI Model Registry, Gen AI Evaluation Service, Cloud Monitoring, Agentic SOC, BigQuery Data Lineage
Platform 4 · OpenAI on AWS
Cognition over governed plumbing.
A hybrid pattern: OpenAI provides the cognition (intent + reasoning + planning); AWS provides the governed Digital Brain (memory, knowledge, tools, observability). Best illustrated by the agentic-commerce customer-journey blueprint in the deck — a 9-step flow from "My internet keeps dropping since I moved" to a credit + a shipped Wi-Fi extender, with the agent never seeing raw payment data.
Cognition: OpenAI models for intent + context + reasoning
Brain Core (AWS): Graph DB + OpenSearch for journey context retrieval
Action: Agentic Commerce Protocol (ACP) for policy-gated, idempotent execution
Memory: Resolution outcomes link back to the journey graph for similarity matching next time
Platform 5 · Databricks
Lakehouse-native, Unity Catalog-governed.
Mosaic AI is the agentic surface; Unity Catalog runs governance end-to-end across data, models, agents, and tools.
Orchestration: Mosaic AI, Workflows, AI Gateway, Serving Endpoints (LangGraph optional)
Govern + observe: Horizon, Trust Center, Cortex GUARD, Object Tagging, RBAC/Access Policies, Masking Policies, Access History, Query History/Profiling, Snowflake Observability
[Image Suggestion: Six small thumbnail-style "L2 architecture" cards arranged in a 3×2 grid, each one labeled with a platform name and showing a simplified 5-layer brain stack with native service tags. All six cards share the same shape and stack — only the labels differ — making the "same brain, different services" claim instantly visual.]
04
Where every platform is enough.
It is genuinely true that the major platforms cover the brain's plumbing well. If your engineering team is ready to wire it up, you can build the following layers entirely native — on any of the six. This is the consensus zone.
The "yes" column — fully native, all six platforms
Capability
Why it works natively
Brain Infrastructure
Compute, networking, security, identity, multi-tenancy, resilience — the platforms have spent a decade on this. Generally sufficient. Usually no third-party needed.
Data Accessibility
Secure access to enterprise data sources is solved. Lake Formation, Azure Data Lake, BigQuery IAM, Unity Catalog, and Snowflake RBAC are all enterprise-credible.
Model Recipe (fine-tuning)
Domain adaptation works on Bedrock + SageMaker, Azure ML + Foundry, Vertex AI fine-tuning, Mosaic AI, and Cortex Fine-Tuning. Hugging Face / vLLM only enters the picture for hybrid or non-native model mixing.
AI Lifecycle Automation (CI/CD)
CI/CD promotion gates exist: CodePipeline + CodeBuild, Azure DevOps, Cloud Build + Vertex Pipelines, GitHub + Workflows, Native App Releases. The pipelines themselves are fine; the eval-metric gates are where third-party adds value.
Infrastructure Observability
CloudWatch + X-Ray, Azure Monitor + Application Insights, Cloud Monitoring, Mosaic AI monitoring, Snowflake Observability — runtime/infra signals are well-covered.
This is the part of the deck that took the longest to build, and it's the part that pays back fastest. After mapping all six platforms layer-by-layer, the same nine capabilities fall short on every platform — sometimes by design, sometimes because the category is genuinely young, sometimes because the platforms are racing toward it but not there yet.
The nine universal gaps — and what fills them
Capability
Why natives fall short
What fills the gap
Industry Pattern Libraries
Platforms ship general templates. None ship deep vertical "industry cognition" or reusable domain-agent IP.
Vertical-specialized agents (banking KYC, fraud, claims, marketing ops) are not provided out of the box anywhere.
Accenture Industry Agents
+ Salesforce Agentforce · ServiceNow agents · SAP Joule extensions · Microsoft Foundry partner packs
Domain Ontology Engineering
No major platform provides ontology authoring + lifecycle tooling. The graph storage is there; the engineering is not.
TopBraid · PoolParty · Protégé (OSS)
Knowledge Representation (advanced reasoning)
Neptune / Cosmos DB Gremlin / BigQuery graphs / Iceberg via Snowflake all store graphs — but ontology-driven reasoning patterns and rules engines need more.
Stardog · Neo4j · TerminusDB (OSS)
Semantic Layer (enterprise governance)
Data semantics are covered by Glue / Synapse / BigQuery / Unity Catalog / Snowflake Semantic Model — but enterprise stewardship workflows + semantic contracts at scale are not.
Collibra · Alation · Atlan · OpenMetadata (OSS)
Agent Decision Lineage
Model registries exist (SageMaker, Azure AI, Vertex, MLflow, Cortex). The "why" trace across multi-agent decisions — evidence packs across chained reasoning — does not.
MLflow + OpenLineage
+ Collibra/Alation · Arize/Fiddler for QA gates
Agent Quality Observability
Hallucination detection · semantic correctness · tool misuse · agent behavior drift — all newer than the infra-observability tooling, and inconsistent across platforms.
Baseline explainability exists; chain-of-reasoning explainability across multiple agents working together does not.
Fiddler · TruEra · Arize · Evidently (OSS)
Agent Certification & Readiness
CI/CD + custom evals get you partway. No platform ships a productized "is this agent ready for production" certification framework.
Arize · WhyLabs · W&B + Great Expectations
+ custom certification scorecards
[Image Suggestion: A "platform coverage heatmap" — six columns (one per platform) and 23 rows (one per L2 capability). Cells are color-coded green (fully native), amber (partial), or grey (gap). The nine universal-gap rows show consistent grey/amber bands across all six columns — visualizing the thesis instantly.]
None of the gaps are fatal. All of the gaps are predictable. A team that walks in already knowing the nine has a 6-month head start on a team that learns them by hitting them.
06
What a vertical brain looks like.
The reference becomes concrete the moment you industry-fy it. The deck includes a worked example: The Banking Digital Brain on AWS — A Runtime Architecture Flow Blueprint. Same seven-step loop, banking-specific organs.
Banking Experience Layer
Where bankers actually work.
Banker Copilots
Investigator Workbench
Contact Center Assist
Digital Channels
Back-office Automation
Banking Data Foundation
Customer 360 + Risk 360, governed.
Redshift · Kinesis · S3 · Glue · Lake Formation
Customer 360 + Risk 360 as the headline data products
Plus an Industry Agent harness for partner-supplied vertical agents
Cycle: Sense → Interpret → Evaluate → Learn → Govern → Reflect → Deploy
07
So how do you actually pick?
Cost spreads at enterprise scale are narrower than the headlines suggest (see 3.a). Capability gaps are uniform across platforms (see Section 05). So what does drive the choice?
1
Existing data gravity
If your data already lives somewhere, the brain probably should too. A Redshift + S3 estate wants AWS. A Synapse + Fabric estate wants Azure. A Lakehouse-of-record on Databricks or a warehouse-of-record on Snowflake are equally compelling reasons to stay put.
2
Operating model fit
If your team already lives in Vertex / ADK or in Foundry / Semantic Kernel, you'll ship faster on the platform whose mental model you've already internalized. The "best" platform is the one your engineers already trust.
3
Cognition strategy
If the cognition you need is OpenAI-shaped, the OpenAI-on-AWS pattern (Section 03 · Platform 4) is a real architecture, not a fallback. Hybrid is a first-class choice.
4
Plan for the gaps anyway
Whichever platform wins, the same nine gaps are coming. Budget for them — ontology engineering, agent decision lineage, agent certification, vertical agent IP — and build the partner stack into the architecture diagram from day one.
The three flavors of the ecosystem, at a glance
Hyperscalers
AWS · Azure · GCP
Deepest service breadth; full vertical stack from compute to cognition
OpenAI-on-AWS belongs here as a hybrid pattern
Best fit when the brain spans many capabilities and your data already lives there
Lakehouse-First
Databricks
Mosaic AI as the agentic surface · Unity Catalog governance end-to-end
Strong fit for ML-heavy, streaming-heavy, lakehouse-of-record estates
Multi-cloud portable
Warehouse-First
Snowflake
Cortex agents · Horizon governance · Marketplace + Native Apps for distribution
Strong fit for governed-BI consumption, data sharing, and clean-room patterns
Industry agents arrive via Marketplace the same way data does
08
The bottom line.
The platform decision and the architecture decision are not the same decision. The architecture is constant. The platform is a translation of that constant into a specific set of services and a specific set of gaps.
Bring the blueprint. Map it onto your platform. Plan for the same nine gaps that every platform has. Then you can have the cost conversation — because you'll know what you're actually pricing.
Ready to map the brain to your platform?
The full executive deck — every L2 architecture diagram, the per-platform native services, the gap tables, the banking and agentic-commerce examples — is the source of record. Open it for the diagrams, then talk to Atish for the engagement view.
#01 · 1.d · AI Security Architecture · v1 · May 2026
Security isn't a layer. It's a zone. Architect around it.
Most enterprise security thinking still applies to AI — identity, encryption, network segmentation, audit, key management. What changes is the threat surface. Models are non-deterministic. Prompts are executable. Tools have side effects. The data that trains the system can be the attack vector against it. This is the architectural answer: four zones, twelve layers, thirty-nine capabilities, with the Agentic DMZ as a load-bearing security boundary every model interaction must cross — by design, not by exception.
Walk into an enterprise AI program and someone will ask where the security layer goes. A box marked Guardrails. A box marked Content Filter. A box marked PII Redaction. Arrows. Everyone nods.
Then the system fails in production. Why? Because the boxes hid everything that mattered.
Web applications taught us that security is a cross-cutting concern — auth in front, encryption in transit, RBAC at the data tier. AI inherits all of that. And then breaks the model. A model is not a database. A prompt is not a query. A tool call is not a stored procedure. The attack surface isn't a port to close — it's a behavior to constrain.
AI security isn't a layer. It's a zone. When a CISO asks "where's the AI security layer?", the right answer is: "There isn't one. There's a controlled boundary — the Agentic DMZ — that every model interaction crosses. And there are security capabilities in every other zone that make the boundary mean something."
That's not hand-waving. That's the pattern. Four zones. One boundary. Twelve layers of control.
02
The four zones, named.
Every enterprise-grade agentic system decomposes into four zones with distinct security, governance, and execution characteristics. Skip one and you've shipped a demo. Cover all four and you've shipped a system. Each zone is a trust boundary — meaning every transition between them is a place where security controls earn their keep.
Fig 1. The four-zone agentic stack. Zone 2 is the load-bearing security boundary — every external interaction crosses it before reaching agent execution; every model invocation crosses back through it before reaching the user. The other three zones contribute security capabilities that make the boundary enforceable.
Each zone has a distinct security mandate. None of them works without the others.
Zone 1 · Channels — Authenticate the actor. Authorize the action. Capture consent. If you cannot identify who is on the other end of the wire, no downstream control matters.
Zone 2 · Agentic DMZ — Normalize the input. Filter sensitive data. Enforce content policy. Defend against prompt injection. This is where the "AI" part of AI security earns its name.
Zone 3 · Agentic Apps — Isolate execution. Mediate tool access. Bound agent autonomy. The model can suggest anything; the runtime decides what actually executes.
Zone 4 · Agentic Foundation — Encrypt at rest. Govern the model registry. Monitor drift. Audit every token. The platform-level controls that make incident response possible.
03
Zone 2 is the idea everything else rests on.
A DMZ — demilitarized zone — is a forty-year-old network pattern: a controlled space between a trusted interior and an untrusted exterior, where every transition is mediated by explicit security controls. The Agentic DMZ applies the same pattern to AI — a controlled boundary between users and agent execution, where every prompt is normalized, every input is filtered, and every model boundary is enforced before reasoning begins.
Fig 2. A DMZ is a forty-year-old pattern. The Agentic DMZ is the same pattern at a new substrate — controlled boundary, mediated transitions, explicit controls — with prompt injection, PII, and tool-access taking the place of port-level firewalls and IDS rules. Same shape. New attack surface.
Three layers do the work:
Signal Processing & Normalization — Speech-to-text with diarization and language detection. Text-to-speech with consistent voice identity. Multimodal normalization that strips raw input down to a structured, tagged format. The model never sees raw audio, raw HTML, or raw user upload. It sees a normalized representation the rest of the boundary controls can reason about.
Session & Flow Control — Turn management. Conversation state. Flow governance. Rate limiting. Loop prevention. This is the layer that catches the abuse pattern before the prompt-injection layer does. An agent that detects a barge-in storm or a backchannel flood doesn't need a content filter — it needs a circuit breaker.
Input & Prompt Security — PII detection, masking, and tokenization on the way in. Toxicity detection, domain restrictions, and compliance guardrails on inputs and outputs. Adversarial detection, tool-access controls, and model-boundary enforcement against prompt-injection attempts. This is the layer most people mean when they say "AI security." It is not the only one.
The Agentic DMZ is the answer to a single architectural question: where do the AI-specific controls live? Not scattered through every microservice. Not bolted onto the model wrapper. Not duplicated by every team that ships an agent. In one named zone, with one named owner, that every interaction must cross.
04
The threat surface, decomposed.
Three industry standards have converged on a shared map of the AI threat surface. None of them replaces the others. Together they tell you what to look for, where to look for it, and how to talk about it with people who don't build AI.
OWASP LLM Top 10 (2025) — Application-level risks. Prompt injection, sensitive information disclosure, supply-chain compromise, insecure output handling, excessive agency, training-data poisoning, model denial-of-service, insecure plugin design, overreliance, model theft. This is the developer's catalog. If you build an agent, you should be able to name all ten.
MITRE ATLAS — Adversarial Threat Landscape for Artificial-Intelligence Systems. The same idea as MITRE ATT&CK, applied to ML. This is the red team's catalog. Tactics and techniques an attacker uses against models in the wild — reconnaissance, initial access, ML model access, evasion, exfiltration, impact.
NIST AI Risk Management Framework — The governance frame. Map → Measure → Manage → Govern. This is the board's catalog. What an enterprise has to be able to say about its AI systems before regulators, auditors, or a customer's risk team will let them through procurement.
The architecture's job is not to repeat any of these. The architecture's job is to make sure every entry in every catalog has a place in the stack where the control belongs — and a person whose name is on enforcing it.
Door A's risk catalog already maps thirteen agent-specific risks across five categories — Confidentiality, Integrity, Availability, Harmfulness, Honesty — onto a five-stage control pipeline aligned to the OWASP LLM Top 10. This page does not reproduce that catalog. It places the catalog into the four-zone architecture so the controls have somewhere to live.
05
Three control disciplines. Every zone uses all three.
Inside every zone, security controls fall into one of three disciplines. Most teams ship the first one and forget the other two. That is the most common reason a working AI system becomes an unworkable AI security incident.
Prevent — Stop the bad outcome from happening. Authentication. Authorization. PII redaction. Prompt-injection defense. Tool-access policy. Container isolation. Network segmentation. Encryption. Most of the work, none of the visibility.
Detect — Notice when prevention fails. Anomaly detection on prompts. Drift monitoring on models. Distributed tracing on agent runs. Token analytics. Conversation replay. Audit logging. The instrumentation that turns "something feels off" into a ticket.
Respond — Contain the blast radius. Kill-switches at the model gateway. Rollback at the agent registry. Quarantine at the tool gateway. Incident response playbooks that name the on-call. Post-incident review that closes the gap that opened the door. The discipline that turns one bad day into a learning, not a press release.
Fig 3. The control matrix. Twelve cells, four zones, three disciplines. Zone 2's row carries the heaviest load — it is the AI-specific zone — but no row is allowed to be empty. A zone without all three disciplines is a zone with a hole in it.
A zone without all three disciplines is a zone with a hole in it. Prevent without Detect is a guess.Detect without Respond is a complaint.Respond without Prevent is theatre. The four-zone pattern works because every zone is built to do all three.
06
Security shows up in seven of the nine viewpoints.
Door A — The Blueprint — names nine architectural viewpoints for any intelligent agent system. AI security is not a tenth viewpoint. It is a property that shows up in seven of the original nine, and the architect's job is to know where.
Viewpoint (from 1.a)
Where security lives
Anchor zone
Data
Classification, lineage, retention, residency. Encryption at rest and in transit. Access policy on every data store the agent reads or writes.
Zone 4
Runtime
Container isolation. Sandboxing. Memory hygiene between sessions. Side-effect containment for tool calls.
Zone 3
Cognitive
Prompt-injection defense. Output validation. Adversarial-input detection. Boundary enforcement on what the model can be asked to do.
Zone 2
Security
The architect's stewardship of every other row. Threat model. Control catalog. Control owner. Audit cadence.
All zones
Integration
Tool-invocation gateway. Permission scope on each connector. Response validation. Per-call authorization, not session-level grants.
Zone 3
Infrastructure
Network segmentation. Identity infrastructure. Key management. Hardware-backed enclaves where the workload requires them.
Zone 4
Model
Model registry with provenance. Prompt versioning. Drift monitoring. Model-supply-chain controls — including what was used to train it and what was used to fine-tune it.
Zone 4
DevMLOps
Secure CI/CD for prompts and models. Pre-deployment evaluation gates. Environment promotion controls. Rollback paths.
Zone 4
Multi-agent
Agent-to-agent authentication. Delegation boundaries. Conflict resolution that does not silently expand authority.
Zone 3
Security shows up in every row. What changes is which zone holds the primary control and which discipline — Prevent, Detect, Respond — owns the response. The viewpoint says what to think about. The zone says where to put it. The discipline says how to enforce it.
07
Prompt injection, walked all the way through.
Pick one risk and trace it across all four zones. Prompt injection is the right one — it is the AI-specific risk most people have heard of, the one most often miscategorized as "just a content-filter problem," and the one whose mitigation pattern reveals every part of the architecture at once.
Zone 1 — Channels. Authenticate the user. Bind the session to a verified identity. If the request comes from an authenticated, authorized actor, you have a name attached to the bad input. If it doesn't, the rest of the controls have less to work with.
Zone 2 — Agentic DMZ. Normalize the input — strip exotic Unicode, decode embedded payloads, separate user content from system instructions. Detect adversarial patterns. Filter known injection signatures. Tag retrieved content (RAG context, tool output) as untrusted so the model treats it as data, not as instruction. This is the layer that catches most attempts.
Zone 3 — Agentic Apps. Enforce least privilege at the tool-invocation gateway. The model can request a high-impact action; the gateway decides whether the current session is authorized to perform it. Bound agent autonomy with policy: a model that wants to call a destructive API should never be the only voice in the decision.
Zone 4 — Agentic Foundation. Log the prompt, the retrieved context, the model output, and the tool call as one correlated trace. Monitor drift in detection efficacy over time — adversaries adapt. Replay conversations on demand. If detection failed in Zone 2 and authorization caught it in Zone 3, the audit trail in Zone 4 is what tells you why.
Fig 4. The same prompt injection traced across the stack. Zone 1 names the actor. Zone 2 normalizes and filters and catches most attempts. Zone 3's tool gateway denies the privileged action even if Zone 2 missed. Zone 4's correlated trace tells the post-incident review what to fix. No single zone defeats it. Four zones in sequence do.
No single zone defeats prompt injection. Four layers of partial defense, applied in sequence, do. A control that works ninety percent of the time, layered four times, gets you to four nines. That is the architectural insight. The rest is engineering discipline.
08
A footnote on Door B — because architecture is the control.
Door B — Costco Runs It — is built on a Nexus architecture: differentiated capabilities anchored in a sovereign core, commodity capabilities federated to satellites that ride close to the data they already serve. That pattern is not a security pattern. It happens to be a security pattern.
Look at what Nexus does, in security terms:
The core is a trust boundary. Knowledge layer, governance, model registry, and central control plane live in the core. Differentiated decisions cannot be made outside it. One control plane. One audit trail. One on-call.
The satellites are blast radius limits. Salesforce, SAP, ServiceNow run their commodity agents close to their own data, brokered through MCP. A compromise at a satellite cannot cascade into the core unless the core's policy layer permits it.
MCP is the controlled boundary. Every cross-zone call is mediated. Tool-access policy travels with the request. The protocol itself is the place security is enforced — not a separate "gateway tier" that has to remember to be there.
The four-zone pattern is what Costco is shipping. The Nexus core is Zones 3 and 4. The satellites are bounded extensions of Zone 3, mediated through Zone 2 boundary controls expressed as MCP policy. "Run it anywhere" and "secure it everywhere" are the same sentence.
09
The bottom line.
AI security is not a layer to add. It is a zone to architect around. The Agentic DMZ is the load-bearing concept; the four-zone stack is what makes it enforceable; the three control disciplines are how each zone stays honest; and the nine viewpoints from Door A are where the work actually gets done.
Three things to walk away with:
Name the boundary. If your team cannot point at the one zone every model interaction must cross, you do not have a boundary. You have hope. Hope is not a control.
Name the controls per zone. Identity in Zone 1. Prompt security in Zone 2. Tool-access mediation in Zone 3. Governance and audit in Zone 4. Every zone needs Prevent, Detect, and Respond. No zone gets a pass.
Name the standards behind it. OWASP LLM Top 10 for the developer's catalog. MITRE ATLAS for the red-team's catalog. NIST AI RMF for the board's catalog. One architecture, three audiences, the same pattern underneath.
This is the security pattern that runs through Doors A, B, and C. The framework explains the viewpoints; the spotlight shows the Nexus pattern; the ecosystem shows where each platform's gaps live. This page shows the boundary they all enforce.
Ready to secure your agent architecture?
This page is a v1 articulation of the AI security architecture pattern that threads through the v7 Intelligent Agent Reference Architecture, the Costco Nexus blueprint, and the six-platform Intelligent Digital Brain ecosystem map. The four-zone model and capability inventory are reproduced from the Agentic Stack — Capabilities & Descriptions source materials extracted on 2026-04-15. Content is under active development. Re-validate against the latest source release before scoping a new engagement.
Tools change every quarter. Foundations don't.Human in the Lead is where we keep the curriculum that turns engineers, analysts, and leaders into people who can actually command agentic AI — paired with the foundational concept primers that explain what's happening underneath, and the partner field reports that tell us what's actually shipping. Three ways to keep your team in the lead. Pick your door.
Training program · Foundational concept · Partner field report
01
Three ways to keep humans in the lead.
Some teams get there through a structured, multi-day bootcamp. Others get there through one perfect weekend with a primer that finally makes the math click. And some get there by reading the field report from someone who just spent the week in San Francisco at the partner's biggest event of the year. Human in the Lead holds all three.
Behind Door A — Citizens Spotlight — is Human-in-the-Lead Training, the multi-day agentic AI program we ran for Citizens. Four modules · 417 slides, all built on the premise that humans stay in command of the agents. All four are live now — Day 0 (the May 2025 foundations preview) plus the three live days of the September 2025 Citizens AI Academy Track C: Banking Reinvention, Tool Use & Reasoning, Memory & Planning.
Behind Door B — Words as Numbers — is something more foundational: the vector embeddings primer our Center for Advanced AI built to teach the building block underneath every modern generative AI system. 26 slides. Worked examples. The math, demystified. If you've ever sat in a room where someone said "just embed it" and you weren't sure what that meant — this is the door.
Behind Door C — Agentic Enterprise — is the Google Cloud Next '26 recap: every announcement that matters from Google's biggest event of the year, organized by the six-layer stack Google itself laid out — Agentic Taskforce, Agent Platform, Agentic Defense, Agentic Data Cloud, Research & Frontier Models, AI Hypercomputer. The deck Google's own alliance team handed us. The thesis, the receipts, and the customer stories — translated into something you can actually use on a Monday.
Read in any order. The primer explains the foundation; the bootcamp shows how to build on it; the field report tells you what one of the world's three biggest AI partners is actually shipping. Together they cover the full distance from "what's a vector?" to "here's what Google announced last week, and why it matters for your roadmap."
#04 · 4.b · Vector Embeddings · v5
How machines turn words into numbers.
Every modern AI system — from search to chatbots to recommenders — runs on the same foundational trick. Take a word, an image, a sound clip, a heartbeat, anything that isn't a number. Turn it into a list of numbers. Then let the math find what's similar, what's different, and what belongs together. This is that trick, demystified.
26 slides · the foundational primer2 authors · CAAI∞ dimensions · in theory
AuthorsLan Guan·Mo Nomeli·Center for Advanced AI · Lan Guan (Chief AI & Data Officer) · Mo Nomeli (CAAI Global Lead AI Learning & Emerging Tech)
01
Computers think in numbers. Humans don't.
Imagine a database of 50,000 companies. Tabular data is easy. Names, CEOs, headquarters, employee counts, industries. Find all companies with more than 1,000 employees. Sort CEOs alphabetically. Calculate the average company size. One SQL query. Done.
Now imagine a press release attached to one of those rows: "Acme Inc. revealed a significant strategic shift under its newly appointed CEO, Jane Smith. Smith outlined a comprehensive plan focusing on sustainable growth initiatives..."
Now ask: Which other CEOs are pursuing sustainability? Is this strategy shift common in the industry? How might this affect Acme Inc.'s stock price?
Fig 1. Structured tabular data is rich with operations — filter, sort, calculate. Unstructured text is rich with insights — but those insights are locked behind a wall of language nuance, context, and meaning. Embeddings break that wall down.
Unstructured data is where the real signal lives. The challenge is that traditional tools can't process it — they need additional steps to unlock the value. Vector embeddings are those additional steps.
02
Measure. Compare. Discover.
Once data is in vector form, the math takes over — and it's a particular kind of math. Three operations matter.
Measure the distance between individual data points
Determine the similarity between different data points
Transform data in ways that are useful for analysis
The way "similarity" gets measured is the part most people skip. The standard answer is cosine similarity — the angle between two vectors. Three angles tell the whole story.
Fig 2. The three states of cosine similarity. Near 0° means the vectors point in the same direction — they're similar. Near 90° means they're perpendicular — unrelated. Near 180° means they point opposite ways — they oppose each other.
03
What is a vector embedding, exactly?
Strip away the jargon and the answer is genuinely simple. A vector is a fixed-length array of numbers that represents a point in a mathematical space.
Each number in the array corresponds to a direction (or dimension) within that space, and its value determines the vector's magnitude in that direction. Vectors in machine learning can have thousands of dimensions — those are difficult to visualize. But simpler vectors with two or three dimensions can be easily graphed and understood.
A vector embedding — or simply, an "embedding" — is a way to turn things that aren't numbers (like words or pictures) into a list of numbers. This list captures the important qualities and relationships within the original data. Embeddings capture semantic similarity, tone, and hierarchical relationships: "MIT" will be close to "University." "Happy" will be farther than "Sad." "Car" will be close to "Vehicle."
Here's the worked example from the deck. Three West Coast cities, each described by three numbers — longitude, latitude, and population. That's a 3-dimensional embedding. The cities exist as points in a 3D space.
City
Longitude
Latitude
Population (Millions)
Los Angeles
-122.4
37.8
4.2
Seattle
-122.3
47.6
3.9
Vancouver
-123.1
49.3
2.4
Fig 3. Three cities in three dimensions. Move along the longitude axis, then up the latitude axis, then up the population axis — and you've placed each city in its own spot in space. Now imagine doing this with 1,536 dimensions instead of 3. That's what a real text embedding looks like.
Key takeaway: embeddings work the same way — just with more dimensions. More dimensions mean capturing more complex nuances and revealing hidden patterns that would be invisible in 2 or 3 dimensions.
04
Translating data for computers.
Computers struggle to directly understand the way humans communicate — text, pictures, sounds. To help, we turn these formats into numerical representations called "vectors" that computers can process more easily. Same trick. Three different modalities.
Fig 4. Three modalities. Three models. One unified output format. Once everything is a vector, the same math works on all of it — which is why a single AI system can search across audio, text, and video at once.
05
Data has a secret code.
Each modality encodes a different kind of "meaning." Same idea, four different signatures.
Modality 1 · Text
Text embeddings understand how words are related.
Words like "king" and "queen" sit close together
"King" and "car" sit far apart
The geometry IS the meaning
Modality 2 · Image
Image embeddings turn pictures into a special code.
The code remembers what the picture looks like — colors, shapes, smoothness
An orange sits closer to a yellow object than a black one
Visual similarity becomes spatial proximity
Modality 3 · Audio
Audio embeddings turn sounds into a code.
The code remembers pitch, instrument, character of the sound
A piano and a guitar have different codes — even playing the same note
Acoustic identity becomes a vector
Modality 4 · Temporal
Temporal embeddings track changes over time.
Records how heart rate moves during rest, sleep, running
Compare heart rates across activities — spot unusual patterns
Time-series shape becomes a fingerprint
06
Why old NLP failed at meaning.
Before embeddings, computers tried to handle language with two main techniques: n-grams (contiguous sequences of n words — unigrams, bigrams, trigrams) and bag-of-words representations. They worked, mostly. Until they didn't.
The problem: those approaches were context-agnostic. They counted word frequencies. They ignored what words actually meant in context. A vector embedding fixes that.
Aspect
N-grams
Vector Embeddings
Definition
Contiguous sequences of n words (unigrams, bigrams, trigrams)
Dense, continuous vector representations for words or sentences
Representation
Based on word frequencies within n-grams
Captures meaning and context
Limitations
Sparse (high-dimensional vectors with many zeros) · Context-agnostic · Ignores word order
Fig 5. Same string of letters, two different points in space. The whole reason embeddings unlocked modern NLP is that they finally taught machines what every human already knew — context changes meaning.
07
What you can actually do with embeddings.
Once your data is in vector form, a whole catalog of capabilities unlocks. Six come up most often.
Use case 1
Finding similar things — semantic search.
Embeddings help find similar words, documents, or even products. The classic example: news articles about the same topic. Or — "healthy breakfast options" retrieves content like "nutritious meals." Even though the words are different, the meaning is close.
Use case 2
Organizing data — automatic categorization.
Embeddings group similar things together and help label them — teaching computers how to sort items automatically. In a customer service use case, embeddings can categorize and retrieve similar inquiries and pain points, leading to faster resolution.
Use case 3
Better search engines.
Embeddings make search engines smarter. They can find what you're looking for even if you don't use the exact same words as the underlying content.
Use case 4
Smart recommendations.
Websites use embeddings to suggest things you might like. Watch a certain kind of movie and they'll suggest similar ones — because the movies are nearby in vector space.
Use case 5
Seeing the big picture.
Embeddings can be turned into pictures — visualizations — to see how different pieces of data relate to one another at a glance. That's how you find the unexpected clusters.
Use case 6
Faster learning.
Embeddings let computers use what they've already learned for new tasks. The model trained for one job can be repurposed for the next — so it learns faster.
08
Where do you put a billion vectors?
Once you've embedded everything, you need somewhere to store, index, and search across massive datasets of unstructured data. That's a vector database — purpose-built for this exact job.
Fig 6. Two databases. Two different jobs. The traditional one finds exact matches. The vector one finds meaningful neighbors. Both are useful — for different things.
Where vector databases shine — five popular use cases.
LLM Retrieval Augmented Generation (RAG): powering advanced chatbots and generative AI systems that need to access and process vast amounts of information. Embeddings help retrieve the most relevant vectors (top K) to ground LLM responses in accurate, contextually rich data.
Question and answer systems: enabling accurate and relevant responses to user questions.
Recommender systems: tailoring suggestions (products, content, etc.) based on user preferences and similarity analysis.
Semantic search: providing search results based on the meaning and context of the query, not just keywords.
Image, video, and audio search: finding similar media based on visual or audio characteristics.
09
The architecture that put embeddings on every roadmap.
If you've heard of RAG — Retrieval-Augmented Generation — you've heard of the architecture that made vector embeddings business-critical. Here's how it actually works.
Fig 7. The five-step RAG flow. The LLM doesn't know your private data — but it doesn't have to. The vector DB retrieves the relevant docs, the Q/A system stitches them into the prompt, and the LLM reasons over both together. Embeddings are the bridge.
10
A worked exercise — see it for yourself.
The deck closes with an exercise. Take seven words. Cluster them.
The list: [sciences, weather, institute, college, school, university, climate]
The challenge: arrange them into two clusters — one for education, one for weather. You can probably do this in your head. The question is whether the math agrees.
Here are the actual 3-dimensional embeddings from the deck:
Word
Embedding (3-dim)
sciences
[0.7, 0.5, 0.3]
weather
[0.2, 0.7, 0.5]
institute
[0.75, 0.4, 0.25]
college
[0.65, 0.35, 0.4]
school
[0.6, 0.3, 0.45]
university
[0.7, 0.45, 0.35]
climate
[0.15, 0.65, 0.55]
Fig 8. The math agrees. sciences, institute, college, school, university sit in one neighborhood; weather, climate sit in another. The embeddings encode meaning even at just 3 dimensions — and the distance between clusters is itself a measurement of the semantic gap.
Key takeaway: the embedding analysis reveals that words related to education share similar numerical representations, forming a distinct cluster — and the same applies to weather-related terms. Embeddings capture these nuances of meaning, which can be far more powerful than simple keyword analysis.
11
Eight key future trends.
Where embeddings go next, in the deck's words.
Cross-modal embeddings to handle text, image, audio together
Integration with quantum computing to accelerate similarity search
Ethical AI to reduce bias
Continuous learning to adapt to new data dynamically
Explainable embeddings to understand relationships
Integrating with AI agents
Unsupervised learning enhancements using embeddings
Ensemble RAG
12
Five takeaways.
Vector embeddings are the foundational trick. Bridging the gap — translating various types of data (words, images, etc.) into a format that computers can easily work with. Understanding relationships — embeddings aren't just about the data itself; they capture how different pieces of data relate to one another. Unlocking generative AI — embeddings empower many types of generative AI, where the goal is to create new things (text, images, code, etc.). Condensing information — instead of dealing with complex raw data, embeddings provide a compact, meaningful representation. Powering data-driven decisions — by understanding data through embeddings, we can make informed decisions and create innovative solutions.
And on the business side: smarter search, deeper insights — find documents, products, or information based on true meaning, not just keyword matches. Enhanced customer understanding — analyze feedback, reviews, and social media sentiment with nuance for actionable insights. Streamlined processes — automate tasks that rely on understanding language, from support ticket routing to content summarization. Competitive edge — extract valuable information and patterns from text data that traditional methods miss.
The next time someone says "just embed it," you'll know exactly what they mean — and exactly what makes it work.
Three questions to ask your team next.
The deck closes with three questions to spark the right conversations. Use them. They surface where embeddings can deliver the most value in your organization.
Challenges: "What are some current tasks where our ability to understand language is a bottleneck?" This surfaces pain points embeddings might address.
Data: "What kinds of text data do we have that might be underutilized — customer support, market search, compliance, etc.?"
Feasibility: "Are there areas where a small-scale embedding project could be a good proof-of-concept?" This promotes actionable next steps.
#04 · 4.c · Google Cloud Next '26 · Recap
Everything Google just announced. Translated.
Once a year, Google Cloud puts every product team on a stage in San Francisco and says "this is what we believe the next twelve months of enterprise AI looks like." Next '26 was that stage. Six layers. Hundreds of announcements. One thesis: the Agentic Enterprise — where intelligence meets action. This is that 71-slide field report, organized by Google's own stack and translated into something a delivery lead can actually use on Monday.
71 slides · partner field report6 layers · the Google AI stack5 Google alliance authors
SourceGoogle Cloud · Next '26 official recap·Delivered by the Google alliance team — Anil Mehta, Blaise Abderholden, Chase Crowson, Nishant Kulkarni, Anjana Nandi. Proprietary to Google Cloud; internal Accenture distribution only. All product names, customer stories, statistics, and launch-stage indicators (GA / Preview / Pre-announcement) reproduced from the source deck.
Watch first — 8 minutes · narrated walk-through of Google's six-layer agentic stack
01
The thesis: where intelligence meets action.
Last year, the keynote story was models. This year, the keynote story is agents — and Google's framing for it is sharper than most. "The Agentic Enterprise at scale." Context for every action. Agents for every process. Intelligence for every person. Success for every industry.
Strip the marketing varnish and the underlying claim is concrete: agents only matter if they can act — read your data, hold context across tools, follow policy, and finish work without supervision. That's the sentence the entire deck is engineered to defend, layer by layer.
Google's structural argument for why they are the partner to build this on rests on three pillars they repeated all week: full-stack co-design (every layer optimized for AI together), multicloud-by-default (their tools work where your data already lives), and enterprise-ready hyperscaler (resilience, scalability, security, sovereignty). The line they kept hammering: "Google Cloud is the only provider to offer first-party solutions across the entire AI stack."
Fig 1. Google's stack, in their own words. Read it top-down (where work happens) or bottom-up (what makes it possible). Every layer below has its own product slate — and every layer's headline announcement at Next '26 is built to make the layer above it more capable.
02
The big rebrand: Vertex AI is now Gemini Enterprise.
If you take only one thing from Next '26, take this: Google has unified its AI portfolio under a single Gemini Enterprise umbrella. The business-user app, the developer platform formerly known as Vertex AI, and the customer-experience suite are now one named system: Gemini Enterprise, Gemini Enterprise Agent Platform, and Gemini Enterprise for Customer Experience.
The plain-English version: "Vertex AI" is now "Gemini Enterprise Agent Platform" — and it's no longer pitched as a model-serving platform with some agent features tacked on. It's pitched as the place you build, scale, govern, and optimize agents, with the old Vertex capabilities (Model Garden, Model Builder, Agent Builder) folded inside.
Google's framing for the platform is a four-word sentence — and the architecture makes good on each verb:
Fig 2. The four-pillar story Google told all week. Build covers ADK, Agent Studio, and the Agent Garden of pre-built agents. Scale is Agent Runtime — sub-second cold starts and Memory Bank for long-term context. Govern is the new identity, registry, and gateway primitives that make zero-trust enforceable per agent. Optimize is simulation, evaluation, and observability — the operations layer most agent platforms still skip.
The customer logo wall on this slide reads like an enterprise-AI honor roll: L'Oréal, Citi, Color Health, Bloomberg, Deutsche Bank, Goldman Sachs, Mercedes-Benz, PayPal, Reddit, ServiceNow, Snyk, Toyota, Unilever, Wayfair, WPP, Yahoo. The two stories Google chose to lead with: L'Oréal built a proprietary "Beauty Tech Agentic Platform" on the Agent Platform with ADK; Citi launched Citi Sky, an AI wealth platform that is now proactively handling 90% of rollovers via the AI assistant. That second number is the kind of receipt a CFO can act on.
03
Agentic Taskforce: the front door for everyone else.
If Agent Platform is for developers, Agentic Taskforce is for everyone else — and Google split it into two distinct products: the Gemini Enterprise app (where employees create and orchestrate agents) and Gemini Enterprise for CX (where the same agents serve customers). The two share the same Agent Platform plumbing underneath. That symmetry is the whole point.
The headline features inside the Gemini Enterprise app are a tour of every agent UX pattern of the past year, packaged together:
Agent Designer(private preview) — anyone can build complex multi-system workflows in natural language. The pitch is "low-code agent creation without the bottleneck of asking IT."
Canvas Mode(private preview) — an interactive co-creation editor for Docs and Slides that pulls in your work and personal context. M365 interoperability means you can export to Microsoft Office formats — a clear shot at Copilot.
Projects in Gemini Enterprise(experimental) — a "shared brain" for teams that strictly grounds the AI in explicitly added files, preventing context loss and irrelevant hallucinations.
Inbox in Gemini Enterprise(experimental) — a unified hub for managing long-running agents at scale, with status alerts via email and chat.
Skills(experimental) — codify your unique expertise into reusable Skills, invokable anywhere you use Gemini.
Long-running Agents(experimental) — multi-step workflows like end-to-end financial reconciliation or sales-prospect sequencing without constant human supervision.
The CX side is where Google is making its sharpest competitive claim: "the only platform that seamlessly unifies shopping and service." The product suite is Omnichannel Gateway → CX Agent Studio → AI Commerce Search → Agent Assist → Conversational Insights — covering the full arc from intent-aware search to live agent coaching. The receipts on this slide are the quietly impressive part:
Fig 3. The CX receipts. Humana's 80 million calls per year is the kind of scale that makes the Agent Assist story credible — that's not a pilot, that's production.
04
Workspace: the agentic operating system for work.
The Workspace announcements are where the deck stops being abstract. Workspace Intelligence is the central claim: a secure system that "inherently understands complex semantic relationships within your specific work ecosystem" — apps, collaborators, domain knowledge — so you don't have to repeat context in tasks. In English: your agents already know who your team is and what you're working on.
The interface that exposes this to users is Ask Gemini in Google Chat(preview) — pitched as "a unified command line for all of your work." Three things make it land:
A daily briefing that surfaces important tasks, unread threads, and urgent action items.
Skills in Workspace — completing complex tasks like generating documents and slides directly from chat.
Expanded third-party connectors — Gemini now bridges Workspace content with external tools like Asana, Jira, and Salesforce. This is the connector breadth that's been the missing piece versus Microsoft 365 Copilot.
The new in-product AI features cover the full Workspace surface: Docs Enhancements generates infographics and triages documents from comments. Slides Generation produces full editable decks in one shot using shared context. Interactive Canvas in Sheets builds spreadsheets via natural language and creates interactive mini-apps (dashboards, kanban boards) on top of live data. Drive Insights & Projects centralizes file context for Gemini. Avatars in Vids(GA) converts presentations into videos with branded avatars including company logos and backdrops.
Two more bets worth flagging:
Workspace MCP Server(public preview) — lets developers bring advanced Workspace capabilities (synthesizing Drive documents, drafting Gmail responses, managing Calendar and Chat logic) directly into their AI applications and agents within a secure, open framework. This is a meaningful bet on MCP as the agent-tool standard.
Rapid Enterprise Migration with Workspace(preview) — Google's claim is that migrating from Microsoft 365 to Workspace is now up to 5× faster with a new cloud-based data import service plus AI-powered Office macro converter, Office file editing in Gmail, and redlining in Docs. Read this as the M365-displacement play getting sharper teeth.
And the security/governance posture caught up to the agent story: AI control center, regional data locking (US and EU now, Germany and India coming), and client-side encryption that lets you "authoritatively deny access to any agent and any entity, including Google itself." That last clause is unusually direct phrasing for a hyperscaler.
05
Agentic Defense: the SOC gets a fleet.
The security layer is where Google's Wiz acquisition starts paying off in the keynote. The Wiz AI-Application Protection Platform (AI-APP) went GA — agentless visibility into AI applications across any CSP, hosted, custom code, cloud and PaaS. And Wiz introduced a color-coded fleet of AI agents that maps neatly to a real SOC's day:
Fig 4. Wiz's color-coded agent fleet. The pattern is the same one Google used elsewhere all week — specialize agents by job, then put a workflow agent on top of them. The Triage and Investigation Agent in Google Security Operations did the same thing on the broader SecOps platform — Google says it has triaged 5+ million alerts, turning a 30-minute analyst job into roughly one minute.
Two other security stories worth reading carefully:
Google Cloud Fraud Defense(pre-announcement) — explicitly framed as "the evolution of reCAPTCHA", repositioned as a unified trust platform for the agentic web. The single layer verifies humans, bots, and autonomous AI agents across the entire digital commerce journey from registration to payment. Read between the lines: as agents start buying things on behalf of humans, "is this traffic legit?" becomes a much harder question — and Google wants to be the one answering it.
Dark Web Intelligence(preview) in Google Threat Intelligence — Gemini-powered processing of 10 million dark web events daily at 98% accuracy, dynamically profiling each customer's brand and assets to surface relevant data leaks and insider threats. Stops attacks before the first match is struck, in their phrasing.
06
Agentic Data Cloud: the context engine under everything.
Every agent claim above only works if the underlying data layer can keep up. The Agentic Data Cloud announcements are dense — six product families with a slate of features each — but the through-line is consistent: turn the data platform into something agents can use directly, without a human-built pipeline in between.
BigQuery got the headline numbers. Fluid Scaling with true per-second billing claims up to 34% cost savings on dynamic workloads. Advanced Runtime Optimizations claim up to 200× faster queries with no schema or code changes — and a 35% YoY improvement in query speed and 40% YoY reduction in query processing costs. Native multimodal processing via ObjectRef and ai.parse_document lets developers parse and analyze documents alongside structured data inside the Knowledge Catalog. TimesFM and Tabular FM bring zero-shot forecasting and tabular classification directly into BigQuery — no model training required.
The single most important new product on this layer is the Knowledge Catalog(GA), framed as "always-on enterprise semantics" — a dynamic context engine that replaces static data dictionaries, extracts entities, resolves conflicting definitions, and maps complex business relationships. The Deep Research Agent in Gemini Enterprise natively leverages it. Bloomberg Media's CTO is quoted as the proof point — they unified enterprise metadata and business context through Knowledge Catalog to launch their Data Access AI Agent. Spotify's CTO appears two slides later citing Apache Iceberg interoperability.
Other announcements worth tracking by name:
Lightning Engine for Spark(GA) — vectorized execution engine claiming 4.9× faster query completion than open-source Spark. Unifying lakehouse architecture is pitched at 117% ROI with payback under six months.
Iceberg REST Catalog(preview) — full read/write interoperability between BigQuery, Spark, and third-party OSS engines.
SAP BDC for BigQuery(preview) — bidirectional, zero-copy data sharing between SAP Business Data Cloud and Google's Agentic Data Cloud. Read this as: SAP gravity, no copying required.
Dashboard Agents in Looker (pre-announcement) — natural language questions inside dashboards for context-aware answers. Looker Hosted MCP Server(pre-announcement) exposes Looker's governed semantic layer to MCP-using agents.
AlloyDB AI(preview) supports 10B+ vectors, 6× faster than standard PostgreSQL, processing 100k rows/second for less than 1/10th of a cent. The Open-source MCP Toolbox now integrates 40+ distinct databases.
Spanner Omni(preview) — downloadable Spanner edition that deploys beyond Google Cloud infrastructure. Mercado Libre's senior tech manager is quoted on cross-cloud resilience. Oracle Database@Google Cloud expanded to 20 global regions.
07
Research & Frontier Models: voice gets a face.
Two model announcements headlined this layer — both about conversation, not reasoning benchmarks. That's the tell about where Google thinks the next year's user expectations are heading.
Gemini Live API + Live Avatar(private preview) — the transition from audio-only to face-to-face multimodal AI. Native audio-to-audio reasoning synchronized with real-time video rendering. The framing: "a lifelike, expressive visual presence" instead of disembodied voice.
Gemini 3.1 Flash TTS(preview) — Google's most expressive text-to-speech model, with 200+ audio tags for steering pacing and expressiveness, supporting more than 70 languages. All outputs carry SynthID watermarking. The benchmark slide showed it leading the Artificial Analysis Text-to-Speech Arena Quality Elo at 1211 — narrowly beating ElevenLabs v3, Inworld TTS Max, MiniMax Speech 2.0 HD, and others.
Read these as a single play: by next year, the default support agent, the default training video, and the default product walkthrough will all be able to look at you and respond in your language. If your customer-experience roadmap doesn't have a voice/avatar lane, that's the gap to close.
08
AI Hypercomputer: the receipts under the receipts.
Every agent capability above eventually cashes out in compute, network, and storage. The AI Hypercomputer announcements are where Google made its loudest hardware noise — and the headline is the 8th-generation TPU, split for the first time into two distinct chips with two distinct jobs.
Fig 5. TPU 8 is two chips. TPU 8t is the training powerhouse — Google's claim is months-to-weeks for frontier-model training, with one superpod hitting 9,600 chips and 2 PB of shared high-bandwidth memory. TPU 8i is the inference engine — designed specifically for the agentic-workflow case where long-context decoding chokes on memory bandwidth. Read together: training and serving are now different products with different chips.
Around the TPUs, Google announced the supporting cast in the kind of detail that only matters to people running the workloads — but those are the people writing the checks:
Virgo Network — collapsed-fabric data center architecture with 4× the bandwidth of previous generations, connecting up to 134K TPUs into a single, non-blocking cluster.
Managed Lustre — now delivering 10 TB/s of bandwidth, claimed at 10× faster than last year and 20× faster than other hyperscalers for a single instance. Capacity scaled to 80 PB via C4NX instances and Hyperdisk Exapools.
Cloud Storage Rapid — Rapid Bucket and Rapid Cache. Native PyTorch and JAX integrations. Checkpoint writes 3.2× faster, restores 5× faster with Rapid Bucket.
Compute — new C4N series processing up to 95M packets/sec (40% faster than other hyperscalers, per Google), M4N series with Hyperdisk Extreme delivering 26.57 GiB RAM per vCPU and a 20% Oracle TCO reduction, Axion N4A Arm-based processors, Axion C4A.metal bare metal, H4D with Cloud RDMA, and pre-announcements for Z4D and Z4M.
GKE Agent Sandbox — gVisor kernel isolation (the same tech securing Gemini), launching up to 300 sandboxes per second per cluster, with 30% better price-performance than competitors when running AI agents.
GKE hypercluster(private GA) — single conformant GKE control plane managing millions of accelerators across 256,000 nodes spanning multiple GCP regions. GKE Pod Snapshots reduce pod start-up time by up to 81% for large models like Llama 3.2 70B and shrink the overprovision buffer by 92%.
Cloud Run — now serving up to 70B+ parameter models on serverless via NVIDIA RTX PRO 6000 Blackwell GPU, with full managed remote MCP server, Cloud Run Instances for long-running agents, and Cloud Run Sandboxes for isolated code execution.
Google Distributed Cloud — Gemini deployable in connected or fully air-gapped environments. Support for NVIDIA Blackwell B200/B300 GPUs, A4/M2/M3 machine families, 6 PB object storage per zone, and a new sovereign agentic AI architecture that keeps workflows entirely within the customer's secure organization boundary.
Networking — Agent Gateway as the "air-traffic controller" for agentic traffic, natively understanding MCP and A2A protocols. Cloud Network Insights for end-to-end visibility. GKE Inference Gateway with multi-region support, predictive latency boost, and disaggregated serving — Google's quoted result: "reduced Time to First Token (TTFT) latency by over 35% for Qwen3-Coder."
09
The launch-stage cheat sheet.
The deck uses three tags consistently — GA, Preview, and Pre-announcement — and they matter for sequencing. GA is now. Preview is months. Pre-announcement is "we want this on your roadmap, not yet on your contract." Here's the same content sorted by what you can actually deploy versus what you're committing your roadmap to:
Fig 6. Same announcements, sorted by what you can actually build with today. The GA column is the deal-grade list. The Preview column is your pilot list. The Pre-announcement column is your strategy-deck list.
10
The bottom line.
Stripped of the keynote choreography, Next '26 said three things that matter for any team building on Google Cloud over the next twelve months:
Vertex AI is now Gemini Enterprise Agent Platform. Update your slides, your statements of work, and your customer-facing decks. The capability set is broader than Vertex was, but every Vertex investment carries forward — Model Garden, Model Builder, and Agent Builder are folded inside.
Agents are governed objects now, not configurations. Agent Identity, Agent Registry, Agent Gateway, Agent Simulation, Agent Observability — these aren't features, they're a fleet-management posture. If you're proposing an agent-heavy architecture and your governance story is a sentence, your governance story is too short.
The infrastructure receipts are real, but most of them are pre-announced. TPU 8t/8i, Virgo Network, the new compute series — these are roadmap items, not GA hardware. Use them in strategy decks; build pilots on what's GA today (BigQuery Fluid Scaling, Knowledge Catalog, Workspace Intelligence, Wiz AI-APP, Triage Agent).
The competitive read: Google's strongest move at Next '26 was the unification under Gemini Enterprise — both as a brand and as an architecture. The story they're telling against Microsoft is no longer "we have better models" — it's "we are the only provider with first-party solutions across the entire AI stack." Whether that claim survives contact with a real M365-shop procurement cycle is the question every account team will be running into next quarter.
Read this with Door A and Door B. The bootcamp teaches your team to command agents. The primer explains what's under the agents. This door tells you what one of your three biggest partners is shipping — so when a client asks "what does your Google bench look like on agent governance?", you have something better than a brochure to point at.
Want the original 71 slides?
This recap reproduces the structure, claims, and customer stories from Google Cloud's official Next '26 deck. For the original — including embedded blog links, session videos, and the customer reference library — reach out to your Google alliance contact.
#04 · 4.a · Citizens · Human-in-the-Lead Training · Day 0 · May 2025
Before you build your first agent. The foundations.
Day 0 is the first day of the Agentic AI bootcamp — and the day everyone wishes they'd had before they started. 97 slides covering what an agent actually is, why "agentic" is more than marketing, the SPAR framework that anchors everything else, and the eleven topics that map onto the rest of the week. Run as a live track for Citizens; reusable as a foundations primer for any new team after them.
97 slides · taught live11 topics on the agenda5 agentic levels mapped1 Citizens cohort
CuratorMo Nomeli·CAAI Global Lead AI Learning & Emerging Tech · Source: Intro to Agents — Day 0 · Citizens · Human-in-the-Lead Training · May 2025
01
"Is everything with an LLM an agent?"
That's the question Day 0 opens with — and it's the right one. Because the answer is no.
An LLM in a chat box is not an agent. An LLM that retrieves a document is not an agent. An LLM that calls a function is closer, but still not quite. The line between "calling an LLM" and "running an agent" is fuzzy enough that most teams build for months without agreeing on what they're building. Day 0 fixes that, in two moves: define the term, then place every system on a spectrum.
Once everyone in the room knows what counts as an agent — and what level of agent they're actually building — the rest of the week stops being a vocabulary fight and starts being engineering work.
02
Five levels of agentic.
Most "agents on the market" sit at Level 2 or 3. A few specialized systems reach Level 4 in narrow domains. Level 5 is hypothetical. Knowing the level you're at — and the level you're targeting — kills more debates than any other framework on Day 0.
Level 1
Rule-Based Automation
Fixed rules and workflows. Repetitive tasks like data entry or form processing. Like cruise control in a car.
No adaptability
Full human oversight required
Deterministic by design
Level 2
Intelligent Automation
ML, NLP, and computer vision processing unstructured data. Basic predictions. End-to-end automation, but inside rigid parameters.
More capable than Level 1
Still needs human supervision
Bounded by configured rules
Level 3
Agentic Systems
Plan, reason, generate across modalities. LLMs + memory + reinforcement learning. Customer support, financial analysis in digital domains.
Operates well within predefined boundaries
Struggles with novel/complex situations
Most enterprise agents today live here
Level 4
Semi-Autonomous Agentic Systems
Comparable to self-driving cars in mapped areas. Independently pursue goals, adapt strategies, manage workflows. Still needs domain constraints.
Adjusts based on feedback
Limited and defined domains only
The current frontier of production systems
Level 5
Fully Autonomous Systems
Hypothetical. Understands any goal, develops strategies, learns from experience, adapts across domains without human input. General AI.
Value-aligned decisions
Seamless cross-system integration
Not real yet — and possibly never
03
SPAR: the four-beat agent loop.
Once you know what level you're building at, you need a mental model for what an agent actually does. Day 0 uses SPAR — the simplest loop that captures every real agentic system.
The SPAR cycle — every agent runs this loop
S
Sense · Gather information, input, and context. Check what is needed to complete the task.
P
Plan · Think, analyze, map what approach fits the criteria. Outline specific steps to accomplish the goal.
A
Act · Execute the plan — usually requiring coordination across tools, assets, and action sequences in a defined environment.
R
React · Learn from experience. Reflect on results. Did the outcome meet the criteria? Did it satisfy the goal?
The integration of Sense → Plan → Act → React is the fundamental shift away from traditional automation. Linear scripts don't react. Agents do.
Throughout the rest of the week, every advanced topic — multi-agent systems, tool use, planning, evaluation — gets traced back to which beat of SPAR it lives in. That's the reason this framework comes first.
04
A single agent has five components.
Zoom into any agent — single or multi — and you'll find these five organs. Day 0 introduces them; the rest of the week deep-dives each one.
The five-component anatomy
Component
What it does
Where the rest of the week goes
Profile & Persona
Who is this agent? What role does it play? What rubric or grounding defines its voice?
Day 0 covers profile generation: human-crafted vs LLM-generated vs data-generated.
Action & Tool Use
What can the agent do? Which APIs, scripts, knowledge bases, and external systems can it reach?
Tool Use deep-dive (slides 66-96). RAISE framework, the Detective's Dilemma, tool overload.
Knowledge & Memory
What does the agent retain beyond the immediate chat? Other agent conversations, API instructions, domain knowledge.
Embeddings, RAG, knowledge graphs — covered later in the week.
Reasoning & Evaluation
Zero-shot, few-shot, chain-of-thought, tree-of-thought. Plus self-consistency and LLM-as-judge for evaluation.
Reasoning + benchmarking sessions later in the week.
Planning & Feedback
Single-path (chain-of-thought) vs multi-path (tree-of-thought). Planning with vs without human feedback.
Planning gets its own deep-dive. Feedback threads through Privacy/Safety/Ethics.
05
The core agent cycle.
SPAR is the abstract loop. The core agent cycle is what it looks like when you actually instrument it with software components.
1
Perception
The agent receives and interprets incoming requests — text, voice, API calls — and extracts user intent.
2
Reasoning
It analyzes the collected information, identifies patterns, and formulates a plan. Evaluates options and seeks clarification when needed.
3
Action
The agent executes the plan: retrieves data, generates a response, triggers external scripts, calls tools.
4
Observing & Learning
It assesses results, refines its approach for future tasks, and logs new knowledge or mistakes — feeding the loop back into supervised, unsupervised, or reinforced learning.
06
Tool use, taught through a crime scene.
The longest section of Day 0 — about 30 slides — is on tool use. The teaching frame is "The Detective's Dilemma": you're a detective with too many tools, the wrong tools, or no tools at all. Sound like your AI agent project?
RAISE Framework
The four parts of an agent's tool ecosystem.
Controller — the dialogue + LLM core that decides what to do next
Working Memory — system prompt, task instructions, conversation history, scratchpad
Tool Pool — databases, scripting, interpreters, knowledge bases, external AI tools
Example Pool — <Q, A> pairs the agent can retrieve from when planning
The Tool Use Lifecycle
From request to result.
Query arrives → Controller parses
Retrieve relevant examples from Example Pool
Plan actions, write to Working Memory
Execute against the Tool Pool, observe results
Loop until the goal is met or escalation triggered
Mo Tools, Mo Problems
The minimalism principle.
Avoid tool overload. Each tool added increases the agent's choice-set exponentially.
How agents see tools. Tools are not menus — they're descriptions the LLM has to understand.
Tool resilience. Tools fail. Plan for failure modes from day one.
Bridge tooling. Sometimes you need a tool to call a tool to call a tool. Sometimes you shouldn't.
07
What goes wrong (and why).
Day 0 names the failure modes early so the rest of the week can focus on countermeasures. Eight categories show up over and over in real production systems.
Decision visibility is limited. Why did the agent do that? Often unanswerable.
Data & Model Dependency
Flawed data propagates errors through every downstream agent action.
Coordination Complexity
Multi-agent collaboration bottlenecks become increasingly difficult as you scale.
Non-Determinism
Unpredictability causes cascading errors. Same input, different output.
Limited Customization
Rigid templates limit adaptation to specific business contexts.
Integration & Scalability
Plugging into existing enterprise systems is harder than the demos suggest.
Ethical Risks
Autonomy introduces trust issues. Who's responsible when the agent acts wrong?
08
What we tell teams on Day 0.
Day 0 closes with concrete advice: ten best practices distilled from production deployments, plus a tour of the platform landscape teams will actually pick from.
The ten Day 0 best practices
Build discipline
Foundations
Start simple. MVP-first, basic planning, no premature complexity
Clear success criteria. Define specific goals upfront
Constrained environments. Develop in controlled settings to manage non-determinism
Leverage existing tools. Reuse, don't reinvent
Operating posture
Production
Robust orchestration. Strong management for agent collaboration
Closes the Day 0 loop — sets up the Privacy/Safety/Ethics deep-dive later in the week
The platform landscape — what teams will actually pick from
Platform
Strength
Watch-out
LangChain
Flexible LLM workflows, modular, large community.
Developer-focused; higher technical barrier.
CrewAI
Multi-agent collaboration with task-based roles. Code + visual.
Effective for "crews" but can be opinionated.
AutoGPT
Low-code, drag-and-drop visual editor for continuous agents.
Can be challenging to set up reliably.
SuperAgent
Open-source framework + cloud platform, optimized for fast iteration.
Developer-centric; lacks visual builder.
MetaGPT
Simulates a "development team" to generate full-stack prototypes.
Niche focus on software development specifically.
CAMEL
Communication and negotiation between agents for adaptive decisions.
Primarily research-grade.
09
What Day 0 sets up.
Day 0 isn't about building anything. It's about arriving on Day 1 with the same vocabulary, the same mental model, and the same definition of "agent" as everyone else in the room.
From here the program goes deeper across the three live days of the Citizens AI Academy Track C (September 2025). Day 1 is Intro to Agents + Reinventing Banking. Day 2 is Tool Use + Reasoning. Day 3 is Memory + Planning + the A.G.E.N.T design framework. Each day builds on the SPAR cycle and the five-component anatomy you just learned.
Ready to run Day 0 with your team?
The full deck — all 97 slides, including diagrams, agenda, the SPAR walkthrough, the Detective's Dilemma narrative, and the platform landscape — is available for download. The same content has been delivered live to Citizens; reach out to discuss running it for your team.
#04 · 4.a · Citizens AI Academy · Track C · Day 1 · September 2025
From prompts to agency.
Day 1 is where the cohort moves from "calling an LLM" to "running an agent." Three sessions in the morning — Intro to Agents, Understanding Agents, Reinventing Banking with Agents — close out with a live KYC multi-agent demo on Accenture's AI Refinery. Then the Pod runs Hypersprint #1 against the real Citizens backlog.
107 slides · taught live3 sessions in the morning1 KYC demo · AI Refinery1 Hypersprint vs Pod backlog
CuratorMo Nomeli·CAAI Global Lead AI Learning & Emerging Tech · Source: Citizens AI Academy · Track C · Day 1 · September 2025
01
"Find me the best mortgage."
Day 1 opens with a banking scenario that lands harder than the generic "book a vacation" example. You are a customer. You want a mortgage on the house at 123 Main St. Lowest rate. Close in 25 days. The bank's digital assistant builds you a perfect plan in seconds — partner lender, rate sheet, document checklist, timeline.
Then reality hits. The promotional rate expired yesterday. Your "verified funds" sit behind a 3-day settlement period. The recommended insurer doesn't cover your flood zone. Now the customer does the real work — manually hunting for new rates, scrambling to liquidate, finding a different insurer. The plan looked perfect because it never had to operate in the real world.
That's the gap Day 1 names: between generative AI (which gives you a plan) and agentic AI (which can execute the plan, react when reality doesn't match, and finish the job). For a Citizens cohort, this isn't theoretical — it's the difference between a chatbot that sounds smart and an agent that actually closes the loan.
02
Agentic AI vs traditional AI vs chatbots.
Day 1 makes the team draw the lines clearly. Otherwise the rest of the week becomes a vocabulary fight.
Three categories — what each one actually does
Category
What it does
Banking example
Traditional AI / ML
Single prediction or classification. Stateless. Same input → same output.
A fraud-scoring model that returns a 0–1 risk score on a transaction.
Chatbot (GenAI)
Generates text. Can converse. No memory across sessions. No actions in external systems.
A customer-service bot that answers FAQ but can't actually unlock your account.
Agentic AI
Generates a plan, executes it via tools, observes results, adapts. Has goals, memory, and the ability to act in external systems.
An onboarding agent that pulls KYC docs, validates them, runs sanctions screening, and only escalates the edge cases to a human.
03
SPAR — the anchor, taught again.
SPAR is taught on Day 0 as the foundations. Day 1 brings it back as the working frame for the rest of the week. Every later concept — Tool Use, Reasoning, Memory, Planning, Multi-Agent — maps back to one or more SPAR phases.
SPAR · the four-phase agent loop
S
Sense · gather information, input, context. Check what's needed to complete the task.
P
Plan · think, analyze, map an approach. Outline specific steps to accomplish the goal.
A
Act · execute. Coordinate across tools, assets, action sequences in a defined environment.
R
React · learn from experience. Reflect on results. Did the outcome meet the criteria?
The integration of Sense → Plan → Act → React is the fundamental shift away from traditional automation. Linear scripts don't react. Agents do.
04
Five levels of agentic — placed on a banking map.
The Agentic Progression Framework runs Levels 1 through 5. Most production banking systems live at Level 2 or 3. Knowing which level you're targeting kills more debates than any other framework on Day 1.
Level 1
Rule-Based Automation
Fixed rules and workflows. Repetitive tasks like data entry, form processing. Like cruise control.
Comparable to self-driving cars in mapped areas. Independently pursue goals, adapt strategies, manage workflows.
Adjusts based on feedback
Limited and defined domains only
The current frontier of production
Level 5
Fully Autonomous
Hypothetical. Understands any goal, develops strategies, learns from experience, adapts across domains without human input.
Value-aligned decisions
Seamless cross-system integration
Not real yet — and possibly never
05
Three ways agents collaborate.
The afternoon "Understanding Agents" session adds a frame for what comes later in the week: how multiple agents work together. Three patterns — each with a banking equivalent.
Pattern 1
Centralized
One orchestrator agent at the top, all decisions and routing flow through it. Specialists below execute.
Easy to reason about
Single point of failure
Banking parallel: a Loan Origination Manager calling out to credit-check, valuation, and KYC sub-agents
Pattern 2
Decentralized
Peer agents communicate directly. No top-down router. Coordination via shared protocol or message bus.
More resilient
Harder to audit
Banking parallel: peer fraud-detection agents sharing flags across regions
Pattern 3
Hierarchical
A tree. Top-level coordinator, sub-orchestrators, leaf specialists. Decisions cascade through tiers.
Scales to complex workflows
More moving parts to test
Banking parallel: regulatory reporting where region → product → entity all roll up
Open standards matter here. The protocols (MCP, A2A) that let these patterns work without each agent inventing its own dialect are taught later in the program.
06
Reinventing banking — where agents land.
The afternoon track maps where agentic AI actually lands in financial services. Six functions, each with a value proposition the cohort can take back to their Pod.
Function
Where the agent lives
Value delivered
Sales & Service (Banking)
Quick access to product info, contextual recommendations, account servicing.
Effective underwriting, proactive risk assessment vs reactive remediation.
Reduced risk · better data protection · faster processing.
Technology Development
Streamlined software development, code generation, test scaffolding.
Improved workflow · increased efficiency · shorter dev cycles.
07
The KYC and AML deep dive.
Two banking workflows get the deep treatment on Day 1: Anti-Money Laundering screening and Know-Your-Customer onboarding. Both are high-volume, high-stakes, and well-suited to a Level-3 agentic system.
AML / Sanctions
Revolutionizing alert adjudication
Automate high-volume sanctions, PEP, and adverse-media alert screening 24/7 with high consistency
Generative AI inside agents drafts initial SAR narratives, aiding investigators
Manages rising alert volumes without proportional staff increases
KYC / KYB
Streamlining customer onboarding
Automate data gathering and verification from diverse sources during onboarding
Intelligent document processing extracts and validates info for due diligence
Deeper risk insights by analyzing complex ownership structures, multi-source screening
Continuous, agent-driven monitoring ensures ongoing compliance and timely risk reassessment
08
The road-ahead reality check.
Day 1 doesn't close on hype. It closes on the operational risks the cohort needs to keep front of mind for the rest of the week.
Risk
What it looks like in banking
Data, talent, integration
Most production agentic systems stall on data quality, scarce ML/AI talent, or integration with legacy core-banking systems — not on model capability.
Regulatory horizon
Banking regulators expect explainability, decision-trail audits, and clear human accountability. Agents that can't show their work fail audit.
Trust & transparency
Why did the agent decide that? If the answer is "because the LLM said so," you have a problem. Decision logs are non-negotiable.
Ethical & operational
Bias propagation in credit decisions. Hallucinated SAR narratives. Customers with no clear path to dispute an agent's decision.
Job impact
Agents augment investigators and analysts more than they replace them. The Day 1 framing: "agents handle the volume; humans handle the judgment."
09
What Day 1 sets up.
By the end of Day 1, the cohort has a shared vocabulary (agentic vs GenAI vs traditional ML), a shared frame (SPAR), a shared map (the 5 levels, the 3 collaboration patterns), and a shared business case (KYC and AML, demoed live).
From here the bootcamp goes deeper into each capability. Day 2 attacks tool use and reasoning. Day 3 attacks memory and planning. Each day builds on the SPAR cycle the team locked in today.
Ready to run Day 1 with your team?
The full deck — all 107 slides, including the mortgage hook, the SPAR walkthrough, the 5-level framework, the 3 collaboration patterns, and the AI Refinery KYC multi-agent demo — is available for download. The same content was delivered live to the Citizens cohort in September 2025; reach out to discuss running it for yours.
#04 · 4.a · Citizens AI Academy · Track C · Day 2 · September 2025
Tools. And the power of pause.
Day 2 is two deep dives. Tool Use with Agents in the morning — the Detective's Dilemma, the RAISE framework, "Mo Tools, Mo Problems" minimalism, progressive tool access. Reasoning with Agents in the afternoon — fast vs slow thinking, LLMs vs LRMs, multi-agent reasoning, metacognitive awareness. Hypersprint #2 begins after lunch.
CuratorMo Nomeli·CAAI Global Lead AI Learning & Emerging Tech · Source: Citizens AI Academy · Track C · Day 2 · September 2025
01
Why tools matter — the building blocks of action.
Day 2 opens by tying tools back to the agentic levels from Day 1. Level 1 is a switch statement. Level 2 introduces criteria and decision-making about which tool to call. Level 3 is where the agent actually orchestrates multiple tools — figuring out the order, handling dependencies, making the calls.
The frame: tools are the bridge between abstract goals and tangible outcomes. An agent without tools is a chatbot with goals and no hands. An agent with tools can move money, file SARs, update CRM records, send compliance notifications. Tools are what turn "could" into "did."
And the limit: an agent is bounded by its understanding of the tools' capabilities, when to use them, and how to use them effectively. This is why Day 2 spends a third of its time on tool design — because tool design is agent design.
02
The Detective's Dilemma — taught with banking.
Day 2's central narrative is "the Detective's Dilemma." Picture a banking representative preparing an enhanced-due-diligence reply for a KYC review. The LLM has been trained on Citizens' policies. It outlines internal procedures. It drafts a template response. It explains itself.
And then it stops. Because outlining procedures is not the same as performing them. The agent needs tools — to actually pull the income docs, run sanctions screening, log the case, generate the SAR. Without tools, the LLM is a detective who knows the case backwards and forwards but can't open the evidence locker.
03
"Mo Tools, Mo Problems" — the access paradox.
More tools = more capability. More tools = more failure modes. Day 2 names this paradox directly so the cohort doesn't fall into it.
Take a hypothetical agent with three well-described tools, each with high resilience and detailed descriptions:
The ability to send emails
The ability to query a customer-service database (with access controls scoped to that customer's history)
A connection to the data lake to populate prioritized issues
In theory: the agent can find novel issues in customer-service calls and notify the right authority. Any foreseeable problems? Yes — many. The agent could email the wrong recipient. It could surface a false positive that triggers an investigation. It could inadvertently expose customer data through an over-broad query. Each new tool added increases the failure surface multiplicatively, not additively.
04
The RAISE framework — operationalized.
Day 2 spends real time inside RAISE — the framework that defines an agent's tool ecosystem. Built on top of the ReAct method (Reason + Act in a loop), RAISE adds a memory mechanism that mirrors human short-term + long-term memory.
Component 1
Controller
The dialogue + LLM core. Decides what to do next based on the current task plan and the contents of working memory.
Reads the prompt + history
Generates the next action
Parses tool outputs into observations
Component 2
Working Memory
Short-term scratchpad for the current task. System prompt, task instruction, conversation history, retrieved examples, task trajectory.
Resets per task (or per session)
Bounded by context window
Where the agent's "thinking out loud" lives
Component 3
Tool Pool
Databases, scripting interpreters, knowledge bases, external AI services — the things the agent can actually call.
Each tool has an input/output spec
Each tool has a description the LLM reads
Tool errors flow back as observations
Component 4
Example Pool
A library of past <Q, A> pairs the agent can retrieve from when planning. The agent's long-term reference.
Retrieved on prompt
Injected into working memory
The "I've seen this before" mechanism
RAISE in action — the agentic loop
1
Query arrives → Controller parses intent and writes task plan to Working Memory.
2
Retrieve relevant examples from the Example Pool → injected into Working Memory.
3
Plan actions, write thought to scratchpad, execute against the Tool Pool.
4
Observe results, update Working Memory, loop until goal met or escalation triggered.
RAISE is the operating model for Level-3 banking agents. The Day 2 lab finds the "tool internal monologue" in the running code — making the loop visible.
05
Tools fail. Plan for it.
Day 2 ends the tool track with the operational reality: tools fail. APIs go down. Data is stale. Calls time out. The agent has to be designed for resilience from day one, not as an afterthought.
Strategy 1
Tool Resilience
Build retry logic, fallback paths, and graceful degradation into every tool wrapper. An agent with a flaky API should know to wait, retry, or escalate — not silently fail.
Strategy 2
Progressive Tool Access
Don't give a new agent the keys to everything on day one. Start with read-only access. Then read-write to a sandbox. Then read-write to production with human approval. Then unattended.
Strategy 3
Test, test, test
Adversarial scenarios. Tool-failure simulations. Edge cases. Production agents that have never failed in testing will fail in production. Better to fail in the lab.
06
Reasoning — fast and slow.
The afternoon shifts from "what tools" to "how the agent thinks." Day 2 leans on Daniel Kahneman's two-systems framing, which makes the architectural choice tangible.
System
How it operates
Banking analog
System 1 — Fast
Quick, automatic, pattern-matched. Little effort. The "snap judgment" mode.
Real-time fraud rules — millisecond decisions on transaction approval.
System 2 — Slow
Deliberate, reasoned, multi-step. Like planning a chess move. Higher latency, higher accuracy on novel problems.
Multi-step fraud-pattern investigation across an account history; SAR drafting.
The Day 2 lesson: combine both. Fast checks for routine cases (low latency, high consistency). Slow reasoning for edge cases (high latency, accepted because the case warranted it). Imagine rerouting a $1.2M pharmaceutical shipment to avoid a storm — only to cross routes that violate international transport regulations. That's a System 1 mistake. The Day 2 framing for banking: "think carefully, deeply, and reason thoroughly" — but only when the case earns it.
07
LLMs vs LRMs — the power of pause.
Day 2 introduces Large Reasoning Models as a distinct category from Large Language Models. Both look the same from the API, but they're trained differently and behave differently.
Characteristic
Large Language Models (LLMs)
Large Reasoning Models (LRMs)
Training Data
Vast unstructured text corpora.
Structured data + explicit reasoning frameworks.
Reasoning Depth
Surface-level, statistical pattern-matching.
Causal relationships, systematic analysis.
Adaptability
Generalizes broadly across language tasks.
Specializes narrowly in technical / logic-heavy domains.
Key Strength
Translation, summarization, dialogue.
Math, coding, multi-step decision-making.
Output Type
Probabilistic text outputs.
Deterministic logical conclusions.
The compute model is also different. LLMs got better via train-time compute scaling — more data, more parameters. That curve is hitting limits (finite data, finite compute). LRMs scale via test-time compute — letting the model think longer at inference, exploring more reasoning paths. The "power of pause" is the model spending more inference tokens on hard problems.
08
Many small reasoners beat one big one.
The Day 2 reasoning track ends with a counterintuitive finding from recent research: collaborative debate frameworks of smaller models can exceed the reasoning capacity of a single large LLM — at a fraction of the cost.
Single Reasoner
Scale test-time compute
Give one strong reasoner more inference tokens. Use prompts that elicit deep thinking ("what factors might make this recommendation unreliable?"). Strong baseline.
Debate
Two reasoners, one truth
Even with smaller language models (SLMs), debate frameworks can exceed LLM performance at a 14x cost factor. Diverse perspectives challenge each model's reasoning.
Multi-Agent
Many small + diverse
Smaller, more diverse models with reasoning capabilities. Scale wide instead of scale up. Individually limited; collectively they surpass each other as a team operating at different "thinking" speeds.
Metacognitive awareness is the new horizon. LRMs are starting to surface their own uncertainty — "progress is being made but we need to reconcile these discrepancies." Recognizing uncertainty is the prerequisite for Human-in-the-Loop escalation. When the agent can flag its own confusion, the human review path has a clear trigger. That's the holy grail of explainability and observability rolled into one.
09
What Day 2 sets up.
By the end of Day 2, the cohort has both the action layer (tools, RAISE, progressive access, resilience) and the thinking layer (LLMs, LRMs, fast/slow, multi-agent reasoning) for what they're going to build.
Day 3 brings memory and planning — what the agent knows and how it decides what to do next. The team will need both in their Hypersprint #2 work.
Ready to run Day 2 with your team?
The full deck — all 91 slides, including the Detective's Dilemma, the RAISE framework, "Mo Tools, Mo Problems," progressive tool access, the LLM/LRM comparison, and the multi-agent debate research — is available for download.
#04 · 4.a · Citizens AI Academy · Track C · Day 3 · September 2025
Memory. Planning. A.G.E.N.T
Day 3 is the structural day. Morning: Memory in Agents — the three layers, context windows, long-term storage, feedback loops. Afternoon: Planning Agentic Workflows — when to use agents, when not to, the Three Circles of Opportunity, and the A.G.E.N.T design framework the cohort will use for every agent they build.
CuratorMo Nomeli·CAAI Global Lead AI Learning & Emerging Tech · Source: Citizens AI Academy · Track C · Day 3 · September 2025
01
Memory isn't recording — it's reconstruction.
Day 3 opens with a thought experiment. Think back to a fond memory. What were the sounds? The smells? The conversations in the background? Who was there?
And then the trick: are you remembering the event itself, or your last retelling of it? Most "memories" are actually reconstructions — built from fragments, refined each time you recall them. Memory isn't a recording. It's a story we keep rewriting.
That's the framing for the agent's memory architecture. An agent's memory isn't a transcript of everything it has seen. It's a curated, structured, prioritized representation of what mattered. The Day 3 task: design that curation deliberately, because if you don't, the LLM's context window will do it for you — badly.
02
The three layers of agent memory.
Day 3's core memory model has three layers. Each one solves a different problem; together they make agents that actually learn.
Layer 1
Short-Term Memory
The agent's working scratchpad. Recent interactions, current task context. Ensures contextual continuity within a single session.
Lives in the LLM's context window
Bounded by token limits
Resets between sessions
Layer 2
Long-Term Memory
Persistent storage beyond the session. User preferences, past interactions, workflows, domain-specific knowledge.
Vector stores, knowledge graphs, relational DBs
Retrieved on demand into working memory
Where the agent gets continuity
Layer 3
Feedback Loops
The mechanism that keeps memory useful over time. Refines both short-term and long-term memory, prunes stale info, reinforces what works.
Human-in-the-loop ratings
Outcome-based reinforcement
Memory consolidation: turning experience into knowledge
03
Short-term memory — the context window.
Day 3's short-term-memory section uses a concrete metaphor: picture yourself at a busy intersection in London. The cars are documents. Should you pay attention to pedestrians, red buses, or taxis? Multiple databases are firing queries into the context window at once. Did the model focus on what its scope was? What if it missed the queen walking by?
Real-world impact of short-term memory choices
Decision
What it controls
Banking-stakes failure mode
Context window size
How much text the model can process at once. Newer models (Llama 4, GPT-5) support millions of tokens.
Larger windows can impact performance — model attention degrades. Stuffing more in isn't always better.
Token management
Which tokens to keep, which to summarize, which to evict from the active context.
Critical KYC document evicted to make room for chitchat → due-diligence error.
Landmark events
Tagged moments in the conversation the agent must remember regardless of token pressure.
Customer's stated risk tolerance gets buried in transcript noise → agent recommends an unsuitable product.
Attention mechanisms
How the model weights different parts of the context when generating output.
The framing the cohort takes home: guide short-term memory toward better outcomes. Highlight important info using repetition, clear statements, or explicit tags like <<IMPORTANT>>. In project management, key milestones, decisions, and challenges should be clearly noted without unnecessary detail. The agent reads what you tell it to read — engineer the prompt accordingly.
04
Long-term memory — and why banking needs it.
Day 3 makes the business case for long-term memory with a concrete banking failure pattern: customer uses airline Wi-Fi when logging into the banking portal. The portal flags as unusual login activity from unsecure Wi-Fi. The account is locked. The customer calls and walks through a long process to unlock.
With long-term memory? The agent remembers this customer travels for work, has been to airports 47 times this year, and uses unsecure Wi-Fi in 31% of sessions without incident. The flag never fires. The customer never calls.
Why LTM matters for banking processes
Customer Outcomes
20-30%
Higher customer satisfaction through personalized, natural interactions
Customer interactions build on past experiences
Advisors understand preferences and solve issues smoothly
Operational Quality
50%+
Reduction in error rates (per businesses adopting LTM-enabled AI)
Build data on processes
Longitudinal study of interactions, pain-points, friction
Where Current LLMs Fall Short
Today
Each session is amnesia by default
No native preference recall
Context window ≠ long-term memory
05
Designing long-term memory — five steps.
Day 3 walks the cohort through a five-step build process for LTM. By the end, every Pod has a vocabulary for talking about how their agent remembers things.
1
Select a framework
LangGraph for graph-based memory and orchestration. CrewAI for memory inside multi-agent crews. LangChain for episodic, semantic, and procedural memory modules. LlamaIndex for knowledge-base management.
2
Define memory requirements
What needs to persist? What can be reconstructed on demand? What's transient? Categorize as events, facts, or how-to memories.
3
Build retrieval mechanisms
Vector search (Pinecone) for semantic retrieval. Relational stores for structured data. Graphs (Neo4j) for relationship-heavy queries. Tag everything for explicit retrieval paths.
4
Implement memory consolidation
How does experience become knowledge? Summarization, landmark tagging, periodic distillation. Without consolidation, your LTM becomes a write-only log.
5
Integrate memory with agent reasoning
Memory only helps if the agent uses it. Wire the retrieval calls into the reasoning loop. Make the agent's memory visible in its scratchpad.
06
Feedback loops — and the SAFELOOP discipline.
Day 3 closes its memory section with feedback loops — the third memory layer, and the most operationally risky. LLMs can over-optimize specific metrics through feedback loops, missing the broader context. Without discipline, feedback loops cause behavior drift.
The Day 3 mnemonic — feedback with human oversight — spells out the discipline:
Letter
Practice
Why it matters
S — Supervision
Human oversight prevents unintended outcomes.
Without it, the agent optimizes for the wrong proxy.
A — Alignment
Loops should enhance capabilities while staying ethical.
Performance gains that violate policy are losses.
F — Foresight
Anticipate risks and design carefully.
Most feedback-loop failures are foreseeable.
E — Examination
Regular audits ensure accuracy and catch behavior drift.
Drift is gradual; audits are how you catch it.
L — Limits
Guard against over-optimization of narrow metrics.
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
O — Oversight
Vigilant monitoring is non-negotiable.
Production isn't lab. Real users break things lab tests miss.
O — Outcomes
Measure success against broad goals, not just metrics.
The afternoon shifts from memory to planning. Before the cohort designs an agent, they need to know whether the use case deserves one. The Day 3 framework: The Three Circles of Agentic Opportunity.
Circle 1
Effort — is it worth it?
Practical, straightforward process
Team is ready and willing to adapt
You can start small and scale up
Potential benefits justify investment
Implement without disrupting core operations
Circle 2
Feasibility — can it be done?
Tasks follow clear, consistent rules and repeatable steps
Data and processes are organized and accessible
AI can produce reliable, verifiable outcomes before human review
Circle 3
High Impact — will it matter?
Automating tasks boosts efficiency and frees up skilled workers
Prioritize repetitive, time-consuming tasks like data entry and reporting
Automation should align with strategic goals, not just convenience
The sweet spot is the intersection of all three circles — high-value, feasible, efficient to automate, and the kind of task teams frequently complain about. Day 3's "Agentic AI Prioritization Metric" is a 2x2 the cohort actually votes on:
Quadrant
What it is
What to do
High Impact, Low Complexity
Quick Wins.
Your ideal agentic opportunity. Build this first.
High Impact, High Complexity
Strategic Projects.
Future opportunities requiring careful planning.
Low Impact, Low Complexity
Low Priority.
Nice-to-have agents. Defer.
Low Impact, High Complexity
Avoid.
Not worth the effort.
08
A.G.E.N.T — the design framework.
The capstone of Day 3 is the A.G.E.N.T framework — the design checklist every Citizens Pod will run on every agent they build for the rest of the week (and beyond). Five components, five questions.
Component
Key Question
Key Elements
Actionable Steps
A — Agent Identity
Who is the agent?
Purpose, role, scope.
Craft a clear mission. Outline responsibilities and limits. Align design with goals.
G — Gear & Brain
What powers the agent?
AI model, tools, knowledge sources.
Select a model balancing performance + cost. Integrate the right tools/APIs. Build accurate knowledge sources.
Run real-world scenarios. Collect feedback and track performance. Plan for growth.
09
What Day 3 sets up.
By the end of Day 3, the cohort has a memory architecture (3 layers + 5 LTM steps + SAFELOOP discipline), a prioritization model (3 circles + 2x2 quadrants), and a design framework (A.G.E.N.T) — everything they need to scope, design, and trust an agent end-to-end.
Day 3 closes the curriculum arc taught live to the September 2025 Citizens cohort. Days 4 and 5 of the Academy continued with Multi-Agent Orchestration, Scaling, Evaluation, Guardrails, and the Agentic Case Study — covered later in the bootcamp series as those modules are written up.
Ready to run Day 3 with your team?
The full deck — all 122 slides, including the 3-layer memory model, the 5-step LTM build, the SAFELOOP discipline, the Three Circles of Opportunity, the prioritization 2x2, and the complete A.G.E.N.T framework — is available for download.
#04 · 4.a · Citizens Spotlight · Human-in-the-Lead Training · May 2025
Five days. One Citizens cohort. Humans in the lead.
Human-in-the-Lead Training — a live, multi-day agentic AI program delivered for Citizens, built on a simple premise: humans stay in command of the agents, not the other way around. Four modules: Day 0 — Intro to Agents (the May 2025 foundations preview), then the three live days of the September 2025 Citizens AI Academy · Track C — Banking Reinvention, Tool Use & Reasoning, Memory & Planning. Pick a day. Read what was actually taught.
4 modules · Day 0 + Days 1–34 modules · all written up417 slides · across 4 modules1 Citizens cohort · Track C
CuratorMo Nomeli·CAAI Global Lead AI Learning & Emerging Tech · Source: Human-in-the-Lead Training · Citizens · 5-day curriculum · May 2025
Pick a day
Foundations → Deep dives → Capstone
01
What "Day 0" actually means.
Most agentic AI training jumps straight to "build something." That's the wrong starting point. Day 0 is the day before the building starts — when the team agrees on what an agent is, what level of autonomy they're targeting, and what mental model they'll use for the next four days.
If Day 0 lands, every later day compounds on it. If Day 0 is skipped, every later day re-litigates the same vocabulary fights — and the curriculum slows to a crawl. Hence: Day 0 first. Always.
Days 1, 2, and 3 take the foundations and go deep. Day 1 is Intro to Agents + Reinventing Banking with Agents (with a live KYC multi-agent demo on AI Refinery). Day 2 is Tool Use + Reasoning (RAISE, "Mo Tools Mo Problems," LRMs and the power of pause). Day 3 is Memory + Planning + the A.G.E.N.T design framework. Same curriculum, taught live to the Citizens Track C cohort in September 2025.
#06 · AI Refinery 101 · By Accenture
Stop Googling. Start shipping.
Every team building agents has the same problem: scattered docs, partner-by-partner learning curves, and a brand-new agent harness re-invented every quarter. AI Refinery™ by Accenture is the platform we built to make that problem go away — one place to develop and execute AI multi-agent solutions, with the agents, models, memory, governance, safety, and APIs already wired together. This is the 101.
12 utility agents12 huddle partners8 model types10 API surfaces
If you've shipped an agent in the last twelve months, you know the drill. Pick a model. Wire a vector store. Bolt on a tool-calling layer. Wrap it in something that looks like memory. Add guardrails. Add evals. Add an orchestrator. Hope it doesn't break. Then watch the next team start over from scratch.
The market gives you ingredients. What you actually want is a kitchen.
That's what AI Refinery is. It's a platform — not a framework, not a wrapper, not a "starter kit" — for developing and executing AI multi-agent solutions. Three things it's designed to help you do, straight from the docs:
Adopt and customize large language models (LLMs) to meet specific business needs.
Integrate generative AI across various enterprise functions using a robust AI stack.
Foster continuous innovation with minimal human intervention.
Seamless integration. Ongoing advancements. The platform isn't trying to be every framework. It's trying to be the substrate that the rest of your agentic stack builds on. One reference. One environment. One toolkit your team actually uses.
02
The four pillars.
Everything in AI Refinery hangs off four load-bearing capabilities. Get these right and the rest follows.
Fig 1. The four pillars. Together they form the substrate every agentic application built on AI Refinery rides on top of.
Pillar 1
Flexible Agentic Teams
Enable agents to autonomously perform tasks
Make decisions and interact with other agents and systems
Composable teams — not isolated agents
Pillar 2
Comprehensive Model Catalog
LLMs, VLLMs, rerankers, and more
Choose models to power your agents
Available through agentic workflow or direct API calls
Pillar 3
Scalable Distiller Framework
Designed to streamline complex workflows
Orchestrates various agents handling different tasks
The connective tissue between everything else
Pillar 4
Agent Memory
Retain context across interactions
Personalize interactions per user
Provide coherent responses over time
03
Twelve utility agents. Ready to deploy.
Built-in utility agents are the workhorses — engineered to streamline tasks like Retrieval-Augmented Generation (RAG), data analytics, and image generation. Ready-to-deploy. Configure with YAML. Deploy with minimal Python. Use one or chain them inside an orchestrator to build a multi-agent solution.
Agent
What it does
A2A Agent
Supports the integration of agents that are exposed over the Agent2Agent (A2A) protocol — for seamless communication and collaboration.
Analytics Agent
Streamlines data analysis tasks for insightful decision-making.
Author Agent
Enhances writing processes with AI-driven content creation.
Critical Thinker Agent
Analyzes conversations to identify issues and provide insights.
Deep Research Agent
Handles complex user queries through multi-step, structured research to produce comprehensive, citation-supported reports.
Image Generation Agent
Creates high-quality images (both text-to-image and image-to-image).
Image Understanding Agent
Analyzes and interprets visual data for deeper insights.
MCP Agent
Integrates Model Context Protocol (MCP) support for dynamic tool discovery and invocation via MCP servers.
Planning Agent
Designs realistic plans by analyzing user interactions and goals.
Research Agent
Handles complex queries using RAG via web search and vector search methods.
Search Agent
Answers queries by searching the internet, specifically using Google.
Tool Use Agent
Interacts with external tools to perform tasks and deliver results.
Configuration is intentionally minimal. Below is the actual sample from the docs — a project that wires up the SearchAgent to perform web searches and respond to user queries.
YAML · project config# configure your utility agents in this listutility_agents:
- agent_class: SearchAgent # The class of the agentagent_name:"Search Agent"# A name that you chooseorchestrator:agent_list:# list the configured agents here
- agent_name:"Search Agent"
Python · deploy & queryimport asyncio
import os
from air import DistillerClient
from dotenv import load_dotenv
load_dotenv() # loads API_KEY from .env
api_key = str(os.getenv("API_KEY"))
async def search_demo():
distiller_client = DistillerClient(api_key=api_key)
distiller_client.create_project(
config_path="example.yaml",
project="example"
)
async with distiller_client(
project="example",
uuid="test_user"
) as dc:
responses = await dc.query(
query="Who won the FIFA world cup 2022?"
)
async for response in responses:
print(response['content'])
if __name__ == "__main__":
asyncio.run(search_demo())
The example demonstrates a single agent. Configure additional agents under utility_agents and include them in orchestrator.agent_list to develop a multi-agent solution.
04
Three super agents. For when one agent isn't enough.
Super Agents are engineered to handle complex tasks by orchestrating multiple agents — creating dynamic and powerful collaborations. Three of them ship with the SDK.
Super Agent · 1
Base Super Agent
Decomposes a complex task into several subtasks, assigning each to the appropriate agents.
Dynamic decomposition — the agent decides who does what
Best for open-ended, exploratory workflows
Super Agent · 2
Flow Super Agent
Executes a deterministic workflow configured by the user among agents.
You define the steps · the platform runs them
Best when the path is known and reliability matters more than flexibility
Super Agent · 3
Evaluation Super Agent
Systematically assesses the performance of utility agents based on predefined metrics and sample queries — a structured approach to improving agent performance.
Treats agent quality as something measurable
Generates the feedback loop for continuous improvement
05
The Trusted Agent Huddle.
Twelve utility agents and three super agents would already be a strong roster. But the platform doesn't ask you to choose between AI Refinery and the rest of your stack. The Trusted Agent Huddle brings third-party agents into the same orchestration fabric — a roster of 12 partners whose agents you can call alongside the built-ins.
Partner agent
Where it runs
Amazon Bedrock Agent
Hosted on AWS — uses the reasoning of foundation models, APIs, and data to break down user requests, gather information, and complete tasks.
Azure AI Agent
Cloud-hosted on Microsoft Azure — interprets queries, invokes tools, executes tasks, and returns results.
CB Insights Agent
Hosted on the CB Insights market intelligence platform — verified market intelligence, company profiles, deal information, business analytics.
Databricks Agent
Hosted on Databricks — uses Databricks Genie so business teams interact with their data in natural language.
Google Vertex Agent
Hosted on Google Cloud Platform — leverages Google's foundation models, search, and conversational AI to automate tasks and personalize interactions.
Pega Agent
Hosted on Pega Platform — analyzes business workflows in real time, generates context-aware answers using enterprise knowledge to streamline issue resolution.
SAP Agent
Hosted on SAP — automates workflows, analyzes real-time business data, assists in financial operations, delivers contextual responses.
Salesforce Agent
Hosted on Salesforce — routes cases, provides order details, extends databases, responds to queries.
ServiceNow Agent
Hosted on ServiceNow — workflow automation, intelligent support, decision-making enhancement, user experience improvement.
Snowflake Agent
Hosted on Snowflake — business teams interact with their data through natural language and analyze data intuitively.
Wolfram Agent
Hosted on Wolfram Alpha — advanced computations, visualizations, scientific and mathematical queries, knowledge-based data retrieval.
Writer AI Agent
From Writer.com — generates, refines, and structures content using integrated tools and customizable guidelines.
06
The model catalog. Eight types. One choice point.
The model catalog offers a wide range of AI solutions for text and image processing — accessible through the agentic workflow or directly via API calls. Eight model types currently shipped, each with named providers and specific models from the catalog.
microsoft · llmlingua-2-bert-base-multilingual-cased-meetingbank
Type 4
Rerankers
For optimizing search result rankings
Reorders retrieved documents by query relevance
Type 5
Diffusers
For image generation tasks
black-forest-labs · FLUX.1-schnell
Type 6
Segmentation Models
For high-quality image segmentation
Type 7
Text-to-Speech (TTS)
For converting text to speech
Azure · AI-Speech
Type 8
Automatic Speech Recognition (ASR)
For converting speech to text
Azure · AI-Transcription
07
Safety, by default.
AI Refinery prioritizes safety — offering key features to ensure ethical and secure interactions. Two safety features ship today, each crucial for maintaining privacy and promoting responsible AI usage across applications.
Safety · 1
PII Masking
Safeguards personally identifiable information by masking sensitive data — like emails and phone numbers — before they reach backend systems or AI agents.
Configurable — define what counts as PII for your context
Reversible — original values are recoverable when authorized
Toggleable — turn it on or off per workflow
Aligns with global data protection standards
Safety · 2
Responsible AI (RAI)
Applies safety and policy rules to user queries handled by Large Language Models. Ships with default rules. Welcomes custom ones.
Default rules filter illegal, harmful, and discriminatory content
Allows users to create custom rules for specific needs
Ensures ethical AI operations
08
Four advanced features that pay for themselves.
These are the capabilities that move you past prototype-grade. Shared memory.Prompt compression.Reranking.Self-reflection. Each one solves a problem you'd otherwise solve manually — over and over.
Feature · 1
Agents' Shared Memory
Lets multiple AI agents access and utilize common memory resources — enhancing collaboration for more coherent and contextually aware responses.
Chat History Module: stores and retrieves chat conversations efficiently — agents maintain context across interactions
Relevant Chat History Module: fetches and summarizes the most pertinent past conversations, focusing on key insights and themes
Variable Memory Module: manages key-value pairs for storing and updating user-specific data — for personalization and continuity
Feature · 2
Prompt Compression
Reduces the size of input prompts while retaining essential information — enabling faster, more cost-effective processing.
Streamlines content from top-ranked documents
Enhances efficiency in generating comprehensive responses
Translation: smaller bills, same answer quality.
Feature · 3
Reranking
Improves the relevance of retrieved documents by reordering them based on their pertinence to the query.
Prioritizes the most relevant information first
Ensures the agent provides precise, meaningful responses
The difference between "found it" and "found something close"
Feature · 4
Self-Reflection
Enables Utility Agents to iteratively refine responses by evaluating and regenerating them until they meet quality standards.
Ensures responses are correct and relevant
Strategies include selecting the best attempt or aggregating information for the final output
Quality as a process, not a wish
09
Ten APIs. One platform.
The AI Refinery platform offers a comprehensive suite of APIs to enhance AI application development — from generating text responses to utilizing machine learning models. Each API focuses on a specific area to meet diverse project needs.
Fig 2. The 10 API areas. Distiller (highlighted) is the orchestration entry point — every other API is a primitive your agents can call directly. Realtime Distiller and Physical AI are the streaming and embodied-AI extensions.
API
What it gives you
Audio
Tools for audio processing and analysis, including speech recognition.
Chat Completion
Generates responses using LLMs supported by AI Refinery.
Distiller
Enables agentic project creation and access to other AI Refinery features.
Realtime Distiller
Streaming variant of Distiller for realtime agent workflows.
Embeddings
Creates the embedding of textual data using embedding models supported by AI Refinery.
Images
Provides image generation and segmentation capabilities.
Knowledge
Offers knowledge extraction and knowledge graph functionalities.
Models
Access the list of models currently supported by AI Refinery.
Moderations
Evaluates whether the input contains any potentially harmful content.
Physical AI (preview)
Provides advanced tools for video-based understanding, simulation, and synthesis of the physical world.
Training
Enables customization of AI models with personal data through training capabilities.
Observability
Enables querying logs, metrics, and traces for monitoring and debugging AIRefinery applications.
10
The bottom line.
Stop Googling. Start shipping. AI Refinery™ by Accenture isn't asking you to learn a new partner — it's asking you to stop relearning the same patterns every quarter. 12 utility agents ready to deploy. 3 super agents for orchestration. 12 trusted partner integrations via the Trusted Agent Huddle. 8 model types in the catalog. 10 API surfaces. 2 safety features — PII masking and Responsible AI. 4 advanced features — shared memory, prompt compression, reranking, self-reflection. All wired together.
The platform's three design intents from the docs: adopt and customize LLMs to meet specific business needs, integrate generative AI across enterprise functions using a robust AI stack, and foster continuous innovation with minimal human intervention. Each one is a problem most teams solve in private. AI Refinery solves them once, in shared infrastructure, so your team can focus on what's actually different about your use case.
The harness is built. Bring your agents.
Get started.
The full SDK documentation is live — including quickstarts, project guidelines, tutorials for every utility agent, multi-agent workflow patterns, the agent library, the model catalog, and the complete API reference. Generate API keys, install the SDK, and ship your first project today.
Twelve heavyweight partners. 189 capabilities. One head-to-head map. The agentic layer doesn't sit in isolation — it rides on top of eighteen platform modules across Governance, Data & AI, and Foundation. This map shows where it lives in the broader operating model. Click Module 18 below to enter the live ecosystem comparison.
Module 18 — the Agentic Layer ecosystem comparison — is live. Click to explore.
CuratorJimmy Priestas·Global AI & Data Lead — Digital Core
Enterprise AI Operating Model
Governance Framework
1Strategy & Value Enablement
2Governance & Operating Model
3Value Realization
4Platform Orchestration & Control
5Enablement & Self Service
Data & AI Backbone
Data
6Data Mgmt. & Governance
7Integration & Interoperability
8Ingestion
9Data Storage & Processing
10Experimentation & Consumption
11Insights & Analytics
AI
16
Classic AI/ML
Multi-modal AI co-existing, including vision, language, speech
17
Gen AI Services & Pre-Built Industry Solutions
Gen AI Architecture & Governance · Design, Boost, Build, Operationalize · Pre-built industry solutions accelerate reinvention journey
18
Agentic AI
5 Key Agentic AI Capabilities that can be built individually, or combined for maximal Enterprise Reinvention in Agentic solutions.
Knowledge
Development of enterprise-wide knowledge capacity with adaptive learning.
Models
Customize pre-built foundation models to drive reinvention and value.
Agents
Embed the power of generative AI across end-to-end workflows to drive increased value.
Governance
Dynamically route queries to the most appropriate model based on use case specificity.
Infrastructure
Compute, security, and confidential infrastructure that underpins agentic workloads at scale.
Enter the Agentic AI Atlas
Digital Foundation
12Cloud Infrastructure
13Continuum Control Plane
14Security
15Composable Integration
AI Agentic Architecture · Ecosystem Comparison of the Agentic Layer
Who actually wins the agentic layer? 12 partners · 189 capabilities
An interactive atlas of the Agentic Layer (Module 18 of the enterprise AI operating model), mapping how 12 ecosystem partners cover 189 capabilities across agents, governance, models, infrastructure, and knowledge. Scope is limited to the Agentic Layer; this is not a comparison of the partners' full enterprise portfolios.
12
Ecosystem Partners
189
Capabilities
5
Domains
2,268
Data Points
The ecosystem partners
Click any ecosystem partner to see how they cover the 189 Module 18 capabilities, plus strategic strengths, gaps, and ideal use cases. Scope is limited to the Agentic AI Layer; broader enterprise capabilities outside Module 18 are not assessed here.
CuratorJimmy Priestas·Global AI & Data Lead — Digital Core
Component architecture
The full Agentic AI Layer organized into five domains. Each tile is a capability — click to see how every ecosystem partner implements it. Use the filter to color the diagram by partner coverage.
CuratorJimmy Priestas·Global AI & Data Lead — Digital Core
Has capabilityN/A
The capabilities
Browse the full capability hierarchy. Click any capability to see how every ecosystem partner implements it.
CuratorJimmy Priestas·Global AI & Data Lead — Digital Core
The matrix
The full comparison grid. Scroll horizontally to see all 12 ecosystem partners side-by-side. Capability column stays pinned.
CuratorJimmy Priestas·Global AI & Data Lead — Digital Core
Strategic analysis
Each ecosystem partner's architectural strengths, notable gaps, and ideal-fit scenarios — strictly within the scope of Module 18 (Agentic AI Layer). Content is red-teamed for balance: every partner has substantive strengths and substantive gaps. Claims are limited to capabilities mapped in this atlas; broader enterprise portfolios are out of scope.
CuratorJimmy Priestas·Global AI & Data Lead — Digital Core
Ecosystem Partner
Strengths
Gaps
Ideal Use Case
#08 · AI Everywhere
Where the practice puts AI to work. Seven fronts.
Accenture's Reinvention Services brings the full breadth of the firm to bear on every client problem — organized into seven Reinvention Partner areas that map to how clients actually think about their business. Pick a front. Each one is its own playbook for embedding data and AI at scale, and each one is being assembled now. Cybersecurity. Digital Core. Finance. Industry & Enterprise. Song. Supply Chain & Engineering. Talent.
7 partner areas1 reinvention thesisComing Soon
Pick your front
Reinvention Partners · seven areas of the practice
#08 · 8.a · Cybersecurity
Cyber-resilience. Value through trust.
The Cybersecurity Reinvention Partner reinvents how enterprises defend, protect, and grow value through trust — building defenses, protecting enterprises, managing risk, and enabling emerging technologies. This chapter is in build. The full playbook will cover the AI & data layer of cyber: agentic SOCs, identity for non-human actors, model-and-data security patterns, and the partner stack underneath.
Coming SoonReinvention Partner · 8.a
#08 · 8.b · Digital Core
The digital foundations, reinvented.
The Digital Core Reinvention Partner reinvents the foundations every enterprise runs on — technology strategy and architecture, data and AI, modernizing and managing applications, infrastructure, data, and cloud. This chapter is in build. The full playbook will cover the architecture patterns, the modernization plays, and the AI-native operating model that ties them together.
Coming SoonReinvention Partner · 8.b1 sub-chapter live
Inside Digital Core
Sub-chapters of the Digital Core playbook
#08 · 8.b.i · Digital Core · Enterprise Architecture
The architecture beneath the architecture.
Enterprise Architecture is the connective tissue of Digital Core — the patterns, principles, and decisions that determine whether AI lands as a product, a platform, or a pile of pilots. This sub-chapter is in build. It will cover the EA reference patterns we use, the decision frameworks behind them, and the partner ecosystem that supports each layer.
Coming SoonSub-chapter · 8.b.i
#08 · 8.c · Finance
Financial performance, reinvented.
The Finance Reinvention Partner reinvents financial performance by supporting the CFO agenda — driving best-in-class performance and delivering insights and benchmarking across the enterprise. This chapter is in build. The full playbook will cover AI in close-and-consolidate, predictive forecasting, working-capital optimization, and the data foundations underneath.
Coming SoonReinvention Partner · 8.c
#08 · 8.d · Industry & Enterprise
Core value chains. End-to-end.
The Industry & Enterprise Reinvention Partner reinvents core industry value chains and drives end-to-end, cross-functional reinvention to deliver growth and long-term value. This chapter is in build. The full playbook will cover the industry-specific AI patterns we deploy, the cross-functional decision frameworks, and where the highest-value reinventions are landing today.
Coming SoonReinvention Partner · 8.d
#08 · 8.e · Song
How clients grow.
Song reinvents how clients grow — bringing together customer growth strategy, marketing, sales, service, commerce, design, digital products, data, and AI to create customer-led growth. This chapter is in build. The full playbook will cover agentic CX, generative creative, conversational commerce, and the data foundations that make personalization at scale possible.
Coming SoonReinvention Partner · 8.e
#08 · 8.f · Supply Chain & Engineering
Across the product and asset lifecycle.
The Supply Chain & Engineering Reinvention Partner helps clients leverage AI and digital technologies across product and asset lifecycles to build competitive advantage. This chapter is in build. The full playbook will cover digital twin patterns, agentic supply planning, generative engineering, and the partner stack across PLM, MES, and ERP.
Coming SoonReinvention Partner · 8.f
#08 · 8.g · Talent
How people and organizations work.
The Talent Reinvention Partner reinvents how people and organizations work — delivering leadership, talent, operating models, and change to accelerate the workforce agenda. This chapter is in build. The full playbook will cover human-AI collaboration patterns, agent-as-coworker operating models, the change-management frameworks underneath, and the skills architecture we deploy.
Coming SoonReinvention Partner · 8.g
>
Atlas Assistant
Ask about any section · architecture · costs · training · ecosystem
Note: To use this feature, please drop this HTML file into a new Claude Conversation as an artifact.