Issue #66 · AI Agent Insider
Google I/O 2026: Gemini 3.5 Flash Ships to Production as Token Volumes Hit 3.2 Quadrillion Per Month
Wednesday, May 27, 2026 · 15 min read
Table of Contents
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The Hook
Google just disclosed the production token volumes running through its infrastructure, and the numbers reframe the entire conversation about AI adoption pace. 3.2 quadrillion tokens per month – a 7x increase from a year ago – is not a benchmark result or a press-release projection. It is the actual throughput of a system that has crossed from developer tool to mass-market infrastructure. The model that will carry most of that load going forward, Gemini 3.5 Flash, launched at I/O 2026 this week and skipped the preview phase entirely: it goes straight to production with benchmark scores that compete with the previous generation’s flagship models at four times the speed and less than half the price. The scale story, the economics story, and the governance story this week all point at the same thing: the agentic era is not approaching, it is already running at quadrillion-token scale.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
This Week’s Signal
Google I/O 2026: Gemini 3.5 Flash Is in Production and the Scale Numbers Are Structural
Google I/O runs every year, and every year the announcements are framed as pivotal. This year the framing is harder to dismiss because the numbers are concrete and the comparison is direct.
At the I/O 2026 keynote on May 19, Sundar Pichai disclosed that Google is now processing 3.2 quadrillion tokens per month across its surfaces. A year ago at I/O 2025, that number was 480 trillion tokens. Two years ago it was 9.7 trillion. The compound growth rate on that series is not something that can be attributed to developer experimentation or enterprise pilots – it is a consumer-scale adoption curve. AI Mode in Search reached 1 billion users in its first year. The Gemini app now has 900 million monthly active users, up from 400 million at I/O 2025. These are not potential addressable market projections. They are current active user figures from a deployed system.
The model anchoring the next stage of that growth is Gemini 3.5 Flash, which shipped at I/O to general availability – no preview period, no restricted access window, directly to production across the Gemini app, Google Search, and the developer platform. The capability claims are benchmarked specifically: 76.2% on Terminal Bench 2.1 (versus Gemini 3.1 Pro’s 70.3%), 1,656 ELO on GDP Val AA (versus 3.1 Pro’s 1,314), 84.2% on the Charsiv reasoning benchmark. Artificial Analysis clocked Flash at nearly 280 tokens per second, compared to 60-70 tokens per second for GPT-5.5 and Claude Opus 4.7. Pichai stated directly that Flash delivers “cutting-edge performance in agentic AI” at less than half the price of comparable frontier models, sometimes nearly a third.
The economics calculation Pichai laid out in the keynote is worth quoting as a concrete operator framing: leading companies that are currently processing a trillion tokens daily and move 80% of their workloads to Gemini 3.5 Flash would save over a billion dollars a year. That is not a general claim about AI cost reduction – it is a specific arithmetic outcome from the pricing differential at a specific scale tier.
The broader I/O picture matters alongside the Flash numbers. Google also launched Gemini Omni, a multimodal model trained simultaneously on text, audio, images, and video – the first Google model the company describes as a “world model” rather than a language model with vision capabilities appended. DeepMind CEO Demis Hassabis described it publicly as “a pivotal step toward AGI” and told reporters: “We’re at the foothills of the singularity.” Gemini Omni Flash is rolling out this week to subscribers and to YouTube Shorts at no cost. Gemini Omni is not a focus of this issue’s signal, but it is worth flagging as the context behind why Google is projecting further acceleration in the token throughput numbers.
The infrastructure underpinning the scale story is also new this week. Google announced 8th-generation TPUs with a dual-chip architecture: TPU 8T for training (nearly 3x the computing power of the previous generation) and TPU 8 for inference with dramatically reduced latency. Training now distributes across multiple data centers simultaneously using JAX and Pathways, scaling to over 1 million TPUs worldwide. Google’s projected 2026 capital expenditure is $180-190 billion – roughly six times the $31 billion spent in 2022.
Antigravity 2.0, Google’s agent development platform, also shipped at I/O as a desktop application, CLI, and SDK – moving from a coding environment to a full hub for building and managing autonomous agents. The Flash variant used inside Antigravity is optimized for the agentic loop specifically and is 12x faster than competing coding models in that context.
What this means for your stack. The 3.2 quadrillion token number is the structural signal. AI processing at that throughput has crossed a threshold where marginal cost per token becomes the primary economic variable in product design decisions. The Flash pricing and speed profile – 280 tokens per second, less than half the price of comparable models – changes the cost structure of any agent workflow that is currently running on a premium-tier model primarily because no cheaper option delivered adequate quality. For operators: run your current highest-volume agent workflows through Gemini 3.5 Flash against your own ground-truth benchmark set. The combination of speed, accuracy, and price is designed specifically to capture the workloads currently priced on Sonnet- and Opus-tier models. The question is whether Flash’s quality profile is sufficient for your specific task – and the only way to know is to test it against your production task distribution, not against public benchmarks that do not represent your use case.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
3 Operator Playbooks
1. Gartner Predicts 40% of Enterprise AI Agent Projects Will Be Canceled by 2027 – DOMAIN: Operator Wins & Failures
Gartner published two related findings this week that every team with an active agent deployment should read together.
The first is a prediction from their 2025-2026 research cycle, now being cited heavily in enterprise planning contexts: over 40% of agentic AI projects will be canceled by the end of 2027. The second, published May 26, is the mechanism: applying uniform governance across AI agents will itself be a primary driver of that failure. The argument is precise – the problem is not that organizations are deploying agents without governance. The problem is that they are applying the same governance framework to a low-stakes research summarizer agent that they apply to an agent with write access to a CRM or financial system. That overhead makes the low-stakes agents bureaucratically unworkable without meaningfully reducing risk on the high-stakes ones.
The adoption backdrop makes this urgent. Gartner separately projects that 40%+ of enterprise applications will feature task-specific AI agents by end of 2026, up from less than 5% in 2025. IDC projects agentic automation will enhance capabilities in 40%+ of enterprise apps by 2027. The deployment wave is not hypothetical – 54% of enterprises now report running AI agents in production (Ampcome mid-year report). The governance infrastructure catching up to that deployment rate is the problem Gartner is flagging.
The failure mode they describe is familiar from cloud and data governance cycles: a policy team writes a comprehensive governance framework appropriate for the most sensitive deployment scenario, then applies it as a blanket standard to every agent project regardless of scope. The cost of compliance kills the low-stakes projects. The high-stakes projects get through anyway because they have executive air cover. The result is a governance theater that fails at both ends – bureaucratic overhead for safe agents, insufficient scrutiny for risky ones.
Your move: Audit every current and planned agent deployment against a tiered risk classification, not a single governance standard. The criteria that actually determine blast radius: Does the agent have write access to production systems? Can it take actions that are hard to reverse (sending emails, making purchases, modifying records)? Does it process personal or regulated data? Can it escalate its own privileges? Agents that answer “no” to all four should have lightweight oversight: logging, a defined approval path for new capability additions, and a quarterly review. Agents that answer “yes” to one or more need the full framework: explicit scope limits, consent gates on destructive actions, audit trails, rollback capability, and a named owner. Do not let the governance burden for a sensitive deployment kill the deployment decision for a safe one. That trade is exactly how you end up in the 40% that cancel.
2. 97 Million MCP Downloads, But Most Deployments Are Missing Spec-Compliant Authentication – DOMAIN: Security & Trust
The Model Context Protocol has become a de facto standard faster than any comparable integration protocol in recent memory. Since launch in November 2024, combined Python and TypeScript SDK downloads crossed 97 million monthly. OpenAI adopted it in March 2025. Microsoft announced Copilot Studio support in March 2025. Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation in December 2025. By end of 2026, Gartner projects 40%+ of enterprise applications will include MCP-capable AI agents.
The security gap in all of this is concrete: most MCP server deployments, including many in production today, are not implementing the authentication the spec requires. For a spec-compliant remote MCP server, the requirements are: OAuth 2.1 with PKCE when authorization is implemented, HTTPS on all endpoints, discoverable authorization server metadata, Protected Resource Metadata per RFC 9728, and Resource Indicators per RFC 8707 to prevent token audience confusion attacks. The last two – RFC 9728 and RFC 8707 – are the ones most commonly absent from current deployments.
Why it matters specifically: when an MCP server lacks Protected Resource Metadata, an attacker can perform a token audience confusion attack – presenting a token issued for one service to a different service and having it accepted. When Resource Indicators are missing, there is no mechanism for the resource server to verify the token was issued for it specifically. In an agentic system where one agent orchestrates multiple MCP tools – calendar, email, CRM, code execution – an improperly scoped or audience-confused token is a privilege escalation path across the entire agent’s tool surface.
Auth0 made Auth for MCP generally available on May 6, 2026. WorkOS has MCP-compatible OAuth with enterprise SSO and fine-grained authorization. Stytch and others are shipping MCP auth tooling. The ecosystem has moved – but deployments have not caught up.
Your move: If you are running any remote MCP server in production, run a spec compliance audit against these five requirements this week. The audit takes less than a day. If RFC 9728 Protected Resource Metadata is not exposed by your server, that is the first priority fix – it is the authentication requirement most commonly missing and the one that enables the token confusion attack. If you are evaluating auth providers for new MCP deployments, filter on spec-complete OAuth 2.1 support as a hard requirement, not a “supported soon” roadmap item. The 97 million downloads figure means your organization’s MCP tooling is very likely already in use by engineers who are not waiting for an auth security review. Find it before someone else does.
3. JetBrains Koog 1.0: The Enterprise-Grade Agent Framework for the JVM Ecosystem – DOMAIN: Infrastructure & DevTools
JetBrains announced Koog 1.0 at the KotlinConf 2026 keynote on May 27. Koog is an open-source framework for building AI agents in Kotlin and Java – not a Python framework with JVM bindings, but a framework built from first principles for JVM and Kotlin Multiplatform environments.
The 1.0 designation carries a specific commitment: no breaking changes to stable modules for at least one year. For enterprise teams making an agent framework decision, that commitment is the critical differentiator between an experimental SDK and a production foundation. Most of the agent framework ecosystem – LangGraph, CrewAI, AutoGen – has been evolving rapidly enough that teams have had to absorb breaking API changes on timescales that create maintenance overhead in production systems. Koog 1.0 is betting that the market for stable, long-term-supported agent frameworks is real and that the JVM/Kotlin ecosystem is underserved.
The capability additions in 1.0 are practically relevant: local Android AI inference via LiteRT model integration (agents that run entirely on-device without API calls), a redesigned Java interop layer with a cleaner API surface, OpenTelemetry integration across Kotlin Multiplatform targets (the observability requirement that enterprise agent deployments have consistently cited as the gap in experimental frameworks), improved persistence and memory for long-running agents, and Anthropic prompt caching support to reduce latency and token cost on repeated prompt patterns.
The last item is worth noting: Anthropic prompt caching is a production cost optimization that many teams are not using and should be. For agents with long system prompts or frequently repeated context blocks, prompt caching reduces token costs on cached content by roughly 90% and latency by 60-80%. Koog 1.0 building this in natively means JVM teams do not have to implement it as a custom layer.
Your move: If your organization runs Java or Kotlin services and you have an agent deployment in scope, evaluate Koog 1.0 against your current framework choice on three criteria: the stability commitment (can your team absorb breaking changes in production?), the OpenTelemetry integration maturity (does your observability stack already run OTel?), and the Android/on-device path (if any of your agent use cases touch mobile or offline-capable deployment, local LiteRT inference is a capability that no Python framework can offer in the same ecosystem). The on-device angle is particularly relevant for any agent workflow where latency, connectivity, or data privacy constraints make cloud API calls impractical. Koog 1.0 is not trying to be the default agent framework for every team – it is trying to be the right framework for JVM shops that need stability, observability, and native platform integration.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Steal This
The Agent Governance Tier Matrix
Gartner’s prediction that 40% of agent projects will fail due to governance problems is not a warning to add more governance – it is a warning to apply the right governance to the right tier. Use this matrix before each new agent deployment decision.
AGENT GOVERNANCE TIER MATRIX
================================
Classify each agent deployment before assigning governance overhead.
CLASSIFICATION INPUTS
Answer each question Y / N / Partial for the agent being evaluated:
ACCESS SCOPE
[ ] Does the agent have write access to production databases or systems?
[ ] Can the agent send external communications (email, Slack, API calls with side effects)?
[ ] Can the agent make purchases, financial transactions, or commit resources?
[ ] Can the agent modify access controls, permissions, or credentials?
[ ] Does the agent operate on personal, regulated, or confidential data?
REVERSIBILITY
[ ] Can the agent's actions be reversed within 5 minutes?
[ ] Is there a log of every action the agent takes (sufficient for replay/audit)?
[ ] Is there an automatic rollback trigger if output quality drops below threshold?
EXPOSURE
[ ] Is the agent's output visible externally (customers, regulators, partners)?
[ ] Does the agent interact with other agents that have broader access scope?
[ ] Is the agent in a regulated industry context (finance, health, legal)?
TIER ASSIGNMENT
TIER 1 -- LIGHTWEIGHT (all answers are N or Partial on Access, Y on Reversibility, N on Exposure)
Governance:
- Logging to a queryable store (30-day minimum retention)
- Named owner and defined scope document (1-page max)
- Quarterly capability review
- Incident notification path (who to call if it breaks)
Required BEFORE deployment:
[ ] Scope document signed by owner
[ ] Log destination confirmed
[ ] Quarterly review scheduled
TIER 2 -- STANDARD (any Y on Access, all Y on Reversibility, any N on Exposure)
Governance:
- Everything in Tier 1
- Explicit scope limits documented and technically enforced (not just policy)
- Human approval gate for any new tool/capability additions
- Monthly output quality review
- Defined rollback procedure tested before go-live
Required BEFORE deployment:
[ ] Scope limits technically enforced (not just described)
[ ] Rollback tested in staging
[ ] Monthly review scheduled with accountable reviewer
TIER 3 -- FULL OVERSIGHT (any Y on Access + any external Exposure, or any N on Reversibility)
Governance:
- Everything in Tier 2
- Consent gate on each class of consequential action (not each instance)
- Immutable audit trail (append-only log, tamper-evident)
- Data lineage for any output that informs a human decision
- Legal/compliance review before production deployment
- Incident response runbook specific to this agent
- 90-day review cadence with documented pass/fail criteria
Required BEFORE deployment:
[ ] Legal/compliance sign-off documented
[ ] Immutable audit trail operational and tested
[ ] Incident runbook written and distributed to on-call
[ ] Pass/fail criteria defined and baselined
USAGE NOTES
- Tier assignment is for the MOST SENSITIVE capability the agent has,
not an average. One write-access tool makes it Tier 2 minimum.
- Re-classify whenever scope changes. Adding a new tool resets the check.
- Tier 1 agents should not require more than 2 hours of governance work.
If they do, your governance process is miscalibrated.
- The goal is proportional oversight, not maximal oversight.
The agents Gartner predicts will fail are Tier 1 projects buried under Tier 3 process.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
The Bottom Line
Google’s I/O 2026 numbers tell a story that does not require interpretation: AI processing at 3.2 quadrillion tokens per month, growing 7x year-over-year, running through a model that costs less than half of comparable alternatives at four times the speed, is no longer an industry story – it is an infrastructure story. The capability question for most enterprise workloads is effectively answered; the remaining question is cost structure and model selection for specific task profiles. The two other stories this week frame what that production reality looks like from the inside. Gartner’s 40% cancellation prediction is not a warning against deploying agents – it is a warning against applying the same governance weight to a document summarizer and a system with write access to production databases. The organizations that get governance proportionality right will be running agents across every function; the ones that don’t will have canceled their agent programs before they shipped the interesting work. And the MCP authentication gap – 97 million downloads, most deployments missing spec-compliant OAuth 2.1 – is the specific security debt accumulating under all of this adoption velocity. The standard is defined, the tooling exists, and the cost of an audit is a day of work. The cost of a token audience confusion attack across an agent’s entire tool surface is not.
AI Insider is published by Digital Forge Studios Inc.
Stay sharp.
New issues every weekday. No spam, no fluff — just the practitioner's edge.