Issue #63 · AI Agent Insider

The US Government Benchmarked DeepSeek V4, Exa Raised $250M to Build Search for Agents, and TD Bank Cut Mortgage Processing from 15 Hours to 3 Minutes

Table of Contents

The Hook

The US government has now weighed in on DeepSeek V4 with a methodology its own AI lab cannot game: non-public benchmarks. The verdict from NIST’s Center for AI Standards and Innovation is specific and useful — not a geopolitical dismissal, but a precise capability map that tells operators exactly where DeepSeek V4 is competitive and where it isn’t. Meanwhile, two stories this week show what the production agent era actually looks like when it lands in the real world: a $250 million bet that AI agents need their own search infrastructure, and a bank that took a regulated, 15-hour workflow and compressed it to three minutes.

This Week’s Signal

The US Government Benchmarked DeepSeek V4 — Here Is What the Non-Public Tests Actually Show

DeepSeek’s self-reported evaluations place V4 Pro at rough parity with Claude Opus 4.6 and GPT-5.4. The US government’s Center for AI Standards and Innovation ran a different test and reached a different conclusion.

NIST’s CAISI released its evaluation of DeepSeek V4 Pro this month using a suite that includes two instruments DeepSeek cannot optimize against: ARC-AGI-2’s semi-private dataset and PortBench, CAISI’s internally developed software engineering evaluation. The finding: DeepSeek V4’s capabilities lag the US frontier by approximately 8 months — placing it closer to GPT-5, released in late 2025, than to GPT-5.5 or Claude’s current generation.

The benchmark breakdown is the part operators need to read carefully, because the gap is not uniform.

DeepSeek V4 Pro is near-competitive or competitive in mathematics: 96-97% on OTIS-AIME-2025, 96% on PUMaC 2024, and 96% on SMT 2025 — within 3 percentage points of GPT-5.5 across all three. On FrontierScience (expert-level scientific reasoning), V4 scores 74% — matching GPT-5.4 mini and one point behind Anthropic’s Opus 4.6. On general GPQA-Diamond reasoning, V4 scores 90%, trailing GPT-5.5’s 96% but ahead of most mid-tier alternatives.

The gaps open sharply on the harder evaluations. ARC-AGI-2 abstract reasoning: 46% for DeepSeek V4 vs 79% for GPT-5.5 — a 33-point gap that is not noise, it is a structural capability difference in novel task generalization. CTF-Archive-Diamond cybersecurity: 32% for V4 vs 71% for GPT-5.5 — a benchmark specifically measuring whether a model can navigate difficult real-world offensive security challenges. DeepSeek V4 scores identically to Claude Opus 4.6 on CTF, suggesting both are constrained at the same ceiling, not that V4 is uniquely weak. PortBench software engineering: 44% for V4 vs 78% for GPT-5.5 — a 34-point gap on the non-contaminated CLI porting task.

The IRT-estimated Elo scores summarize the distribution: GPT-5.5 at 1260 ± 28, Claude Opus 4.6 at 999 ± 27, DeepSeek V4 Pro at 800 ± 28. The 200-point scale increment represents a 3x increase in the odds of solving a given task. V4 is roughly one full tier below Opus 4.6 and two tiers below GPT-5.5 in aggregate.

The cost picture offsets this substantially. CAISI compared DeepSeek V4 against GPT-5.4 mini — the most cost-competitive US reference model — across 7 benchmarks. DeepSeek V4 was 53% less expensive in the best case and only 41% more expensive in the worst. On 5 of 7 benchmarks, V4 was the more cost-efficient option.

The operational read is not “DeepSeek V4 is inadequate.” It is that the capability profile has now been mapped honestly. For math, science, and coding tasks where V4 approaches parity with US frontier models, the 53% cost advantage is decisive: there is no rational reason to pay full GPT-5.5 rates for a workflow where 96% math accuracy and 74% science accuracy are sufficient. For abstract reasoning, novel problem generalization, and security-sensitive agentic tasks, the 33-46 point gaps on non-contaminated benchmarks are the signal operators needed to size the risk correctly. DeepSeek V4 is an excellent tool in its capability range. The CAISI evaluation tells you precisely where that range ends.

What this means for your stack: Run your production workloads through the CAISI capability map. If your agent pipeline is doing summarization, code generation, mathematical reasoning, or structured data extraction — V4 Pro is competitive at materially lower cost. If your agent is doing multi-step agentic planning, novel task generalization, or anything adjacent to security operations — the abstract reasoning and CTF gaps are real, and the cost savings do not cover the output quality risk.

3 Operator Playbooks

1. Exa Raises $250M to Build the Search Engine AI Agents Actually Need — DOMAIN: Infrastructure & DevTools

Traditional search was built for humans: latency measured in seconds, results optimized for click-through, coverage biased toward popular pages. That architecture does not serve AI agents. Agents search at scale — potentially hundreds of thousands of queries per second per deployment — and need precision, freshness, and machine-readable results rather than ranked blue links.

Exa (formerly Metaphor) closed a $250 million Series C led by Andreessen Horowitz at a $2.2 billion valuation this week, explicitly building the search infrastructure for this gap. The company operates its own independent crawler (not relying on Google or Bing as a backend), a third-generation vector database, and a search API designed for agent consumption rather than human browsing. Current users: Cursor, Cognition, HubSpot, OpenRouter, and Monday.com. 400,000+ developers are already using the API.

The capital will fund the next generation of search models and the infrastructure to handle the query volumes that production agent deployments demand. CEO Will Bryk framed the mission directly: “organizing the world’s knowledge, but this time for AI” — with comprehensiveness, freshness, and precision requirements that human-scale search was never designed to meet.

The market signal is worth reading carefully. Exa is not optimizing an existing search engine for AI — it is building a new one from first principles for a world where agents are the primary consumers of web knowledge. That is a structurally different product than Bing or Google’s search API, and the $2.2 billion valuation suggests investors believe the bottleneck is real and large.

Your move: If your agents are running web searches at any meaningful volume, audit the search infrastructure underneath them. Are you routing through standard search APIs built for human queries? Test Exa’s API on a representative sample of your agent’s actual search patterns — not benchmark queries, but the specific context-retrieval and research tasks your agents actually execute in production. The precision and freshness characteristics that matter for agents are meaningfully different from those that matter for human search interfaces. If your agents are currently producing hallucinations or outdated context in search-dependent workflows, the issue may be the search layer, not the model.

2. TD Bank Compressed Mortgage Pre-Adjudication from 15 Hours to 3 Minutes — DOMAIN: Operator Wins & Failures

This is the production case study that belongs in every enterprise AI planning document: not a demo, not a benchmark, but a regulated financial workflow at a major bank, measured before and after.

TD Bank’s Layer 6 research team announced on May 21 that it built an agentic AI system for mortgage and HELOC pre-adjudication. Internal tests show processing time dropped from approximately 15 hours to under 3 minutes per application. The agent handles document classification, income verification, and summary memo generation — the labor-intensive analytical work that precedes a human underwriter’s final decision.

The workflow is instructive precisely because of its regulatory context. Mortgage pre-adjudication is not an experimental use case — it carries fair lending obligations, data lineage requirements, and audit trail demands. TD Bank’s public announcement of the system means it has navigated at least the preliminary compliance review required to announce the capability. The Trustworthy AI review process the bank references is a governance gate, not a marketing label.

The numbers matter for the broader enterprise calculation. A 300x compression ratio (15 hours to 3 minutes) is not marginal improvement — it is a workflow category change. Pre-adjudication is a throughput-constrained bottleneck in mortgage processing; the limiting factor has historically been human analyst capacity, not borrower demand. An agent that processes applications at 3 minutes each changes the staffing math, the SLA commitments, and the product design of a mortgage pipeline.

Your move: Map your own highest-volume, highest-time-cost analytical workflows against the TD Bank pre-adjudication profile. The characteristics that made this a viable agent target: structured inputs (applications, documents), deterministic evaluation criteria (income ratios, document completeness), and a defined output format (summary memo for human review). If you have workflows that match this profile — structured document analysis, criteria-based classification, summary generation — they are agent candidates regardless of industry. Build the governance case alongside the technical one: define what “correct” looks like, establish the error rate your compliance or operations team will accept, and instrument the agent’s output for audit before deploying.

3. Anthropic’s Claude Mythos Is in Restricted Preview — And the Capability Rumor That Matters Most Is Autonomous Vulnerability Discovery — DOMAIN: Security & Trust

Anthropic has approximately 50 partner organizations in restricted preview for Claude Mythos, its next-generation model. The benchmark numbers have not been released. What has circulated through partner reports and industry coverage this week is the claim that will define the security conversation around Mythos: the model may be capable of autonomously discovering previously unknown software vulnerabilities.

This is not the same claim as “the model is good at CTF challenges.” CTF performance measures whether a model can solve known-form problems in a structured competition context. Autonomous zero-day discovery — identifying genuinely novel exploitable weaknesses in production software — is a qualitatively different capability. If the Mythos reports are accurate, it places the model in a category that changes how security teams think about AI risk, not just AI capability.

The CAISI evaluation of DeepSeek V4 (this issue’s Signal) gives context for why this matters: even DeepSeek V4 and Claude Opus 4.6 score 32% and 46% respectively on CTF-Archive-Diamond, a controlled cybersecurity benchmark. GPT-5.5 scores 71%. A model with genuine autonomous vulnerability discovery capability would represent a step-change beyond any of these numbers — and would explain why Anthropic is controlling access to roughly 50 organizations rather than releasing broadly.

The restricted preview model is itself a policy signal. Anthropic is not choosing limited distribution because of insufficient server capacity — it is making a deliberate access control decision about a capability with asymmetric offensive/defensive implications. The 50-partner cohort almost certainly includes major enterprise security teams, government partners, and safety researchers whose job is to characterize what the model can actually do in adversarial conditions before broader deployment.

Your move: If you run security operations, red-team, or vulnerability management programs, get on Anthropic’s enterprise radar now. The organizations in the Mythos preview cohort are building institutional knowledge about AI-assisted security operations that will translate directly into competitive advantage when the model reaches broader availability. For everyone else: the Mythos restricted preview is the earliest warning that the next capability tier for frontier AI will require new thinking about access control, not just new thinking about deployment. When a model can find vulnerabilities in software you depend on, the security model for AI in your organization needs to account for that — before the capability is widely available, not after.

Steal This

The Model Selection Matrix for Production Agent Workloads

Use the CAISI evaluation framework to route workloads to the right model based on actual capability gaps — not vendor claims. Updated with DeepSeek V4’s CAISI benchmark results.

MODEL SELECTION MATRIX — AGENT WORKLOADS (May 2026)
=====================================================
Based on NIST/CAISI non-public benchmark evaluation.
All Elo estimates: GPT-5.5 (1260), Opus 4.6 (999), DeepSeek V4 Pro (800)

TASK TYPE -> RECOMMENDED MODEL (based on capability/cost tradeoff)

MATHEMATICS & QUANTITATIVE REASONING
DeepSeek V4 Pro: 96-97% (near-parity with GPT-5.5 at 99-100%)
Recommendation: DeepSeek V4 Pro (53% cost advantage, minimal accuracy loss)
Use when: financial modeling, data analysis, quantitative extraction, arithmetic

SCIENTIFIC REASONING & RESEARCH
DeepSeek V4 Pro: 74% (tied with GPT-5.4 mini, 1pt behind Opus 4.6)
Recommendation: DeepSeek V4 Pro for volume; Opus 4.6 for high-stakes decisions
Use when: literature review, research synthesis, technical documentation

GENERAL CODING & SOFTWARE ENGINEERING
DeepSeek V4 Pro: 74% SWE-Bench (vs GPT-5.5 81%, Opus 4.6 79%)
Recommendation: Opus 4.6 or GPT-5.5 for complex refactors; V4 for boilerplate
Use when: code generation, unit tests, documentation, SQL/scripting

ABSTRACT REASONING & NOVEL TASK GENERALIZATION
DeepSeek V4 Pro: 46% ARC-AGI-2 (vs GPT-5.5 79%, Opus 4.6 63%)
Recommendation: GPT-5.5 for novel planning; Opus 4.6 minimum for agentic tasks
Do NOT use DeepSeek V4 for: open-ended planning, unfamiliar domain agents,
  creative problem-solving with high error cost

CYBERSECURITY & ADVERSARIAL TASKS
DeepSeek V4 Pro: 32% CTF (tied with Opus 4.6; GPT-5.5 at 71%)
Recommendation: GPT-5.5 for security operations; all others for non-adversarial
Do NOT use any sub-GPT-5.5 model for: red-team agents, vuln scanning,
  security code review, incident response automation

COST DECISION GATE
Current task cost/run: $____
Cost with V4 Pro equivalent (~53% reduction): $____
If V4 Pro accuracy delta is within your acceptable range (check above):
  -> Migrate to V4 Pro
If task falls in Abstract Reasoning or Cybersecurity column:
  -> Do NOT migrate; accuracy loss exceeds cost savings

BENCHMARK YOUR OWN WORKLOAD
1. Sample 50 representative production tasks
2. Run through current model + DeepSeek V4 Pro
3. Score accuracy against your own ground truth (not model claims)
4. If V4 accuracy within 5% at 50%+ lower cost -> migrate
5. If accuracy delta exceeds 5% on judgment-critical tasks -> stay

The Bottom Line

Three things happened this week that together describe where the production AI agent market actually stands in May 2026. The US government published an independent, non-gameable benchmark evaluation of DeepSeek V4 that gives operators the capability map they should have had before making model decisions — and the conclusion is more nuanced than either the boosters or the skeptics wanted: V4 is a cost-efficient tool for well-defined tasks and a risky choice for the agentic work where novel reasoning matters. Exa raised $250 million on the thesis that search infrastructure built for humans is architecturally inadequate for agents operating at production scale — a bet that will look prescient if agent query volume follows the trajectory every major lab is forecasting. And TD Bank published a real number from a real regulated deployment: 15 hours to 3 minutes, in mortgage pre-adjudication, at a major bank. That ratio is the business case operators in every industry should be building toward — not as a demo benchmark, but as the standard against which the cost of not automating is measured.


AI Insider is published by Digital Forge Studios Inc.

Support the forge

Ko-fi Patreon
ETH0x3a4289F5e19C5b39353e71e20107166B3cCB2EDB BTC16Fhg23rQdpCr14wftDRWEv7Rzgg2qsj98 DOGEDNofxUZe8Q5FSvVbqh24DKJz6jdeQxTv8x