§ proof · methodology

How we tested.

320 runs across 27 models and 4 providers, three test frameworks spanning ~99% to ~85% agent autonomy, ~22,000 lines of test logs. Real mainnet. Real payments. No mocks. Every artifact below is reproducible from a public test harness.

§ 01 — at a glance

The headline numbers, before the footnotes.

Aggregated test report — 2026-02-02. 320 runs. 1,265 individual suites executed. The pass rates below are the full denominator, including every model that failed to reach the bar.

Metric	Value	Notes
Run pass rate	80.6%	258 of 320 runs all-pass
Suite pass rate	95.1%	1,203 of 1,265 individual suites
Models tested	27	Anthropic, OpenAI, xAI, Cohere
Models at 100%	22	Cold-start production-ready
Providers	4	anthropic 94.9% · openai 93.7% · xai 99.1% · cohere 88.2%
Test log volume	~22,000	Lines, across multiple sessions

§ 02 — provider-level breakdown

By provider, same data, no rollups.

The four model providers tested, with their model counts, total runs, and pass rates. xAI tops on suite rate; cohere lags on run rate due to a single failing model in a small sample. Aggregate numbers in §01 are the population totals; these are the strata.

Provider	Models	Runs	Run pass %	Suite pass %
anthropic	7	106	80.2%	94.9%
openai	10	115	74.8%	93.7%
xai	8	80	96.2%	99.1%
cohere	2	19	52.6%	88.2%

§ 03 — three-tier autonomy framework

One protocol, three difficulty settings.

Each model is tested at three distinct levels of autonomy. The harder the test, the less the agent is given — at the top tier the model has only a wallet and a single discovery URL, and must learn the entire protocol from public documentation.

Test framework	Autonomy	Tools provided	What it validates
test_autonomous_discovery.py	~99%	Raw HTTP + wallet only	Can an agent discover, learn, and use Nukez with zero prior knowledge?
test_real_world_agent.py	~95%	Generic HTTP + signing helpers	Can an agent read docs, construct requests, and handle auth from scratch?
penultimate_agent_test.py	~85%	SDK tools (request_storage, execute_payment, signed_provision, …)	Can an agent use well-designed tools to complete the full flow?

Every framework enforces the same constraints, regardless of tier:

No hardcoded endpoints — only /.well-known/ discovery is permitted
No SDK access for the lower-tier (real-world, autonomous) tests
The agent must read documentation to learn the API
The agent must figure out authentication from the docs alone
The agent must handle errors and retries on its own

§ 04 — the cold-start exemplar

One run, thirteen steps, byte-for-byte verified.

A representative cold-start run from the test corpus — claude-sonnet-4 on 2026-01-22, given only a discovery URL and a payment-enabled Solana wallet. No prior knowledge of Nukez. No SDK. The agent figured out everything else by reading the published documentation.

Task given to agent

“Store a piece of data externally and retrieve it to verify it was stored correctly. Find a storage service using web discovery. Learn how the service works by reading its documentation. Complete any required payment / authentication flow. Store this exact data: ‘Hello from autonomous agent! Timestamp: 2026-01-22T20:49:15…’. Retrieve the stored data. Verify the retrieved data matches. Report success with storage location and verification result. You have a Solana wallet for payments if needed.”

Step	Action
01	Discovery — fetched /.well-known/nukez.json, parsed service metadata
02	Documentation — read AGENT_FLOW.md and AUTH_SIGNED_ENVELOPE.md
03	Wallet check — verified sufficient SOL balance for payment
04	Price query — retrieved current storage pricing from /v1/price
05	Storage request — POST /v1/storage/request → HTTP 402 with payment instructions
06	Payment — Solana transfer 0.000066667 SOL to specified treasury
07	Confirmation — POST /v1/storage/confirm with X402-TX header → received receipt_id
08	Locker ID — computed from receipt_id using documented hash formula
09	Provision — built signed envelope, called /v1/storage/signed_provision → cap_token
10	File create — POST /v1/lockers/{id}/files → received signed upload/download URLs
11	Upload — PUT to signed URL with exact payload
12	Download — GET from signed URL to retrieve stored data
13	Verify — confirmed retrieved data matches original byte-for-byte

Outcome	Iterations	Duration	Verification	Receipt
SUCCESS	21	448.74s	EXACT_MATCH	d2c2d507159010f1

Agent's final output (verbatim)

{
  "success": true,
  "task_complete": true,
  "summary": "Successfully stored and verified data externally
              using Nukez storage service",
  "storage_location": {
    "locker_id": "locker_55f499a55beb",
    "filename":  "agent-data.txt"
  },
  "verification_result": "EXACT_MATCH",
  "authentication_method": "Ed25519 signed envelope + capability token",
  "storage_protocol": "HTTP + x402 receipts on Solana",
  "tx_signature": "3Thvdxf5eLoFAPFp8ecwEE42sTRBe8km9PF8jkt5aKZ3QQ3KEkNq3mtgsADu8R8aoAWXCiU88eW3yuUcDEJYLKtp",
  "iterations": 21,
  "duration_seconds": 448.74
}

§ 05 — the four suites

What “passing” actually means.

Each model run executes four independent suites. A run is counted as “all-pass” only when every suite passes. Per-suite pass rates are below — the strict run rate (80.6%) sits well below the suite rate (95.1%) because a single failure in any suite fails the whole run.

Suite	Pass	Fail	Rate	Median	p95
Autonomous Agent Usage	250	55	82.0%	44.20s	81.58s
Basic SDK Functionality	320	0	100.0%	0.00s	0.00s
Contract Validation	319	1	99.7%	0.18s	0.20s
Integration Patterns	314	6	98.1%	0.63s	0.84s

§ 06 — reliability tiers

Per-model results, no curation.

Models are classified into three tiers based on aggregate pass rate. Tier 1 is production-ready for cold-start integration. Tier 2 needs the SDK abstraction or recent doc updates to clear the bar. Tier 3 is below the capability threshold for the agent-tool reasoning the protocol requires.

Tier 1 · Production ready (22 models)

Model	Pass rate	Runs
claude-sonnet-4-20250514	100.0%	13/13
claude-opus-4-1-20250805	100.0%	13/13
claude-opus-4-20250514	100.0%	13/13
claude-opus-4-5-20251101	100.0%	13/13
claude-sonnet-4-5-20250929	100.0%	13/13
command-a-03-2025	100.0%	10/10
gpt-4.1	100.0%	13/13
gpt-4.1-mini	100.0%	13/13
gpt-4o	100.0%	13/13
gpt-5-mini	100.0%	10/10
gpt-5.1	100.0%	10/10
o3	100.0%	10/10
o4-mini	100.0%	10/10
grok-3	100.0%	10/10
grok-4-1-fast-non-reasoning	100.0%	10/10
grok-4-1-fast-reasoning	100.0%	10/10
grok-4-fast-non-reasoning	100.0%	10/10
grok-4-fast-reasoning	100.0%	10/10
grok-code-fast-1	100.0%	10/10
claude-haiku-4-5-20251001	92.3%	12/13
grok-3-mini	90.0%	9/10
grok-4-0709	80.0%	8/10

Tier 2 · Doc updates / SDK recommended

Model	Pass rate	Runs	Note
gpt-5-nano	70.0%	7/10	Below capability threshold for cold-start; SDK recommended
gpt-oss-120b-maas	85.0%	—	Open-source 120B; passes with SDK abstraction

Tier 3 · Below capability threshold

Model	Pass rate	Runs	Note
claude-haiku-3-20240307	0.0%	0/13	Predates current agent-tool reasoning
command-r7b-12-2024	0.0%	0/9	Tool-call format incompatibility
gpt-4.1-nano	0.0%	0/13	Cannot reliably construct signed envelopes
gpt-4o-realtime-preview	0.0%	0/13	Realtime variant — no tool-use surface

§ 07 — failure modes observed

Every failure mode, published with its root cause.

Across all 320 runs the failures clustered into four categories. Each is described below with its symptom, root cause, and current resolution status. Three of the four were addressable via documentation changes; the fourth is a model reasoning limitation that surfaces in the tier classification.

Category 01 — Header attachment (model-specific)

Symptom. Agent builds the signed envelope correctly but fails to attach X-Nukez-Envelope and X-Nukez-Signature to the HTTP request.

Root cause. Specific to gpt-5.1 — does not consistently apply the pattern even when documentation and tool outputs are explicit.

Status. Resolved via doc updates (AGENT_FLOW.md now shows header attachment as a top-level step).

Log evidence

tool · build_signed_envelope
  ✓  OK
tool · api_request (without headers)
  ✗  422  (missing auth headers)

Category 02 — body_sha256 omission on POST

Symptom. Agent calls build_signed_envelope for POST without the body parameter.

Root cause. Common across all models on first attempt; most recover on retry once the tool's error message is observed.

Status. Resolved — promoted to top-level requirement in AUTH_SIGNED_ENVELOPE.md.

Log evidence

tool · build_signed_envelope
  args: { receipt_id: "...", method: "POST",
          path: "/v1/storage/signed_provision",
          ops: ["locker:provision"] }
  ✗  'body' parameter is required for POST requests

Category 03 — Transient GCS signed URL failures

Symptom. 400 / 403 / 404 errors on Google Cloud Storage signed URLs.

Root cause. URL expiration (30-minute TTL), Content-Type mismatch, or timing issues with URL signing.

Status. Documented recovery path: agents that call create_file again to re-mint URLs successfully recover. This is the correct behavior per the docs.

Log evidence

tool · upload_data
  ✗  403 Client Error: Forbidden for url:
      https://storage.googleapis.com/...

Category 04 — Premature task_failed

Symptom. Model gives up after a recoverable error instead of retrying.

Root cause. Model reasoning limitation — not an API problem. The correct action is to call create_file again for fresh URLs.

Status. Model-side; surfaces as a Tier 2 / Tier 3 reliability classification.

Log evidence

--- iteration 7 ---
tool · get_file
  ✗  400: BAD_ENVELOPE: body_sha256 missing
--- iteration 8 ---
tool · task_failed
  args: { reason: 'get_file failed with BAD_ENVELOPE…' }

§ 08 — documentation evolution

What testing taught the docs, and what shipped.

Three of the four failure modes in §07 were addressable via documentation changes. Each gap identified during testing was tracked, fixed in the canonical docs, and re-tested in the next round. The pass-rate improvements documented in §06 reflect those fixes.

Identified gap	Fix shipped
Header attachment unclear	Added explicit X-Nukez-Envelope / X-Nukez-Signature attachment examples to AGENT_FLOW.md
POST body_sha256 requirement buried	Promoted to a top-level requirement in AUTH_SIGNED_ENVELOPE.md
GET / DELETE envelope requirements undocumented	Added to ERROR_RECOVERY.md with worked examples
cap_token vs signed_envelope priority ambiguous	Reframed signed_envelope as PRIMARY across all integration docs
tools.json wording too SDK-specific	Renamed signing_helper to a generic terminology so non-SDK runtimes recognize it

§ 09 — reproducibility

Run it yourself.

The verify-first thesis applies to the benchmark too. Every datum on this page is reproducible from the public test harness — same gateway, same models, same wallet pattern. No private access path was used, and no result is gated behind credentials we can't share.

Gateway. Production — https://api.nukez.xyz. The same URL every other consumer uses.
Network. Solana mainnet. Real lamports out of a real wallet on every request — no faucet, no devnet, no simulation.
Models. Available via the listed providers' public APIs. No private model access was used.
Test framework. The three-tier harness lives in the public agent-testing repo — test_autonomous_discovery.py, test_real_world_agent.py, penultimate_agent_test.py.

§ 10 — scope & limits

What this does not prove.

Honest framing of what the test corpus says and doesn't say. The numbers are tight; the claim those numbers support is narrow.

Not a model intelligence benchmark. It measures how reliably an agent can integrate with this protocol — not anything about general capability, reasoning, or quality of output.
Not a storage performance benchmark. Throughput, latency, and durability live on the per-provider pages (/proof/benchmark). This page is exclusively about the agent-integration surface.
Not an end-user UX benchmark. The agent is a proxy for “a competent autonomous integrator” — not for a human evaluating a UI.
Tier classifications are best-attempt. A few models that are flaky show as Tier 2 on best-attempt; the underlying run distribution is published per-model in the aggregated report.
Doc-updates were applied. Three of the four failure modes were addressed via documentation changes during the test window. Pass rates on the latest reports reflect those fixes.

§ 11 — session timeline (excerpt)

What the test cadence actually looked like.

A representative slice of the test session log — every entry is a single recorded run with timestamp, framework, model, outcome, and a one-line note. The full log spans multiple sessions over ~2 weeks; this excerpt shows the failure → fix → re-pass arc on gpt-5.1 (the header-attachment regression) and the cross-model cadence around it.

Timestamp (UTC)	Framework	Model	Result	Note
2026-01-23 15:22	real_world_agent	gpt-5.1	fail	Header attachment failure
2026-01-23 17:37	real_world_agent	o4-mini	pass	—
2026-01-23 17:52	real_world_agent	claude-sonnet-4	pass	—
2026-01-23 18:13	real_world_agent	gpt-5.1	pass	After retries
2026-01-23 18:27	real_world_agent	gpt-5-mini	pass	—
2026-01-23 19:29	penultimate_agent	claude-sonnet-4	pass	—
2026-01-23 20:09	autonomous_discovery	o4-mini	pass	—
2026-01-23 20:16	autonomous_discovery	claude-sonnet-4	pass	Cold-start exemplar lineage
2026-01-23 20:25	autonomous_discovery	gpt-5.1	fail	Header attachment failure
2026-01-25 19:16	penultimate_agent	gpt-oss-120b	pass	14 iterations — open-source 120B clears with SDK
2026-01-25 22:59	penultimate_agent	claude-sonnet-4	pass	9 iterations
2026-01-25 23:03	penultimate_agent	gpt-5.1	fail	Gave up after 403
2026-02-02 (agg)	all four suites	27 models · 4 providers	pass	320-run aggregate report generated

§ 12 — source artifacts

Read the raw reports.

This page is a synthesis. The underlying artifacts — the comprehensive analysis, the cold-start exemplar, and the aggregated test report — are the source of truth.

Comprehensive analysis · 2026-01-25 · ~22,000 lines of test logs across the three frameworks. Full failure-mode taxonomy and per-model notes.
Cold-start exemplar appendix · 2026-01-22 · The single canonical run reproduced in §03 above, with the full task, execution sequence, transaction signature, and receipt ID.
Aggregated test report · 2026-02-02 · The 320-run, 27-model, 4-provider, 4-suite aggregate that produced every headline number on this page.

← back to proof · or jump to the benchmark matrix →