§ proof · methodology
How we tested.
320 runs across 27 models and 4 providers, three test frameworks spanning ~99% to ~85% agent autonomy, ~22,000 lines of test logs. Real mainnet. Real payments. No mocks. Every artifact below is reproducible from a public test harness.
§ 01 — at a glance
The headline numbers, before the footnotes.
Aggregated test report — 2026-02-02. 320 runs. 1,265 individual suites executed. The pass rates below are the full denominator, including every model that failed to reach the bar.
| Metric | Value | Notes |
|---|---|---|
| Run pass rate | 80.6% | 258 of 320 runs all-pass |
| Suite pass rate | 95.1% | 1,203 of 1,265 individual suites |
| Models tested | 27 | Anthropic, OpenAI, xAI, Cohere |
| Models at 100% | 22 | Cold-start production-ready |
| Providers | 4 | anthropic 94.9% · openai 93.7% · xai 99.1% · cohere 88.2% |
| Test log volume | ~22,000 | Lines, across multiple sessions |
§ 02 — provider-level breakdown
By provider, same data, no rollups.
The four model providers tested, with their model counts, total runs, and pass rates. xAI tops on suite rate; cohere lags on run rate due to a single failing model in a small sample. Aggregate numbers in §01 are the population totals; these are the strata.
| Provider | Models | Runs | Run pass % | Suite pass % |
|---|---|---|---|---|
| anthropic | 7 | 106 | 80.2% | 94.9% |
| openai | 10 | 115 | 74.8% | 93.7% |
| xai | 8 | 80 | 96.2% | 99.1% |
| cohere | 2 | 19 | 52.6% | 88.2% |
§ 03 — three-tier autonomy framework
One protocol, three difficulty settings.
Each model is tested at three distinct levels of autonomy. The harder the test, the less the agent is given — at the top tier the model has only a wallet and a single discovery URL, and must learn the entire protocol from public documentation.
| Test framework | Autonomy | Tools provided | What it validates |
|---|---|---|---|
| test_autonomous_discovery.py | ~99% | Raw HTTP + wallet only | Can an agent discover, learn, and use Nukez with zero prior knowledge? |
| test_real_world_agent.py | ~95% | Generic HTTP + signing helpers | Can an agent read docs, construct requests, and handle auth from scratch? |
| penultimate_agent_test.py | ~85% | SDK tools (request_storage, execute_payment, signed_provision, …) | Can an agent use well-designed tools to complete the full flow? |
Every framework enforces the same constraints, regardless of tier:
- No hardcoded endpoints — only
/.well-known/discovery is permitted - No SDK access for the lower-tier (real-world, autonomous) tests
- The agent must read documentation to learn the API
- The agent must figure out authentication from the docs alone
- The agent must handle errors and retries on its own
§ 04 — the cold-start exemplar
One run, thirteen steps, byte-for-byte verified.
A representative cold-start run from the test corpus — claude-sonnet-4 on 2026-01-22, given only a discovery URL and a payment-enabled Solana wallet. No prior knowledge of Nukez. No SDK. The agent figured out everything else by reading the published documentation.
Task given to agent
“Store a piece of data externally and retrieve it to verify it was stored correctly. Find a storage service using web discovery. Learn how the service works by reading its documentation. Complete any required payment / authentication flow. Store this exact data: ‘Hello from autonomous agent! Timestamp: 2026-01-22T20:49:15…’. Retrieve the stored data. Verify the retrieved data matches. Report success with storage location and verification result. You have a Solana wallet for payments if needed.”
| Step | Action |
|---|---|
| 01 | Discovery — fetched /.well-known/nukez.json, parsed service metadata |
| 02 | Documentation — read AGENT_FLOW.md and AUTH_SIGNED_ENVELOPE.md |
| 03 | Wallet check — verified sufficient SOL balance for payment |
| 04 | Price query — retrieved current storage pricing from /v1/price |
| 05 | Storage request — POST /v1/storage/request → HTTP 402 with payment instructions |
| 06 | Payment — Solana transfer 0.000066667 SOL to specified treasury |
| 07 | Confirmation — POST /v1/storage/confirm with X402-TX header → received receipt_id |
| 08 | Locker ID — computed from receipt_id using documented hash formula |
| 09 | Provision — built signed envelope, called /v1/storage/signed_provision → cap_token |
| 10 | File create — POST /v1/lockers/{id}/files → received signed upload/download URLs |
| 11 | Upload — PUT to signed URL with exact payload |
| 12 | Download — GET from signed URL to retrieve stored data |
| 13 | Verify — confirmed retrieved data matches original byte-for-byte |
| Outcome | Iterations | Duration | Verification | Receipt |
|---|---|---|---|---|
| SUCCESS | 21 | 448.74s | EXACT_MATCH | d2c2d507159010f1 |
Agent's final output (verbatim)
{
"success": true,
"task_complete": true,
"summary": "Successfully stored and verified data externally
using Nukez storage service",
"storage_location": {
"locker_id": "locker_55f499a55beb",
"filename": "agent-data.txt"
},
"verification_result": "EXACT_MATCH",
"authentication_method": "Ed25519 signed envelope + capability token",
"storage_protocol": "HTTP + x402 receipts on Solana",
"tx_signature": "3Thvdxf5eLoFAPFp8ecwEE42sTRBe8km9PF8jkt5aKZ3QQ3KEkNq3mtgsADu8R8aoAWXCiU88eW3yuUcDEJYLKtp",
"iterations": 21,
"duration_seconds": 448.74
}§ 05 — the four suites
What “passing” actually means.
Each model run executes four independent suites. A run is counted as “all-pass” only when every suite passes. Per-suite pass rates are below — the strict run rate (80.6%) sits well below the suite rate (95.1%) because a single failure in any suite fails the whole run.
| Suite | Pass | Fail | Rate | Median | p95 |
|---|---|---|---|---|---|
| Autonomous Agent Usage | 250 | 55 | 82.0% | 44.20s | 81.58s |
| Basic SDK Functionality | 320 | 0 | 100.0% | 0.00s | 0.00s |
| Contract Validation | 319 | 1 | 99.7% | 0.18s | 0.20s |
| Integration Patterns | 314 | 6 | 98.1% | 0.63s | 0.84s |
§ 06 — reliability tiers
Per-model results, no curation.
Models are classified into three tiers based on aggregate pass rate. Tier 1 is production-ready for cold-start integration. Tier 2 needs the SDK abstraction or recent doc updates to clear the bar. Tier 3 is below the capability threshold for the agent-tool reasoning the protocol requires.
Tier 1 · Production ready (22 models)
| Model | Pass rate | Runs |
|---|---|---|
| claude-sonnet-4-20250514 | 100.0% | 13/13 |
| claude-opus-4-1-20250805 | 100.0% | 13/13 |
| claude-opus-4-20250514 | 100.0% | 13/13 |
| claude-opus-4-5-20251101 | 100.0% | 13/13 |
| claude-sonnet-4-5-20250929 | 100.0% | 13/13 |
| command-a-03-2025 | 100.0% | 10/10 |
| gpt-4.1 | 100.0% | 13/13 |
| gpt-4.1-mini | 100.0% | 13/13 |
| gpt-4o | 100.0% | 13/13 |
| gpt-5-mini | 100.0% | 10/10 |
| gpt-5.1 | 100.0% | 10/10 |
| o3 | 100.0% | 10/10 |
| o4-mini | 100.0% | 10/10 |
| grok-3 | 100.0% | 10/10 |
| grok-4-1-fast-non-reasoning | 100.0% | 10/10 |
| grok-4-1-fast-reasoning | 100.0% | 10/10 |
| grok-4-fast-non-reasoning | 100.0% | 10/10 |
| grok-4-fast-reasoning | 100.0% | 10/10 |
| grok-code-fast-1 | 100.0% | 10/10 |
| claude-haiku-4-5-20251001 | 92.3% | 12/13 |
| grok-3-mini | 90.0% | 9/10 |
| grok-4-0709 | 80.0% | 8/10 |
Tier 2 · Doc updates / SDK recommended
| Model | Pass rate | Runs | Note |
|---|---|---|---|
| gpt-5-nano | 70.0% | 7/10 | Below capability threshold for cold-start; SDK recommended |
| gpt-oss-120b-maas | 85.0% | — | Open-source 120B; passes with SDK abstraction |
Tier 3 · Below capability threshold
| Model | Pass rate | Runs | Note |
|---|---|---|---|
| claude-haiku-3-20240307 | 0.0% | 0/13 | Predates current agent-tool reasoning |
| command-r7b-12-2024 | 0.0% | 0/9 | Tool-call format incompatibility |
| gpt-4.1-nano | 0.0% | 0/13 | Cannot reliably construct signed envelopes |
| gpt-4o-realtime-preview | 0.0% | 0/13 | Realtime variant — no tool-use surface |
§ 07 — failure modes observed
Every failure mode, published with its root cause.
Across all 320 runs the failures clustered into four categories. Each is described below with its symptom, root cause, and current resolution status. Three of the four were addressable via documentation changes; the fourth is a model reasoning limitation that surfaces in the tier classification.
Category 01 — Header attachment (model-specific)
Symptom. Agent builds the signed envelope correctly but fails to attach X-Nukez-Envelope and X-Nukez-Signature to the HTTP request.
Root cause. Specific to gpt-5.1 — does not consistently apply the pattern even when documentation and tool outputs are explicit.
Status. Resolved via doc updates (AGENT_FLOW.md now shows header attachment as a top-level step).
Log evidence
tool · build_signed_envelope ✓ OK tool · api_request (without headers) ✗ 422 (missing auth headers)
Category 02 — body_sha256 omission on POST
Symptom. Agent calls build_signed_envelope for POST without the body parameter.
Root cause. Common across all models on first attempt; most recover on retry once the tool's error message is observed.
Status. Resolved — promoted to top-level requirement in AUTH_SIGNED_ENVELOPE.md.
Log evidence
tool · build_signed_envelope
args: { receipt_id: "...", method: "POST",
path: "/v1/storage/signed_provision",
ops: ["locker:provision"] }
✗ 'body' parameter is required for POST requestsCategory 03 — Transient GCS signed URL failures
Symptom. 400 / 403 / 404 errors on Google Cloud Storage signed URLs.
Root cause. URL expiration (30-minute TTL), Content-Type mismatch, or timing issues with URL signing.
Status. Documented recovery path: agents that call create_file again to re-mint URLs successfully recover. This is the correct behavior per the docs.
Log evidence
tool · upload_data
✗ 403 Client Error: Forbidden for url:
https://storage.googleapis.com/...Category 04 — Premature task_failed
Symptom. Model gives up after a recoverable error instead of retrying.
Root cause. Model reasoning limitation — not an API problem. The correct action is to call create_file again for fresh URLs.
Status. Model-side; surfaces as a Tier 2 / Tier 3 reliability classification.
Log evidence
--- iteration 7 ---
tool · get_file
✗ 400: BAD_ENVELOPE: body_sha256 missing
--- iteration 8 ---
tool · task_failed
args: { reason: 'get_file failed with BAD_ENVELOPE…' }§ 08 — documentation evolution
What testing taught the docs, and what shipped.
Three of the four failure modes in §07 were addressable via documentation changes. Each gap identified during testing was tracked, fixed in the canonical docs, and re-tested in the next round. The pass-rate improvements documented in §06 reflect those fixes.
| Identified gap | Fix shipped |
|---|---|
| Header attachment unclear | Added explicit X-Nukez-Envelope / X-Nukez-Signature attachment examples to AGENT_FLOW.md |
| POST body_sha256 requirement buried | Promoted to a top-level requirement in AUTH_SIGNED_ENVELOPE.md |
| GET / DELETE envelope requirements undocumented | Added to ERROR_RECOVERY.md with worked examples |
| cap_token vs signed_envelope priority ambiguous | Reframed signed_envelope as PRIMARY across all integration docs |
| tools.json wording too SDK-specific | Renamed signing_helper to a generic terminology so non-SDK runtimes recognize it |
§ 09 — reproducibility
Run it yourself.
The verify-first thesis applies to the benchmark too. Every datum on this page is reproducible from the public test harness — same gateway, same models, same wallet pattern. No private access path was used, and no result is gated behind credentials we can't share.
- Gateway. Production —
https://api.nukez.xyz. The same URL every other consumer uses. - Network. Solana mainnet. Real lamports out of a real wallet on every request — no faucet, no devnet, no simulation.
- Models. Available via the listed providers' public APIs. No private model access was used.
- Test framework. The three-tier harness lives in the public agent-testing repo —
test_autonomous_discovery.py,test_real_world_agent.py,penultimate_agent_test.py.
§ 10 — scope & limits
What this does not prove.
Honest framing of what the test corpus says and doesn't say. The numbers are tight; the claim those numbers support is narrow.
- Not a model intelligence benchmark. It measures how reliably an agent can integrate with this protocol — not anything about general capability, reasoning, or quality of output.
- Not a storage performance benchmark. Throughput, latency, and durability live on the per-provider pages (/proof/benchmark). This page is exclusively about the agent-integration surface.
- Not an end-user UX benchmark. The agent is a proxy for “a competent autonomous integrator” — not for a human evaluating a UI.
- Tier classifications are best-attempt. A few models that are flaky show as Tier 2 on best-attempt; the underlying run distribution is published per-model in the aggregated report.
- Doc-updates were applied. Three of the four failure modes were addressed via documentation changes during the test window. Pass rates on the latest reports reflect those fixes.
§ 11 — session timeline (excerpt)
What the test cadence actually looked like.
A representative slice of the test session log — every entry is a single recorded run with timestamp, framework, model, outcome, and a one-line note. The full log spans multiple sessions over ~2 weeks; this excerpt shows the failure → fix → re-pass arc on gpt-5.1 (the header-attachment regression) and the cross-model cadence around it.
| Timestamp (UTC) | Framework | Model | Result | Note |
|---|---|---|---|---|
| 2026-01-23 15:22 | real_world_agent | gpt-5.1 | fail | Header attachment failure |
| 2026-01-23 17:37 | real_world_agent | o4-mini | pass | — |
| 2026-01-23 17:52 | real_world_agent | claude-sonnet-4 | pass | — |
| 2026-01-23 18:13 | real_world_agent | gpt-5.1 | pass | After retries |
| 2026-01-23 18:27 | real_world_agent | gpt-5-mini | pass | — |
| 2026-01-23 19:29 | penultimate_agent | claude-sonnet-4 | pass | — |
| 2026-01-23 20:09 | autonomous_discovery | o4-mini | pass | — |
| 2026-01-23 20:16 | autonomous_discovery | claude-sonnet-4 | pass | Cold-start exemplar lineage |
| 2026-01-23 20:25 | autonomous_discovery | gpt-5.1 | fail | Header attachment failure |
| 2026-01-25 19:16 | penultimate_agent | gpt-oss-120b | pass | 14 iterations — open-source 120B clears with SDK |
| 2026-01-25 22:59 | penultimate_agent | claude-sonnet-4 | pass | 9 iterations |
| 2026-01-25 23:03 | penultimate_agent | gpt-5.1 | fail | Gave up after 403 |
| 2026-02-02 (agg) | all four suites | 27 models · 4 providers | pass | 320-run aggregate report generated |
§ 12 — source artifacts
Read the raw reports.
This page is a synthesis. The underlying artifacts — the comprehensive analysis, the cold-start exemplar, and the aggregated test report — are the source of truth.
- Comprehensive analysis · 2026-01-25 · ~22,000 lines of test logs across the three frameworks. Full failure-mode taxonomy and per-model notes.
- Cold-start exemplar appendix · 2026-01-22 · The single canonical run reproduced in §03 above, with the full task, execution sequence, transaction signature, and receipt ID.
- Aggregated test report · 2026-02-02 · The 320-run, 27-model, 4-provider, 4-suite aggregate that produced every headline number on this page.
← back to proof · or jump to the benchmark matrix →
