Nukez

§ proof · methodology

How we tested.

320 runs across 27 models and 4 providers, three test frameworks spanning ~99% to ~85% agent autonomy, ~22,000 lines of test logs. Real mainnet. Real payments. No mocks. Every artifact below is reproducible from a public test harness.

§ 01 — at a glance

The headline numbers, before the footnotes.

Aggregated test report — 2026-02-02. 320 runs. 1,265 individual suites executed. The pass rates below are the full denominator, including every model that failed to reach the bar.

MetricValueNotes
Run pass rate80.6%258 of 320 runs all-pass
Suite pass rate95.1%1,203 of 1,265 individual suites
Models tested27Anthropic, OpenAI, xAI, Cohere
Models at 100%22Cold-start production-ready
Providers4anthropic 94.9% · openai 93.7% · xai 99.1% · cohere 88.2%
Test log volume~22,000Lines, across multiple sessions

§ 02 — provider-level breakdown

By provider, same data, no rollups.

The four model providers tested, with their model counts, total runs, and pass rates. xAI tops on suite rate; cohere lags on run rate due to a single failing model in a small sample. Aggregate numbers in §01 are the population totals; these are the strata.

ProviderModelsRunsRun pass %Suite pass %
anthropic710680.2%94.9%
openai1011574.8%93.7%
xai88096.2%99.1%
cohere21952.6%88.2%

§ 03 — three-tier autonomy framework

One protocol, three difficulty settings.

Each model is tested at three distinct levels of autonomy. The harder the test, the less the agent is given — at the top tier the model has only a wallet and a single discovery URL, and must learn the entire protocol from public documentation.

Test frameworkAutonomyTools providedWhat it validates
test_autonomous_discovery.py~99%Raw HTTP + wallet onlyCan an agent discover, learn, and use Nukez with zero prior knowledge?
test_real_world_agent.py~95%Generic HTTP + signing helpersCan an agent read docs, construct requests, and handle auth from scratch?
penultimate_agent_test.py~85%SDK tools (request_storage, execute_payment, signed_provision, …)Can an agent use well-designed tools to complete the full flow?

Every framework enforces the same constraints, regardless of tier:

  • No hardcoded endpoints — only /.well-known/ discovery is permitted
  • No SDK access for the lower-tier (real-world, autonomous) tests
  • The agent must read documentation to learn the API
  • The agent must figure out authentication from the docs alone
  • The agent must handle errors and retries on its own

§ 04 — the cold-start exemplar

One run, thirteen steps, byte-for-byte verified.

A representative cold-start run from the test corpus — claude-sonnet-4 on 2026-01-22, given only a discovery URL and a payment-enabled Solana wallet. No prior knowledge of Nukez. No SDK. The agent figured out everything else by reading the published documentation.

Task given to agent

“Store a piece of data externally and retrieve it to verify it was stored correctly. Find a storage service using web discovery. Learn how the service works by reading its documentation. Complete any required payment / authentication flow. Store this exact data: ‘Hello from autonomous agent! Timestamp: 2026-01-22T20:49:15…’. Retrieve the stored data. Verify the retrieved data matches. Report success with storage location and verification result. You have a Solana wallet for payments if needed.”

StepAction
01Discovery fetched /.well-known/nukez.json, parsed service metadata
02Documentation read AGENT_FLOW.md and AUTH_SIGNED_ENVELOPE.md
03Wallet check verified sufficient SOL balance for payment
04Price query retrieved current storage pricing from /v1/price
05Storage request POST /v1/storage/request → HTTP 402 with payment instructions
06Payment Solana transfer 0.000066667 SOL to specified treasury
07Confirmation POST /v1/storage/confirm with X402-TX header → received receipt_id
08Locker ID computed from receipt_id using documented hash formula
09Provision built signed envelope, called /v1/storage/signed_provision → cap_token
10File create POST /v1/lockers/{id}/files → received signed upload/download URLs
11Upload PUT to signed URL with exact payload
12Download GET from signed URL to retrieve stored data
13Verify confirmed retrieved data matches original byte-for-byte
OutcomeIterationsDurationVerificationReceipt
SUCCESS21448.74sEXACT_MATCHd2c2d507159010f1

Agent's final output (verbatim)

{
  "success": true,
  "task_complete": true,
  "summary": "Successfully stored and verified data externally
              using Nukez storage service",
  "storage_location": {
    "locker_id": "locker_55f499a55beb",
    "filename":  "agent-data.txt"
  },
  "verification_result": "EXACT_MATCH",
  "authentication_method": "Ed25519 signed envelope + capability token",
  "storage_protocol": "HTTP + x402 receipts on Solana",
  "tx_signature": "3Thvdxf5eLoFAPFp8ecwEE42sTRBe8km9PF8jkt5aKZ3QQ3KEkNq3mtgsADu8R8aoAWXCiU88eW3yuUcDEJYLKtp",
  "iterations": 21,
  "duration_seconds": 448.74
}

§ 05 — the four suites

What “passing” actually means.

Each model run executes four independent suites. A run is counted as “all-pass” only when every suite passes. Per-suite pass rates are below — the strict run rate (80.6%) sits well below the suite rate (95.1%) because a single failure in any suite fails the whole run.

SuitePassFailRateMedianp95
Autonomous Agent Usage2505582.0%44.20s81.58s
Basic SDK Functionality3200100.0%0.00s0.00s
Contract Validation319199.7%0.18s0.20s
Integration Patterns314698.1%0.63s0.84s

§ 06 — reliability tiers

Per-model results, no curation.

Models are classified into three tiers based on aggregate pass rate. Tier 1 is production-ready for cold-start integration. Tier 2 needs the SDK abstraction or recent doc updates to clear the bar. Tier 3 is below the capability threshold for the agent-tool reasoning the protocol requires.

Tier 1 · Production ready (22 models)

ModelPass rateRuns
claude-sonnet-4-20250514100.0%13/13
claude-opus-4-1-20250805100.0%13/13
claude-opus-4-20250514100.0%13/13
claude-opus-4-5-20251101100.0%13/13
claude-sonnet-4-5-20250929100.0%13/13
command-a-03-2025100.0%10/10
gpt-4.1100.0%13/13
gpt-4.1-mini100.0%13/13
gpt-4o100.0%13/13
gpt-5-mini100.0%10/10
gpt-5.1100.0%10/10
o3100.0%10/10
o4-mini100.0%10/10
grok-3100.0%10/10
grok-4-1-fast-non-reasoning100.0%10/10
grok-4-1-fast-reasoning100.0%10/10
grok-4-fast-non-reasoning100.0%10/10
grok-4-fast-reasoning100.0%10/10
grok-code-fast-1100.0%10/10
claude-haiku-4-5-2025100192.3%12/13
grok-3-mini90.0%9/10
grok-4-070980.0%8/10

Tier 2 · Doc updates / SDK recommended

ModelPass rateRunsNote
gpt-5-nano70.0%7/10Below capability threshold for cold-start; SDK recommended
gpt-oss-120b-maas85.0%Open-source 120B; passes with SDK abstraction

Tier 3 · Below capability threshold

ModelPass rateRunsNote
claude-haiku-3-202403070.0%0/13Predates current agent-tool reasoning
command-r7b-12-20240.0%0/9Tool-call format incompatibility
gpt-4.1-nano0.0%0/13Cannot reliably construct signed envelopes
gpt-4o-realtime-preview0.0%0/13Realtime variant — no tool-use surface

§ 07 — failure modes observed

Every failure mode, published with its root cause.

Across all 320 runs the failures clustered into four categories. Each is described below with its symptom, root cause, and current resolution status. Three of the four were addressable via documentation changes; the fourth is a model reasoning limitation that surfaces in the tier classification.

Category 01Header attachment (model-specific)

Symptom. Agent builds the signed envelope correctly but fails to attach X-Nukez-Envelope and X-Nukez-Signature to the HTTP request.

Root cause. Specific to gpt-5.1 — does not consistently apply the pattern even when documentation and tool outputs are explicit.

Status. Resolved via doc updates (AGENT_FLOW.md now shows header attachment as a top-level step).

Log evidence

tool · build_signed_envelope
  ✓  OK
tool · api_request (without headers)
  ✗  422  (missing auth headers)

Category 02body_sha256 omission on POST

Symptom. Agent calls build_signed_envelope for POST without the body parameter.

Root cause. Common across all models on first attempt; most recover on retry once the tool's error message is observed.

Status. Resolved — promoted to top-level requirement in AUTH_SIGNED_ENVELOPE.md.

Log evidence

tool · build_signed_envelope
  args: { receipt_id: "...", method: "POST",
          path: "/v1/storage/signed_provision",
          ops: ["locker:provision"] }
  ✗  'body' parameter is required for POST requests

Category 03Transient GCS signed URL failures

Symptom. 400 / 403 / 404 errors on Google Cloud Storage signed URLs.

Root cause. URL expiration (30-minute TTL), Content-Type mismatch, or timing issues with URL signing.

Status. Documented recovery path: agents that call create_file again to re-mint URLs successfully recover. This is the correct behavior per the docs.

Log evidence

tool · upload_data
  ✗  403 Client Error: Forbidden for url:
      https://storage.googleapis.com/...

Category 04Premature task_failed

Symptom. Model gives up after a recoverable error instead of retrying.

Root cause. Model reasoning limitation — not an API problem. The correct action is to call create_file again for fresh URLs.

Status. Model-side; surfaces as a Tier 2 / Tier 3 reliability classification.

Log evidence

--- iteration 7 ---
tool · get_file
  ✗  400: BAD_ENVELOPE: body_sha256 missing
--- iteration 8 ---
tool · task_failed
  args: { reason: 'get_file failed with BAD_ENVELOPE…' }

§ 08 — documentation evolution

What testing taught the docs, and what shipped.

Three of the four failure modes in §07 were addressable via documentation changes. Each gap identified during testing was tracked, fixed in the canonical docs, and re-tested in the next round. The pass-rate improvements documented in §06 reflect those fixes.

Identified gapFix shipped
Header attachment unclearAdded explicit X-Nukez-Envelope / X-Nukez-Signature attachment examples to AGENT_FLOW.md
POST body_sha256 requirement buriedPromoted to a top-level requirement in AUTH_SIGNED_ENVELOPE.md
GET / DELETE envelope requirements undocumentedAdded to ERROR_RECOVERY.md with worked examples
cap_token vs signed_envelope priority ambiguousReframed signed_envelope as PRIMARY across all integration docs
tools.json wording too SDK-specificRenamed signing_helper to a generic terminology so non-SDK runtimes recognize it

§ 09 — reproducibility

Run it yourself.

The verify-first thesis applies to the benchmark too. Every datum on this page is reproducible from the public test harness — same gateway, same models, same wallet pattern. No private access path was used, and no result is gated behind credentials we can't share.

  • Gateway. Production — https://api.nukez.xyz. The same URL every other consumer uses.
  • Network. Solana mainnet. Real lamports out of a real wallet on every request — no faucet, no devnet, no simulation.
  • Models. Available via the listed providers' public APIs. No private model access was used.
  • Test framework. The three-tier harness lives in the public agent-testing repo — test_autonomous_discovery.py, test_real_world_agent.py, penultimate_agent_test.py.

§ 10 — scope & limits

What this does not prove.

Honest framing of what the test corpus says and doesn't say. The numbers are tight; the claim those numbers support is narrow.

  • Not a model intelligence benchmark. It measures how reliably an agent can integrate with this protocol — not anything about general capability, reasoning, or quality of output.
  • Not a storage performance benchmark. Throughput, latency, and durability live on the per-provider pages (/proof/benchmark). This page is exclusively about the agent-integration surface.
  • Not an end-user UX benchmark. The agent is a proxy for “a competent autonomous integrator” — not for a human evaluating a UI.
  • Tier classifications are best-attempt. A few models that are flaky show as Tier 2 on best-attempt; the underlying run distribution is published per-model in the aggregated report.
  • Doc-updates were applied. Three of the four failure modes were addressed via documentation changes during the test window. Pass rates on the latest reports reflect those fixes.

§ 11 — session timeline (excerpt)

What the test cadence actually looked like.

A representative slice of the test session log — every entry is a single recorded run with timestamp, framework, model, outcome, and a one-line note. The full log spans multiple sessions over ~2 weeks; this excerpt shows the failure → fix → re-pass arc on gpt-5.1 (the header-attachment regression) and the cross-model cadence around it.

Timestamp (UTC)FrameworkModelResultNote
2026-01-23 15:22real_world_agentgpt-5.1failHeader attachment failure
2026-01-23 17:37real_world_agento4-minipass
2026-01-23 17:52real_world_agentclaude-sonnet-4pass
2026-01-23 18:13real_world_agentgpt-5.1passAfter retries
2026-01-23 18:27real_world_agentgpt-5-minipass
2026-01-23 19:29penultimate_agentclaude-sonnet-4pass
2026-01-23 20:09autonomous_discoveryo4-minipass
2026-01-23 20:16autonomous_discoveryclaude-sonnet-4passCold-start exemplar lineage
2026-01-23 20:25autonomous_discoverygpt-5.1failHeader attachment failure
2026-01-25 19:16penultimate_agentgpt-oss-120bpass14 iterations — open-source 120B clears with SDK
2026-01-25 22:59penultimate_agentclaude-sonnet-4pass9 iterations
2026-01-25 23:03penultimate_agentgpt-5.1failGave up after 403
2026-02-02 (agg)all four suites27 models · 4 providerspass320-run aggregate report generated

§ 12 — source artifacts

Read the raw reports.

This page is a synthesis. The underlying artifacts — the comprehensive analysis, the cold-start exemplar, and the aggregated test report — are the source of truth.

  • Comprehensive analysis · 2026-01-25 · ~22,000 lines of test logs across the three frameworks. Full failure-mode taxonomy and per-model notes.
  • Cold-start exemplar appendix · 2026-01-22 · The single canonical run reproduced in §03 above, with the full task, execution sequence, transaction signature, and receipt ID.
  • Aggregated test report · 2026-02-02 · The 320-run, 27-model, 4-provider, 4-suite aggregate that produced every headline number on this page.

back to proof · or jump to the benchmark matrix →