Evidence

Engineering Evidence

We test wavebird the way we would want our infrastructure tested. Every claim on this page is backed by reproducible evidence from controlled benchmark runs and a comprehensive pre-pilot validation campaign with fault injection.

Last updated: 2026-03-30

What we measured

Speed

28.76ms end-to-end

Measured runtime path from request entry to a sponsoring decision being ready, excluding the AI model’s own wait time. p99 means 99% of requests were this fast or faster.

Reliability

0missing proofs

Over 8 hours (263,534 total slots processed) with fault injection active, we observed 0 missing proofs across the 260,321 terminal slots expected to produce proof.

Resilience

7SSP failure scenarios

We simulated seven exchange failure modes (plus three PostgreSQL failures). Result: 0 crashes, 0 unrecoverable states, correct circuit breaker activation and recovery.

Terms used on this page
Slot
One sponsoring opportunity in the runtime (one decision attempt).
Proof
A signed evidence record produced for filled slots and used for audit and settlement.
Beacon
A post-render signal from the wrapper/app confirming a creative was rendered.
Mock-SSP
A simulated ad exchange response used to measure the internal ad path without public network noise.
Fault injection
Deliberate, randomized failures introduced during the campaign (latency jitter, slow responses, HTTP errors, malformed responses, drops, no-bid spikes).

How we tested

In March 2026, we ran our pre-pilot validation campaign: a set of automated tests designed to find problems before the first real partner connects. We did not test under ideal conditions. We deliberately broke things.

Our Mock-SSP chaos mode randomly injected network delays, server errors, malformed responses, dropped connections, and traffic spikes into the test runs. The goal is simple: prove correct behavior under failure before we connect a live partner.

Mock-SSP

Mock-SSP simulates an ad exchange response inside the benchmark harness and inside the pre-pilot chaos campaign so we can measure the internal ad path without public network noise.

Proof integrity

We processed 10,000 sponsoring slots at 100 concurrent connections with fault injection active. Result: 0 missing proofs, 0 invalid signatures, 0 orphaned beacons.

Settlement accuracy

We ran 5,000 slots through 6 billing scenarios — including micro-unit price boundaries, duplicate detection, and multi-SSP fallback attribution. Result: exact reconciliation in every scenario (0 billing errors).

Resilience

We tested 7 SSP failure scenarios plus 3 PostgreSQL failure scenarios. Result: 0 crashes and correct circuit breaker activation and recovery in all scenarios.

Found and fixed during the campaign

Settlement attribution bug in multi-SSP fallback: slots were incorrectly attributed to the timed-out primary SSP.

Full campaign details

In March 2026, we ran a comprehensive pre-pilot validation campaign with chaos fault injection active. The campaign tested proof integrity, settlement accuracy, resilience, concurrency limits, and sustained stability.

Proof Chain Integrity

  • 10,000 slots processed at concurrency 100 with chaos faults active.
  • 294 latency jitter faults, 42 slow responses, 18 HTTP errors, 7 malformed responses, and 7 connection drops injected.
  • Result: 0 missing proofs, 0 invalid signatures, 0 orphaned beacons.
  • Every filled slot has a correctly signed proof pack.

Settlement Accuracy

  • 5,000 slots across 6 test scenarios.
  • Standard mixed-outcome run, micro-unit price boundaries, multi-SSP fallback attribution, duplicate detection, CS profile breakdown, and 30-minute duration stability.
  • Result: exact reconciliation in all scenarios, 0 billing errors.

Found and fixed: settlement attribution bug in multi-SSP fallback. Slots were incorrectly attributed to the timed-out primary SSP.

Resilience Under Failure

  • 7 SSP failure scenarios tested: connection refused, timeout, HTTP 500, HTTP 429, partial failure with fallback, flapping, and slow response.
  • 3 PostgreSQL failure scenarios: mid-runtime drop, never available, and slow queries.
  • Redis fail-policy: explicitly changed from implicit fail-open to configurable fail-closed (`CSL_RATE_LIMIT_REDIS_FAIL_POLICY`).
  • Result: 0 crashes, 0 unrecoverable states, correct circuit breaker activation and recovery in all scenarios.

Sustained Load (8 Hours)

  • 263,534 slots processed over 8 hours with chaos faults active.
  • 0 missing proofs across 260,321 proofable terminal slots.
  • All 8 hourly quality gates passed.
  • 0 handle leaks, metric cardinality stable (67 -> 81).
  • Chaos faults injected: 242 latency jitter, 38 slow responses, 12 HTTP errors, 8 malformed responses, 8 connection drops, 83 no-bid spikes.

Open finding: in-memory accumulation causes memory growth over extended runs. Slot eviction and ledger compaction are implemented and active. This is under continued optimization.

Under load

We pushed the system from 10 to 200 concurrent connections to find where it starts to struggle. The answer: it never crashes. It gets slower, but it keeps working.

“c100” means 100 concurrent connections.

Concurrent connectionsResponse time (p99)ThroughputErrors
1064 ms333 ops/s0
25293 ms126 ops/s0
50695 ms92 ops/s0
751,203 ms73 ops/s0
1001,764 ms64 ops/s0
1503,267 ms52 ops/s0
2003,590 ms33 ops/s0

At 200 concurrent connections, p99 response time increases to 3.6 seconds but every response is still valid (2xx). Under that extreme load we see decision poll timeouts; when load drops back to 25 connections, the system recovers within 30 seconds.

How to read this table

The “Errors” column is HTTP-level errors. In these runs, every response was 2xx at every concurrency level. Under extreme load we do observe decision poll timeouts (2 at c100, 130 at c150, and 1,871 at c200). The system degrades gracefully rather than failing hard. Spike recovery from c200 to c25 completes within 30 seconds.

Sustained operation (8 hours)

We ran the system continuously for 8 hours with fault injection active, processing 263,534 sponsoring slots. All 8 hourly quality gates passed. 0 missing proofs across 260,321 terminal slots expected to produce proof. 0 handle leaks.

Open finding

What we found: memory usage grows over extended runs because in-memory state accumulates faster than it is cleaned up. Slot eviction and ledger compaction are implemented and active. This is under continued optimization.

Detailed methodology

The benchmark suite and the pre-pilot campaign were both run under controlled conditions. The goal was to measure the wavebird runtime itself, not the public internet or live model providers.

Evidence date
2026-03-23 (benchmarks), 2026-03-30 (pre-pilot campaign)
Execution mode
Local host benchmark harness
Exchange substitute
Mock-SSP
Runs
7
Warmup requests
1000
Measured requests per run after warmup
Not yet published in the current sanitized evidence bundle
Selection method
Median per benchmark
Pre-pilot campaign
Chaos fault injection via configurable Mock-SSP chaos mode with latency jitter (30%), slow responses (3%), HTTP errors (2%), malformed responses (1%), connection drops (0.5%), and periodic no-bid spikes.

Per-run variation exists internally and will be published once the sanitized artifact bundle is ready. The original benchmark methodology remains unchanged and the March 23 results remain valid.

Full benchmark metrics

Benchmarks

March 23, 2026

Firewall p99 latency

0.22ms

Filtering step before any ad request leaves the runtime.

Mock-SSP round-trip p99 latency

15.28ms

Internal ad path against a controlled exchange substitute.

End-to-end p99 latency

28.76ms

Measured runtime path with external model wait time excluded.

Settlement max runtime

887.58ms

Longest measured settlement run in the current evidence pack.

Mock-SSP request throughput

1364.82ops/s

Controlled request throughput inside the benchmark harness.

Pre-pilot campaign

March 30, 2026

Proof integrity

10,000slots

Processed at c100 with 0 missing proofs.

Settlement accuracy

5,000slots

6 scenarios with exact reconciliation.

SSP resilience

7failure modes

0 crashes across SSP failure scenarios.

Concurrency tested

c10-c200

Graceful degradation under spike load.

Sustained load

263,534slots

Processed over 8 hours with 0 proof gaps.

What this does not claim

We are transparent about what this evidence does and does not prove:

  • These are internal measurements, not third-party audits.
  • Latency was measured locally, not across the public internet or live model providers.
  • The exchange partner was simulated (Mock-SSP), not a live partner.
  • These numbers are not a production SLA.
  • The first live partner integration is the next milestone.

What is still open

Two things are not where we want them yet: beacon processing slows down above 50 concurrent connections, and the 8-hour sustained run shows more memory growth than our target allows. Both are under active optimization.

  • Beacon p99 at concurrency 50 and 100 remains above target in the original benchmark suite.
  • In-memory state accumulation during extended sustained load is under active optimization. Slot eviction and ledger compaction are implemented and reducing growth, but the 8-hour soak test does not yet meet the <20% memory growth target.
  • Jobs/sec remains below target in the original benchmark suite.

Artifacts

Downloadable artifacts will be published once the sanitized bundle is ready for public release. Pre-pilot campaign reports are available internally as machine-readable JSON artifacts.

Next step

See how it integrates

If the runtime evidence is what you needed, the next step is the integration path.