Benchmarks | Canticle

╭─ System Gates ─╮

Repository Verification Commands

python -m pytest test_seam_all/test_seam.py -q
python -m pytest tests/audit -q
python -m pytest tools/history tools/streams -q
python -m tools.history.verify_integrity
python -m tools.history.verify_routing
python -m tools.history.verify_continuity
python -m tools.streams.verify_streams
git diff --check

A metric improvement is rejected when it is purchased by reduced provenance, exactness, isolation, safety, or holdout discipline.

╭─ Evidence Hierarchy ─╮

From weakest to strongest — SEAM never uses a lower evidence class to claim a higher one.

1Static Inspection

2Syntax / Import

3Unit Test

4Negative Input

5Integration

6Real Adapter

7Fixed Benchmark

8Repeated Eval

9Sealed Holdout

╭─ Benchmark Suite ─╮

Latest tracked runs across SEAM's internal suites and public memory benchmarks. Click any card for methodology, sub-metric breakdown, and prior hash-verified runs.

Fidelity & Exactness

Active

100%

MIRL Compilation

Source-to-MIRL transformation preserves exact text, spans, and provenance. Binary gate — no tolerance.

SHA-256

Hash Verification

Every artifact, fixture, and bundle verified by hash equality. Mismatch invalidates the artifact.

Exact

Quote/Span Retention

Quote, span, and table-cell retention verified through round-trip reconstruction. Direct-query exactness enforced.

Binary

Corruption Rejection

Exactness gates are binary unless the governing contract explicitly defines tolerance. Corruption is always rejected.

Retrieval

Active

Metric	Measurement	Gate Type
Recall @ fixed budget	Fixed candidate and context token budgets	Regression
Precision	Irrelevant-context rate measurement	Regression
First relevant rank	Rank of first relevant record	Regression
Displacement	Relevant evidence displacement score	Binary
Per-query delta	Per-category and per-query regression analysis	Regression
Trace correctness	Retrieval trace audit	Binary
Scope isolation	Cross-scope leakage negatives	Binary
Latency	Wall-clock retrieval time at fixed workload	Regression

Context & PACK Density

Active

Token

Budget Compliance

Token count and compression ratio verified against operator-specified budgets. Never exceeds allocation.

Ref

Reference Retention

Provenance and evidence references survive PACK compression. Reconstruction verified where required.

Task

Semantic Retention

Answer quality measured at fixed prompt budget. Denser representation accepted only when utility remains within gates.

Short/Long Regression

Tested on both short and long inputs to catch regressions that only appear at scale boundaries.

Answer Quality

Gated

ID'd

Judge Identity

Answerer and judge model identity recorded. Temperature-zero is not proof of determinism — repeated outputs verified.

N-Run

Repeated Agreement

Repeated-run agreement measured. Confidence intervals or observed variance reported when feasible.

Split

Per-Category Scores

Retrieval-miss vs answerer-miss classification. Per-category scoring with abstention behavior tracking.

Performance & Resource Bounds

Active

Dimension	Methodology	Status
Wall-clock latency	Fixed workloads, reported environment	Tracked
Throughput	Records/sec at standard ingest load	Tracked
Peak memory	Monitored during benchmark runs	Tracked
Disk growth	SQLite + derived index footprint	Tracked
DB query count	Per-operation query audit	Binary
Index rebuild time	Full derived-state rebuild from canonical	Tracked
Context size	Token output at fixed budget	Binary
Startup / shutdown	Cold start and graceful shutdown timing	Tracked

Benchmark Data

Repository Verification Commands

Fidelity & Exactness

Retrieval

Context & PACK Density

Answer Quality

Performance & Resource Bounds