╭─ System Gates ─╮

Repository Verification Commands

python -m pytest test_seam_all/test_seam.py -q
python -m pytest tests/audit -q
python -m pytest tools/history tools/streams -q
python -m tools.history.verify_integrity
python -m tools.history.verify_routing
python -m tools.history.verify_continuity
python -m tools.streams.verify_streams
git diff --check

A metric improvement is rejected when it is purchased by reduced provenance, exactness, isolation, safety, or holdout discipline.

╭─ Evidence Hierarchy ─╮

From weakest to strongest — SEAM never uses a lower evidence class to claim a higher one.

1Static Inspection
2Syntax / Import
3Unit Test
4Negative Input
5Integration
6Real Adapter
7Fixed Benchmark
8Repeated Eval
9Sealed Holdout
╭─ Benchmark Suite ─╮

Latest tracked runs across SEAM's internal suites and public memory benchmarks. Click any card for methodology, sub-metric breakdown, and prior hash-verified runs.

Fidelity & Exactness

Active
100%
MIRL Compilation

Source-to-MIRL transformation preserves exact text, spans, and provenance. Binary gate — no tolerance.

SHA-256
Hash Verification

Every artifact, fixture, and bundle verified by hash equality. Mismatch invalidates the artifact.

Exact
Quote/Span Retention

Quote, span, and table-cell retention verified through round-trip reconstruction. Direct-query exactness enforced.

Binary
Corruption Rejection

Exactness gates are binary unless the governing contract explicitly defines tolerance. Corruption is always rejected.

Retrieval

Active
Metric Measurement Gate Type
Recall @ fixed budget Fixed candidate and context token budgets Regression
Precision Irrelevant-context rate measurement Regression
First relevant rank Rank of first relevant record Regression
Displacement Relevant evidence displacement score Binary
Per-query delta Per-category and per-query regression analysis Regression
Trace correctness Retrieval trace audit Binary
Scope isolation Cross-scope leakage negatives Binary
Latency Wall-clock retrieval time at fixed workload Regression

Context & PACK Density

Active
Token
Budget Compliance

Token count and compression ratio verified against operator-specified budgets. Never exceeds allocation.

Ref
Reference Retention

Provenance and evidence references survive PACK compression. Reconstruction verified where required.

Task
Semantic Retention

Answer quality measured at fixed prompt budget. Denser representation accepted only when utility remains within gates.

±
Short/Long Regression

Tested on both short and long inputs to catch regressions that only appear at scale boundaries.

Answer Quality

Gated
ID'd
Judge Identity

Answerer and judge model identity recorded. Temperature-zero is not proof of determinism — repeated outputs verified.

N-Run
Repeated Agreement

Repeated-run agreement measured. Confidence intervals or observed variance reported when feasible.

Split
Per-Category Scores

Retrieval-miss vs answerer-miss classification. Per-category scoring with abstention behavior tracking.

Performance & Resource Bounds

Active
Dimension Methodology Status
Wall-clock latency Fixed workloads, reported environment Tracked
Throughput Records/sec at standard ingest load Tracked
Peak memory Monitored during benchmark runs Tracked
Disk growth SQLite + derived index footprint Tracked
DB query count Per-operation query audit Binary
Index rebuild time Full derived-state rebuild from canonical Tracked
Context size Token output at fixed budget Binary
Startup / shutdown Cold start and graceful shutdown timing Tracked