deterministic · reproducible · hash-verified
Every claim is auditable. Every result is reproducible. Every index is rebuildable.
python -m pytest test_seam_all/test_seam.py -q
python -m pytest tests/audit -q
python -m pytest tools/history tools/streams -q
python -m tools.history.verify_integrity
python -m tools.history.verify_routing
python -m tools.history.verify_continuity
python -m tools.streams.verify_streams
git diff --check
A metric improvement is rejected when it is purchased by reduced provenance, exactness, isolation, safety, or holdout discipline.
From weakest to strongest — SEAM never uses a lower evidence class to claim a higher one.
Latest tracked runs across SEAM's internal suites and public memory benchmarks. Click any card for methodology, sub-metric breakdown, and prior hash-verified runs.
Source-to-MIRL transformation preserves exact text, spans, and provenance. Binary gate — no tolerance.
Every artifact, fixture, and bundle verified by hash equality. Mismatch invalidates the artifact.
Quote, span, and table-cell retention verified through round-trip reconstruction. Direct-query exactness enforced.
Exactness gates are binary unless the governing contract explicitly defines tolerance. Corruption is always rejected.
| Metric | Measurement | Gate Type |
|---|---|---|
| Recall @ fixed budget | Fixed candidate and context token budgets | Regression |
| Precision | Irrelevant-context rate measurement | Regression |
| First relevant rank | Rank of first relevant record | Regression |
| Displacement | Relevant evidence displacement score | Binary |
| Per-query delta | Per-category and per-query regression analysis | Regression |
| Trace correctness | Retrieval trace audit | Binary |
| Scope isolation | Cross-scope leakage negatives | Binary |
| Latency | Wall-clock retrieval time at fixed workload | Regression |
Token count and compression ratio verified against operator-specified budgets. Never exceeds allocation.
Provenance and evidence references survive PACK compression. Reconstruction verified where required.
Answer quality measured at fixed prompt budget. Denser representation accepted only when utility remains within gates.
Tested on both short and long inputs to catch regressions that only appear at scale boundaries.
Answerer and judge model identity recorded. Temperature-zero is not proof of determinism — repeated outputs verified.
Repeated-run agreement measured. Confidence intervals or observed variance reported when feasible.
Retrieval-miss vs answerer-miss classification. Per-category scoring with abstention behavior tracking.
| Dimension | Methodology | Status |
|---|---|---|
| Wall-clock latency | Fixed workloads, reported environment | Tracked |
| Throughput | Records/sec at standard ingest load | Tracked |
| Peak memory | Monitored during benchmark runs | Tracked |
| Disk growth | SQLite + derived index footprint | Tracked |
| DB query count | Per-operation query audit | Binary |
| Index rebuild time | Full derived-state rebuild from canonical | Tracked |
| Context size | Token output at fixed budget | Binary |
| Startup / shutdown | Cold start and graceful shutdown timing | Tracked |