'It worked when I tried it' is not a test for a non-deterministic system.
Treat prompts and rules like code: a golden dataset of inputs with known-good outputs, run headless on every change, gated by a hook that fails the build below baseline. The eval that blocks the merge is the one that prevents regressions.