Systematically generating tests that would have caught Anthropic's top‑K bug
We introduce fractional proof decomposition, a technique for scaling testing compute logarithmically, instead of linearly, with bug rarity. We achieve this efficiency by fusing partial evaluation and property-based testing.
Most large projects have a limited compute budget for testing, lest it slows down the CI pipeline. This means missing rare edge cases until customers find them in production.
I love formal verification! And I want to bring the power of reasoning about program structure to testing. In this blog post, I will briefly introduce fractional proof decomposition, our new technique that makes testing compute scale logarithmically with bug rarity instead of linearly.
As a demo, I’ll walk through how this approach would have caught Anthropic’s recent approximate top-K bug. Note that we generate these unit tests without relying on the bug reproducer code, using just the high-level specification of top-K sampling.
Read more →