Blog

Systematically generating tests that would have caught Anthropic's top‑K bug

We introduce fractional proof decomposition, a technique for scaling testing compute logarithmically, instead of linearly, with bug rarity. We achieve this efficiency by fusing partial evaluation and property-based testing.

Most large projects have a limited compute budget for testing, lest it slows down the CI pipeline. This means missing rare edge cases until customers find them in production. I love formal verification! And I want to bring the power of reasoning about program structure to testing. In this blog post, I will briefly introduce fractional proof decomposition, our new technique that makes testing compute scale logarithmically with bug rarity instead of linearly. As a demo, I’ll walk through how this approach would have caught Anthropic’s recent approximate top-K bug. Note that we generate these unit tests without relying on the bug reproducer code, using just the high-level specification of top-K sampling.
Read more →