LLM-Generated Code Unlocks Superpowers LLMs Don’t Possess
I trust code an LLM writes in ways I’d never trust the LLM directly. Many other practitioners feel the same way. We’re getting comfortable with LLM-generated code.
Why We Trust the Code
The problem with trusting LLMs directly is straightforward: they’re probabilistic. Ask the same question twice, get different answers. And how do we even verify a prose response is correct? Often we can’t.
The key move is crystallization. We take a probabilistic process (the LLM’s token generation) and freeze its output into a deterministic artifact. The code compiles or it doesn’t. It passes tests or it doesn’t. Once verified, it runs the same way every time.
This works because of an asymmetry: generating correct code is hard, but checking it is cheap. The LLM explores a solution space to produce a candidate. We run it, test it, inspect it. That verification cost is a fraction of what it would take to write the code ourselves.
So when we use LLM-generated code, we’re not trusting the LLM. We’re trusting the code, an artifact we’ve verified through execution. The LLM’s unreliability gets filtered out by the verification step.
One nuance worth calling out: running tests is cheap, but knowing your tests are *sufficient* is hard. Verification works well when the problem has clear correctness criteria: parsing a specific format, computing a known function, transforming data in well-defined ways. It’s less reliable when edge cases are unknown or failure modes are subtle. The comfort we feel with LLM-generated code is real, but it’s bounded by verification quality.
The Superpowers
With that caveat, the core observation stands: code doesn’t share the LLM’s limitations.
- LLMs are probabilistic. Code executes deterministically.
- LLMs are slow, generating tokens one at a time. Code runs at compiled speed.
- LLMs have bounded context windows. Code can process arbitrarily large datasets.
- LLMs approximate. Code computes exactly.
- LLMs are hard to verify. How do we know a text response is correct? Code is testable, sometimes even formally provable.
The LLM doesn’t overcome its inherent constraints. By generating code, it routes around them.
This is why code generation isn’t just another LLM use case alongside summarization or chat. It’s categorically different, an escape hatch from the LLM’s own limitations.
The Larger Unlock
Today’s code generation workflow is straightforward: the LLM generates a candidate, a human verifies it, and the code gets deployed. One shot, human in the loop.
But the same property that makes us comfortable today (cheap verification) enables something more powerful.
If verification is cheap, we can do a lot of it. Instead of generating one candidate and hoping it’s right, we can generate thousands. We test them all. We filter, cluster, and surface the best solutions. The LLM becomes less like a programmer and more like a search engine over the space of possible programs.
Here’s how this works mechanically: the same non-determinism that makes us verify LLM outputs in the first place is what enables exploration at scale. Turn up the temperature, generate more candidates, and the randomness explores corners of solution space a deterministic process would never reach. The “unreliability” becomes a feature. Verification is what makes it safe to explore aggressively.
This is already happening at the frontier. DeepMind’s AlphaCode generates millions of candidate programs and filters them through test cases. Their FunSearch system used LLMs to discover new mathematical constructions by treating code generation as combinatorial search. These are early examples, but they demonstrate the pattern: generate many, verify cheaply, surface the best.
This connects to a broader shift in how AI systems spend compute. For years, the scaling story was about training: bigger models, more data, longer runs. But training-time scaling may be hitting diminishing returns. Publicly available data is getting harder to find.
The new frontier is inference-time scaling. Instead of baking all capability into weights during training, you let the model search longer at runtime. Explore more possibilities. Backtrack. Verify. Retry.
Code generation is the natural fit for this approach. We can let the model explore aggressively because we have a reliable filter. Other domains (open-ended text, nuanced reasoning) lack clear verification signals. Code has them.
Richard Sutton’s “Bitter Lesson” argued that 70 years of AI research points to one conclusion: general methods that leverage scale consistently beat clever human-designed approaches. Search and learning, powered by compute, win.
Code generation is where this lesson plays out next. Not just because LLMs are getting smarter at writing code in one shot, but because the combination of generation and verification enables search at scale.
What This Means
If this framing is right, a few implications follow.
First, compute demand is shifting, not plateauing. Training is hitting walls, but inference-time search is bounded mainly by cost and verification quality. For problems with clear correctness criteria, more compute means more exploration. The constraint evolves from “do we have enough data to train on?” to “do we have enough compute to search with?”
Second, data walls matter less than they appear. If inference-time search scales, we’re less dependent on training data to bake in every capability. The model learns general patterns, then searches for specific solutions at runtime.
Third, code is the domain where AI capabilities level up first. Not because code is easy, but because it’s verifiable. The same property that makes us comfortable trusting LLM-generated code today makes it amenable to scaled search tomorrow. What works here will propagate to other domains as verification methods improve.
That instinct many of us share - trusting the code, not the LLM - turns out to be a leading indicator. It points to a mode of AI capability that routes around the LLM’s inherent limitations by generating artifacts that don’t share them.
The probabilistic model generates. The deterministic code executes. And in that handoff lies more power than either possesses alone.