Blog/2025-12-11/LLMs Excel At Easy Verification Problems

From Rest of What I Know

In Blog/2025-12-01/Grounding Your Agent I talked about how grounding your agent allows it to make better decisions. This is akin to the approach you would take if you were to debug code.

The core device in debugging is the structure of the discovery loop. To reproduce the issue, we go through a loop that looks like:

  1. Enter loop
  2. If condition true, reduce example
  3. Else terminate

The end of this provides a Minimal Reproducible Example (MRE) that you can then use to perform your debugging loop. In the classic case of a regression is the bisection loop on your codebase using the MRE to ground your search.

Using LLMs to retrieve data from their memory will succeed in many cases, it's true. But that's a seductive (and false) god. The majority of an LLM's knowledge is there as a substrate for its intelligence and ability to reason. Much of it can be retrieved, but between the knowledge cutoffs and the fact that they imperfectly recall things, a better use of the LLM is to use them as reasoning agents without knowledge.

I suspect this is the primary reason for the high variance between different people's experiences with LLMs. Some, like me, use them both in attended and unattended mode to write code. Others write them only with high attention, and still others find that they generate code that's useless to them.

In my experience, all the people I know who talk about using LLMs successfully in semi-attended modes use them after they have transformed the problem into a checkable. LLMs are pretty good with working with a yes/no answer. The primary response from my mesh checker script was just an affirmative or negative on whether the generated mesh was an appropriate solid without self-intersections. That was sufficient for the LLM to iterate on things.

In fact, it is not whether or not the problem is tractable that matters so much as whether the problem is a checkable. Certainly, in the extreme, no LLM is going to quickly solve your TSP faster or better with just a "Is this the fastest?" check, but there are a large class of problems where you can easily check at the end whether or not the solution is any good. LLMs are great at this.

And particularly, if the problem is a checkable, you can often write a program that verifies a solution, then use the LLM in a loop to generate solutions based on that feedback.