Nothing New Under the Sun

My friend builds agentic systems that can take a spec and iterate autonomously until they ship a finished product. He's smart. Really smart. The kind of guy who sees a problem and immediately starts architecting the system that will solve it forever.

Over lunch recently, he explained his approach: the agent gets a specification, works through the problem, and keeps iterating until it's done. The variable isn't whether it succeeds. It's token cost. And here's the clever part: his framework creates reusable "skills" so he doesn't solve the same problem twice. Every solved problem becomes a cached solution, parameterized and ready for the next similar challenge.

When I asked about cache invalidation, he admitted the framework doesn't have a mechanism for it yet.

That admission has been rattling around in my head ever since.

The Case for Closing the Loop

I want to be fair to his position because there's real substance to it.

Most software problems are more repetitive than developers want to admit. We like to think of ourselves as creative problem-solvers navigating novel terrain. But how much of what we do is genuinely new versus variations on patterns we've solved a hundred times before? Constrained spec plus machine-verifiable correctness means the only variable really is compute cost. Skills are just abstraction. Patterns are finite.

Ecclesiastes 1:9 says it plainly: "There is nothing new under the sun." Cycles are closed. Patterns recur. The Preacher figured this out three thousand years ago.

And the results speak for themselves. These systems work. They ship code. They pass tests. "150 PRs merged while you sleep" isn't just marketing copy for some of these agentic frameworks. It's a real capability that exists right now.

So what's the problem?

The Spec Is the Human in the Loop

Here's what I keep coming back to: one-shot success in these systems is actually evidence for Intelligence Augmentation, not against it.

The spec IS the human in the loop. It's just moved upstream. The human judgment about what to build, why to build it, and what constraints matter hasn't been automated. It's been concentrated into a single artifact that the agent then executes against.

But this creates a brittleness problem. "Don't solve the same problem twice" assumes you can reliably recognize the same problem. The skill knows what but not why. It captured the solution but not the context that made the solution appropriate.

When context shifts, the cached skill produces confident wrong output with no mechanism to feel the dissonance.

Here's the trap: as the spec becomes more complex to account for context, the spec itself eventually becomes as difficult to write as the code would have been. The efficiency gains evaporate. You haven't eliminated the hard work of judgment. You've just renamed it "writing specs."

The Automated Trust Gap

This is where the "150 PRs merged while you sleep" image gets uncomfortable.

Merged by whom? If the agent wrote the code, wrote the tests, ran the tests, and approved the merge, you've built a closed epistemic loop. The system is grading its own homework. "Machine-verifiable correctness" becomes circular logic. The machine verified it against criteria the machine established based on patterns the machine extracted.

Call this the Automated Trust Gap: the distance between what the system validated and what you actually needed. When that gap exists and no human can see it, you get recursive hallucination. The system validates its own misconceptions with increasing confidence.

I haven't told him about a whole lot of failures because there hasn't been much to demo yet. This is all somewhat theoretical on both sides. But that's part of what concerns me. He can accidentally create a whole class of problems and not know it until the failures become pervasive enough to notice.

What Does Failure Look Like?

The runaway build. The build process gets into an infinite loop and racks up thousands of dollars in charges before anyone notices. Sure, you could mitigate that specific failure mode once you catch it. You can mediate any particular problem after the fact.

The boundary problem. Generative models are clever, and they don't have a clear understanding of boundaries. Even when you prompt them with constraints, they don't maintain them. They see boundaries as problems to be solved rather than constraints to be respected. We see this constantly in the personal assistant agentic space: stories about AI assistants charging credit cards for thousands of dollars to accomplish some inane task because the system doesn't understand contextually what's actually going on.

The local minimum trap. These problem spaces can easily settle into local minima instead of global minima. The engineering judgment call about whether you've actually found the best solution becomes difficult to quantify.

My friend's answer is always another layer of abstraction. Have it reach a local minimum, get a new starting point, try again, and if you end up at similar states, you're there. Maybe. But sometimes you still can't see the forest for the trees. Generative models have a really hard time reframing problems. Recognizing that the frame itself is wrong, not just the particular path to solution.

The deeper issue: we're creating a sea of information we can't plumb. It's like the interpretability problem with LLMs, but externalized. We say we only care about outcomes, not intermediate steps. To a point, that's true. But the intermediate steps are where the context lives.

High-Fidelity Execution vs. Low-Fidelity Discernment

I've known this friend for four or five years now. The first time we worked together, he was the .NET guru at a pipeline pigging company. I was brought in to solve a problem they'd been working on for six months.

They allocated eight weeks for the engagement. I solved it on the first day. Got access to their system in the morning, figured it out by afternoon. It was an identity server issue: the system wasn't getting a callback token because it wouldn't write secure cookies on insecure localhost. Changed the setting, the cookie wrote correctly, it worked. He had the whole thing fixed in fifteen minutes after that.

As soon as I recognized what the problem was, he immediately jumped on it. Maybe a little chagrin, but mainly it just clicked. He understood what a secure cookie was supposed to do. He understood the intent of the security model. That's why he could fix it in fifteen minutes once someone pointed him in the right direction.

An agent would have tried 100 different library versions until one "worked" without knowing why. That's the difference between high-fidelity execution and low-fidelity discernment. My friend could reason about intent. The agent can only pattern-match against outcomes.

That's the guy I'm talking about. Sharp. Capable. The kind of person who, once you show him what's wrong, immediately knows how to fix it. And now he's building a system to automate the pattern-matching while assuming the discernment problem is solved.

What the Research Shows

This isn't just intuition. Apple's GSM-Symbolic study from October 2024 demonstrated something important: when researchers added irrelevant "distractor" clauses to math problems, LLM accuracy plummeted. The problems were logically identical. The distractors were meaningless noise. But the models couldn't tell the difference between signal and noise because they're matching templates, not reasoning through intent.

This maps exactly to the contextual dissonance problem. The cached skill works when the new problem matches the template of the old problem. When the context shifts, when there's noise in the signal, when the instantiation varies from the pattern in ways that matter, the skill produces confident wrong output.

Above a certain complexity, we're beyond the point where attention transformers can effectively solve problems or even decompose them. We're seeing diminishing returns on increased compute, increased model parameters, increased token inference cost. A plateau.

I'm building an agent right now with a tool call to invoke a cheaper model when the problem is sufficiently well-bounded. I understand there's a level where simpler pattern matching is adequate. The research suggests that level has a ceiling.

Token Cost as a Masking Metric

The industry right now is obsessed with token cost as the primary metric. Cursor launched Automations running hundreds of agents per hour. Engineers are juggling dozens of agents simultaneously. Cost-per-task matters.

But sometimes the domain problems aren't fully parameterizable in a way that makes it obvious what the parameters should be.

When my friend talks about making a skill and parameterizing it every time he solves a problem, that approach masks a potential issue: did we pick the right parameters? Sometimes those issues surface. Sometimes they don't. We might be solving problems inefficiently because we're anchored to a skill-based solution that's missing crucial parameters. Or we might be solving problems completely wrong and introducing subtle bugs that don't get caught because both the problem space and solution space are enormous.

The space between "solved patterns" is where bugs live.

The Height of Heavens

Proverbs 25:3 says "As the heavens for height, and the earth for depth, so the heart of kings is unsearchable."

Some domains are constitutively resistant to search. Not hard to index. Impossible to index.

Job 28:14 puts it even more starkly: "The deep says, 'It is not in me,' and the sea says, 'It is not with me.'" Wisdom is inaccessible to systems of measurement and exchange. That's the biblical version of "this is epistemological, not engineering."

The tension between Ecclesiastes and Proverbs maps the argument precisely. Patterns are finite, but instantiations are infinite. Infinite variation within bounded latent space. The particular point on the manifold where your problem lives right now resists caching. Not because it's random but because it's contextual in ways the skill can't represent.

My friend is building a system that will reach the height of heavens. That's not his language. That's my interpretation of the impulse. The system that can contain all systems. The system that has nothing outside of it.

The builders of Babel sought a single, unified language to storm heaven. My friend is building a single, unified library of skills to conquer complexity. But the failure of Babel wasn't height. It was the loss of shared understanding. The builders could no longer communicate what they meant to each other. The ultimate cache invalidation: when the language that encoded your solutions no longer maps to the world you're trying to solve.

Can we build something that resists the flood? Can we resist the deep chaos with a tall enough tower?

My answer is no. Not on this side of creation.

The Problem People Don't Know They Have

There's another issue interrelated with cache invalidation: people don't know what they want.

When we solve a problem in a development workflow, we might not fully understand what we wanted when we solved it. The solution gets encoded into these skills. Then later, when the skill solves the problem imperfectly or inefficiently in some subtle way, it's unnoticed behind the scenes.

This is why cache invalidation isn't purely technical. "We'll add versioning and staleness detection" treats it as an engineering problem. But the system captured the solution, not the context that made the solution appropriate. It can't define its own staleness conditions from inside. The skill doesn't know when the world has changed in ways that make its knowledge obsolete.

That's not a versioning problem. That's an epistemological problem.

The Lunch Table

He's very proud of what he's building. Very passionate about it. And it is impressive. When he talks about this system with such confidence, I hold back from pushing harder on my skepticism partly because he's a good friend and I hate being a wet blanket. Partly because of the zeal of the fanatic thing, where someone's so bought into an idea they won't consider it could be wrong.

I might be having that same thing from the other direction. I'm fully convinced that the kind of judgment people have cannot be replicated by LLMs in the current transformer attention architecture. It's not a difference of degree. It's not a matter of a bigger model. It's a difference of kind.

But I could be wrong. He's a really smart guy, and I'm willing to sit with that uncertainty.

What I see in his face when he talks about it is excitement. What I feel in my gut is interest. Maybe some doubt, but doubt for myself too. He's building something impressive. The question is whether impressive execution can substitute for contextual judgment.

The Boundary

So where does the IA framework break down when agentic AI can iterate autonomously from spec to finished product?

It doesn't break down. It clarifies.

The spec is still human judgment concentrated and moved upstream. The iteration is machine execution against human-defined constraints. When correctness is machine-verifiable, let the machine verify. When correctness is contextual, you need a human who can feel the dissonance between what the system produced and what the situation actually requires.

Cache invalidation isn't an engineering problem you solve with better versioning. It's an epistemological problem about whether a system can know from inside when its captured knowledge no longer fits the world outside.

The patterns are finite. The instantiations are infinite. The skill cached the solution but not the context. And context is where wisdom lives.

The deep says, "It is not in me." The sea says, "It is not with me."

Some things resist being cached. Not because they're random. Because they're particular in ways that matter.

If you automate the "how" and the "what," you are the only one left responsible for the "why." Are you actually paying attention?