From Eval to Your Own QA Agent: Why Generating Code Is No Longer Enough

In my previous article I wrote about eval not as assessment for its own sake, but as a mechanism of truth. That distinction matters. In the world of LLM applications we often look at the model’s output and ask: “How good was the answer?” But for engineering that’s not enough. The better question sounds different: “Can we trust this result in a real process?”

That’s exactly the question my story with an autonomous QA agent started from.

When you start actively using AI tools for development, the first impression is almost always positive. The model writes code quickly, explains unclear parts, generates tests, suggests file structure, helps with refactoring. It feels like a significant chunk of routine work has finally disappeared. But after a few real tasks another feeling sets in: there’s speed, but not enough confidence.

AI can write a component but doesn’t guarantee it works in the browser. It can create an API test but won’t always check the right business logic. It can confidently say the task is done when the test never ran or the check was superficial. At some point it becomes obvious: generating code is only half the process. The other half is verification.

Where the problem actually appears

The problem isn’t that AI writes code badly. Often it writes it fast enough and well enough for a first iteration. The problem is different: AI doesn’t always have a mechanism of proof.

In classic development we don’t consider a task finished just because code was written. We run tests, check the UI, look at logs, hammer edge cases, analyze regressions. We look for confirmation that the idea actually became a working part of the product.

With AI agents there’s an interesting trap. They create a very convincing sense of completion. The code exists, the file changed, the comment is written, the answer looks confident. But there may be no real verification behind it. And if in development you might still catch that during code review, in testing the situation is harder: a bad test often looks like a good one until release day.

For example, a test might check that a button exists but not what happens when you click it. It might pass without a single meaningful assert. It might depend on the previous test’s state. It might be green but not catch a regression. Formally there’s a test, but no mechanism of truth.

That’s where the eval idea from the previous article became practical for me. If eval isn’t “scoring” but a way to reach the truth, then a QA agent shouldn’t just generate tests. It should verify whether those tests actually have value.

Why off-the-shelf tools turned out to be insufficient

Before thinking about your own approach, it’s logical to try what already exists. Today there are many AI tools that promise help with testing: some generate Playwright tests, others analyze the codebase, others create unit tests, others try to run scenarios through the browser.

In demos this often looks very convincing. Give a feature description — get a test. Give a page — get a scenario. Give an error — get an explanation. But in a real project nuances appear quickly.

First, many tools lack a deep understanding of project context. They see a code fragment or a page but don’t always understand architecture, test strategy, team conventions, data quirks, and CI.

Second, some tools generate tests as a text artifact but don’t complete the full cycle: create, run, read the error, fix, repeat, assess quality.

Third, even if a test passes, the question remains: does it actually check something or just not fail?

That last point became key for me. Because QA isn’t about creating files with tests. QA is about reducing risk. If a test doesn’t reduce risk, it can increase noise in the project even if it looks useful.

From idea to Cairn

That’s how the idea of Cairn gradually formed.

Not as “yet another AI test generator.” And not as a tool meant to replace a QA engineer. Rather as an autonomous assistant that takes on the most repetitive part of the path: explore the application, understand the stack, generate a check, run it, analyze the result, and not stop at the first green checkmark.

I wanted to build a system that works closer to how a QA engineer actually thinks. When we test a feature we don’t just write the first scenario and stop. We form a hypothesis, test it, get new facts, refine the scenario, add negative cases, look at stability, think about regression.

Cairn should move the same way: from uncertainty to proof.

The name fits the idea too. A cairn is a stone marker on a route. It isn’t the road itself, but it helps you not get lost. For me that image describes a QA agent’s job well: leave behind markers in the form of checks, logs, tests, decisions, and conclusions.

Core principle: don’t trust the first answer

If I describe Cairn’s philosophy briefly: don’t trust the model’s first answer without verification.

AI can suggest a good test. But that test needs to be executed. If it fails — understand why. If it passes — understand whether it’s a false positive. If it covers the happy path — think about edge cases. If it uses an unstable selector — improve it. If it depends on data — isolate state.

Here eval isn’t a separate stage at the end but part of the process itself. Each iteration should bring the system closer to the truth: what works, what doesn’t, what’s verified, what only looks verified.

That’s why Cairn shouldn’t be just a wrapper around an LLM. It needs architecture that supports cycles, going back, error analysis, observability, and evaluating test quality itself.

Why this matters for teams

In teams already using AI for development, the speed of creating code grows. But with that grows the need for a faster feedback loop. If AI helps write features faster, the QA process can’t stay as slow and manual as before.

But the answer isn’t to simply generate more tests. Quality doesn’t grow from a large number of weak tests. On the contrary you get flaky suites, long runs, noise in CI, and a false sense of security.

You need an approach that helps create checks with engineering value. That’s why Cairn for me isn’t about “AI instead of QA” but “AI as part of a verification system.”

It should help the team move faster from idea to implementation without skipping the most important step — proof that the implementation works.

Conclusion

After experimenting with different AI tools it became clear: the biggest problem isn’t generation but verification. AI is already good enough at creating a first version of a solution. But engineering value appears only when that solution passes verification.

The previous idea of eval as a mechanism of truth became the practical foundation here. If we want to build reliable AI-assisted workflows, we need not only models that write code but systems that can verify that code.

Cairn is my attempt to move in that direction. Not to make another loud AI tool, but to build an autonomous QA cycle that helps distinguish “looks ready” from “actually works.”