I Built an AI Harness Because I Want to Trust AI (Series Final)

20 min read

目次

  1. Series Index
  2. Origin — What I Was Thinking About in 2025
  3. Wall 1: The Context Window Limit
  4. Wall 2: Don't Lean on Learning Either
  5. The Way Out — GraphRAG + MCP
  6. I Don't Trust AI to Begin With
  7. So I Build Harnesses to Hold AI to Determinism
  8. What "Don't Make AI Infer, Lean on Determinism" Actually Means
  9. Trial and Error That Got Me Here
  10. I Spent Two Months on Static-Analysis code-graph, Then Threw It Out
  11. Setting Coverage 90% as a Solo Target Broke the Implementations
  12. Parallel Sub-Agent → Sequential Evaluation
  13. What This Section Is Really Saying
  14. Closing the Series
  15. Afterword — Where Engineering Careers Are Heading

Hi, I'm Ryan, CTO at airCloset.

Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It is unrelated to existing commercial services like Snowflake Cortex or Palo Alto Networks Cortex.

Across the six posts of this series I've worked through how cortex's harness is put together, mechanism by mechanism: the overall picture, the knowledge graph, Auto Review, Self-Healing + Recurrence Prevention, and non-engineer PRs. Having walked through all of them, I want to step one level down for the wrap-up. Why am I building this thing in the first place? That's what this post is about.

The five mechanisms might look independent, but the root is one thing, and the series doesn't close cleanly without that one thing being put into words. Together with the philosophy, I want to look back at the failures that don't show up when you only write about what worked — what I threw away, where I tripped — as a reference point for anyone trying something similar.

Series Index

# Theme Key scene Article
1 Series intro: cortex's harness PRs auto-merge / incidents self-heal before you notice ai-harness-intro
2 Product Graph (cpg) Code, docs, DB, infra unified into one graph cortex-product-graph
3 AI PR review webhook → AI review → auto-fix → squash merge cortex-auto-review
4 Self-Healing + observability + auto-added guardrails Alert → AI investigates → fix PR + new lint/type gate → auto redeploy cortex-self-healing
5 Democratizing the maintenance phase Domain experts open PRs to production; the harness owns the quality gate cortex-non-engineer-prs
6 Series wrap-up The underlying philosophy plus a retrospective on the failures and lessons This post ← you are here

Origin — What I Was Thinking About in 2025

When I started building cortex, there was one question I wanted to answer:

How do I get AI to understand the system accurately?

If AI could understand the system accurately, then PR review, bug investigation, and fixes could all be delegated, and even non-engineers could open up their own development. Conversely, as long as I was stuck on "understand it accurately," everything downstream was sitting on unstable ground. So I spent a lot of time on the prerequisite layer before any of the individual mechanisms.

The two obvious approaches both hit walls.

Wall 1: The Context Window Limit

The first reflex is "just give it all the information it might need." Stuff the codebase, docs, DB schema, infra definitions all into the prompt, and AI gets the whole picture.

That fails on size. Codebase + docs + schemas + infra at our company doesn't come close to fitting into any realistic context window.

"Surely context windows will keep growing, and this'll work eventually?" — the more I thought about it, the less of a future I saw in that direction.

Even with a model whose context window is very large like Gemini, behavior gets unstable when you push it close to the limit. Middle information gets dropped, irrelevant tokens skew the conclusion sideways. This isn't a model-selection problem; it's a structural attention problem. The more unrelated tokens you mix in, the more the attention ratio toward relevant tokens drops mechanically. This is the documented "lost in the middle" phenomenon (information placed at the start and end of long inputs gets used; information placed in the middle is effectively ignored), and stuff the context window full and you routinely end up in a state where the information you thought you handed over isn't actually visible to the model.

So even if "size" is solved, you immediately need a higher-level context to judge which tokens are necessary and which aren't. That problem is recursive and can't be resolved by context window size. Information has to be structured, or AI doesn't make correct judgments. That's true of humans too — but humans are a notch better off, because LLMs don't notice they don't know, and they answer with confidence anyway. Silently wrong is worse than visibly stuck.

The keep-growing-context-windows path didn't have a real resolution in sight.

Wall 2: Don't Lean on Learning Either

The other obvious move is to make AI itself learn. Fine-tune per organization, teach it our codebase, our docs, our business. I considered it. Currently not doing it.

Two reasons. One: getting learning into actual production was still research-phase (in 2025 then; still in 2026 as I write this) and the path to real deployment had high latency. The other is thornier: even if you could learn it, "forgetting" is extremely hard.

A business system has to reflect "the current truth." When the design changes, the DB schema changes, the business rules change, you want to actively erase old knowledge. But "delete just this piece of what's baked into the LLM weights" is unsolved at the research level — there's even a field name for it, machine unlearning, which tells you how hard it is. And on top of that, teaching the model new things also destroys unrelated existing knowledge (called destructive interference / catastrophic forgetting). Lean on learning and both hit at once: the cost of keeping things consistent explodes.

Rather than treating "doesn't learn" as a downside, I came around to: because it doesn't learn, swapping out the external knowledge is enough to reflect the current state, and the consistency story is much simpler. That was the call at the time.

The Way Out — GraphRAG + MCP

With no future in the context-window direction or the learning direction, I came across the GraphRAG concept.

GraphRAG itself is widely discussed elsewhere; for me, what it meant was the framing: "supply only the context that's needed, at the moment it's needed." Combined with MCP (Anthropic's protocol for connecting LLMs to external tools), AI can go fetch what it needs on its own.

What was decisive was that this structure lets AI traverse the graph agentically. Rather than "read everything and find related parts by inference," AI gets to the node it needs and pulls the fact out. Which leads to:

Instead of making AI infer, supply facts as context.

That one sentence became the core of cortex's entire design philosophy.

The first thing I built was a static-analysis-based code-graph, which I then threw away after trial and error, and arrived at the annotation-based product-graph (cpg) — details in the trial-and-error section.

2025 origin. Neither growing the context window nor relying on learning had a future; GraphRAG + MCP became the way through.

I Don't Trust AI to Begin With

The origin section in one line:

I don't trust AI to fill in the blanks for me.

"Don't trust" here is not the same as "have no faith in." This isn't doubting Claude / GPT / Gemini's generation quality. What I mean is:

The first one is a truth no amount of model progress will change. Architecturally, LLMs can't know things that weren't in the training data and aren't in this session's context. "Surely smarter models will pick up on it" — I don't think that future is coming. Smarter is a real direction; smart alone doesn't compensate for not knowing.

The second one is about responsibility, and humans owning it. AI can't decide on its own what "ideal" means. When it tries, it lands on a generic best-practice answer slightly off from the actual situation. Ideal depends on the business, the organization, the moment in time — none of which is visible to AI unless a human verbalizes it and hands it over.

So that conviction is not underestimating AI's capability; it's a design decision to not let AI auto-complete the prerequisites.

Mastering AI is not about giving it freedom — it's about confining its output to a predictable range.

The mechanism for confining it is the harness this series has been describing.

So I Build Harnesses to Hold AI to Determinism

Reading each mechanism through the lens of "don't make AI infer; lean on determinism" surfaces that the five posts in this series are all the same conviction showing up in different layers.

Part 2 — Knowledge Graph: Instead of making AI search the codebase, this mechanism tilts toward making the codebase legible. With @graph-* annotations, code / docs / DB / infra are unified into one graph, so AI doesn't have to grep + infer to find related parts. This is the direct implementation of "supply facts as context" from the origin section. → cortex-product-graph

Part 3 — Auto Review Dimensions: Nine review dimensions (responsibility / severity / type SSoT / etc.) are fixed in advance. When AI does the review, what to check isn't something it gets to infer. "Looking at the PR as a whole" gives AI too much room for inference, so dimensions are split and each is judged independently. Dimensions = locked by the harness, evaluation = AI's job.cortex-auto-review

Part 4 — Self-Healing + Recurrence Prevention: Alert → investigation → fix PR → redeploy. The flow itself is fixed. AI doesn't get to think through "how should we respond to incidents" each time. And Recurrence Prevention — adding lint / CI gates so the same trap can't be stepped on twice — is mechanical refusal at the gate, not trust-AI-not-to-do-it-again. Or put differently: I don't expect AI never to repeat a mistake. → cortex-self-healing

Part 5 — Non-Engineer PRs: If the harness weren't holding quality, business-side folks opening PRs directly to production wouldn't survive a single day. Conversely, with the three mechanisms above stacked up (context locked, dimensions locked, traps locked out mechanically), the person closest to the requirements can ship the change directly. The translation layer and the engineering priority queue disappeared as a downstream consequence of the determinism push. → cortex-non-engineer-prs

So the five mechanisms are "don't make AI infer; lean on determinism" implemented at different layers. The root is one conviction.

What "Don't Make AI Infer, Lean on Determinism" Actually Means

Let me sharpen this phrase that's come up a few times.

"Lean on determinism" does not mean "give AI zero room to infer." Code generation, judging review findings, hypothesizing root causes from error logs — these are domains where AI not inferring is the same as no work getting done.

Where I want to lean on determinism is in domains where variance isn't allowed. Specifically:

What implements this line — where inference is allowed vs. where it isn't — is the harness. To borrow the metaphor from Part 5, the harness lays down rails you can't fall off. On top of the rails, AI runs free (inference works as inference); but it can't fall off the rails sideways.

Put differently, this is equivalent to the framing in Part 2 (cortex-product-graph): "where hallucination gets confined." Saying "no inference allowed" isn't quite right — the harness isn't a thing that makes hallucination go to zero. It's a thing that confines hallucination to places where hallucination is OK (i.e., the inference-allowed zone). The structure and facts about the codebase are pulled deterministically, so hallucination has no opening there; hallucinations on the judgment side get filtered downstream by tests / lint / dimension-by-dimension reviews. The places where hallucination is allowed and the places where it isn't are physically split by the harness. That's the continuation of the Part 2 framing.

Inference-allowed zone (top, green) and inference-forbidden zone (bottom, orange). The harness implements this boundary — i.e., decides where hallucination gets confined.

This is also the underlying basis for Part 1 (Series Intro)'s claim "models commoditize; harnesses differentiate." Model-side quality is converging across Claude / GPT / Gemini, but the harness is codebase-specific and business-specific, so this is where org-level differentiation actually comes from.

A note: this framing isn't confined to cortex's harness. The same stance shapes db-graph MCP, the natural-language interface over internal DB schemas, and Sandbox MCP, which lets non-engineers safely publish AI-built apps. It's the through-line in any platform we build that's based on AI doing meaningful work.

One level more abstract: the individual features aren't where the value is. The value sits in the conviction itself. cortex / db-graph / Sandbox MCP are all that one conviction translated into our own use cases.

The way I think about "design":

Design is translating an abstract principle into a concrete implementation that fits your own use cases.

It's not drawing class diagrams, and it's not laying out architecture diagrams — it's the translation work of "how does this principle take shape under our business / codebase / constraints?" That's where each organization's distinctiveness lives, and that's the value that can't be copied.

Said the other way: another organization copying cortex's surface doesn't reproduce the substance. What gets asked of every org is how it translates this principle into its own use cases.

Trial and Error That Got Me Here

Everything I've written about above is the form that ended up working. Getting to that form involved a lot of throwing away. Two representative examples worth keeping on record, plus one shorter one.

I Spent Two Months on Static-Analysis code-graph, Then Threw It Out

The first thing I built was static-analysis-based code-graph: extracting AST data — imports, call graphs, type dependencies — and putting that into a graph DB. At a glance, the obvious implementation of "make AI understand the codebase."

Why two months? code-graph wasn't just cortex; it spanned our consumer-facing services and internal-system repositories too — over 40 repos in total (cortex being one of them). The mechanically-extractable AST data (imports / call graphs / type dependencies) was usable as-is via tree-sitter, but each repo had its own API endpoints / DB schema / event definitions / Pub/Sub topology, and extracting those boundary nodes (where an app meets the outside) goes beyond mechanical AST analysis and had to be implemented per-repo-type — that's where the time went.

So I spent two months out of the first three on this, and got something that worked end-to-end.

And then I threw it away.

Why: static analysis is great at capturing structure, but it can't traverse on intent or business context. Concretely, three things broke:

→ I switched to the annotation-based approach (@graph-* JSDoc tags write the business intent into the code, and that gets unified with DB schema / docs / infra into one graph). Searchable semantically, and when you traverse, only related stuff comes back. That's the current product-graph (cpg). Don't drag sunk cost forward and you'll get to the final form — discarding two months of investment instead of trying to recoup it was the foundation for everything that came after.

Setting Coverage 90% as a Solo Target Broke the Implementations

Test coverage is still gated at 90%+ (as covered in Part 3). That part hasn't changed. But there was a period when Coverage was treated as a standalone target, and during that period the implementation visibly got worse.

Specifically:

Result: Coverage 90%, quality lower than before. When you look at Coverage alone, the shortest path to "satisfy it" is a weaker implementation that passes the tests.

Two lessons:

Then, as the follow-up: I added linting that mechanically closes off the routes that let you weaken implementations to satisfy Coverage. Two specific examples:

On top of that, cortex's testing guideline opens with "Coverage is not the goal, just a supporting indicator," and threshold-lowering / istanbul ignore workarounds get bounced as Critical in Auto Review. So even when Coverage is satisfied, "this is intentionally deleting a branch" / "this is swallowing the exception" comes back as a Major finding.

From the lesson "a single metric warps implementation," we descended through guideline that states the principle → lint that mechanically rejects → Auto Review that evaluates as a dimension before Coverage 90% finally functioned as the "minimum floor" it should have been all along. This too sits in the lineage of the Recurrence Prevention mechanism from Part 4 (so the same trap can't be stepped on twice).

Parallel Sub-Agent → Sequential Evaluation

Third: an internal-structure call about Auto Review. Distribute the 9 dimensions to parallel sub-agents and evaluate concurrently — the plausible-looking design ("parallel = faster, parallel should also hold quality") I tried first and ended up throwing out.

What actually happened: time, cost, and accuracy all got worse.

Switching to sequential evaluation: same session goes through 9 dimensions in sequence, so context loads once, and each dimension's call has the previous dimension's verdict in front of it. All three — time, cost, accuracy — improve simultaneously.

The takeaway: the distributed-systems intuition that "parallel = faster, parallel = quality holds" breaks its own assumptions in an AI harness. Unlike parallelizing across CPU cores on your machine, with AI the context isn't shared memory; it's per-process state. Sequential evaluation in one session ends up better on speed, token efficiency, and inter-dimension consistency at the same time — a structural property that's easy to miss at design time.

What This Section Is Really Saying

The form I've described across the series is the result of a lot of trial and error. Not starting with the right answer and laying it out from there. The decisions of throwing things away with sunk costs included, the trap of letting a metric I chose turn into the goal, the distributed setup that looked natural but worked against me — those are the things I walked through before landing at the current shape.

Not easy. I don't pretend it was. But if you do walk through it, real results follow — that's the honest read on it now.

Closing the Series

What I most wanted to communicate across these six posts comes down to one thing:

AI coding is not about "how to use AI" — it's about designing the environment AI runs in.

Assuming a large codebase: prompt engineering / model selection / tool selection — each matters individually, but polishing them alone doesn't get you to auto-merging PRs, auto-healing incidents, or non-engineer development. Getting there requires building a codebase / business flow / observability / repair cycle where AI doesn't need to infer. That's not an individual AI skill — that's an environment-design problem (conversely, for a small project of a few dozen files, today's AI models work fine standalone. Harnesses become essential when scale exceeds what one person can hold in their head.).

And the conviction at the root of environment design is, repeating myself, "I don't trust AI" — looking the reality in the face that context that wasn't handed over isn't known, and the ideal state doesn't happen without being told. Once you accept that premise, what to build clarifies naturally.

Looking back, four decisions ended up being the ones that mattered:

If even one of these is useful to someone starting on something similar, that would be great.


Afterword — Where Engineering Careers Are Heading

Slipping off the wrap-up topic — this is something I've been turning over recently, written here in a "loosely held thought" tone, so feel free to skim.

As harnesses mature, I think engineering work splits along two directions.

One direction is value creation from problem identification and business design. In the world of Part 5 — where non-engineer PRs work — "writing code" stops being scarce, and the actual scarce thing becomes the ability to define what to build. The person closest to the requirements (a PMO, a business manager, a domain-deep engineer) ends up driving Claude Code through to the merged PR themselves. This direction looks less like "engineer" and more like a business designer who moves between domain and implementation.

The other direction is building the foundation that lets all of that happen safely and quickly. Non-engineers can open PRs to the production repo only because the harness underneath holds quality — knowledge graph, Auto Review, Self-Healing, Recurrence Prevention, lint, CI, tests, observability stack, all interlocked. Designing / maintaining / evolving that gets harder, not easier. As the house-builder side, rail-layer side, this demands deep infra understanding / security instinct / observability design / a feel for AI's architectural quirks.

I'm building cortex, so I'm spending more time on the latter; building "a foundation where the business can run its own changes" is genuinely fun for me. That said, I'm not the type who fully commits to one side — I move between listening to business questions and assembling the foundation, and the satisfaction from each is its own kind. This isn't a "which is more important?" question — the harness exists precisely so the former is possible, and the former being alive is what gives the latter meaning. They're mutually dependent.

Maybe the era of polishing just "coding ability" is shifting slightly. Where to put your value — or whether to move between both directions — becomes a question more engineers will need to choose into intentionally.


Six posts in, thanks for sticking with me to the end.


# Theme Key scene Article
1 Series intro: cortex's harness PRs auto-merge / incidents self-heal before you notice ai-harness-intro
2 Product Graph (cpg) Code, docs, DB, infra unified into one graph cortex-product-graph
3 AI PR review webhook → AI review → auto-fix → squash merge cortex-auto-review
4 Self-Healing + observability + auto-added guardrails Alert → AI investigates → fix PR + new lint/type gate → auto redeploy cortex-self-healing
5 Democratizing the maintenance phase Domain experts open PRs to production; the harness owns the quality gate cortex-non-engineer-prs
6 Series Final — the underlying philosophy plus a retrospective This post

comments (0)

no comments yet.