Observability Design for the AI Era — Reconciling PII Protection With AI Searchability, and Driving Self-Healing (Part 2)

11 min read

Contents

  1. The Observability Stack Is a Natural Path for PII
  2. Multi-Layer PII Design — Six Layers
  3. Hash on Both the Write and Search Sides
  4. Integration Surface — "Humans = Web, AI = MCP" on the Same Backend
  5. Human side: AI Operations Portal
  6. AI side: MCP
  7. The Real Driver of Self-Healing
  8. What's Still Open — Defining "What Counts as an Error" and the Stacktrace Design
  9. Closing — Static Edition + Dynamic Edition Are Lined Up; Merging Them Is the Next Series

Hi, I'm Ryan, CTO at airCloset.

In Part 1, I walked through the four monitoring axes (application / infrastructure / CI / LLM) and the deliberately different shape each one ends up in. That's the write-side of the observability stack, more or less wrapped up.

But shaping the write side isn't the end of the story. The moment production data flows through the stack, you have to block the path PII can take to slip in — and that's true with or without AI. It's the kind of classic observability problem where, if you cut corners, you walk straight into a leak incident.

Add AI to the picture and the weight of that risk jumps. Log search used to be the domain of SREs and a handful of developers; even if PII slipped in, the population doing the searching was small. Once AI agents start querying logs broadly via MCP, both the range and the frequency of who searches explode. PII handling that was previously "we got lucky, nothing happened" turns into structural risk that surfaces hard.

And on top of that, if the observability stack isn't queryable by AI, the whole "AI-consumable observability" goal from Part 1 falls apart.

Part 2 is about how I reconciled these two — protecting PII while keeping searchability for AI — and how that combination ends up driving Self-Healing from CI failure to PR proposal.

The Observability Stack Is a Natural Path for PII

App emits a log → it lands in Loki → AI queries it through MCP. Stand up this naive flow and you get:

Plain-text PII pooling in the observability stack means AI can search it directly. This isn't really an AI problem, it's an observability problem: the stack itself becomes a PII conduit. At the same time, if you scrub PII completely, you lose "I want to investigate Customer A's support ticket" as a query, which is a normal support workflow.

cortex (the internal AI platform[^cortex]) had to reconcile both. The key principle was: don't make "block the PII path" and "search by PII" mutually exclusive.

Multi-Layer PII Design — Six Layers

cortex's PII handling is six layers, each with a different role:

Layer Purpose Mechanism
Write: BQ Policy Tag Column-level access control pii_high / pii_medium / pii_low three-tier taxonomy. Without fine-grained reader on the column, SELECT errors out with Access Denied (pure CLS — no dynamic masking)
Write: ETL DLP Strip plain-text PII from derived tables Cloud DLP redacts during transforms (customer support data, etc.). Placeholders like [EMAIL_ADDRESS] / [PHONE_NUMBER] preserve the structure
Write: log hashing Plain text never reaches Loki App-side hash via hashEmail (HMAC-SHA256 → 12-char prefix; key lives outside the observability stack) before log emit
Search: same function on both sides Look up a specific customer's logs without ever touching plain text Query-side runs the same hashEmail before sending to Loki
Output: MCP masking Mask when AI consumes Column-name detection replaces values with placeholders like ***@***.com
Identity separation Internal staff email isn't customer PII Edge Router HMAC-signs auth emails; used as attribution identifier

The fourth row — search with the same function on both sides — is where the security / usability tradeoff gets really tight.

Hash on Both the Write and Search Sides

Naively "remove PII from logs" and you can no longer answer "let me look up Customer A's logs." But if you hash at write time and store that hash in the log, the search side can run the same hash function over the input and find the matching record. Plain-text email never touches either end.

Hash on both write and search to keep plain-text PII out of the observability stack while preserving search

Concretely:

Write side:

// Application code
logger.info("Subscription updated", {
  user: hashEmail(user.email), // → '7a3f9c2e0b1d' (HMAC-SHA256 12-char prefix)
  plan: "monthly",
});
// → Only the hashEmail result ends up in Loki

Search side (when you want to pull a specific customer's logs):

The search tool's entry point puts the same hashEmail in front of every query. After passing through, only the hashed value reaches Loki:

// Search tool entry: hash first, then query Loki
const hash = hashEmail(input);
// → '7a3f9c2e0b1d'
const logs = await loki.query(`{app="subscription"} |~ "${hash}"`);
// → Returns logs containing the matching hash

Both sides run the same hashEmail, so logs from the same customer collapse to the same hash on lookup. Meanwhile:

This reuses the property "same input → same hash" of hash functions in the form "the same function on both sides makes search work." The security / debug usability tradeoff compresses cleanly.

And of course, this is all just the app log layer. The BQ side is protected by Policy Tag-based column-level access control as its own layer (rows 1–2 of the table above). The whole thing is multi-layered.

One more thing worth noting: search tool input arguments (including MCP servers) carry plain-text only at the moment of arrival. The tool runs hashEmail immediately, so neither Loki nor the MCP tool-call log retains plain text. How the search tool handles its arguments is itself part of the multi-layer PII design.

Integration Surface — "Humans = Web, AI = MCP" on the Same Backend

Three observable shapes built, PII handled. The next question is who queries them, and how. The common trap is to build "human dashboard aggregations" and "AI data feeds" separately. The moment you do:

cortex's choice: share one observability backend; only the consumer-facing interface differs.

Same observability backend (Prometheus / BQ / Loki) — humans through the web dashboard, AI through MCP

Human side: AI Operations Portal

There's an internal portal (codenamed PI Lab) that aggregates dashboards by monitoring target:

Here's what the MCP usage dashboard actually looks like:

MCP tool usage dashboard — call count per server / tool plus average execution time

Over the past 30 days, service-product-graph had 37,946 calls (with 7,106 errors), gws had 19,350, db-graph had 17,297 — and that's just the top. Which MCP is used how much, where the failures are showing up — all visible at a daily glance. (The "high error rate" some servers seem to have is partly typed errors counted in — expected rejections like "permission denied" — so the interpretation needs care.) The "annotation graph MCP, ~50,000 calls / 73 users" figure from the previous series came from this same view.

These pages on the React side pull from BQ / Prometheus / Loki through an internal API. The aggregation logic lives at the API layer.

AI side: MCP

When AI agents need the same data, they go through purpose-specific MCPs:

The design pivot: the human dashboard and the AI MCP share the same backend. No separate "AI aggregation table" and "human aggregation table." Build the observability backend once, then provide a consumer-specific interface layer (web dashboard / MCP) on top.

In DDD terms, MCP and the web dashboard are both just presentation layers — different I/O channels into the same domain (the observability backend). Treating MCP as "something special" leads to duplicate implementations; treating it as one presentation layer form keeps the design clean.

That's exactly why "the observability stack is visible to AI" actually holds. Build the backend, but without an AI-facing presentation layer (= MCP), AI can't query it. MCP is the piece that makes "hand it to AI" actually work.

The Real Driver of Self-Healing

The layer that keeps the observability stack from being "just a screen to look at" is Self-Healing. I covered the full picture in AI Harness Series Part 4, so I'll skip the details here, but from the observability side, the start and end of the chain are clear:

Self-Healing chain from CI failure / production alert to PR proposal

The flow:

  1. Detect — Production alert / CI failure fires a Loki LogQL alert
  2. Deliver — POST to event-relay (the internal webhook hub)
  3. Launch — auto-review bot starts up (= an agent backed by Claude Code)
  4. Gather context — The bot pulls full logs via Grafana MCP, traces related PR / commit / code via Product Graph MCP
  5. Propose — File a fix PR
  6. Verify — If CI passes, the bot auto-merges; if not, another bot reviews

So the starting point of Self-Healing is whether the observability stack can hand "what broke" to AI in the right shape. If errors aren't recognized / stacktraces aren't preserved / related code (PR / commit / graph) isn't reachable — any of those missing and the chain stops cold. (The specific failure modes are in the next section.) Put another way:

The quality of observability is the ceiling for AI autonomous operation.

That's the central claim of Part 2. Reframe the observability stack as "input that drives AI," not "monitoring infrastructure," and the priorities of your design decisions shift accordingly.

What's Still Open — Defining "What Counts as an Error" and the Stacktrace Design

The biggest remaining issue, honest version.

You can polish the observability stack to a mirror finish, but if the design of what counts as an error and whether the stacktrace survives falls apart, all of it is wasted. I touched on this earlier in AI Harness Series Part 2 in the context of cortex's internal knowledge graph, and it shows up on the observability side too.

Concretely, here are the failure modes:

These are all problems at the code that creates the observability entry point, not at the observability stack itself. No matter how polished the stack is, if the faucet at the entry point is broken, nothing flows out.

What's in place today is three layers, none of them complete:

In other words: "There's a guideline, lint catches some, AI review catches some, but it's not airtight" is the honest description. The real gap is that at the moment new code is being written, there isn't a harness that proactively suggests / completes "this should be treated as an error, this should keep its stacktrace." Auto-review picks things up at PR time, but a proactive harness for the observability entry-point design itself isn't built yet.

"Observability stack: done. Observability target design: still on humans." That's the honest picture. Closing that gap with a harness is the next step.

Closing — Static Edition + Dynamic Edition Are Lined Up; Merging Them Is the Next Series

The code-graph series was about reshaping a static analysis graph so AI could query it — handing the structure of code as fact. This two-part series was about handing what's happening in production right now, also as fact.

Shape What's Handed Over
Static edition (code-graph + db-graph + annotation graph) 3-graph parallel + SAME_ENTITY Code and meaning
Dynamic edition (Part 1 + this post) Prometheus / BQ / Loki + MCP Production behavior and cost

The honest part: these two still sit side by side, not joined. For cortex's stated principle of "don't let AI infer — hand it facts" to truly reach completion, the next step is to pour dynamic data into the static graph and merge them. This is the exact same gap I flagged as the "absence of dynamic analysis" residual at the end of code-graph Part 2: putting "how often is this edge actually used in production?" on the static graph's nodes. That's when "hand it as fact" reaches its final form.

Layer Self-Healing on top of static + dynamic and you get "AI autonomously operates," which works today. But merging the two editions into one graph is still ahead — that's the next series.

And one more time, observability target design (what counts as an error, whether stacktrace survives) is what really sets the ceiling. Harness-ifying that is the next homework item.

Thanks for reading this far.

comments (0)

no comments yet.