Reliable email labelling

Jun 9, 2026

Email looks simple from a distance.

A message arrives. Alfred reads it. A label appears.

But the closer we got to making that feel automatic, the less it looked like a classification problem. It became a trust problem.

An inbox is full of almosts. A calendar invite that is really a webinar. A GitHub review comment that sounds urgent, but only applies to code that has not shipped. A security alert that looks like a robot until it says a secret was exposed. A shareholder notice with a deadline, but no real action for you. A thread that used to be follow_up, then ended with a done reply.

The hard part was not naming ten labels. It was making the right label appear quickly, quietly, and consistently enough that you stop thinking about the system. What follows is the order in which that broke, and how each break got fixed.

The first hurdle: speed was not the whole story

At one point, tagging felt slow. The obvious guess was the model.

It was not the model. It was almost never the model.

Every "tagging is slow" report turned out to be plumbing. A newly connected Gmail account did not always have a fresh users.watchinstalled, so new mail waited for the five-minute poll. Gmail's history API is eventually consistent with its own Pub/Sub, so a webhook could fire with nothing to fetch. A single Google credential flipping to needs_reauth silently stalled all tagging. The classifier itself was sub-second the whole time.

The most embarrassing version of this: for one stretch, production had not run an email-triage workflow in days. The prod trigger row was frozen on an old event shape and matched nothing. Tagging still appeared to work, because a dev server left running on my laptop was quietly triaging the same real mailbox over the five-minute poll. Labels live on Google's side, so the laptop's work showed up in the real inbox. The product looked alive. It was on life support.

If a personal assistant is late, it does not matter how smart it is. The first promise is presence. Most of the time we were not fixing intelligence. We were fixing delivery.

The fix was infrastructural, not magical: install the watch at connect time, re-seed the builtin workflows for existing users at boot so prod stops drifting, keep polling as a real fallback, and route real-time and catch-up through the same persistence helper. One path for dedup. One path for triage. One path for label writes.

The second hurdle: Gmail threads do not behave like rows

Gmail has a quiet trap: thread view unions labels across every message in a thread.

If the first email in a thread was fyi, the second was follow_up, and the final reply closed the loop, Gmail could show all three labels at once. Technically true. Product-wise, unreadable. We saw threads carrying action_needed / action_needed / urgent stacked together.

The first mitigation was "latest message wins": strip the siblings, label the newest. It read clean and it had a race. The find-siblings query only saw labels that were already written, so two triage runs in the same thread could each look, find nothing to strip, and both apply. The lesson stuck: when a find-siblings query depends on data a parallel write has not committed yet, you have a race, not a feature.

So the model of the data changed. The triage table moved to one row per thread, keyed on (user_id, source_thread_id) instead of one row per message. When the same multi-tag bug resurfaced on production months later, the last gap closed: a Postgres advisory lock around the read-apply-strip sequence, plus a recency guard so a run for an older message cannot clobber a newer classification.

The honest framing is that the invariant — at most one Alfred label per thread — lives in Gmail, not in our database. Postgres can serialize the external writes. It cannot declare them true. So the label writer treats Gmail as the source of truth for thread shape: ask for the siblings, strip every old Alfred label, apply the new one to the latest message. The user sees one tag. The system does the fussy work underneath, and self-heals when an old deployment left a stray label behind.

The third hurdle: cheap models need better eyes, not a bigger brother

The cheap classifier read things too literally. A GitHub secret-scanning alert — a Redis URI exposed in a public repo — came through as fyi, because the prompt had filed "alerts" under passive awareness. Later, self-initiated sign-in and magic-link emails came through as urgent, because the prompt literally listed "sign-in verification" under urgent. Both were the model doing exactly what it was told.

The first instinct was to escalate. When the cheap model is uncertain, call the boss model, give it user context, let it think harder. That worked as a design and made the hot path heavier than it needed to be. Email tagging happens all day. It should not summon the most capable model every time a sender looks unfamiliar.

The next instinct was the opposite: cache a confident decision per sender and skip the model entirely on the easy hits. That got killed in review, and the reason is the whole thesis of triage v3:

Never skip the model. A sender who is 99% newsletter can still send the one email that matters. Cheap models are not smart, but smart models are slow and expensive and unsustainable per-email at inbox volume. The smartness we actually need only shows up in the edge cases.

So v3 kept the cheap model on every single email and made it smart by feeding it deterministic observations before it reads the prose:

what kind of sender this is — service envelope, bot, or human
what this sender usually becomes, as a fed histogram, never a bypass
whether the account is work or personal
whether the user already replied in this thread
whether Gmail marked the message important, promotional, or sent
whether the content carries a high-precision signal like exposed credentials

The model still reads the email and still makes the judgment. It just no longer has to infer the entire world from raw prose. Fast on every message, smarter without becoming heavy.

The fourth hurdle: rules can become clutter

Every inbox creates tempting patches. "This sender is usually a newsletter." "This bot is usually harmless." "This phrase often means urgency." Left unchecked, those patches become a wall of exceptions that pass today's example and fail tomorrow's neighbor.

Alfred uses a narrower pattern: observations, one cheap classification, one second cheap pass only on a hard deterministic conflict, and a tiny override floor for things we want to catch with high precision.

The override floor is where restraint had to be enforced against myself. The obvious move was "any security keyword forces urgent." That would have re-created the Dependabot-noise problem and re-tagged every CVE FYI as an emergency. So the floor was narrowed to a single, near-unambiguous signal: a secret or key paired with an exposure verb — exposed, leaked, committed, compromised. A self-initiated "sign in" or "your code is 123456" contains none of those verbs, so it never trips the floor. That was the exact bug that opened v3, now fenced off by construction.

Bad tags are tuned from logs, not vibes. The triage.sender_extractionevent records the sender context, the observations, both passes, the override flags, and the todo decision — enough to debug a mistake without turning the user's inbox into a raw log dump.

The fifth hurdle: a fix in triage is not a fix everywhere

The manufactured-urgency problem was supposed to be closed. Then an evening briefing led with this, as the first thing to check tonight:

The Visitors.now grace period ended today — if you haven't upgraded yet, tracking is now permanently disabled and incoming events are being dropped, so that's the first thing to check tonight.

We had already fixed this on the triage side. A free-tier upgrade ask — "upgrade your plan," "trial ending," "tracking ends soon," "upgrade before you lose the free tier" — is the vendor's conversion lever, not a bill and not a real consequence. Triage now reads it as marketing or fyi, never urgent or payment, and a deadline in that kind of email no longer earns immunity from the manufactured-stake test.

But the briefing does not just render triage tags. It has its own ranking, and its top rank is reserved for "something wrong or at risk — an incident, a fired alarm, a security signal." That ranker read the alarming wording as an incident and led with it, independent of triage — and here reinforced by a stale action_needed tag still sitting on the older threads in that conversation. The lie we had taught triage to ignore walked in through a different door and got believed by a different agent.

A correction lives in one component. The judgment it corrects can be re-derived in another. Every agent that ranks or re-scores has to carry the same discriminator, or the fix only looks complete.

So the same owed-versus-conversion discriminator was ported into the briefing's top rank, written as a principle rather than a Visitors.now-specific patch. A real incident carries a real stake: production is down, a credential is exposed, a bill is actually owed, a person is blocked. Vendor conversion pressure dressed as an incident is none of those, however alarming the wording. And because the briefing reads triage labels that can lag, the rule carries an explicit robustness clause — this holds even if the triage label looks urgent; the label can lag, so judge the content — which means the briefing gets it right without anyone re-triaging the stale threads first.

What it became

Alfred now classifies email across ten labels: urgent, action_needed, follow_up, awaiting_reply, meeting, fyi, done, payment, newsletter, and marketing.

In the dev database, all ten are in use across 307 thread-level triage rows. A taxonomy is not real until the product has to live with it.

What the system really earned was a memory of how sorting goes wrong:

delivery can be mistaken for intelligence
a Gmail thread is not one message, and the invariant lives in Gmail
a bot is not always noise
urgency can be manufactured, and so can a literal-minded prompt
cheap models get better when code hands them the right facts
corrections should come from observed misses, not one-off rules
a fix in triage is not a fix in the briefing — every agent that ranks has to carry the same judgment

The result should feel almost boring. The email arrives. The label is there. The thread has one current state. And you move on. That is the product.