Practical work suggestions from integration activity

Jun 9, 2026

A suggested todo is a small thing.

One line. A checkbox. Maybe a sentence of help.

But if the line is wrong, it is worse than silence. A bad todo does not just misread an email. It creates work.

That was the central problem with Alfred's todo suggestions: not how to find more of them, but how to deserve fewer. The path there was a string of specific, slightly embarrassing mistakes.

First it suggested nothing

The first version of the feature was, on paper, complete. There was a suggest_todo tool. It was wired correctly and force-included in every agent run. The table stayed empty. Zero rows, for everyone.

It was a pure prompt gap. Nothing in any system prompt told an agent the tool existed, what it was for, or when to call it. A capability the model is never told about does not exist. That is its own small lesson about agents: the wiring is necessary and never sufficient.

Then it suggested the wrong things

The first real suggestion on production was "Log in to Claude.ai," minted from an Anthropic sign-in email. The funny part is that the classifier had reasoned correctly — it tagged the email action_neededwith the note that this was a self-initiated login, not urgent. The todo got created anyway, because the rule at the time was roughly "if the category is important, propose a todo."

Then a steady parade. "Respond to Deepak's LinkedIn connection request." A chess.com email warning that a 100-day streak would reset before midnight. "You have 7 unread on Linear." "5 people viewed your profile." Each one was easy to patch with a new clause. Patching each one was the actual problem.

Stop example-patching. The original sin was folding the todo decision into the email category. Todo-worthiness is orthogonal to what kind of email arrived. It needed to be its own judgment, made by a cheap model with good context reasoning like a person — not a frontier model, and not a growing list of banned phrases.

A tag is not a task

Email triage and todos look adjacent. They are not the same decision.

An email can be done and still contain a trailing ask. An email can be fyi and still carry a real obligation, like an auto-renewal you have to cancel. An urgent email can be urgent for someone else. A meeting email can be ceremonial. The label says what kind of email arrived. The todo asks a different question: is there a real commitment the user should track?

The five gates

The rubric that replaced the patches is deliberately small and ordered.

First: is the obligation on me?The email has to ask the user to do something. Not a teammate named in the body. Not a reviewer. This gate exists because of a specific bug: Alfred created "Run the engineering standup" as the user's todo, even though the email said Sakshi was running it because someone was out. The model had the ownership right there in the text and assigned the task to the wrong person. So it now gets minimal identity context — the user's name and the account being triaged — purely so it can tell "you should do this" from "@alice please review the PR."

Second: is there a real external stake?Someone is waiting. Money is owed or at risk. Access could be lost. There is a hard deadline. A commitment was made to a human. Manufactured urgency does not count — a chess streak resetting is not a real consequence, however the email phrases it. Neither are unread counts, "people viewed your profile," marketing scarcity, or ceremonial notices like an AGM or a "save the date."

Third: would the user forget it? A login code you just requested is already in motion. A mid-flow confirmation will self-resolve. Nothing to remember means no todo.

Fourth: is it actionable from the email alone? "Thoughts?" is too vague. "Send the signed SOW by Friday" is not.

Fifth: is it already handled? If the user already replied or the loop is closed, Alfred should not reopen it as a reminder.

Only when all five gates pass does Alfred propose a todo.

The real mitigation: trace the no

The important field is not only todoSuggestion. It is todoDecision.

Every classification emits an outcome: proposed, no_obligation, not_significant, would_not_forget, too_vague, or already_handled. That turns "this one was wrong" into "gate two failed on this class of email," which is the difference between tuning a system and adding examples until the model stops annoying you. The rubric stays stable. The logs show where the boundary is wrong.

Suggestions happen in real time

Todos are suggested from the email-triage run, not from the daily briefing. That was also a correction. The briefing was the original producer, and a new user once accrued eleven urgent and action-needed threads in six hours and got zero todos, because the briefing cron had not fired yet. The briefing is a render of open loops. It should not be the thing that creates durable tasks.

The triage run is closer to the event. The selectivity rides inside the classify call that already happens, so there is no extra model cost. If the classifier emits a valid suggestion, the workflow tail calls system.suggest_todo.

No human approval is needed because a suggestion has no external side effect. It is not sending mail. It is not changing a calendar. It creates a passive row with provenance. That is the right level of autonomy.

Duplicates are a trust leak

The same commitment can arrive through multiple channels, or the same thread can be re-triaged after a reply. Without care, Alfred would create a second checkbox for the same loop.

So suggestions carry sources: provider, kind, id, and optional URL. If an open or suggested todo already references the incoming source, Alfred merges the refs instead of creating a duplicate. This is not full semantic dedup — Alfred does not yet know a Slack thread and a Gmail thread are the same obligation unless their sources overlap. But it gets the structural case right, which is the case it can prove.

What the data says, and what it does not

An audit of the agent-authored todos on production found about 38% of them were noise. The June stringency pass — the real-stake bar, the ownership gate, the manufactured-urgency kills — was validated in dry-run first, reclassifying historical todos to see what the new rubric would keep or kill before it touched anything. The first commit run was a near-miss: 103 of 104 runs skipped, because the same already-tagged guard that protects threads also blocked re-processing. With a force flag, the pass cut the agent-authored todos from 44 to 23.

The dry run was crisp where it mattered. Chess and the LinkedIn nudges and the Linear unread counts all died on the manufactured-stake gate. The Sakshi standup and an "@alice please review" both died on the ownership gate, while "Send the signed SOW to Priya by Friday" survived. That is the shape you want.

It is also honest about its limits. Pre-merge code review is still leaky: the model reads a concrete suggestion from a bot as a real obligation, so some "address the comments on PR #N" todos survive a gate that should kill them. And a production alarm to a team alias the user merely sits on still tags urgent— intrinsic significance cannot tell that the user is not the on-call. Those need role-aware context, not another banned phrase. Naming the gap is part of the restraint.

The best todo suggestion is not loud. It is specific. It names the real verb. It names the object. It carries the source. And it arrives only when there is something worth carrying forward. The goal was never a second inbox. It was for Alfred to remember what matters.