Skip to main content

Launch Loop — Orchestration Contract

Source: docs/launch-loop.md

MANDATORY READING for every agent (implementer, evaluator, designer): /Users/sebastian/dev/deepspace/docs/design-guidelines.md

That document is the authoritative aesthetic contract: what the site looks like, what's forbidden, how the typography scale works, where cards are allowed, the CTA budget, the subtraction principle. Every fix dispatched must trace to a rule in either design-guidelines.md or this file. If a brief seems to contradict design-guidelines.md, the guidelines win.


This document defines how Claude runs the autonomous UI/UX improvement loop for the TracePlot site. It is the single source of truth for the rubric, the roles, the workflow, and the stop conditions. If you are an agent reading this: your task definition in the orchestrator prompt is authoritative for what to do, but how you report must follow the contract here so the orchestrator can fold your output into the rubric without reinterpretation.


1. Orchestration principles

  1. Claude orchestrates, agents execute. The orchestrator does not read source files directly, does not run the dev server or build scripts directly, does not edit code directly, does not run the browser or score pages directly. Every concrete action is delegated to a subagent with a self-contained prompt. The orchestrator's job is to decide what to do next, brief the right agent, and merge the result into the rubric.
  2. Evaluation is continuous, not one-shot. Every iteration begins with a fresh evaluation pass. The rubric is never assumed stable between iterations — implementing one fix often moves another score, and regressions must surface the same turn they appear.
  3. Full surface, not just the landing page. Every page the site ships — landing, /start, /reserved, /methodology, /docs, every /docs/[slug], /legal/privacy, /legal/terms, /legal/imprint — is scored. Launch readiness is a property of the whole site, not one hero.
  4. Every fix must be justified by an evaluator finding. The orchestrator never dispatches a fix it cannot point to a rubric finding for. Taste-based changes that aren't traceable to a scored dimension are out of scope.
  5. Images carry no text. All textual UI — labels, badges, headlines, sidebars, pins, controls — is rendered in HTML/JSX as an overlay. Images are the photographic or illustrative layer only.
  6. The dev server is assumed to be running on http://localhost:3000 and hot-reloads on file writes. Agents do not start, stop, or restart it.
  7. Stop when the rubric is satisfied AND no obvious wins remain. Do not pad the loop. Do not invent issues to keep going.

2. The rubric

Ten dimensions, each scored 0–5. The target is every dimension ≥ 4, at every breakpoint, on every page. Stretch goal is ≥ 4.5 average.

#Dimension012345
1Typography & hierarchybroken, unreadablereadable, no hierarchysome hierarchy, inconsistent scaleclear hierarchy, consistent scalerefined pairing, rhythm, tracking, line-height intentionaleditorial-grade — typography itself is part of the brand
2Layout & spacingbroken layoutcontent visible but misalignedaligned, inconsistent spacingconsistent spacing scale, no dead zonesintentional whitespace, visual balance, asymmetry used purposefullynothing to add or remove
3Responsive (375 / 768 / 1024 / 1440)breaks on mobileworks but looks bad on one sizeworks on all sizes, one feels like an afterthoughtall sizes work welleach breakpoint feels native, not squishedflawless across the entire range
4Header / nav / footer chromebrokenaligned, unpolishedsymmetric, plaindeliberate styling, consistenthover/focus states, subtle detailsmemorable without being distracting
5Hero & section framingmissingtext + image, awkwardclear headline, decent visualheadline + visual + CTA in conversationall three pull together, no wasted spacecan't imagine it better
6Imagery (simple, text-free, HTML overlay on top)none or stockgenericon-brand but flatintentional, matches palette, no baked-in textdistinctive, supports narrativeeditorial-grade brand work
7Component polish (buttons, cards, pills, badges, tables, forms, accordions)default browserstyled but inconsistentconsistent stylingconsistent + refined (shadows, radii, hover)micro-details (focus rings, transitions, active states)design-system feel
8Copy integrity (polish only, no meaning rewrites)broken, typosclear but blandclear and on-brandscannable, each section earns its spacevoice consistent, CTAs compellingevery word pulls weight
9Accessibility (contrast, focus, semantics, alt, keyboard, ARIA)brokenseveral issuesroughly AAWCAG AA cleanAA+ with keyboard and screen-reader careAAA where reasonable
10Technical polish (console clean, no CLS, images optimized, meta tags, OG image, favicon)brokenworksclean console, loads fastimages optimized, LCP goodLCP < 2s, CLS < 0.1, no layout shifts, full metaLighthouse 95+ across the board

Dimension-specific anti-patterns (automatic score caps)

Dim 2 — Card-everywhere anti-pattern. A region wrapped in border + bg-surface + rounded-[22px] (or any equivalent "white box on the page background" treatment) must earn its container role. A box is only justified if the content inside is a genuine container:

  • A pricing plan (one of several compared offers) → card OK.
  • A comparison row stacked on mobile → card OK.
  • A diagnostic "Is this you?" sub-question → card OK.
  • Everything else — page heroes, legal/docs hero headers, pill-row groups, TOC asides, article bodies, section titles, publisher info, contact info, "what happens next" lists — does NOT earn a card. Use flat type on the page background + whitespace + optional thin rules for separation.

If a page has more than 2 unrelated card layers stacked on top of each other (excluding legitimate container cards above), dim 2 is capped at 2 for that page until the nesting is removed. Editorial design pattern: page background does the work; type and whitespace create hierarchy; boxes are the exception, not the rule.

Dim 5 — Editorial hero rule. Pages whose hero is just eyebrow + h1 + subhead + (optional) small CTA (no dashboard, no form, no mock) render the hero as flat type directly on the page. No card wrapper. Applies to: /start, /reserved, /methodology, /legal/*, /docs, /docs/[slug]. Dim 5 is capped at 3 on any such page whose hero is card-wrapped.

Dim 7 — Approved library primitives. Only these libraries are in use for interactive/a11y-critical UI:

  • @radix-ui/react-accordion — FAQ accordion.
  • @radix-ui/react-dialog — mobile nav drawer, any modal.
  • (Others may be added only via user approval.)

Everything else (buttons, pills, cards-that-survive, badges, tables, form layouts) remains hand-rolled Tailwind. No shadcn/ui adoption, no Headless UI, no Aceternity. Implementers that introduce other libraries must report it and will be rejected by default.

Implementer brief contract

Every implementer dispatch must carry:

  1. The rubric finding(s) being addressed, with file:line citations from an evaluator report.
  2. A numbered list of atomic fixes — each fix is one discrete edit with an expected before/after.
  3. A "do not touch" guardrail list — files the implementer may read but not edit.
  4. Out-of-scope rule: any issue the implementer notices outside its brief is reported, not fixed.
  5. Typecheck + lint requirement: implementer runs pnpm typecheck and pnpm lint on completion and reports pass/fail.
  6. Tab hygiene: implementers that open a browser tab follow §3a.

Document mode vs Marketing mode

Every page belongs to one of two stylistic modes. The mode determines how aggressive the editorial framing can be. Mixing modes (applying marketing chrome to a document page) is an anti-pattern and caps several dimensions.

Marketing mode pages:

  • / (landing)
  • /start
  • /reserved
  • /docs (index)

Marketing mode allows: display serif headlines at large scales, small-caps eyebrow labels, hero composition with visual + CTA, crafted section openers, distinctive treatments per section, imagery, and all the usual brand vocabulary. These pages are selling — they should feel distinctive and intentional.

Document mode pages:

  • /legal/privacy
  • /legal/terms
  • /legal/imprint
  • /docs/[slug] (any)
  • /methodology

Document mode is deliberately plain. The reader is there for information, and decoration competes with content. Document mode pages must:

  1. Use ONE font family for all text. Body sans-serif throughout. A modest serif h1 is acceptable (at most 1.8–2rem), but h2/h3/h4 must match the body font family — no display serif at huge sizes.
  2. Have a narrow heading scale: h1 ≈ 1.75–2rem, h2 ≈ 1.15–1.3rem (bolder weight), h3 ≈ 1rem (bolder), h4 ≈ body size (italic or semibold).
  3. NOT use eyebrow pills / small-caps section labels above the title.
  4. NOT use small-caps metadata treatment for dates, authors, or "last updated" lines. Plain body text.
  5. NOT have a sidebar / aside TOC. No "On this page" navigation panel. Readers scroll.
  6. NOT have a mobile <details> TOC accordion. Same reason.
  7. NOT have a hero composition (eyebrow + display headline + muted subhead + thin rule). Just: <h1>Title</h1>, optional date line, first body paragraph.
  8. NOT wrap tables in surface cards or scroll-hint gradients. Tables get plain borders and cell padding, full stop.
  9. NOT wrap the article body in any card — just a centered single column with a comfortable max-w-[68ch] measure and sufficient vertical padding.
  10. NOT insert decorative separators between sections. Heading size and vertical rhythm do the separation.

The test: if the page looks like something you'd print and read, it's document mode. If it looks like something designed for a website, it's leaked marketing-mode styling.

A user quote that anchors this rule: "It does not look like a clean document. Full of unnecessary stuff to make it 'beautiful' but just adds noise... The 'On this page' totally unnecessary for terms, I never see that in sites. This is the type of overkill that might feel 'cool' but it's unnecessary."

Dim 1 (typography) and dim 2 (layout) are capped at 3 on any document-mode page that exhibits marketing-mode chrome — eyebrow, display-sized h1, sidebar TOC, mobile details TOC, small-caps metadata, decorated tables, card-wrapped hero. All of them together drag the score to 2.

Scoring rules

  • Each page gets its own 10-dimension score at each of four breakpoints: 375px (mobile), 768px (tablet), 1024px (small desktop), 1440px (large desktop).
  • Dimension 3 (responsive) is scored once per page — it is about the relationship between breakpoints, not a per-breakpoint property.
  • For dimensions that are global (e.g. brand consistency, component polish on shared components), regressions on one page drag the score for all pages that share the component.
  • A dimension is blocking if any breakpoint on any page scores < 4. Launch is not ready until no blockers remain.

What each score means in practice

  • 0 — user would bounce immediately.
  • 1 — visibly wrong, harms credibility.
  • 2 — "fine" but nothing earns its place. Amateur.
  • 3 — clean and professional, doesn't stand out.
  • 4 — launch quality. Noticeable care. This is the minimum floor.
  • 5 — best-in-class. Each dimension at 5 is a stretch, not required.

3. Pages in scope

PathPurposeNotes
/Landing / marketingPrimary surface. Hero, steps, comparison, pricing, FAQ, final CTA.
/startOnboarding entryPost-click destination from most CTAs.
/reservedConfirmation pagePost-reservation success state.
/methodologyProduct-depth contentTechnical trust page.
/docsDocs indexLists doc entries from /docs/*.md.
/docs/[slug]Individual docMarkdown render. At least one representative slug is evaluated per round.
/legal/privacyPrivacy policyLegal chrome.
/legal/termsTermsLegal chrome.
/legal/imprintImprintLegal chrome.

Every round evaluates all of these. Subpages are not second-class citizens.


3a. Browser tab hygiene

Agents open Chrome tabs to evaluate pages. Left unchecked the browser fills with dozens of stale tabs over a few rounds. Hygiene rules:

  1. Every agent opens at most one tab. Navigate that single tab between URLs and breakpoints — do not create a fresh tab per page. Use mcp__claude-in-chrome__tabs_create_mcp once at the start, then reuse via navigate.
  2. Agents record their tab ID in their report. Last line of every browser-using agent's report: Tab opened by this agent: <tabId> (or none if no tab was opened).
  3. Agents close their own tab before returning whenever possible — via mcp__claude-in-chrome__tabs_close_mcp or a window.close() through javascript_tool (for tabs they created). If they can't close, they still report the tab ID so the orchestrator can.
  4. Orchestrator sweeps at round boundaries. After every round (not between agents within a round), the orchestrator calls tabs_context_mcp, compares against the tab IDs reported by agents, and closes any tab that:
    • was reported by a returned agent, or
    • points to a localhost:3000 page and is not being used by any still-running agent, or
    • is leftover from a previous round. One working tab may be kept on about:blank for the next round's first dispatch.
  5. Never close a tab while a background agent that may own it is still running. If in doubt, wait for the agent to return.
  6. Orchestrator does not share tab IDs between agents. Each agent creates its own, reports it, and the orchestrator tracks ownership in a simple map {agentId: tabId}.

4. Breakpoints

WidthLabelDevice class
375pxmobileiPhone 13 / 14 / 15 mini baseline
768pxtabletiPad portrait
1024pxlaptopsmall desktop, common content-max
1440pxdesktopstandard MBP 14" and above

Each evaluator call screenshots all four per page. The Chrome automation tool supports resize_window for this.


5. Agent roster

Each role is a prompt template, not a long-lived process. Every call is a fresh subagent with a self-contained brief.

5.1 Evaluator agent (subagent_type: general-purpose)

Purpose: score a page against the rubric at all four breakpoints, return a structured report.

Inputs from orchestrator:

  • Target URL(s) on http://localhost:3000.
  • The rubric (by reference to this document).
  • Tab ID if known; otherwise instructions to create one.
  • What changed since last eval (so the agent can focus on regressions).

What the agent does:

  1. Loads the URL in Chrome via MCP tools.
  2. Resizes to each of 375 / 768 / 1024 / 1440 and takes a screenshot at each.
  3. Zooms into header, hero, any card regions that look off, using the zoom action.
  4. Scores each dimension per breakpoint.
  5. Writes findings as a structured report.

Required output format (the orchestrator merges this into the running rubric, so the shape matters):

## Page: <path>
### Scores
| Dim | Mobile | Tablet | Laptop | Desktop | Notes |
|---|---|---|---|---|---|
| 1 Typography | 3 | 4 | 4 | 4 | headline clamp too aggressive on 375 |
| 2 Layout | ... | | | | |
| ... | | | | | |

### Blocking findings (score < 4)
1. **[dim 2, mobile, hero]** Launching-autumn pill renders as empty oval next to mock — cause likely `flex justify-end` + pill-as-sibling. File: `app/_components/landing-page.tsx:54-57`.
2. ...

### Non-blocking observations
- ...

### Regression check
- Fix X from previous round: resolved ✓ / partially / regressed.

Constraints:

  • The agent never edits files.
  • Findings must cite a file and line when possible. Vague findings ("the hero feels cramped") are rejected — the orchestrator will re-dispatch asking for specifics.
  • Each finding names the dimension, breakpoint, and suspected cause.

5.2 Explorer agent (subagent_type: Explore)

Purpose: read-only code discovery when the orchestrator needs to understand where something lives before dispatching a fix.

Typical questions:

  • "Where is the header component defined and what components does it render?"
  • "How is the hero mock composed in landing-page.tsx?"
  • "Which components import ProductMock?"

Output format: compact — file paths, line numbers, and a two-sentence summary. No code dumps unless the orchestrator asks for a snippet.

5.3 Implementer agent (subagent_type: general-purpose)

Purpose: take a specific, bounded fix list and apply it to the codebase.

Inputs from orchestrator:

  • A numbered list of atomic fixes, each with the file, line, rubric dimension, finding, and intended outcome.
  • Guardrails (don't touch X, don't rename Y, keep behavior of Z).
  • "Do not invent new fixes. If you see something broken that's not on the list, report it but do not fix it."

What the agent does:

  1. Reads the affected files.
  2. Applies the fixes.
  3. Runs pnpm typecheck and pnpm lint on what it changed.
  4. Returns a diff summary, not the diff itself.

Output format:

## Fixes applied
1. [dim 4] Header nav spacing — `components/header.tsx:23` — changed gap-4 to gap-6 and wrapped "How it works" in whitespace-nowrap. Typecheck: pass. Lint: pass.
2. ...

## Fixes skipped
- [dim 7] Card hover state — the card already has hover:shadow-card-hover but the token is defined as the same value as shadow-card. Needs token change, not component change. Reporting back.

## New issues observed (not fixed)
- ...

Constraints:

  • Never changes files outside the stated fix list.
  • Never commits. The orchestrator decides if/when to commit.
  • If a fix is impossible as stated, returns that instead of inventing an alternative.

5.4 Image-prompt agent (subagent_type: general-purpose)

Purpose: rewrite a prompt JSON file, run the image generation script, pick the best variant, report back.

Inputs from orchestrator:

  • Which prompt file to rewrite (scripts/prompts/*.json).
  • The rubric finding that motivated the change.
  • Explicit "no text, no labels, no brand names" constraint.
  • Number of variants to generate (default 2; higher if prompt is risky).

What the agent does:

  1. Reads the current prompt JSON and the finding.
  2. Proposes a new prompt string that strips all text, labels, UI chrome, and brand names.
  3. Writes the new JSON to the prompt file.
  4. Runs python3 scripts/generate-image.py --variants N <prompt-file> (requires deleting or renaming existing variants first, since the script skips existing files).
  5. Reads each generated image.
  6. Reports which variant to use and why, or reports failure and proposes a next prompt.

Output format:

## Prompt update: <name>
- **Finding addressed:** [dim 6] <finding>
- **New prompt:** <text>
- **Variants generated:** N at <path>
- **Recommended variant:** vX — <one-line rationale>
- **Fallback:** vY if vX has <specific issue>
- **Concerns:** <anything the orchestrator should know>

5.5 Wire-image agent (subagent_type: general-purpose)

Purpose: replace a placeholder component with an <Image /> that uses the final generated image, plus an HTML overlay layer for any textual UI.

Inputs:

  • Placeholder component path (e.g. components/graphics.tsx:21 ProductMock).
  • Image path under public/images/.
  • The HTML overlay specification — the orchestrator designs the overlay, the agent builds it.
  • Alt text.

What the agent does:

  1. Moves the chosen image from public/images/generated/ to public/images/ with a final name.
  2. Replaces the component's placeholder markup with <Image /> + absolutely-positioned overlay children.
  3. Ensures the overlay is responsive, not a fixed-pixel layer.
  4. Runs typecheck + lint.

5.6 Accessibility agent (subagent_type: general-purpose)

Purpose: dedicated a11y sweep — contrast, focus order, alt text, ARIA, keyboard traps, semantic HTML.

Dispatched: once per round, after visual evaluator, before implementer. Also once at the end before declaring the rubric satisfied.

Output format:

## A11y findings — <page>
| Severity | Dim 9 impact | Finding | File |
|---|---|---|---|
| blocker | -1 | input missing label | app/start/page.tsx:44 |
| ... |

5.7 Technical-polish agent (subagent_type: general-purpose)

Purpose: console errors, network waterfall, image sizes, meta tags, OG image wiring, favicon wiring, CLS check, LCP check.

Dispatched: once per round near the end. This is the dimension-10 specialist.


6. Workflow

6.1 Round anatomy

ROUND N
├── (a) Evaluator × 9 pages (parallel)      ← visual scoring pass
├── (b) A11y agent × 9 pages (parallel)      ← dimension 9 pass
├── (c) Technical-polish agent × 1 (site)    ← dimension 10 pass
├── (d) Orchestrator merges findings → rubric.json, updated scores
├── (e) Orchestrator ranks blockers by (severity × leverage)
├── (f) Orchestrator plans next batch of fixes
│        ├── group 1: independent implementer fixes → parallel dispatch
│        ├── group 2: image prompt updates (if any) → parallel dispatch
│        └── group 3: wire-image fixes (depend on image agent output)
├── (g) Wait for all fixes to return
├── (h) Sanity check: typecheck + lint (via implementer or technical-polish)
└── GOTO ROUND N+1

6.2 Round 0 (baseline)

Round 0 is the only round with no fix dispatch. The orchestrator dispatches:

  • Evaluator on all 9 pages in parallel.
  • A11y on all 9 pages in parallel.
  • Technical-polish once.

Then it merges into docs/launch-rubric-state.md and writes a prioritized issue list. The user can review this if they come back.

6.3 Parallelism rules

  • Evaluators for independent pages are always parallel.
  • Implementers can be parallel only if their file sets do not overlap. The orchestrator is responsible for tracking file ownership within a round.
  • Image agents are always parallel with each other (they touch different prompt files).
  • Wire-image agents must run after their corresponding image agent returns.
  • If two fixes touch the same file, they are batched into one implementer call with both fixes in the list.

6.4 Regression handling

Every evaluator call includes a "regression check" section that verifies the previous round's fixes. If a fix regressed:

  1. The finding is promoted to top priority for round N+1.
  2. The orchestrator dispatches an explorer agent to understand why, before dispatching a re-fix.

6.5 Loop budget

  • Iterate until excellent. The stop conditions in §8 are the gate; the orchestrator does not stop because of an arbitrary round count. If the rubric isn't satisfied after a round, dispatch the next round. Repeat until the stop conditions are all true.
  • Soft warning at 8 rounds: if blockers remain after 8 rounds, the orchestrator writes a "stuck report" summarizing what keeps failing and dispatches differently (different agent, different approach, different model) rather than re-running the same brief.
  • No hard monetary cap on image generation — the user has lifted it — but the orchestrator will not regenerate the same prompt more than 3 times without changing the prompt strategy. Three failures of the same prompt is a prompt-design problem, not a generator problem.
  • Between rounds, the orchestrator re-reads design-guidelines.md and launch-rubric-state.md to re-ground its judgment.

7. Rubric state file

The orchestrator maintains docs/launch-rubric-state.md as the running record. Format:

# Launch rubric — running state

**Last round:** 3
**Last updated:** 2026-04-11T14:32:00Z
**Status:** blocking findings remaining

## Summary (latest round)
| Page | Avg score | Worst dim | Worst score |
|---|---|---|---|
| / | 3.8 | 6 Imagery | 2 |
| /start | 4.1 | 4 Header | 3 |
| ... |

## Blocking findings
1. ...

## History
- Round 0: baseline avg 2.9
- Round 1: avg 3.3 (+0.4), hero regressed dim 3 mobile
- Round 2: avg 3.8 (+0.5)
- ...

This file is what the orchestrator reads at the start of each round to know where it is. It is also what the user reads to understand progress.


8. Stop conditions

The loop stops and the orchestrator declares launch-ready when all of the following hold:

  1. Every page × every breakpoint × every dimension ≥ 4.
  2. No open regressions from the previous round.
  3. pnpm typecheck clean.
  4. pnpm lint clean.
  5. No console errors in the Chrome tab on any of the 9 pages.
  6. No blockers from the accessibility agent.
  7. Dimension 10 checklist complete: og:image wired, favicon wired, meta tags present, images moved out of /public/images/generated/ into /public/images/ with final names, Next <Image> used where applicable.
  8. The orchestrator has exhausted the "non-blocking observations" list down to items it considers post-launch polish.

On stop, the orchestrator writes a final summary to docs/launch-rubric-state.md and posts a concise report to the user.


9. Failure modes the orchestrator must handle

  • Evaluator disagreement round-to-round. If two consecutive evaluator calls score the same unchanged page 3 points differently, the rubric is noisy. The orchestrator averages and logs the drift, and tightens the rubric prompt.
  • Image regeneration flakiness. If the Gemini call fails, retry once. If it fails twice, log it and move to the next fix.
  • Implementer over-reach. If an implementer returns changes to files outside the brief, the orchestrator rejects the round and re-dispatches with tighter guardrails.
  • Hot-reload desync. If the evaluator sees old output after a fix, the orchestrator navigates the tab to the URL fresh rather than assuming reload.
  • Stuck round. If the same blocker appears in 3 consecutive rounds, promote it to user attention rather than retrying.

8a. Model selection for agent dispatch

The orchestrator picks the cheapest model that can do the job well:

  • Sonnet — structured instruction-following work with a clear checklist:
    • Evaluator agents (scoring against the rubric)
    • Implementer agents with atomic fix lists
    • Verifier agents (running typecheck/lint, reading files, reporting back)
    • Any agent whose brief reads as "do X, then Y, then Z, then report in this format"
  • Opus — work that requires architectural judgment, design taste, or merging conflicting signals:
    • Writing the rubric state merge
    • Writing prompt rewrites for image generation
    • Deciding whether to strip a card vs. restyle it (genuine design calls)
    • Resolving cross-agent conflicts
    • Any dispatch where "use your judgment" is a significant part of the brief

Dispatches pass model: "sonnet" explicitly when Sonnet is right; otherwise the agent definition's default is used.

9a. Task state discipline

Task status must always reflect reality, never intent. Rules:

  1. pendingin_progress is set at the moment the actual work starts — i.e. the same tool call that dispatches the agent / runs the command / begins the edit. Not before. Never mark in_progress as a way of "queueing" future work.
  2. in_progresscompleted is set only after the work has returned a verifiable result — agent returned its report, command exited, edit was saved. Not when I think it's done, not when I've just dispatched, not optimistically.
  3. Dispatch ≠ done. Dispatching an agent is when the task enters in_progress. The task does not become completed until the agent returns and I have merged its output.
  4. If I am about to mark multiple tasks in_progress in the same turn, that is a signal I am batching intent, not tracking work. Only the tasks whose work I am starting in that turn get in_progress. The rest stay pending.
  5. No premature completed. A task that I consider "trivially satisfied" (e.g. a tab sweep that turned out to be empty) still gets the two-step: verify → then update.
  6. When a task is abandoned or superseded, use deleted with a note — never falsely mark it completed.

This discipline exists so the user can trust the task list as a source of truth about what is actually happening in the repo at any moment.

10. What the orchestrator never does

  • Read or edit source files directly.
  • Run the dev server, build, typecheck, lint, test, or image-generation scripts directly.
  • Take browser screenshots or score pages directly.
  • Dispatch a fix that isn't traceable to a rubric finding.
  • Commit or push.
  • Rewrite product copy for meaning (only for polish).
  • Change brand tokens (palette, radii, font stack) without explicit user input.
  • Touch pricing numbers, deposit amount, or legal body text.