
goalkeeper
Open SourceContract-driven autonomous goal execution for Claude Code. A subagent judge gates completion against an explicit Definition of Done, so a passing validator can't ship a placeholder.
Designer, author, and maintainer. Published as an open-source Claude Code plugin under the MIT license.
Tech Stack
Stats
Overview
A Claude Code plugin that lets an agent run for hours toward a written contract without drifting into stubs or premature claims of success. After each checkpoint the validator runs (typecheck, tests, lint, whatever the contract says). When the validator passes, a fresh-context subagent **judge** independently reviews the diff and progress log against the Definition of Done, then either approves or returns a structured fix-list. After 5 rejections it pauses for a human. Inspired by OpenAI Codex `/goal` and Geoffrey Huntley's Ralph loop, with one key addition: the judge. A passing validator is not a finished feature. Tests can pass on stubs, linters can pass on `.todo`s, and the executing agent can rationalize sentinel values. The judge catches what the validator can't measure.
Engineering Highlights
- 01Subagent judge gate — Independent context catches placeholders, sentinel values, and shortcuts the executing agent rationalized away. The README's reject-then-approve demo caught a `MAX_RUNTIME_MS = 9999 // TODO` benchmark threshold that the validator passed twice in a row.
- 02Executor-as-subagent chain execution (v0.3) — Multi-goal chains spawn a fresh-context executor per goal so a 9-goal chain runs in one main-conversation context instead of dying at goal 2-3 from token aging. Main context only orchestrates: spawn executor, spawn judge, apply verdict, advance cursor.
- 03Missions primitive (v0.2) — One level above goals. Supervisor reads a mission charter + the prior goal's artifacts and adaptively decides PROCEED / DONE / ESCALATE. Chains commit linearly at start; missions branch based on what the previous step actually produced.
- 04Cache-aware wakeup pacing — Anthropic's prompt cache has a 5-minute TTL. Wakeup delays are tuned to either stay warm (60–270s) or commit to a full cache miss (1200s+), never the worst-of-both 300s.
- 05Anti-placeholder rule, enforced — Stubs, mocks, `.todo`, `.skip`, and "TODO: real implementation" sentinels are automatic judge rejection. Borrowed from Ralph, made structural.
- 06Pre-existing-failure baselining — `validator_baseline_*` captures the validator's state at goal activation so pre-existing debt doesn't block judge approval. The judge subtracts already-broken paths from its rejection criteria.
- 07Append-only log as message bus — Round N's input is Round N-1's verdict, read straight from `log.md`. No human translates "the agent shipped a placeholder" into "here's what to fix."