goalkeeper

Open Source

Contract-driven autonomous goal execution for Claude Code. A subagent judge gates completion against an explicit Definition of Done, so a passing validator can't ship a placeholder.

Designer, author, and maintainer. Published as an open-source Claude Code plugin under the MIT license.

GitHub→Long-form writeup→Demo transcript→Install→

Tech Stack

Runtime: Claude Code plugin (skills + subagents)Spec format: Markdown contract with YAML frontmatter, validated against a JSON SchemaState: append-only `log.md` + JSON state files per goal, mission-level state at `.claude/mission.md`Pacing: cache-aware `ScheduleWakeup` (60–270s warm path, 1200s+ for long waits)Tests: Python lifecycle harness (80 assertions across 14 state-transition tests)Schema validator: `scripts/validate-contracts.py`License: MIT

Stats

v0.3.0

Version

8 (`goal`, `goal-prep`, `goal-judge`, `goal-chain`, `goal-supervisor`, `goal-pause`, `goal-resume`, `goal-clear`)

Skills

80 across 14 tests

Lifecycle test assertions

MIT

License

`itsuzef/goalkeeper`

Plugin marketplace

Overview

A Claude Code plugin that lets an agent run for hours toward a written contract without drifting into stubs or premature claims of success. After each checkpoint the validator runs (typecheck, tests, lint, whatever the contract says). When the validator passes, a fresh-context subagent **judge** independently reviews the diff and progress log against the Definition of Done, then either approves or returns a structured fix-list. After 5 rejections it pauses for a human. Inspired by OpenAI Codex `/goal` and Geoffrey Huntley's Ralph loop, with one key addition: the judge. A passing validator is not a finished feature. Tests can pass on stubs, linters can pass on `.todo`s, and the executing agent can rationalize sentinel values. The judge catches what the validator can't measure.

Engineering Highlights

01Subagent judge gate — Independent context catches placeholders, sentinel values, and shortcuts the executing agent rationalized away. The README's reject-then-approve demo caught a `MAX_RUNTIME_MS = 9999 // TODO` benchmark threshold that the validator passed twice in a row.
02Executor-as-subagent chain execution (v0.3) — Multi-goal chains spawn a fresh-context executor per goal so a 9-goal chain runs in one main-conversation context instead of dying at goal 2-3 from token aging. Main context only orchestrates: spawn executor, spawn judge, apply verdict, advance cursor.
03Missions primitive (v0.2) — One level above goals. Supervisor reads a mission charter + the prior goal's artifacts and adaptively decides PROCEED / DONE / ESCALATE. Chains commit linearly at start; missions branch based on what the previous step actually produced.
04Cache-aware wakeup pacing — Anthropic's prompt cache has a 5-minute TTL. Wakeup delays are tuned to either stay warm (60–270s) or commit to a full cache miss (1200s+), never the worst-of-both 300s.
05Anti-placeholder rule, enforced — Stubs, mocks, `.todo`, `.skip`, and "TODO: real implementation" sentinels are automatic judge rejection. Borrowed from Ralph, made structural.
06Pre-existing-failure baselining — `validator_baseline_*` captures the validator's state at goal activation so pre-existing debt doesn't block judge approval. The judge subtracts already-broken paths from its rejection criteria.
07Append-only log as message bus — Round N's input is Round N-1's verdict, read straight from `log.md`. No human translates "the agent shipped a placeholder" into "here's what to fix."