The overall Mykleos design was mature enough for critical review. I ran it through four expert lenses (agentic programming, psychology, UI, AI) and then weighed each point that emerged against the three declared goals of the product. This document consolidates both the analysis and the judgement, and closes phase 0 of design.
The method: generate generous critiques from the four perspectives, then apply a pragmatic filter "does this critique block usefulness/intelligence/autonomy? if yes I accept it, otherwise I weigh it". Five possible verdicts:
| Verdict | Meaning |
|---|---|
| blocking | Without this change, at least one of the three goals cannot be reached. Must be done. |
| reinforcement | Not a new concept. Completes a piece of the existing design. Accepted. |
| defer | Useful, not blocking. Will be done, with an explicit gate to promote it when truly needed. |
| tension | Real problem not solvable. Managed, not eliminated. Explicit acceptance of the trade-off. |
| rejected | Evaluated, cost exceeds benefit. Rationale tracked. |
Before judging, we need to define. The four terms are not interchangeable.
Solves a real problem of Roberto's on the first try, without technical intervention.
Measured by: completed tasks / approvals requested.
Understands context, picks the right tool, handles the unexpected without falling into loops or generic replies.
Measured by: success rate on unseen cases + quality of "I don't know".
Acts on its own in predictable cases, asks for permission only when needed.
Measured by: successful actions / approvals requested.
Initiates useful actions without being prompted, when the moment deserves it, without disturbing.
Measured by: proposals accepted / proposals issued; appropriate silence.
Clean 4-layer architecture, protocol-based, audit log as a feature. But
missing: loop choice (ReAct? function calling? planner+executor?), state
ownership in case of crash, tool idempotence, an ExecutionTrace
as a first-class object, dry-run mode, and any test strategy. Without
ExecutionTrace you will never stitch together audit + replay + Darwinian
fitness + eval — because they are the same thing seen from different
angles.
The three-tier memory aligns with working/episodic/semantic. Darwinian selection has grounding in RL/ACT-R. The 4 Laws are transparent deontological ethics. But: "neuron" evokes intentionality (ELIZA effect amplified), approval fatigue empties the meaning of gates, automation bias grows with fitness. Capability creep in expectations: users will project "it gets ever smarter" and will be disappointed.
Documents visually coherent, good progressive disclosure. But the most delicate UX surface — the "want me to proceed?" in Telegram, in CLI, via voice — has no design. Status visibility during thinking is invisible. Audit JSONL is great for forensics and terrible for "what did my butler do today". A minimal (even sober) admin dashboard changes daily life more than ten new features.
Explicit awareness of self-judge LLM limits and of indirect prompt injection. CoALA vocabulary adopted. But: no way to measure whether v0.2 is better than v0.1 (need a mini eval, 15-20 scenarios). No cost model (order of magnitude: 1-3 €/day with Claude Sonnet on home use). Model tiering ignored: 60-80% of cost savable by putting gates on a local model and serious actions on frontier. The synthesis prompt for neurons — the piece that determines 80% of the success rate — is not specified.
The complete table. Each row is a specific critique with verdict and destination in the work plan.
| # | Critique | Perspective | Verdict | Where / when |
|---|---|---|---|---|
| 1 | Reasoning loop unspecified | agentic + AI | blocking | agent_runtime.html |
| 2 | Tool-call validation with schema + reject loop | AI | blocking | agent_runtime.html |
| 3 | ExecutionTrace as first-class object | agentic + AI | blocking | agent_runtime.html |
| 4 | Status visibility on every channel | UI | blocking | channel.html (req.) |
| 5 | Approval UX designed (batching, pause, revoke) | psychology + UI | blocking | approval_ux.html |
| 6 | Minimal eval (15-20 YAML scenarios + harness) | AI | blocking | eval.html |
| 7 | Model tiering + 5th operational Law | AI + agentic | blocking | policy.html + cost_tiering |
| 8 | Anti-anthropomorphisation linguistic framing | psychology | reinforcement | constitution.html |
| 9 | Explicit prompt structure + caching rules | AI | reinforcement | agent_runtime.html |
| 10 | Minimal admin web UI (5 htmx views) | UI | defer | phase 3-bis; gate: if JSONL never opened → promote to phase 1 |
| 11 | Long-memory retrieval strategy | AI | defer | gate: >4k tokens long → RAG |
| 12 | MCP adoption | AI | defer | gate: 3+ external MCP tools |
| 13 | State persistence post-crash | agentic | defer | phase 1 accepts "session lost"; gate: >1×/week |
| 14 | Formal neuron versioning (semver) | agentic | defer | gate: >20 neurons in library |
| 15 | Anthropomorphisation → ELIZA | psychology | tension | mitigated by #8 + "what you know about me" tool |
| 16 | Automation bias with high fitness | psychology | tension | mitigated by tutor mode in approval_ux.html |
| 17 | Cost of autonomy | psychology + AI | tension | Roberto must know "Full mode 24h = X €" |
| 18 | Formal pairing SLA as meta-doc | UI | rejected | detail of pairing.html |
| 19 | Explicit mobile-first | UI | rejected | it's testing, not design |
| 20 | Docs search | UI | rejected | trigger: >15 docs (now 5) |
| 21 | Synapse deadlock as dedicated design | agentic | rejected | global timeout covers 95% |
| 22 | Multimodal day-1 | AI | rejected | already deferred in the Survival Kit |
| 23 | Multi-user family day-1 | UI | rejected | first release mono-principal; phase 3 |
| 24 | Fine cost model (TCO analysis) | AI | rejected | order of magnitude suffices |
Totals: 7 blocking · 2 reinforcements · 5 deferred · 3 tensions · 7 rejected.
These are the only non-negotiable changes. All others are at the edges.
| # | What | Default choice + rationale |
|---|---|---|
| 1 | Reasoning loop | ReAct + provider-native function calling for phase 1. Simple, tested, cache-friendly. Revisit in phase 5 considering CodeAct. |
| 2 | Tool-call validation | Every tool has a strict JSON Schema. The dispatcher validates before executing. Validation error → "tool X exists but argument Y is of type Z" reply reinjected to the LLM, max 2 attempts, then abandon with user message. |
| 3 | ExecutionTrace first-class | Python object with: id, session_id, channel, messages[],
tool_calls[], cost_tokens, cost_usd, wall_time_ms, outcome. It is
the same structure used by audit log, dry-run/replay, Darwinian fitness,
eval. A single source of truth on "what happened". |
| 4 | Status visibility | Every channel must display "thinking...", a typing indicator, and update on tool change. Telegram: editable message; CLI: spinner with tool-name; voice: courtesy prompt every second. |
| 5 | Approval UX | Batching: "approve similar actions for 10 minutes".
Reading pause: the "ok" button enables after 3 seconds.
Revocation: an /undo command that stops
execution in progress. Tutor mode: every N consecutive
approvals, a mandatory "you check this one" prompt breaks the flow. |
| 6 | Minimal eval | 15-20 YAML scenarios with input + oracle (expected reply or
criterion). A harness that runs them via reasoning-loop replay. Report:
success rate, p95 latency, cost. Re-run on every commit that
touches agent_runtime/ or policy/. |
| 7 | Model tiering + budget | Two tiers: local-fast (local llama.cpp, < 500 ms, free) for policy gates + classification; frontier (Claude/Opus via supra) for reasoning, synthesis, user reply. Budget: 2 €/day soft cap, 5 € hard cap. Notify at 80% consumption. |
If you kill the "butler" metaphor, you lose warmth and familiarity; if you let it run free, you slide into the ELIZA effect (users attributing consciousness and moral responsibility to the system). Choice: we keep the metaphor, we accept 70% mitigation via linguistic framing + "what you know about me" tool + tutor mode. We prefer warm-with-monitoring to cold-without-ambiguity.
The higher a neuron's fitness, the more the user stops checking its output. It is the paradox of quality-driven selection: it amplifies trust even where it shouldn't. Mitigation: tutor mode (forces periodic review even on "reliable" neurons), visual separation between "I did X" and "I did X because I've done it 40 times already". Doesn't solve, limits.
The more autonomy the user wants, the more the system explores, the
more it spends. Handling: explicit cost declaration
before each autonomy upgrade (e.g. myclaw session --level full --for 24h
shows "estimate: €3.50"). Budget becomes part of consent.
Of the four adjectives, useful and intelligent are properties an agent can have independently. Autonomous and proactive are not: they emerge only if the first two are well calibrated and overlaid with specific mechanisms — the approval gates for autonomy, the telos-alignment function for proactivity. Neither one more point of usefulness nor one more of intelligence produces autonomy or proactivity on its own.
Autonomy emerges from calibration of the gap between
"does on its own" and "asks permission".
Proactivity emerges from calibration of the gap between
"proposes" and "keeps quiet", regulated by the telos.
Operational implication:
agent_runtime.html, 10h
approval_ux.html, 10h eval.html.
Everything else emerges from those three foundations.Proactivity, as a structural adjective, does not replace the seven blocking critiques — it re-reads them. None of them is to be dropped; some are reinforced, others are extended, one requires an implicit requirement to be added. A critique that was not in the original set also appears (the proposals inbox as a UX surface).
| # | Original blocker | What changes under the proactive lens |
|---|---|---|
| 1 | Reasoning loop | The choice does not change (ReAct + function calling), but a requirement is added: the loop must be startable without a user turn to activate it. Admissible triggers: cron, internal event (indexer), threshold on metrics (budget, suspicious activity). Documented as agent-initiated turn mode. |
| 2 | Tool-call validation | Unchanged. Applies identically to proactive turns. |
| 3 | ExecutionTrace first-class | Extended: the trace must record the origin of the turn (source: user | cron | indexer | policy | reflection). Without this distinction the audit cannot answer "who decided to start, this time?" — a crucial question for proactivity. |
| 4 | Status visibility | Extended: proactive turns (evening briefing, inbox proposals) must be recognisable as such in the channel — different iconography, explicit "spontaneous" tag. Never disguise proactivity as reply-to-request. |
| 5 | Approval UX | Reinforced: proactivity multiplies the approval surfaces. Beyond batching and tutor mode, a "proposals inbox" surface separate from blocking approvals is needed. The user must be able to reject a class of proposals ("fewer of these, please"), not only the single one. |
| 6 | Minimal eval | Extended: eval scenarios must cover appropriate non-action. "Mykleos decides to propose nothing today" is a valid output and must be tested. An eval harness that only measures success rate on explicit requests is blind to rightful silences. |
| 7 | Model tiering + budget | Drastically reinforced: proactivity consumes without being requested. Budget becomes a prerequisite of proactivity, not an optional. Proposal: the proactive budget be a separate head from the reactive budget (e.g. 30% / 70%), with independent hard cap. |
| 8 | (new) Proposals inbox as UX surface | Not in the original set. Becomes blocking because without it the proactive fallout has nowhere to accumulate in a non-invasive way. Added to the plan: proposal_ux.html (already anticipated by Extended Perspectives §5). |
Before this judgement, phase 1 called for 4 classical microdesign docs:
gateway · channel · tool · sandbox.
After this judgement, three cross-cutting docs must come first, and only then the four classical ones:
| Order | Doc | Covers (blocking critiques) |
|---|---|---|
| 1 | agent_runtime.html | #1 reasoning loop · #2 tool validation · #3 ExecutionTrace · #9 prompt structure |
| 2 | approval_ux.html | #5 approval UX · mitigation of tensions 1 and 2 |
| 3 | eval.html | #6 eval harness |
| 4 | gateway.html | (phase 1 classics) #4 status visibility in part |
| 5 | channel.html | #4 status visibility complete |
| 6 | tool.html | prerequisite for #2 |
| 7 | sandbox.html | — |
| 8+ | policy.html + cost_tiering (sub-section) | #7 tiering · #17 cost of autonomy |
Mykleos — Perspectives & Judgement v1.1 — 2026-04-22
Closes phase 0 of design. Opens phase 1 with a new order.
v1.1: added the fourth adjective (proactive) and the cross-cutting lens §7-bis.