← Documentation index Microdesign › Cognitive engine

The cognitive engine

how Metnos plans and executes, without calling the cloud
Microdesign — architecture of the four-layer engine.

Audience: anyone who wants to understand how Metnos thinks,
why it answers fast and learns from use.
Reading time: 15 minutes.

Contents

  1. In one line: one model, called once
  2. The idea: propose everything at once, and remember
  3. The four layers of the engine
  4. The cascade through a turn
  5. The plan: what the engine proposes
  6. Layer 0 — Fastpath: self-learned shortcuts
  7. Layer 1 — Autopath: autopaths learned from feedback
  8. Layer 2 — the plan check
  9. Layer 3 — Proposer, Executor, Recovery, Terminator
  10. How it learns and speeds up
  11. What changes for the user
  12. Why it beats LLM-in-loop frameworks
  13. Honest numbers
  14. Open questions

1. In one line: one model, called once

When a request arrives, Metnos no longer calls the planner five or six times in a row. First it tries the shortcuts it already knows; if those don’t apply, it asks the local language model to propose the whole plan in a single call; then it executes it deterministically, and if something goes wrong it attempts a targeted recovery or honestly explains what’s missing. Four layers, one model, never the cloud.

2. The idea: propose everything at once, and remember

Today’s engine starts from two simple insights.

First insight: a single proposal. Instead of querying the model at every step, query it once and ask for the whole plan in one shot: the complete list of steps, the links between them, and the final message. The model sees the whole problem at once, builds internal coherence (it knows step 4 needs step 3’s output) and produces a far more stable plan. The execution that follows is pure deterministic mechanics: no more dice.

Second insight: remember what works. When a plan reaches the end successfully, the system keeps it, indexed by the «meaning» of the request. The next time a request of the same kind arrives, the plan is already there: it starts in a few milliseconds, without bothering the model. The system becomes its own planner — it learns to stop asking.

These two insights turn into four layers, tried in cascade. The distinction that matters is between what remembers and what reasons: the first two layers (L0 and L1) are stateful memory — a store that grows with use and answers without bothering the model when it recognizes the request. The other two (L2 and L3) are stateless: they keep nothing from one turn to the next, but re-check the plan and, when needed, build it from scratch. The third is a check, the fourth is the engine proper that proposes, executes, recovers and — if there’s no way out — honestly admits the limit.

4. The four layers of the engine

Every turn goes through the layers top to bottom. As soon as one layer resolves, the turn ends: the layers below aren’t even touched. The first two (L0 and L1) are the two stateful layers, the memory that learns: they are very fast and don’t use the language model. The other two (L2 and L3) are stateless — they keep nothing between turns — and only the fourth, only when needed, actually calls the model.

User request intent_extractor · verb + object L0 · Fastpathself-learned shortcut (hash + cosine) miss L1 · Autopathlearned autopath (semantic + intent match) miss L2 · Validatorplan check L3 · Engine Proposer1 LLM call Executordeterministic Recovery4 error classes Terminatorhonest limit propose · execute · recover · admit the limit selector: simple | metis | frontier Answer to the user hit → answer
The four-layer cascade: each layer tries to answer; if it doesn’t recognize the request it passes to the next. Fastpath and Autopath answer in a few milliseconds without the model; the engine (L3) calls it once.
L0

Fastpath

self-learned shortcuts · stateful

File: runtime/engine/fastpath.py · store fastpaths.sqlite.

What it does: the first of the two stateful layers. It recognizes the requests already solved successfully by the full plan: the shortcut is auto-produced on every successful turn (valves: removal from the admin console and aging). Recognition is two-stage: first an exact fingerprint compare (hash, under 5 ms, zero model), then a meaning compare with BGE-M3 embeddings (cosine, under 150 ms). On a match, it runs the saved plan immediately.

When it wins: always, if there’s a match. A self-learned shortcut takes precedence over everything else.

L1

Autopath

learned autopaths · stateful

File: runtime/engine/autopath.py · store autopath.sqlite.

What it does: the second stateful layer, the memory that grows on its own. It keeps the plans that worked, indexed by the meaning of the request. It searches first by semantic similarity (clusters of related requests) and then by exact intent match. If it recognizes the request, it runs the already-proven plan without calling the model.

When it acts: after Fastpath, only if the intent is complete (verb + object recognized). It also records every turn so it can learn in the future.

L2

Validator

plan check · stateless

File: runtime/engine/validator.py.

What it does: a deterministic check of a freshly proposed plan, before running it and without the model. It keeps nothing between turns: it is the first of the two stateless layers. It verifies that the cited tools exist, that arguments have valid form, that references between steps point at something real. If it finds errors, it asks the engine to re-propose once. That way a trivial mistake is fixed without even starting execution.

When it acts: between proposal and execution. On by default.

L3

Engine

propose · execute · recover · stateless

File: runtime/engine/{proposer,executor,recovery,terminator}.py.

What it does: the heart of the engine, four stateless components working in sequence. The Proposer asks the model for the whole plan in a single call. The Executor carries it out step by step, fully deterministically. Recovery kicks in if a step fails: it classifies the error and tries an alternative. The Terminator is the last honest resort: if there’s no way out, it tells the user what’s missing.

When it acts: only if neither Fastpath nor Autopath recognized the request.

Two layers with memory, two without. The backbone of the engine is this asymmetry: only L0 (fastpath) and L1 (autopath) hold state that persists and grows with use — they are the system’s memory. The fastpath recognizes the same request already solved; the autopath recognizes a plan generalized to a cluster of related requests, promoted by positive feedback. The Validator (L2) and the engine (L3), by contrast, are stateless: they re-check and rebuild the plan from scratch every time, remembering no past turns. Learning is the memory’s job; reasoning is the engine’s.
A single entry point. All of this is orchestrated by runtime/engine/dispatch.py, which exposes a single turn function and annotates which layer answered (fastpath, autopath, engine, recovery or terminator). The dispatcher knows nothing about application domains: it only orchestrates the layers. Adding a new engine or a new recovery strategy doesn’t require touching the other layers.

5. The cascade through a turn

Here is the full path, from the user’s message to the final answer. Read it top to bottom: each level is tried only if the one above hasn’t already resolved.

 USER: "find the spam mail and move it to trash"
 |
 v
 +-------------------+
 | fast_path (pre) | regex on trivial patterns ("what time is it", "where am I")
 | (zero LLM) | --> match? answer in ~50ms and stop
 +-------------------+
 | (no match)
 v
 +-------------------+
 | intent_extractor | fast-tier LLM ~370ms
 | verb + object + | --> (verb="move", object="messages",
 | keywords | keywords=["spam","trash"])
 +-------------------+
 |
 v
+--------------------------------------------------------------+
| ENGINE (engine/dispatch.run_turn) |
| |
| +----------------+ |
| | L0 Fastpath | hash (<5ms) + BGE-M3 cosine (<150ms) |
| | lookup | --> self-learned shortcut ? execute |
| +----------------+ |
| | (miss) |
| v |
| +----------------+ |
| | L1 Autopath | semantic match + intent_hash |
| | lookup | --> learned autopath ? execute (no LLM) |
| +----------------+ |
| | (miss) |
| v |
| +----------------+ |
| | L3 Proposer | wise-tier LLM, 1-shot |
| | propose | --> plan JSON {steps, fillers, |
| | (Qwen 35B-A3B) | final_message} |
| +----------------+ |
| | |
| v |
| +----------------+ |
| | L2 Validator | typecheck plan; error? --> re-propose |
| +----------------+ |
| | |
| v |
| +----------------+ |
| | Executor | for each step: |
| | (deterministic)| - resolve from_step + FILLER + RUNTIME |
| +----------------+ - invoke_executor + vaglio + accumulate|
| | |
| v |
| +----------------+ |
| | render final | template "Found {N} mail, moved" |
| +----------------+ |
+--------------------------------------------------------------+
 |
 v (ok? yes --> Autopath records + answers user)
 |
 | (error? yes --> Recovery)
 v
 +-------------------+
 | Recovery | classify error: wrong_tool / wrong_args /
 | classify + retry | missing_input
 | | --> re-propose excluding the failed tool
 | | --> execute again
 +-------------------+
 |
 v (ok? yes --> answer)
 |
 | (out_of_scope or recovery failed? --> Terminator)
 v
 +-------------------+
 | Terminator | honest explanation: cause + suggested action
 | honest dead-end | --> record the gap in terminator_log.sqlite
 | | --> "Can't resolve: X. To proceed: Y."
 +-------------------+
 |
 v
 USER: final message (answer or action request)
A note on the order. The Validator is layer 2 by numbering, but it acts after the proposal: its natural place is between the Proposer and the Executor, because it checks the plan that was just proposed. The numbering reflects design priority, not the exact temporal order.

6. The plan: what the engine proposes

The Proposer doesn’t produce free text: it produces a structured object we call the framework (the plan). It has three parts:

{
 "steps": [
 {"tool": "find_messages",
 "args": {"folder": "INBOX", "query": "is:unread"}},
 {"tool": "classify_entries",
 "args": {"from_step": 1, "dimension": "spam"}},
 {"tool": "filter_entries",
 "args": {"from_step": 2, "where_field": "spam", "where_value": "spam"}},
 {"tool": "move_messages",
 "args": {"from_step": 3, "dst_folder": "${FILLER:trash_folder}"}}
 ],
 "fillers": {
 "trash_folder": {
 "prompt": "What is the trash folder for this account?",
 "default": "Trash",
 "tier": "fast"
 }
 },
 "final_message": "Moved ${step4.ok_count} mail to trash."
}

Inside the arguments, four kinds of placeholder appear, which the Executor resolves deterministically at the right moment:

PlaceholderMeaning
from_step: NTake the entries produced by step N (1-based) and pass them to this step.
${stepN.field}Extract a field from step N’s result (supports nested paths and projections). Used mostly in the final message.
${FILLER:name}A slot filled on the fly by a small fast-tier model call (cached), or by the default value.
${RUNTIME:key}Turn context value: actor (who is speaking), lang, channel.
Why the proposal is reliable. The model call can be constrained by a deterministic grammar (GBNF): the decoder rejects, in real time, any token that would violate the schema, so the plan can’t come out malformed. Also, the model doesn’t see every tool in the system, only the ones most relevant to the request (about a dozen, selected by affinity): the prompt stays short and the proposal faster and more precise.

7. Layer 0 — Fastpath: self-learned shortcuts

The Fastpath is the fast lane. Every turn successfully completed by the full plan creates the shortcut on its own — the chains are executors already vetted and tested, no approval step —: from then on, that request (and its near variants) skip everything else and are served directly. Shortcuts age out on their own (never reused, stale, total cap), die when an executor supersedes them or disappears, and can be removed by hand from the admin console.

Recognition happens in two stages. First an exact compare on the phrase fingerprint (deterministic, under 5 ms, no model at all). If that isn’t enough, a meaning compare: the request is turned into a vector with BGE-M3 and compared by cosine against the saved shortcuts (under 150 ms). A self-learned shortcut always wins, even over an automatically-learned autopath.

8. Layer 1 — Autopath: autopaths learned from feedback

The Autopath is the memory that grows on its own, with no one writing rules by hand. It keeps three tables in its sqlite store:

TableWhat it holdsWhen it’s written
autopaths Plans that worked, indexed by request meaning and grouped by semantic cluster. For each: the plan, usage counters, a composite score, the status (champion/challenger). Promoted automatically when a plan proves itself on the same kind of request.
anti_autopaths Plans that failed repeatedly. For each: the reason and an expiry (about 30 days). When a plan fails repeatedly on the same intent. At expiry, the system tries again.
observations A record of every turn: intent, executed plan, latency, semantic vector. Append-only. Always, at the end of a turn. It’s the source of truth for promoting or retiring autopaths.

The search proceeds in two stages: first by semantic similarity (the request falls into the cluster of related requests already seen and the «champion» plan of that cluster is served), then, if needed, by exact intent match. When several plans compete for the same cluster, a champion/challenger scheme applies: the challenger must prove itself better before it can take the champion’s place.

The practical effect. The first time you ask «find the spam mail and move it to trash», the model takes a few seconds to generate the plan. The second time, the Autopath already has that plan in memory: the pipeline starts in a few milliseconds. From then on, that request no longer goes through the language model.

9. Layer 2 — the plan check

Before executing a freshly proposed plan, the Validator runs it through a sieve, purely deterministically and without the model:

If it finds even one error, it doesn’t start execution: it asks the Proposer to re-propose once, excluding the wrong plan. That way a typo or a crooked argument is fixed at no cost, without wasting a heavier recovery call. It’s on by default.

10. Layer 3 — Proposer, Executor, Recovery, Terminator

Proposer — the one-shot proposal

The Proposer is the only component that actually talks to the language model. It receives the request and the extracted intent and produces, in a single call, the whole plan. No step-by-step reasoning, no iterative tool-calls: a single structured object.

There are selectable variants (the METNOS_ENGINE setting):

Executor — execution without doubts

The Executor is the most rigorous component: zero language model in the main loop, only deterministic Python. For each step: it resolves references to previous steps (from_step), fills the ${FILLER:…} slots (a small fast call with cache) and the context values ${RUNTIME:…}, validates the arguments, calls the tool’s executor, passes the result to the Vaglio (the safety judge) and accumulates. At the end it composes the final message from the template with the ${stepN.field} fields.

A hidden virtue: reproducibility. The same plan run on the same data snapshot always produces the same outcome. That means a past turn can be replayed, and tests can make precise assertions — impossible with a stochastic planner.

Recovery — the targeted retry

If a step fails, Recovery steps in. Its first job is to understand what kind of error it is, by reading the structured error class the executor returns (no parsing of multilingual text):

ClassWhenStrategy
wrong_tool The chosen tool wasn’t right: it failed or is semantically wrong. Re-propose a different plan, excluding that tool.
wrong_args The tool was right but the arguments were malformed: an empty pipeline, a step limit hit, a reference pointing nowhere. Re-propose with canonical arguments and explicit references.
missing_input A prerequisite is missing: an index not built, a non-existent folder, a filter producing zero results. Re-propose an alternative; if the user is needed, hand off to the Terminator.
out_of_scope A non-recoverable error: it needs a physical action from the user (e.g. sharing their location) or a capability is missing entirely. Recovery does not step in: it hands off immediately to the Terminator.

Recovery attempts one alternative only, excluding both the failed plan and the tool that caused the problem. No infinite loops: if the alternative doesn’t succeed, the floor passes to the Terminator.

Terminator — the honest dead-end

The Terminator is the subtlest piece of the system. It’s not an «error handler»: it’s an honest recognition of the limit. When its turn comes, it means everything else has tried and failed. The Terminator doesn’t fake success: it explains the cause to the user and suggests a concrete action, and it records the gap in a store (terminator_log.sqlite) with a counter of how many times it has recurred.

The cause and action messages aren’t hardcoded strings: they go through the multilingual dictionary, so the user reads them in their own language. The turn always closes with a readable answer, never with silence or a raw error.

The gap is evolutionary. When the user resolves the missing piece (shares their location, enables an autopath, builds an index), the same request returns to the normal cascade and typically becomes a learned autopath after a few confirmations. Every recognized dead-end is a chance to grow: the occurrence count is visible in the admin panel and helps decide what to fill structurally.

11. How it learns and speeds up

Under every chat answer, the user sees small feedback buttons:

Pressing anything is optional: if a pipeline reaches the end without errors, the Autopath counts it as a positive observation anyway. Explicit feedback speeds things up and refines them, but the system learns from silence too. The more it’s used, the more often requests are served from memory instead of the model — and therefore the faster.

12. What changes for the user

1. First requests are faster

Even on the first try, the engine doesn’t query the step-by-step planner but makes a single proposal. It feels like a system that «thinks for a moment» instead of «freezing for a minute».

2. Repeated requests are instant

After a few turns of the same family of requests, the autopath is learned and the following turns start from memory. The answer arrives in a fraction of a second: that’s the difference between «an assistant» and «an immediate reaction».

3. Failures become informative

When something can’t be done, the Terminator says exactly what is needed. No vague «generic error» messages, no loops of useless retries: a clear sentence, an operational hint, and — where appropriate — a dialog to resolve the missing piece on the spot.

13. Why it beats LLM-in-loop frameworks

Frameworks for «LLM agent workflows» (LangGraph and the like) cover part of the same problem: reducing the model’s in-loop variance by giving structure to tasks. But they have limits this engine doesn’t:

AspectLLM-in-loop frameworksMetnos engine
Origin of the workflows hand-written, manual maintenance optional seed, but they grow on their own from feedback ✓ ✗ ↺
Learning no yes: a memory populated from turns that went well
Anti-error no anti_autopaths table that excludes failed paths
Structured recovery generic retries 4 orthogonal error classes + targeted re-proposal
Honest dead-end infinite retry or crash Terminator: classifies + suggests an action
Traceability none, or text logs observations record + admin panel
Name consistency free names closed vocabulary (verb + object + qualifier) guaranteed

The analogy that makes it click

Traditional LLM-in-loop frameworks are like artists with a blank canvas: every turn they reinvent the composition. Lovely creativity, but the risk of grabbing the wrong tool is real and every piece costs.

The Metnos engine is like a craftsman with a sketchbook: the model sketches the plan once (focused, constrained creativity), and from then on the craftsman reopens the book at the right page. The model’s creativity is still there, but it’s confined to the moment of invention: execution is a faithful reproduction of the sketch. That avoids hallucinations during execution, gains speed, and reaches zero cost once the book is already rich.

14. Honest numbers

On a sample of real requests (mail, files, web, location, time, multi-turn dialogs, scheduling, contacts), comparing the iterative cloud planner with the local single-proposal engine:

ModeCoverageAverage latencyModelCost
Iterative planner (cloud, step-by-step) equivalent ~76 s cloud frontier paid
Local engine (single proposal + memory) equivalent ~12 s cold, <1 s from memory local Qwen 3.6 35B-A3B 0

Several times faster, equivalent coverage, zero cost (the model runs locally on unified-memory hardware). And latency drops further as the memory fills up: recurring requests start from cache, under a second.

Honesty about the limit. Some requests can’t be satisfied in any mode: without a shared location you can’t say where you are, without an enabled autopath you can’t search that service. In those cases the Terminator collects the failure and explains to the user what’s needed. No one «wins» falsely: the failure is classified and the user knows what to do.

15. Open questions