When a request arrives, Metnos no longer calls the planner five or six times in a row. First it tries the shortcuts it already knows; if those don’t apply, it asks the local language model to propose the whole plan in a single call; then it executes it deterministically, and if something goes wrong it attempts a targeted recovery or honestly explains what’s missing. Four layers, one model, never the cloud.
Today’s engine starts from two simple insights.
First insight: a single proposal. Instead of querying the model at every step, query it once and ask for the whole plan in one shot: the complete list of steps, the links between them, and the final message. The model sees the whole problem at once, builds internal coherence (it knows step 4 needs step 3’s output) and produces a far more stable plan. The execution that follows is pure deterministic mechanics: no more dice.
Second insight: remember what works. When a plan reaches the end successfully, the system keeps it, indexed by the «meaning» of the request. The next time a request of the same kind arrives, the plan is already there: it starts in a few milliseconds, without bothering the model. The system becomes its own planner — it learns to stop asking.
These two insights turn into four layers, tried in cascade. The distinction that matters is between what remembers and what reasons: the first two layers (L0 and L1) are stateful memory — a store that grows with use and answers without bothering the model when it recognizes the request. The other two (L2 and L3) are stateless: they keep nothing from one turn to the next, but re-check the plan and, when needed, build it from scratch. The third is a check, the fourth is the engine proper that proposes, executes, recovers and — if there’s no way out — honestly admits the limit.
Every turn goes through the layers top to bottom. As soon as one layer resolves, the turn ends: the layers below aren’t even touched. The first two (L0 and L1) are the two stateful layers, the memory that learns: they are very fast and don’t use the language model. The other two (L2 and L3) are stateless — they keep nothing between turns — and only the fourth, only when needed, actually calls the model.
File: runtime/engine/fastpath.py · store fastpaths.sqlite.
What it does: the first of the two stateful layers. It recognizes the requests already solved successfully by the full plan: the shortcut is auto-produced on every successful turn (valves: removal from the admin console and aging). Recognition is two-stage: first an exact fingerprint compare (hash, under 5 ms, zero model), then a meaning compare with BGE-M3 embeddings (cosine, under 150 ms). On a match, it runs the saved plan immediately.
When it wins: always, if there’s a match. A self-learned shortcut takes precedence over everything else.
File: runtime/engine/autopath.py · store autopath.sqlite.
What it does: the second stateful layer, the memory that grows on its own. It keeps the plans that worked, indexed by the meaning of the request. It searches first by semantic similarity (clusters of related requests) and then by exact intent match. If it recognizes the request, it runs the already-proven plan without calling the model.
When it acts: after Fastpath, only if the intent is complete (verb + object recognized). It also records every turn so it can learn in the future.
File: runtime/engine/validator.py.
What it does: a deterministic check of a freshly proposed plan, before running it and without the model. It keeps nothing between turns: it is the first of the two stateless layers. It verifies that the cited tools exist, that arguments have valid form, that references between steps point at something real. If it finds errors, it asks the engine to re-propose once. That way a trivial mistake is fixed without even starting execution.
When it acts: between proposal and execution. On by default.
File: runtime/engine/{proposer,executor,recovery,terminator}.py.
What it does: the heart of the engine, four stateless components working in sequence. The Proposer asks the model for the whole plan in a single call. The Executor carries it out step by step, fully deterministically. Recovery kicks in if a step fails: it classifies the error and tries an alternative. The Terminator is the last honest resort: if there’s no way out, it tells the user what’s missing.
When it acts: only if neither Fastpath nor Autopath recognized the request.
runtime/engine/dispatch.py, which
exposes a single turn function and annotates which layer answered
(fastpath, autopath, engine,
recovery or terminator). The dispatcher knows nothing
about application domains: it only orchestrates the layers. Adding a new engine
or a new recovery strategy doesn’t require touching the other layers.
Here is the full path, from the user’s message to the final answer. Read it top to bottom: each level is tried only if the one above hasn’t already resolved.
USER: "find the spam mail and move it to trash"
|
v
+-------------------+
| fast_path (pre) | regex on trivial patterns ("what time is it", "where am I")
| (zero LLM) | --> match? answer in ~50ms and stop
+-------------------+
| (no match)
v
+-------------------+
| intent_extractor | fast-tier LLM ~370ms
| verb + object + | --> (verb="move", object="messages",
| keywords | keywords=["spam","trash"])
+-------------------+
|
v
+--------------------------------------------------------------+
| ENGINE (engine/dispatch.run_turn) |
| |
| +----------------+ |
| | L0 Fastpath | hash (<5ms) + BGE-M3 cosine (<150ms) |
| | lookup | --> self-learned shortcut ? execute |
| +----------------+ |
| | (miss) |
| v |
| +----------------+ |
| | L1 Autopath | semantic match + intent_hash |
| | lookup | --> learned autopath ? execute (no LLM) |
| +----------------+ |
| | (miss) |
| v |
| +----------------+ |
| | L3 Proposer | wise-tier LLM, 1-shot |
| | propose | --> plan JSON {steps, fillers, |
| | (Qwen 35B-A3B) | final_message} |
| +----------------+ |
| | |
| v |
| +----------------+ |
| | L2 Validator | typecheck plan; error? --> re-propose |
| +----------------+ |
| | |
| v |
| +----------------+ |
| | Executor | for each step: |
| | (deterministic)| - resolve from_step + FILLER + RUNTIME |
| +----------------+ - invoke_executor + vaglio + accumulate|
| | |
| v |
| +----------------+ |
| | render final | template "Found {N} mail, moved" |
| +----------------+ |
+--------------------------------------------------------------+
|
v (ok? yes --> Autopath records + answers user)
|
| (error? yes --> Recovery)
v
+-------------------+
| Recovery | classify error: wrong_tool / wrong_args /
| classify + retry | missing_input
| | --> re-propose excluding the failed tool
| | --> execute again
+-------------------+
|
v (ok? yes --> answer)
|
| (out_of_scope or recovery failed? --> Terminator)
v
+-------------------+
| Terminator | honest explanation: cause + suggested action
| honest dead-end | --> record the gap in terminator_log.sqlite
| | --> "Can't resolve: X. To proceed: Y."
+-------------------+
|
v
USER: final message (answer or action request)
The Proposer doesn’t produce free text: it produces a structured object we call the framework (the plan). It has three parts:
{
"steps": [
{"tool": "find_messages",
"args": {"folder": "INBOX", "query": "is:unread"}},
{"tool": "classify_entries",
"args": {"from_step": 1, "dimension": "spam"}},
{"tool": "filter_entries",
"args": {"from_step": 2, "where_field": "spam", "where_value": "spam"}},
{"tool": "move_messages",
"args": {"from_step": 3, "dst_folder": "${FILLER:trash_folder}"}}
],
"fillers": {
"trash_folder": {
"prompt": "What is the trash folder for this account?",
"default": "Trash",
"tier": "fast"
}
},
"final_message": "Moved ${step4.ok_count} mail to trash."
}
Inside the arguments, four kinds of placeholder appear, which the Executor resolves deterministically at the right moment:
| Placeholder | Meaning |
|---|---|
from_step: N | Take the entries produced by step N (1-based) and pass them to this step. |
${stepN.field} | Extract a field from step N’s result (supports nested paths and projections). Used mostly in the final message. |
${FILLER:name} | A slot filled on the fly by a small fast-tier model call (cached), or by the default value. |
${RUNTIME:key} | Turn context value: actor (who is speaking), lang, channel. |
The Fastpath is the fast lane. Every turn successfully completed by the full plan creates the shortcut on its own — the chains are executors already vetted and tested, no approval step —: from then on, that request (and its near variants) skip everything else and are served directly. Shortcuts age out on their own (never reused, stale, total cap), die when an executor supersedes them or disappears, and can be removed by hand from the admin console.
Recognition happens in two stages. First an exact compare on the phrase fingerprint (deterministic, under 5 ms, no model at all). If that isn’t enough, a meaning compare: the request is turned into a vector with BGE-M3 and compared by cosine against the saved shortcuts (under 150 ms). A self-learned shortcut always wins, even over an automatically-learned autopath.
The Autopath is the memory that grows on its own, with no one writing rules by hand. It keeps three tables in its sqlite store:
| Table | What it holds | When it’s written |
|---|---|---|
autopaths |
Plans that worked, indexed by request meaning and grouped by semantic cluster. For each: the plan, usage counters, a composite score, the status (champion/challenger). | Promoted automatically when a plan proves itself on the same kind of request. |
anti_autopaths |
Plans that failed repeatedly. For each: the reason and an expiry (about 30 days). | When a plan fails repeatedly on the same intent. At expiry, the system tries again. |
observations |
A record of every turn: intent, executed plan, latency, semantic vector. Append-only. | Always, at the end of a turn. It’s the source of truth for promoting or retiring autopaths. |
The search proceeds in two stages: first by semantic similarity (the request falls into the cluster of related requests already seen and the «champion» plan of that cluster is served), then, if needed, by exact intent match. When several plans compete for the same cluster, a champion/challenger scheme applies: the challenger must prove itself better before it can take the champion’s place.
Before executing a freshly proposed plan, the Validator runs it through a sieve, purely deterministically and without the model:
from_step, ${stepN.…}) point at steps that exist?If it finds even one error, it doesn’t start execution: it asks the Proposer to re-propose once, excluding the wrong plan. That way a typo or a crooked argument is fixed at no cost, without wasting a heavier recovery call. It’s on by default.
The Proposer is the only component that actually talks to the language model. It receives the request and the extracted intent and produces, in a single call, the whole plan. No step-by-step reasoning, no iterative tool-calls: a single structured object.
There are selectable variants (the METNOS_ENGINE setting):
The Executor is the most rigorous component: zero language model in the main
loop, only deterministic Python. For each step: it resolves references to
previous steps (from_step), fills the ${FILLER:…}
slots (a small fast call with cache) and the context values
${RUNTIME:…}, validates the arguments, calls the
tool’s executor, passes the result to the
Vaglio (the safety judge) and accumulates. At the
end it composes the final message from the template with the
${stepN.field} fields.
If a step fails, Recovery steps in. Its first job is to understand what kind of error it is, by reading the structured error class the executor returns (no parsing of multilingual text):
| Class | When | Strategy |
|---|---|---|
| wrong_tool | The chosen tool wasn’t right: it failed or is semantically wrong. | Re-propose a different plan, excluding that tool. |
| wrong_args | The tool was right but the arguments were malformed: an empty pipeline, a step limit hit, a reference pointing nowhere. | Re-propose with canonical arguments and explicit references. |
| missing_input | A prerequisite is missing: an index not built, a non-existent folder, a filter producing zero results. | Re-propose an alternative; if the user is needed, hand off to the Terminator. |
| out_of_scope | A non-recoverable error: it needs a physical action from the user (e.g. sharing their location) or a capability is missing entirely. | Recovery does not step in: it hands off immediately to the Terminator. |
Recovery attempts one alternative only, excluding both the failed plan and the tool that caused the problem. No infinite loops: if the alternative doesn’t succeed, the floor passes to the Terminator.
The Terminator is the subtlest piece of the system. It’s not an
«error handler»: it’s an honest recognition of the limit.
When its turn comes, it means everything else has tried and failed. The
Terminator doesn’t fake success: it explains the cause to the user and
suggests a concrete action, and it records the gap in a store
(terminator_log.sqlite) with a counter of how many times it has
recurred.
The cause and action messages aren’t hardcoded strings: they go through the multilingual dictionary, so the user reads them in their own language. The turn always closes with a readable answer, never with silence or a raw error.
Under every chat answer, the user sees small feedback buttons:
Pressing anything is optional: if a pipeline reaches the end without errors, the Autopath counts it as a positive observation anyway. Explicit feedback speeds things up and refines them, but the system learns from silence too. The more it’s used, the more often requests are served from memory instead of the model — and therefore the faster.
Even on the first try, the engine doesn’t query the step-by-step planner but makes a single proposal. It feels like a system that «thinks for a moment» instead of «freezing for a minute».
After a few turns of the same family of requests, the autopath is learned and the following turns start from memory. The answer arrives in a fraction of a second: that’s the difference between «an assistant» and «an immediate reaction».
When something can’t be done, the Terminator says exactly what is needed. No vague «generic error» messages, no loops of useless retries: a clear sentence, an operational hint, and — where appropriate — a dialog to resolve the missing piece on the spot.
Frameworks for «LLM agent workflows» (LangGraph and the like) cover part of the same problem: reducing the model’s in-loop variance by giving structure to tasks. But they have limits this engine doesn’t:
| Aspect | LLM-in-loop frameworks | Metnos engine |
|---|---|---|
| Origin of the workflows | hand-written, manual maintenance | optional seed, but they grow on their own from feedback ✓ ✗ ↺ |
| Learning | no | yes: a memory populated from turns that went well |
| Anti-error | no | anti_autopaths table that excludes failed paths |
| Structured recovery | generic retries | 4 orthogonal error classes + targeted re-proposal |
| Honest dead-end | infinite retry or crash | Terminator: classifies + suggests an action |
| Traceability | none, or text logs | observations record + admin panel |
| Name consistency | free names | closed vocabulary (verb + object + qualifier) guaranteed |
Traditional LLM-in-loop frameworks are like artists with a blank canvas: every turn they reinvent the composition. Lovely creativity, but the risk of grabbing the wrong tool is real and every piece costs.
The Metnos engine is like a craftsman with a sketchbook: the model sketches the plan once (focused, constrained creativity), and from then on the craftsman reopens the book at the right page. The model’s creativity is still there, but it’s confined to the moment of invention: execution is a faithful reproduction of the sketch. That avoids hallucinations during execution, gains speed, and reaches zero cost once the book is already rich.
On a sample of real requests (mail, files, web, location, time, multi-turn dialogs, scheduling, contacts), comparing the iterative cloud planner with the local single-proposal engine:
| Mode | Coverage | Average latency | Model | Cost |
|---|---|---|---|---|
| Iterative planner (cloud, step-by-step) | equivalent | ~76 s | cloud frontier | paid |
| Local engine (single proposal + memory) | equivalent | ~12 s cold, <1 s from memory | local Qwen 3.6 35B-A3B | 0 |
Several times faster, equivalent coverage, zero cost (the model runs locally on unified-memory hardware). And latency drops further as the memory fills up: recurring requests start from cache, under a second.