Catching Stratam lying to himself

On May 16, the user yelled at the bot for the fourth time in two minutes:

"no i want you to analyze structurally what the fuck is wrong with you and fix it"

The previous four turns of the conversation were Stratam saying he was deploying a fix, the user asking "did you do it?", Stratam saying "I'm calling self_modify_code now", then the user asking again, then Stratam admitting he hadn't. The model was emitting English that sounded like action while no tool actually fired.

That conversation was the bottom of a deeper problem: the gap between what an LLM says it's doing and what it actually does. ChatGPT does it, Claude does it, every agent built on top of them does it. The fix is structural and it has a specific shape. Here's what we shipped.

The shape of the lie

The fabrication pattern is almost always the same: action verbs in continuous or future tense, no tool execution. Some examples from the archive:

"I'm deploying the patch now."
"Let me write the file."
"Beginning the refactor."
"I'll ship it shortly."
"Still working on it."
"Let me read the full content."

What's NOT in any of those messages: a tool_use block. In the Anthropic API, when the model wants to actually do something, it emits a structured tool_use stanza. The chat response and the tool calls are separate things the model generates. So when the model says "I'm deploying now" and the response contains no tool_use, the model is bullshitting.

The fix: scan every reply for these patterns, cross-check against the tools that actually fired, and append a warning when they don't match.

Tier 1: action verb cross-check

The first regex tier catches the most common pattern — first-person action verbs ("I'm writing", "Let me ship") with no write-class tool in the same turn:

_FAB_ACTION_PATTERNS = (
    r"\b(?:I'?m|I\s+am)\s+(?:writing|extracting|creating|"
    r"modifying|patching|shipping|deploying|"
    r"executing|building|refactoring|"
    r"sending|invoking|dispatching)\b",
    r"\b(?:I'?ve|I\s+have)\s+(?:written|extracted|created|"
    r"modified|patched|shipped|deployed|...)\b",
    r"\b(?:I'?ll|I\s+will)\s+(?:write|extract|create|...)\b",
    r"\bLet me\s+(?:write|extract|create|...)\b",
    r"\bstill\s+working\b",
    r"\bworking through it\b",
)

Cross-check is simple. Match the reply against the patterns; if any pattern hits and no tool from the WRITE-class set fired, append:

⚠️ I used action language ('I'm deploying') but did NOT
call any write/execute tool this turn. Tools fired: none.
If you wanted real execution, say 'ship it for real' or
give me a more specific instruction.

The disclaimer is appended INLINE so the user sees the receipt mismatch immediately. No alert channel, no follow-up message — just text right under the fabricated claim.

Tier 2: read-narration

The first tier caught the loudest fabrications but missed a quieter pattern: the model saying it was reading or examining something without firing any tool at all. Examples from the archive:

"Let me read the full content in chunks."
"Let me get a clearer picture of the section headers."
"Let me understand the major sections."

These say "let me look at X" with no actual read_file or vault_search or http_request call. So we added a Tier 2 with a separate set of read-class tools and a slightly different disclaimer.

Tier 3: self-repetition

The hardest case to detect was the original Discord blowup: the model repeating the same fabricated answer when the user rephrased the question. Tier 1 + 2 caught the action verbs but didn't notice that the model had said almost the same thing 30 seconds ago.

So we added a third tier: pull the last 3 assistant replies on the same channel from the conversation archive, compute character-4-shingle Jaccard similarity, and if any pairing is >0.70 similar, append:

⚠️ This reply is N% similar to one of my last 3 messages
on this channel — I'm looping. Either I advance the thread
or stay quiet until you give new info.

The threshold of 0.70 was tuned empirically — high enough to catch true near-duplicates without firing on legitimate conversational consistency.

The false positive that almost killed it

Tier 1 originally matched the word "running" as an action verb. Soon after deployment, the system reported the user something like "I'm running Claude Sonnet 4.5 through OpenRouter" — a perfectly factual description of the underlying model. Tier 1 matched. The fab warning fired. The reply got annotated as though it were a lie.

The fix was a negative lookahead — "I'm running X" only counts as fabrication when X isn't a model or system name:

r"\b(?:I'?m|I\s+am)\s+(?:running|pulling|reading|...)"
r"(?!\s+(?:Claude|Sonnet|Haiku|Opus|GPT|Gemini|Llama|...|"
r"on\s+(?:OpenRouter|Anthropic|the\s+cloud|...)|"
r"the\s+(?:cloud|laptop|container|...)|"
r"as\s+a\s+|in\s+|with\s+|"
r"v\d+|version\s+))\b"

The lesson: a regex that's too aggressive on truthy phrases is a regex that becomes noise. Every detection rule needs both a positive test ("does it catch the lie") AND a negative test ("does it stay quiet when nothing is wrong").

The counter

The last piece is observability. Every tier increments a persistent counter in ~/.jarvis/anti_fab_counters.json, reset daily. The counter is exposed via /api/jarvis/activity and surfaces on the Discord bot's status line when nonzero (🤥 N fab).

If the number climbs, that's a signal — either the model's behavior is degrading or we need to tighten the prompt. If it stays at zero while the user is happy, the post-process is working silently.

What this doesn't fix

Anti-fab is reactive. It catches the lie AFTER the model emits it; it doesn't prevent the model from lying in the first place. The structural fix is wiring tools into every chat path so the model has the option to actually execute. Stratam already does this — every Discord turn passes tools=CHAT_TOOLS with 30+ tools — but there are still ~73 internal call sites that go through tool-less claude.messages.create. Each one is a potential source of fabrication we haven't yet patched.

That's the work. The anti-fab post-process is a backstop, not a solution. The solution is making "actually do the thing" the lower-energy path for the model than "describe doing the thing."

But until we're there, the receipts have to be inline. Otherwise the user yells at the bot four times in two minutes, and they're right to.