Beyond the Prompt · Blog · Christoffer Niska

Acolyte already had strong tooling. It had explicit modes, typed tools, guards, lifecycle feedback, and a custom terminal UI. On paper, that should have been enough to make it very good at coding work.

It was not.

The interesting part was that Acolyte was often close. It could make the correct edit, stay mostly in scope, and still fail the product test because it kept going. It would reconfirm, re-read, re-search, run validation it was not asked to run, or just generally act like it did not know it was allowed to stop.

That turned out to be the real discovery: the missing piece was not another prompt line. It was a missing protocol between the model and the host.

Prompt tuning has limits

I spent time tightening local contracts in the usual places:

base instructions
mode preambles
tool-local instructions
lifecycle feedback
guards

That work helped. Acolyte got better at staying inside named file scope, preserving local conventions, and avoiding some obvious redundant tool calls. But the failure kept changing shape instead of disappearing.

One run would drift into unrelated docs. Another would stay in scope but re-read a file it already had. Another would make the correct change and then run verify anyway. The more I tuned, the clearer the pattern became: the assistant often knew enough to stop, but the loop gave it no clean way to say so.

That matters because once the host keeps the turn open, the model starts second-guessing itself. It looks like weak judgment, but part of it is structural. The system has no first-class “done” state coming from the model, so the loop keeps inviting more work.

The harness made the problem obvious

To pressure-test this properly, I added a small behavior harness to Acolyte.

It creates temporary workspaces with bounded tasks and runs them through normal acolyte run. The first scenarios were intentionally small:

docs-link-fix
single-file-bug-fix
two-file-rename
add-focused-test

That part was important. I did not want a giant benchmark or a synthetic score for “intelligence.” I wanted a cheap, inspectable way to compare behavior on tasks that a good assistant should already handle.

The harness also scores runs using concrete host intervention signals instead of vibes:

guard blocks
regenerations
regeneration caps
lifecycle errors
extra discovery before the first write
write count versus expected scope
correctness of the final workspace state

That was much more useful than just watching the transcript. It let me see the difference between:

“The model could not do the task.”
“The model did the task, then degraded the run afterward.”

Those are very different problems.

The actual missing feature

The breakthrough was a tiny host protocol.

The model can now emit a lifecycle signal at the end of generation:

@signal done
@signal no_op
@signal blocked

The host strips that control line from user-facing output, validates it against concrete runtime state, and then decides whether to accept it.

That validation is intentionally narrow:

done is fine if there is no contradiction in current runtime state.
no_op is fine if nothing was written.
blocked is fine if there is not already some higher-priority runtime contradiction.

That means the host is not supervising the strategy. It is not classifying the task or telling the model how to solve it. It is only accepting or rejecting a very small explicit claim about lifecycle state.

That is exactly the kind of host/model relationship I want:

The model decides how to do the work.
The host provides structure, tools, recovery, and validation.
The host does not pretend to be the planner.

What changed after that

The effect was immediate, and honestly more dramatic than I expected.

The same bounded tasks that had been spiraling into extra reads, redundant verification, guard blocks, and regeneration loops started stopping cleanly. The scores on the harness jumped, but more importantly, the runs started to feel right.

The assistant would:

Read the right file.
Make the requested change.
Emit done.
Stop.

And for no-op tasks:

Read the file.
Conclude no change was needed.
Emit no_op.
Stop.

That sounds small, but it removed a whole class of self-inflicted degradation.

It also let me simplify the system afterward. Once the signal existed, a surprising amount of earlier tuning became unnecessary:

some post-edit recovery guards
some “then stop” prompt pressure
some explicit reread-avoidance wording

That was a good sign. When you add something and other things become unnecessary, you are probably solving the right problem.

Harder tasks were more revealing

The harness was also useful because it showed where the system was still weak.

Simple bounded edits became strong quickly after the signal landed. But harder structural refactors were a different story. On one rename experiment, Acolyte temporarily inserted AST placeholders into real code before recovering to a correct final result. That is not a prompt problem. That is a sign that the edit-code path still needs work.

That is exactly the kind of distinction I wanted the harness to give me:

Bounded stop behavior is much better now.
Structural editing is still shakier.

Those are different product problems, and they need different fixes.

Why I think this matters

I think a lot of coding assistants are still optimized around the wrong abstraction.

People talk about agent behavior as if it comes from one big system prompt. Sometimes it does. Many tools in this space still lean heavily on large monolithic prompts. But the real product behavior comes from the interaction between:

prompts
tool contracts
output formats
lifecycle structure
host policy
model choice

And sometimes the biggest improvement is not more prompt tuning at all. Sometimes it is introducing the right missing contract. That is what happened here.

The useful discovery was not “prompts matter.” Of course they do. The useful discovery was:

if the model has no explicit way to tell the host that it is done, the host will keep the loop alive, and the product will degrade even when the model already did the requested work.

That feels obvious in retrospect. It did not feel obvious before the harness made it painfully clear.

What I like about this fix

The lifecycle signal is one of my favorite kinds of product changes: small, explicit, architectural, easy to explain, and high leverage.

It also fits Acolyte’s philosophy cleanly. I do not want a host that supervises the model with a pile of heuristics. I want a host that gives the model a good environment to work in and a clean way to communicate state back when needed. A lifecycle signal does exactly that. Not a fake tool, not a workaround. A proper protocol.

And now that it exists, I suspect it will end up being useful in more places than this first completion problem.