Marshmallowy: What LLMs Change — and Where the Edges Are

Ludwig Wendzich

February 10, 2026

The genuinely new thing LLMs bring to software isn’t intelligence or automation: It’s marshmallowiness.

LLMs are soft, squishy, language-first systems. They’re unreasonably good at understanding intent, filling in missing context, and producing explanations that feel human.

That changes how you should build. But it also forces you to learn—sometimes the hard way—where marshmallowy systems need boundaries.

Leaning Into Marshmallowy

Traditional software forces users to adapt to rigid structures. LLMs flip that: users can speak naturally and let the system interpret.

This is powerful. A user can say: “Can you chase up that invoice from the plumber, the one from a couple weeks ago?”

No invoice number. Approximate time. Vendor by trade, not name.

Classic software would fall over because this is not a structured query. At best a UX designer could give you buttons to mash to get a structured query. However with LLMs, modern systems thrive on this marshmallowy context.

That softness is the interface. It's super-powerful and it’s actually quite hard to get used to the “We have an LLM” thing. That's another post. But let's say you do click and suddenly you see LLM-shaped opportunities everywhere: how does it go wrong? Where do you draw the line?

The Bounds for Humans: Guidance, Not Guessing

Doing Marshmallowy properly doesn’t mean “anything goes”. In practice, fully open-ended interfaces create uncertainty. Users don’t know what’s possible, what’s safe, or what the system expects.

So we’ve learned that marshmallowy systems still need to provide the people using them guidance:

  • templates instead of blank slates
  • prompts that reveal capability
  • progressive disclosure instead of overwhelming configuration

The LLM gives flexibility to accept anything, but the product still gives confidence through UX. This means experts can always go further, but new users aren’t left guessing.

Examples

Prompts that reveal capability

While it's really cool that you can type anything into this field and the LLM will get it, that doesn't help the user understand of feel confident about what they can type in there. So providing Prompts are great way to give them guidance and confidence to navigate the big empty text-box.

User types @ or / and gets live-filtered parameter completion. The user can still type anything, but the menu reveals what's possible.


Templates instead of blank slates

Another option is to provide a clear known and working starting place, or example. We have templates for common workflows so people can skip straight past the blank state.

Not just "write whatever you want" — templates guide users toward proven patterns.

The Bounds for Agents: Structure, Not Trust

Marshmallowy also doesn’t mean trusting the model. LLMs lie and they don’t listen. LLMs are persuasive, confident, and wrong in subtle ways. They hallucinate outcomes, skip steps, and slowly drift.

Keeping agents honest

Our job is to ensure that given that reality, our agents are still reliable for our customers. So agents operate inside non-negotiable constraints that reduce, catch or eliminate their short-comings. We’ve developed an agentic system that harnesses an LLM through:

  • explicit planning / todos before action
  • tool-only execution
  • tool verification against intent and permissions
  • completion verification so “done” actually means done
  • drift detection for long-running work

Examples

Explicit planning / todos before action

We know agents perform better when they plan. But they don't always plan. So we make them go through a planning phase before they are allowed to execute. This also provides a nice affordance for our users to put a task in “Supervised” mode which requires explicit approval to move out of the Planning and into the Execution phase.

Sterling can be required per task to seek approval before action
But even without approval: Sterling cannot proceed without making a plan first. Period.

Ready to Execute Validates Intent: Context Gathering is Required

And because agents are trained to be super-keen to please, they need some encouragement to slow down and plan more carefully. So we make sure they gather context during their planning phase (with read tools), so they can make a more specific plan.

Must use read tools to gather context before executing. Can't just make up a plan.



Quality Pass Before Completion

Separate sub-agent with fresh context reviews the work. Three outcomes: approved, fix and retry, or escalate to human.

A separate agent (Senior Accountant) reviews the work. Can't self-certify.


Task Complete Validates All Todos

Agents are prone to hallucinate actions (especially after they've planned an action), so we built in guard rails to make sure that during the Planning phase agents need to say what tools they will need to complete specific to-dos, and what documents/entities they will need to create/update. This means that when they mark a to-do as “complete” we can verify those tools were called against those entities. If not: we can reject the completion and ask the agent to try again.

Completion verification so "done" actually means done. Agent says "done" → system checks todos → rejects if work incomplete.


Completion Review Challenge

First task_complete call triggers a challenge: "Look at your original plan. Did you actually do everything?"

Empowering agents in the finance domain

And while Marshmallowiness is great for allowing extremely flexible systems, we are building a specialised agent so our users can fly.

Our users don’t want to be constructing and tracking schemas for standard finance concepts like Invoices, FX, Prepayments etc. Taking marshmallowiness to the extreme might look like “Well can’t users build FX tracking by stringing together Tasks, and Artifacts etc?” While they could, should they? We aren't building a generic agent, we’re building one that’s fluent in finance.

We should support those structures so our users only need to think about what’s unique about their business domain: whether it is tracking horse breeds or AWS cost centers. We ask ourselves: will every finance team need this? If so, we should lay the foundations.

Examples

Prepayments & Amortization

Instead of "Users build prepayment tracking with spreadsheet artifacts", we have a dedicated model tracking metadata throughout the lifecycles, and providing convenient methods we know every team will need.

FX/Currency Built Into Every Transaction

Instead of "Users build FX tracking with spreadsheet artifacts", automatically on all transactions we record metadata and conversion happens transparently via callbacks.

Multi-Stage Approval Workflow

Instead of  "Users string together clarifications for approval", we offer a complete approval infrastructure.

Invoice Understanding (Not Just Raw PDFs)

Instead of "Users describe to the agent how to read an invoice", we have structured invoice understanding built-in.

Designing for Two Audiences

This is the key shift: When you build with LLMs, you’re no longer designing for just one user.

You’re designing for:

  • humans, who need flexibility and confidence
  • agents, which need freedom to interpret and hard edges to stay aligned

Marshmallowy is a new way of thinking: knowing where to lean into softness, knowing where to put very hard edges. Knowing that interpretation is valuable—and authority must be earned.

Get that balance right and you don’t get just a chatbot: you get supervised labour that actually works.

That’s what marshmallowy means.

Book in a demo with our Founder CEO today

Photo of Nik Wakelin

A 30-min call is all it takes to see how Sterling can start helping you save time right away.

Book a demo with Nik