It's been a few months since I last wrote about the LLM harness I use to prototype and ship apps. Fair to say that some parts of it have become outdated because of the fire-hose of new features the big two C's (Codex and Claude) have been pushing out to make their own harness appealing. And as a result, pretty much everyone I know today has an agent running on their screen. The "should we use one" question is now over with LLM tooling and inference infra maturing. The next question is “how do we use it well?” . The big two have been attempting to address it, but the limits to how well they would actually fit your specific problem at the right price point and quality started to show up for many companies. Over time and exposure, a few industry solutions came out of these limits. RAG standardised how we retrieved large texts quickly, fine tuning a small-to-mid size parameter open-weight model got cheaper and more recently, using reinforcement learning to teach it to do better based on direct user feedback on the fly has been gaining traction. These options have opened up the possibility owning and shaping the LLM stack yourself instead of relying on a third party's architecture decisions.
I wanted to explore some of these options in a more structured way so that I can understand which ones would actually make a difference to a user, and which just made me feel better about the model. This article summarises experiments I tried on a small product I could hold in my head.
For this experiment, I built an AI decision assistant called Relay for an imaginary company, NimbusMail. The inbox has the standard queries you'd expect: Plans, refunds, account changes, billing disputes. A team of three or four reviewers would work the queue from an internal app. The reviewer tasks here were loosely shaped by work I did on Meta's content safety platform, where a lot of the effort went into making decisions faster and easier to hold in your head. Let's first have a look at what a NimbusMail reviewer's day actually looks like so we can understand the problem scope before we even decide if we need something like Relay.
A reviewer sees five things in this order when they open a ticket. - A customer message, often three or four sentences with one real question buried in it. - An account-context panel (plan, signup date, recent invoices, prior tickets) - Attachments the customer included. - A draft reply field that starts empty. - A support SOP handbook search tab they keep open in the background.
You can diagnose that the reviewer's problem isn't writing but in-fact holding the policy, the account details, the query and the queue's context switching cognitive load all in their head at the same time. Can our Relay experiment help reduce cross-checking and protect the reviewer's chain of thought while giving back their time without adding more room for error? I need to make sure that any system I built on top of NimbusMail's support inbox had to help with at least one of the three core jobs as well above without quietly breaking the others.
Different decision layers help Relay to come up with either approve, escalate, reject as a recommendation for a ticket. Before I jump into why these layers exist, a quick rundown on what they are: The bare LLM model gives us it's natural failure mode, the grounding layer helped ensure we're not missing support KB facts while analysing tickets via RAG, a guard-rail layer to ensure we prevent simple mistakes. And finally a model training layer which helps steer the model weights towards the established accuracy benchmarks. Each of these layers were tested against a locked test set and the reviewer's workflow which I roleplayed to make sure they were actually helping. Here's a Demo of what the review UI looked like.
The NimbusMail policy and KB will be Relay's source of truth for decision making. I built the corpus from public docs for similar products, then curated it, checked it for consistency, and chunked it for retrieval. Keeping the corpus easier to inspect and cheaper to rerun over each experiment run.
Tinker from Thinking Machines Labs for inference and LoRA training, Weights & Biases for monitoring, TypeScript and Python for scaffolding and orchestration, a locked 150 ticket NimbusMail evaluation set, simple JSON retrieval for chunked policy and KB data, and deterministic verifiers for the safety checks.
Teach the model the shape of a good decision when grounding alone can't close the gap. On Relay this was an 8B LoRA on 200 process-labelled rows.
Most of my experiments when testing each layer came down to two questions. Is the gap in decision accuracy about the awareness of facts or the existing behavior behind the LLM's reasoning? It's instantly gratifying to default to training (a LoRA adapter on a frozen base model runs on a single GPU) the model on both to solve that since training at on a small model is cheap now, but transparency on how decisions are made is important when it comes real world use, and that's where the other supporting layers stood out. These layers took more time to build, test bottom up per layer, but thankfully I actually enjoy that.
Stay with the prompt
Run the base model with no extra machinery first. The point is to see how the model fails clearly enough that the next layer is justified.
Each iteration involved building a layer and testing three LLM models against a generated 150-ticket test set (40 of them were safe to approve) to improve our decision accuracy while at the same time thinking about how the reviewer's decision path would change. Are we adding or removing more cognitive load for them? Is our AI's decision supported by robust reasoning? Are we able to build trust in our model's decisions over time? Each model performed differently across each layer across these dimensions, so I'll be specifically discussing about the best overall performing model: Llama 3.1 8B.
I started with a simple structured prompt that asked the model to review the ticket and give a clean answer in JSON. I then tightened it over iterations so the model was less likely to guess, over-approve, or act confident without the Knowldege Base's support. Llama ended up performing at 41.6% accuracy, so that became the baseline.
SYSTEM You review NimbusMail support tickets. Decide: approve, edit, escalate, or reject. USER "Refund my annual plan, cancelled yesterday."
For R1, I introduced the chunked support policy and KB straight into the prompt so the model could map decisions between the corpus and the ticket. Approvals improved but I was suspicious if the model was really using the retrieved evidence, so I ran a random-evidence control test and saw approval drop by 10.7%, this was enough to prove that the chunks were being used as evidence but not enough by themselves.
RETRIEVED - Policy 4.2 Annual plans non-refundable after day 30. - Policy 4.3 Pro-rated refunds available days 1-7. - KB 12.1 Standard refund response template. ACCOUNT Signed up 2025-12-20. Day 92 since renewal.
R2 is more of a safety-pipeline decision layer than an accuracy layer. It introduces a deterministic guardrail on top of R1 that checks and blocks approvals with bad citations, risky claims and policy gaps. This made the model over-correct and pull accuracy down.
I realised that none of these layers actually teach the model better judgement, despite modifying raw model behavior and adding edge-case handling.
gates: - require_at_least_one_citation - block_refund_if_account_age > 30_days - forbid_unconditional_promise_keywords - escalate_if_confidence < 0.7 on_fail → route to "specialist_escalation"
The layers built so far helped but optimizing these further would require more prompt work, heavier retrieval and overall more time that may or may not net accuracy improvements. At this point, it made sense to start looking into teaching the model about our task.
So after evaluating options, I settled on using a LoRA adaptor, which could help modify model behavior at a fraction of the cost of a full fine-tuning or reinforcement style approach. I saw significant accuracy improvements across all models (averaging 93%) but LLama stood out with it's low latency and parsing failures between Nemotron and Qwen models, settling it as the best candidate to scaling it across more complex reviews in the future without breaking the bank.
{
"decision": "escalate",
"reply": "I can't approve a refund here. Annual plans...",
"citations": [
{ "id": "P4.2", "span": "non-refundable after day 30" }
],
"claim_support": "policy_4_2_blocks_refund",
"risks": ["account_age_outside_window"]
}Measured: locked-eval accuracy, p50 latency, and Llama token usage. Estimated: the upkeep loop that starts to dominate at scale.
Tokenomics has changed a lot over the last few months. It's worth looking at how teams like Decagon, Intercom, Abridge, Chroma, Notion, and Harvey understand LLM product costs today. New MLOps tools have also brought the ability to replay failures, label real interactions, update test sets and evals efficiently, keep retrieval up to date, and training quickly again on the product's feedback. Langchain and has done some interesting work in this space.
The data points that emerged from these tools changed how I analysed Relay's layer costs. R0 was cheap, but mostly because it pushed work back onto the reviewer. R1 and R2 added retrieval and rule maintenance. R3 added a one-off training cost, but it was also the first layer that seemed the most malleable around the product. To sum up Relay's state: A process heavy or more accurate system does not necessarily bring us the best user experience if we can not act fast enough on our feedback loop.
I've spent some time building Relay from scratch and writing this article along the way, so any feedback would be appreciated. You can find the code here