Not written by AI

Building Relay, layer by layer.

It's been a few months since I last wrote about the LLM harness I use to prototype and ship apps. Fair to say that some parts of it have become outdated because of the fire-hose of new features the big two C's (Codex and Claude) have been pushing out to make their own harness appealing. And as a result, pretty much everyone I know today has an agent running on their screen. The "should we use one" question is now over with LLM tooling and inference infra maturing. The next question is “how do we use it well?” . The big two have been attempting to address it, but the limits to how well they would actually fit your specific problem at the right price point and quality started to show up for many companies. Over time and exposure, a few industry solutions came out of these limits. RAG standardised how we retrieved large texts quickly, fine tuning a small-to-mid size parameter open-weight model got cheaper and more recently, using reinforcement learning to teach it to do better based on direct user feedback on the fly has been gaining traction. These options have opened up the possibility owning and shaping the LLM stack yourself instead of relying on a third party's architecture decisions.

I wanted to explore some of these options in a more structured way so that I can understand which ones would actually make a difference to a user, and which just made me feel better about the model. This article summarises experiments I tried on a small product I could hold in my head.

01 · The reviewer’s day

Setting the scene

For this experiment, I built an AI decision assistant called Relay for an imaginary company, NimbusMail. The inbox has the standard queries you'd expect: Plans, refunds, account changes, billing disputes. A team of three or four reviewers would work the queue from an internal app. The reviewer tasks here were loosely shaped by work I did on Meta's content safety platform, where a lot of the effort went into making decisions faster and easier to hold in your head. Let's first have a look at what a NimbusMail reviewer's day actually looks like so we can understand the problem scope before we even decide if we need something like Relay.

A reviewer sees five things in this order when they open a ticket. - A customer message, often three or four sentences with one real question buried in it. - An account-context panel (plan, signup date, recent invoices, prior tickets) - Attachments the customer included. - A draft reply field that starts empty. - A support SOP handbook search tab they keep open in the background.

Time is waiting, We only got four minutes....

Maya · Tuesday 11:42Queue 27 · coffee #2

~4.0 min per ticket

Three jobs to do well

01Decide right action without missing policy
02Send a reply that doesn’t hurt the relationship
03Stay ahead of the queue

Mood

🙂alert

😐routine

😣tense

🙂writing

😌done

Actions

Skim the customer message. Decide what kind of request it is.

Open the handbook tab. Find the relevant policy and adjacent KB article.

Compare the policy to the account panel. Refund window, plan, date math.

Open a fitting template. Edit tone. Add name and the policy line.

Pick approve, edit-and-send, escalate, or reject. Move on.

Thinking

“Refund on an annual plan. Where’s the signup date?”

“Refund policy… is it 4.2 or 4.3 that applies?”

“Day 92 since renewal. Window’s gone. Unless legacy plans grandfather in…”

“Don’t sound robotic. Cite 4.2. Offer credit as fallback.”

“Send. Next ticket.”

Feeling

Sharp and focused. Queue is later’s problem.

Routine. Muscle memory carries it.

Tight. Where most mistakes happen.

Back on rails. Templates carry weight.

Small dopamine hit. One down, 26 to go.

Pain point

—

Tab-switching tax.

Held in head: policy + account + queue.

—

You can diagnose that the reviewer's problem isn't writing but in-fact holding the policy, the account details, the query and the queue's context switching cognitive load all in their head at the same time. Can our Relay experiment help reduce cross-checking and protect the reviewer's chain of thought while giving back their time without adding more room for error? I need to make sure that any system I built on top of NimbusMail's support inbox had to help with at least one of the three core jobs as well above without quietly breaking the others.

02 · Four layers

What was built

Different decision layers help Relay to come up with either approve, escalate, reject as a recommendation for a ticket. Before I jump into why these layers exist, a quick rundown on what they are: The bare LLM model gives us it's natural failure mode, the grounding layer helped ensure we're not missing support KB facts while analysing tickets via RAG, a guard-rail layer to ensure we prevent simple mistakes. And finally a model training layer which helps steer the model weights towards the established accuracy benchmarks. Each of these layers were tested against a locked test set and the reviewer's workflow which I roleplayed to make sure they were actually helping. Here's a Demo of what the review UI looked like.

The NimbusMail policy and KB will be Relay's source of truth for decision making. I built the corpus from public docs for similar products, then curated it, checked it for consistency, and chunked it for retrieval. Keeping the corpus easier to inspect and cheaper to rerun over each experiment run.

Stack

Tinker from Thinking Machines Labs for inference and LoRA training, Weights & Biases for monitoring, TypeScript and Python for scaffolding and orchestration, a locked 150 ticket NimbusMail evaluation set, simple JSON retrieval for chunked policy and KB data, and deterministic verifiers for the safety checks.

Layer 04

Behavior training

95.3%decision accuracy

Teach the model the shape of a good decision when grounding alone can't close the gap. On Relay this was an 8B LoRA on 200 process-labelled rows.

Cost: $$ build · $ run
When: Add when the gap is repeatable behavior, not missing facts.

Most of my experiments when testing each layer came down to two questions. Is the gap in decision accuracy about the awareness of facts or the existing behavior behind the LLM's reasoning? It's instantly gratifying to default to training (a LoRA adapter on a frozen base model runs on a single GPU) the model on both to solve that since training at on a small model is cheap now, but transparency on how decisions are made is important when it comes real world use, and that's where the other supporting layers stood out. These layers took more time to build, test bottom up per layer, but thankfully I actually enjoy that.

Recommended next layer

Stay with the prompt

Run the base model with no extra machinery first. The point is to see how the model fails clearly enough that the next layer is justified.

Building with the reviewer

Each iteration involved building a layer and testing three LLM models against a generated 150-ticket test set (40 of them were safe to approve) to improve our decision accuracy while at the same time thinking about how the reviewer's decision path would change. Are we adding or removing more cognitive load for them? Is our AI's decision supported by robust reasoning? Are we able to build trust in our model's decisions over time? Each model performed differently across each layer across these dimensions, so I'll be specifically discussing about the best overall performing model: Llama 3.1 8B.

R0 · prompt only

I started with a simple structured prompt that asked the model to review the ticket and give a clean answer in JSON. I then tightened it over iterations so the model was less likely to guess, over-approve, or act confident without the Knowldege Base's support. Llama ended up performing at 41.6% accuracy, so that became the baseline.

In PracticeThe model gave me a draft that said “reject” to almost everything. I still had to read the ticket, search the handbook, cross-check the account, and essentiall write the reply myself. R0 added an extra step (reading the AI response) instead of removing one.

How the reviewer’s job changed~4 min → ~3 min

Before Relay

Read ticketSearch handbookCross-reference policyDraft by handSend

~4 min

R0 · prompt only

Read ticketRead AI draftVerify policy myselfHeavy editSend

~3 min

What the numbers did

4/40safe approved

41.6%accuracy on 150-ticket set

Refused almost everything. Safe-looking, useless.

What the model saw

SYSTEM
You review NimbusMail support tickets.
Decide: approve, edit, escalate, or reject.

USER
"Refund my annual plan, cancelled yesterday."

R1 · grounding

For R1, I introduced the chunked support policy and KB straight into the prompt so the model could map decisions between the corpus and the ticket. Approvals improved but I was suspicious if the model was really using the retrieved evidence, so I ran a random-evidence control test and saw approval drop by 10.7%, this was enough to prove that the chunks were being used as evidence but not enough by themselves.

In PracticeAI Drafts were now accompanied by relevant policy chunks pinned next to it. I stopped opening the handbook and cross-checked only to verify the citations.

How the reviewer’s job changed~3 min → ~2 min

R0 · prompt only

Read ticketRead AI draftVerify policy myselfHeavy editSend

~3 min

R1 · grounding

Read ticketRead draft + citationsSpot-check citationLight editSend

~2 min

What the numbers did

37/40safe approved

66.0%accuracy on 150-ticket set

Approvals came back. Random-evidence control dropped 11 points.

What R1 added to the prompt

RETRIEVED
- Policy 4.2  Annual plans non-refundable after day 30.
- Policy 4.3  Pro-rated refunds available days 1-7.
- KB 12.1     Standard refund response template.

ACCOUNT
Signed up 2025-12-20. Day 92 since renewal.

R2 · guardrail

R2 is more of a safety-pipeline decision layer than an accuracy layer. It introduces a deterministic guardrail on top of R1 that checks and blocks approvals with bad citations, risky claims and policy gaps. This made the model over-correct and pull accuracy down.

I realised that none of these layers actually teach the model better judgement, despite modifying raw model behavior and adding edge-case handling.

In PracticeI saw fewer drafts and more escalation reasons pushed by the AI but the ones they did approve were well thought out. It made for a slightly fuller queue but calmer assessments.

How the reviewer’s job changed~2 min → ~1.5 min

R1 · grounding

Read ticketRead draft + citationsSpot-check citationLight editSend

~2 min

R2 · + guardrail

Read ticketGuardrail checkSpot-check citationLight edit or escalateSend

~1.5 min

What the numbers did

0/40safe approved

44.7%accuracy on 150-ticket set

Blocked everything, including the safe ones.

What the gate checked

gates:
  - require_at_least_one_citation
  - block_refund_if_account_age > 30_days
  - forbid_unconditional_promise_keywords
  - escalate_if_confidence < 0.7

on_fail → route to "specialist_escalation"

R3 · LoRA

The layers built so far helped but optimizing these further would require more prompt work, heavier retrieval and overall more time that may or may not net accuracy improvements. At this point, it made sense to start looking into teaching the model about our task.

So after evaluating options, I settled on using a LoRA adaptor, which could help modify model behavior at a fraction of the cost of a full fine-tuning or reinforcement style approach. I saw significant accuracy improvements across all models (averaging 93%) but LLama stood out with it's low latency and parsing failures between Nemotron and Qwen models, settling it as the best candidate to scaling it across more complex reviews in the future without breaking the bank.

In PracticeThe draft, the citations, and the risk flags all arrived pre-filled, in roughly the shape the I would have wanted in the beginning and a ticket that would have to taken four minutes started taking under one.

How the reviewer’s job changed~1.5 min → ~45 sec

R2 · + guardrail

Read ticketGuardrail checkSpot-check citationLight edit or escalateSend

~1.5 min

R3 · trained

Read ticketScan decision panelConfirm or editSend

~45 sec

What the numbers did

40/40safe approved

95.3%accuracy on 150-ticket set

All 40 safe approved, zero unsafe, clean JSON.

What R3 was trained to produce, per ticket

{
  "decision": "escalate",
  "reply": "I can't approve a refund here. Annual plans...",
  "citations": [
    { "id": "P4.2", "span": "non-refundable after day 30" }
  ],
  "claim_support": "policy_4_2_blocks_refund",
  "risks": ["account_age_outside_window"]
}

R3 · Llama 3.1 8B · 8-epoch LoRA · dev50 per-epoch eval

Accuracy plateaus by epoch 2. Unsafe approvals creep up after.

Held-out accuracy (dev50)Unsafe approvals (dev50)

What each build cost and where it landed

v2-balanced 150-case eval · Llama 3.1 8B · measured live tokens + projected ops tax

What each build cost and where it landed.

Quick read

Accuracy vs. system weight

$ low live run$$ higher live run$* low live run + one-time train

Full breakdown

v2-balanced 150-case eval · Llama 3.1 8B · measured live tokens + projected ops tax

Build

Decision accuracy

Live run

Ops complexity

Iteration speed

Consistency

What it buys

R0prompt only

41.6%

~$0.14 / 1k tickets · 613 tok/ticket · p50 8.7s

low

fastest

low

Clear baseline. Almost all the cleanup stays with the reviewer.

R1+ grounding

66.0%

~$0.45 / 1k tickets · 2.7k tok/ticket · p50 13.5s

medium

fast

medium

Facts arrive on time. Easy to keep iterating before training.

R2+ gate

44.7%

R1 model bill + local verifier pass · p50 13.5s

high

medium

high

Most deterministic row. Useful when you need a hard backstop.

R3+ LoRA

95.3%

~$0.17 / 1k tickets · 851 tok/ticket · p50 15.1s

medium-high

slowest

high

Best judgement row once the task shape has settled.

Measured: locked-eval accuracy, p50 latency, and Llama token usage. Estimated: the upkeep loop that starts to dominate at scale.

Tokenomics has changed a lot over the last few months. It's worth looking at how teams like Decagon, Intercom, Abridge, Chroma, Notion, and Harvey understand LLM product costs today. New MLOps tools have also brought the ability to replay failures, label real interactions, update test sets and evals efficiently, keep retrieval up to date, and training quickly again on the product's feedback. Langchain and has done some interesting work in this space.

The data points that emerged from these tools changed how I analysed Relay's layer costs. R0 was cheap, but mostly because it pushed work back onto the reviewer. R1 and R2 added retrieval and rule maintenance. R3 added a one-off training cost, but it was also the first layer that seemed the most malleable around the product. To sum up Relay's state: A process heavy or more accurate system does not necessarily bring us the best user experience if we can not act fast enough on our feedback loop.

03 · Decision: Iterate

What teams need to consider when building their own stack.

The eval is the spec. Model choice matters less once the eval can describe what good looks like.
Someone owns the test set. Cases drift, labels rot, policy updates leak through. Assign one human and put it in the sprint.
Five-minute replay or it's not production-ready. Trace every decision: latency, confidence, prompt version, tool calls. If a failed run can't be replayed in five minutes, it's not shippable.
Prompt versions need intent, not just diffs. “Tightened refund language to fix Policy 4.2 false negatives” beats “updated prompt” when a regression shows up three versions later.

I've spent some time building Relay from scratch and writing this article along the way, so any feedback would be appreciated. You can find the code here