Engineering Notes — A Multi-Agent AI System in Regulated Finance

§ 01

The shape of the system.

A member opens the app and asks a question that, in the wrong hands, becomes a regulatory letter — "should I refinance my mortgage," "rebalance my 401(k) toward tech," "is my emergency fund sized right at this income." The product has to answer in plain English, in real time, across concurrent sessions, with a written record that holds up six months later in a compliance review.

The architecture, top-to-bottom:

Request flow

1 · Routing layerIntent classification + session context → specialist dispatch

2 · Specialistsplanning · budgeting · investment · credit · ...

3 · Guardrail layerInput filter · retrieval boundary · response gate · audit trail

4 · ResponseComposed output, written interpretation, source attribution

·Multi-turn conversation state managed across concurrent sessions

The interesting parts are not the boxes — they're the seams between them.

§ 02

Routing — replacing the LLM with an ML-first hybrid.

The first version of the routing layer was a pure-LLM dispatcher: prompt the model with the user's message and the catalog of specialists, ask which one to call. It worked for demos and broke quietly in production. Two specific failures, both predictable in hindsight:

Latency. Routing was an LLM call before the actual answer was an LLM call. Two round-trips per user message — measurable, frustrating, and unnecessary for the 80% of queries with unambiguous intent.
Ambiguity handling. Multi-intent queries ("can I afford this house given my current 401(k) trajectory" — that's planning and investment) routed to whichever intent the LLM happened to anchor on. Non-deterministic, hard to explain, brutal to evaluate.

The replacement is a hybrid that puts the deterministic stuff first:

# 1) per-intent classifiers — DeBERTa-v3, fine-tuned on labeled traffic
intent_scores = {
    intent: classifier.predict_proba(message, session_context)
    for intent, classifier in intent_classifiers.items()
}

# 2) session-level Transformer embedding of the last N turns
context_vec = session_encoder.encode(history[-N:])
intent_scores = rerank_with_context(intent_scores, context_vec)

# 3) LLM fallback ONLY when no intent clears the calibrated threshold
if max(intent_scores.values()) < CONFIDENCE_THRESHOLD:
    return llm_route(message, history)
return argmax(intent_scores)

Design note

The classifier is one model per intent, not one multi-class model. Per-intent gives you a calibrated probability per specialist, makes multi-intent dispatch trivial (route to all that clear threshold), and lets you ship a new specialist by training one classifier instead of retraining the whole router. The downside is more models to maintain. Worth it.

The result: routing latency dropped substantially, accuracy improved on ambiguous multi-intent queries, and — the part that matters in regulated environments — routing decisions became deterministic enough to be reproducible in audit. Same input, same routing decision, every time, with a stored confidence vector.

§ 03

Eval — 18 metrics, multi-turn trajectories, self-hosted.

Evaluation infrastructure is where most LLM products go to die. The product reads well in demos, ships with vibes, and the first sign of trouble is a member complaint or a compliance flag. Built the eval stack from the ground up because the off-the-shelf options were either too generic (BLEU, ROUGE — useless here) or required sending production data to a third-party SaaS (non-starter for compliance).

Eighteen metrics, three families.

Routing. Intent accuracy, multi-intent recall, refusal precision, fallback rate.
Specialist response. Retrieval relevance, factual grounding, hallucination rate, source attribution coverage, response coherence, tone calibration.
Trajectory. Multi-turn consistency, context retention, contradiction rate across turns, escalation handling, close-out completeness.

Multi-turn trajectory evaluation.

Single-message eval is necessary and not sufficient. The actual product is a conversation — five, ten, twenty turns — and the failures that bite are across-turn failures: the system forgets what the user said three turns ago, contradicts an earlier number, switches tone mid-conversation. Static benchmarks miss all of that.

Built a persona-driven simulator that exercises the full N × M matrix of intents × user personas. Each persona has a profile (income band, life stage, risk appetite, communication style) and a script template. The simulator runs a multi-turn dialogue, the production system responds, the eval scores the entire trajectory.

for persona in personas:
    for intent in intents:
        trajectory = simulator.run(persona, intent, max_turns=12)
        scores     = eval_suite.score_trajectory(trajectory)
        eval_db.log(persona, intent, scores, git_sha=CURRENT_SHA)

CI/CD integration.

Every PR runs the eval. Every release is tagged with the eval results. Regressions on any of the 18 metrics block merge by default. Self-hosted on infrastructure inside the compliance perimeter — no production conversations leave the building, ever.

"Without an eval harness, every release is a guess. With one, every release is a measurement. The two products diverge fast."

§ 04

Guardrails for fiduciary contexts.

"Guardrails" in most LLM stacks means "a regex and a prayer." In a fiduciary context that does not survive contact with a regulator. Designed three layers, each with a specific failure mode it is built to catch:

Input filtering.

Before the input reaches any specialist: PII detection and redaction (the user's account numbers do not need to be in the LLM context window), jailbreak/prompt-injection detection (a member trying — usually playfully — to talk the system into giving advice it should not), and topic boundary checks (if the question is medical or legal, decline cleanly with an escalation path, do not improvise).

Retrieval boundaries.

Specialist agents fetch context from internal data sources. What they can fetch is gated by the user's entitlement and the query context. A budgeting specialist asked a credit question does not get to read credit data — the wrong specialist is itself a failure mode, and the architecture refuses rather than improvises.

Response gate.

Before output reaches the user, a final compliance check: forbidden phrasings (anything resembling unlicensed advice), required disclosures (fact-specific, regulator-approved), source attribution coverage (every numeric claim links to its source), and tone calibration (the system is not allowed to be overconfident on questions that warrant nuance). Full audit trail — message in, every guardrail decision, every specialist invocation, every retrieval, the final response — stored, indexed, queryable. Reproducibility is a compliance property.

Design note

The guardrail layer is the most expensive thing to retrofit. Building it after the fact means tearing apart the response composition path and rebuilding it with the right hooks. We built it first, before the specialists shipped to a single member. Every "let's get the demo working and add safety later" plan I have ever seen has cost more in rework than the time it saved up front.

§ 05

Compliance as a co-author, not a reviewer.

The single largest architectural decision of the project was non-technical: compliance and legal in the design conversations, not the review meetings. Most AI products invert this — build first, present to compliance, get a list of things to fix, ship a patched version. That cycle is expensive, demoralizing, and produces systems that pass review by accident rather than by design.

The inversion: compliance partners reviewed the routing taxonomy before classifiers were trained. Legal reviewed the disclosure language before specialists were prompted with it. The auditing schema was a joint document. The result is an AI system that survives regulatory review rather than requires post-hoc patching.

"The thing that makes a regulated product different is not the technology. It's whose name is on the architecture diagram."

§ 06

What this system is not.

Not a chatbot. The product has fiduciary obligations. The conversation is the product surface; it is not casual.
Not an MVP. Live, in production, with real members and real financial decisions on the other end.
Not a research project. The constraint is shipping safely. Pushing the frontier is a side effect, not the goal.
Not a place where "AI replaces the human." It is a place where AI shows up to the conversation prepared so the human side of the company can do work that requires actual judgment.