Notes from four years and counting as Staff AI Engineer at a stealth-mode financial services startup, building a member-facing AI product where every response has fiduciary weight. Stack details are deliberately specific. Company and product names are deliberately not.
Notes from Surendra Singh — Sep 2021 – present · SF Bay Area · hybrid.
A member opens the app and asks a question that, in the wrong hands, becomes a regulatory letter — "should I refinance my mortgage," "rebalance my 401(k) toward tech," "is my emergency fund sized right at this income." The product has to answer in plain English, in real time, across concurrent sessions, with a written record that holds up six months later in a compliance review.
The architecture, top-to-bottom:
The interesting parts are not the boxes — they're the seams between them.
The first version of the routing layer was a pure-LLM dispatcher: prompt the model with the user's message and the catalog of specialists, ask which one to call. It worked for demos and broke quietly in production. Two specific failures, both predictable in hindsight:
The replacement is a hybrid that puts the deterministic stuff first:
# 1) per-intent classifiers — DeBERTa-v3, fine-tuned on labeled traffic intent_scores = { intent: classifier.predict_proba(message, session_context) for intent, classifier in intent_classifiers.items() } # 2) session-level Transformer embedding of the last N turns context_vec = session_encoder.encode(history[-N:]) intent_scores = rerank_with_context(intent_scores, context_vec) # 3) LLM fallback ONLY when no intent clears the calibrated threshold if max(intent_scores.values()) < CONFIDENCE_THRESHOLD: return llm_route(message, history) return argmax(intent_scores)
The classifier is one model per intent, not one multi-class model. Per-intent gives you a calibrated probability per specialist, makes multi-intent dispatch trivial (route to all that clear threshold), and lets you ship a new specialist by training one classifier instead of retraining the whole router. The downside is more models to maintain. Worth it.
The result: routing latency dropped substantially, accuracy improved on ambiguous multi-intent queries, and — the part that matters in regulated environments — routing decisions became deterministic enough to be reproducible in audit. Same input, same routing decision, every time, with a stored confidence vector.
Evaluation infrastructure is where most LLM products go to die. The product reads well in demos, ships with vibes, and the first sign of trouble is a member complaint or a compliance flag. Built the eval stack from the ground up because the off-the-shelf options were either too generic (BLEU, ROUGE — useless here) or required sending production data to a third-party SaaS (non-starter for compliance).
Single-message eval is necessary and not sufficient. The actual product is a conversation — five, ten, twenty turns — and the failures that bite are across-turn failures: the system forgets what the user said three turns ago, contradicts an earlier number, switches tone mid-conversation. Static benchmarks miss all of that.
Built a persona-driven simulator that exercises the full N × M matrix of intents × user personas. Each persona has a profile (income band, life stage, risk appetite, communication style) and a script template. The simulator runs a multi-turn dialogue, the production system responds, the eval scores the entire trajectory.
for persona in personas: for intent in intents: trajectory = simulator.run(persona, intent, max_turns=12) scores = eval_suite.score_trajectory(trajectory) eval_db.log(persona, intent, scores, git_sha=CURRENT_SHA)
Every PR runs the eval. Every release is tagged with the eval results. Regressions on any of the 18 metrics block merge by default. Self-hosted on infrastructure inside the compliance perimeter — no production conversations leave the building, ever.
"Without an eval harness, every release is a guess. With one, every release is a measurement. The two products diverge fast."
"Guardrails" in most LLM stacks means "a regex and a prayer." In a fiduciary context that does not survive contact with a regulator. Designed three layers, each with a specific failure mode it is built to catch:
Before the input reaches any specialist: PII detection and redaction (the user's account numbers do not need to be in the LLM context window), jailbreak/prompt-injection detection (a member trying — usually playfully — to talk the system into giving advice it should not), and topic boundary checks (if the question is medical or legal, decline cleanly with an escalation path, do not improvise).
Specialist agents fetch context from internal data sources. What they can fetch is gated by the user's entitlement and the query context. A budgeting specialist asked a credit question does not get to read credit data — the wrong specialist is itself a failure mode, and the architecture refuses rather than improvises.
Before output reaches the user, a final compliance check: forbidden phrasings (anything resembling unlicensed advice), required disclosures (fact-specific, regulator-approved), source attribution coverage (every numeric claim links to its source), and tone calibration (the system is not allowed to be overconfident on questions that warrant nuance). Full audit trail — message in, every guardrail decision, every specialist invocation, every retrieval, the final response — stored, indexed, queryable. Reproducibility is a compliance property.
The guardrail layer is the most expensive thing to retrofit. Building it after the fact means tearing apart the response composition path and rebuilding it with the right hooks. We built it first, before the specialists shipped to a single member. Every "let's get the demo working and add safety later" plan I have ever seen has cost more in rework than the time it saved up front.
The single largest architectural decision of the project was non-technical: compliance and legal in the design conversations, not the review meetings. Most AI products invert this — build first, present to compliance, get a list of things to fix, ship a patched version. That cycle is expensive, demoralizing, and produces systems that pass review by accident rather than by design.
The inversion: compliance partners reviewed the routing taxonomy before classifiers were trained. Legal reviewed the disclosure language before specialists were prompted with it. The auditing schema was a joint document. The result is an AI system that survives regulatory review rather than requires post-hoc patching.
"The thing that makes a regulated product different is not the technology. It's whose name is on the architecture diagram."