AI Call Center
A production-grade voice agent platform that handles inbound customer calls end-to-end — authentication, knowledge lookup, tool execution, and post-call summaries — while keeping each caller's context strictly isolated.
The problem
A mid-market contact centre wants to automate the first 90 seconds of every inbound call: verify the caller, triage the intent, answer the simple questions, and hand the complex ones off to a human with a clean summary. Traditional IVR trees collapse under ambiguity, and a single monolithic LLM call cannot simultaneously reason across billing APIs, policy PDFs, and the caller's ticket history without either hallucinating or taking fifteen seconds to respond.
What the business actually needs is an agent that behaves like a well-trained CSR: listens, decides which system to query, checks its own answer, and only speaks when it is sure. It must do this inside a 300 ms conversational budget, across thousands of concurrent calls, without ever mixing one caller's data into another's answer.
Why these patterns
Agentic RAG carries the reasoning load. A classical RAG pipeline would embed the caller's question, pull the top-k chunks, and generate — fine for FAQ bots, fatal the moment a question spans a policy document and the caller's account. The agentic variant promotes the retrieval step into a tool call that the planner can repeat, reformulate, or abandon based on a validator's judgement of the draft answer. When a caller asks "why was I charged twice last month," the agent retrieves billing records, cross-checks the refund policy, notices a missing date, re-retrieves, and only then responds.
MCP gateway is what makes the tool surface governable at call-centre scale. The agent has to reach a dozen systems — telephony state, CRM, billing, knowledge base, escalation queue — and each comes with its own auth, rate limits, and audit requirements. Exposing those directly to the model is a compliance nightmare. Routing every tool call through an MCP gateway gives the security team one place to enforce caller-scoped credentials, one place to rate-limit runaway agents, and one log to replay when a regulator asks what the bot said on a specific call.
Tripartite memory is the quiet backbone. The agent carries three memory layers at once: a working memory scoped to the current call, a semantic memory of product and policy facts shared across all calls, and a procedural memory of conversation patterns the organisation has learned to prefer. The split is not architectural ornament — it is what prevents the system from ever answering caller A using caller B's context.
What breaks without proper memory isolation
If you build this system with a single shared memory store keyed only by session id, you will eventually leak. Two common failure modes:
The first is prompt-cache contamination. A naive implementation caches the system prompt with recent retrievals prepended for latency. Under load, a cache key collision — or a stale session id reused by the telephony layer — serves one caller's account summary to another. The tripartite split forces working memory to live in a per-call namespace that is provably discarded at call end, so a cache hit across callers is structurally impossible.
The second is retrieval bleed. When the agent re-queries mid-call to refine an answer, it often includes prior turn summaries in the retrieval query. If those summaries accidentally land in the shared semantic index instead of the call-scoped working memory, every subsequent caller sees traces of the previous conversation in their retrieval results. Keeping working memory strictly out of the shared vector store — and letting only the procedural memory write back post-call, after redaction — closes that path.
Both failure modes are invisible in development and catastrophic in production. The patterns exist because the contact-centre industry has already paid the cost of learning this lesson.
Operational considerations
Running this stack in production is less about the models and more about the tail. A few practical notes.
Latency budgets are asymmetric. Humans tolerate a 600 ms pause after they stop speaking, but a 600 ms pause while the agent is speaking sounds like the line dropped. Budget the agentic RAG loop for the former, and stream tokens aggressively for the latter. If the validator is about to reject a draft answer, kill the TTS mid-sentence rather than letting a wrong sentence finish.
The MCP gateway is your observability seam. Put structured logs, trace ids, and caller-scoped audit markers at the gateway, not inside each tool. When an ops engineer needs to reconstruct a bad call at 3 a.m., they should be able to replay every tool call from a single log stream without touching the underlying CRM or billing system.
Cost scales with silence, not with speech. The agent is cheapest when it is confident — one retrieval, one generation, done. It is most expensive when the caller is vague, because the planner loops. Instrument the average number of agentic RAG iterations per call and alert when it drifts upward; it is almost always the leading indicator of a knowledge-base regression.
Plan for graceful handoff. The agent will fail. When it does, the tripartite working memory becomes the handoff payload: a structured summary the human agent reads in three seconds instead of thirty. Design that summary format early, because it is the interface between the automated layer and the human layer, and retrofitting it later is painful.