AI Call Center

A voice-agent platform that handles inbound calls end-to-end — authentication, knowledge lookup, tool execution, and post-call summaries — while keeping each caller's context strictly isolated.

CoreAgentic RAGCoreMCP GatewaySupportingtripartite-memory

The problem

Voice support is a domain where ambiguity is the norm and latency is non-negotiable. The opening seconds of any inbound call have to verify the caller, triage the intent, answer the simple questions, and hand the complex ones off to a human with a clean summary. Traditional IVR trees collapse under ambiguity, and a single monolithic LLM call cannot simultaneously reason across transactional APIs, policy documents, and a caller's history without either hallucinating or taking far too long to respond.

What is actually needed is an agent that behaves like a well-trained support rep: listens, decides which system to query, checks its own answer, and only speaks when it is sure. It must do this inside a conversational latency budget, across many concurrent calls, without ever mixing one caller's data into another's answer.

Why these patterns

Agentic RAG carries the reasoning load. A classical RAG pipeline embeds the question, pulls top-k chunks, and generates — fine for FAQ bots, fragile the moment a question spans a policy document and the caller's account. The agentic variant promotes retrieval into a tool call that the planner can repeat, reformulate, or abandon based on a validator's judgement of the draft answer. When a question requires combining transactional records with policy, the agent retrieves, cross-checks, notices what is missing, re-retrieves, and only then responds.

MCP gateway is what makes the tool surface governable at production scale. The agent reaches many systems — telephony state, CRM, billing, knowledge base, escalation queues — each with its own auth, rate limits, and audit requirements. Exposing those directly to the model is a compliance problem. Routing every tool call through an MCP gateway gives one place to enforce caller-scoped credentials, one place to rate-limit runaway agents, and one log to replay when a regulator or ops engineer needs to reconstruct what the bot did.

Tripartite memory is the quiet backbone. The agent carries three memory layers at once: working memory scoped to the current call, semantic memory of product and policy facts shared across all calls, and procedural memory of conversation patterns the organisation prefers. The split is not architectural ornament — it is what prevents the system from ever answering one caller using another caller's context.

What breaks without proper memory isolation

If this kind of system is built with a single shared memory store keyed only by session id, it will eventually leak. Two common failure modes show up:

The first is prompt-cache contamination. A naive implementation caches the system prompt with recent retrievals prepended for latency. Under load, a cache-key collision — or a stale session id reused by the telephony layer — can serve one caller's account summary to another. The tripartite split forces working memory to live in a per-call namespace that is provably discarded at call end, so a cross-caller cache hit becomes structurally impossible.

The second is retrieval bleed. When the agent re-queries mid-call to refine an answer, it often includes prior turn summaries in the retrieval query. If those summaries accidentally land in the shared semantic index instead of the call-scoped working memory, later callers see traces of prior conversations in their retrieval results. Keeping working memory strictly out of the shared vector store — and letting only procedural memory write back post-call, after redaction — closes that path.

Both failure modes are invisible in development and costly in production. The patterns exist because the industry has already paid the cost of learning this lesson.

Operational considerations

Running this stack is less about the models and more about the tail. A few practical notes.

Latency budgets are asymmetric. Callers tolerate a short pause after they stop speaking, but the same pause while the agent is speaking sounds like the line dropped. Budget the agentic RAG loop for the former, and stream tokens aggressively for the latter. If the validator is about to reject a draft answer, kill the TTS mid-sentence rather than letting a wrong sentence finish.

The MCP gateway is the observability seam. Put structured logs, trace ids, and caller-scoped audit markers at the gateway, not inside each tool. When someone needs to reconstruct a bad call, they should be able to replay every tool call from a single log stream without touching the underlying systems.

Cost scales with uncertainty, not with speech. The agent is cheapest when it is confident — one retrieval, one generation, done. It is most expensive when the caller is vague, because the planner loops. Instrument the average number of agentic RAG iterations per call and alert when it drifts upward; it is almost always the leading indicator of a knowledge-base regression.

Plan for graceful handoff. The agent will fail. When it does, the tripartite working memory becomes the handoff payload: a structured summary a human reads in seconds instead of minutes. Design that summary format early, because it is the interface between the automated layer and the human layer, and retrofitting it later is painful.