The real challenge isn’t building one brilliant agent. It’s teaching a team of focused agents to think together — without ever speaking to each other directly.
Keywords: Multi-agent systems, agentic AI, AI orchestration, Model Context Protocol (MCP), coordinator pattern
Abstract
This paper presents a production-grade multi-agent research system built on the Coordinator pattern, implementing fan-out/fan-in parallelism, Model Context Protocol (MCP) for decoupled inter-agent communication, and quality-gated phase transitions. The system orchestrates four specialized sub-agents — WebSearchAgent, DocAnalyzer, Synthesizer, and Reporter.
Each operating on disjoint MCP server sets enforcing least-privilege access. We demonstrate that shared MCP state eliminates direct agent-to-agent coupling while enabling reliable cross-agent data propagation. The pipeline achieves an empirical confidence score of 93/100 on benchmark topics, with 100% quality gate pass rates across discovery, analysis, and synthesis phases.
The paper discuss task decomposition as a Directed Acyclic Graph (DAG), context management across three isolation layers, contradiction detection, and graceful degradation under failure modes. Architecture comparisons with single-agent systems are provided.
Introduction
I’ve spent the better part of the last eighteen months building AI agents on top of Kubernetes, Databricks, and a growing stack of MCP servers. Single-agent systems are elegant, one context window, one system prompt, one accountability chain. They’re also fundamentally limited. When a research task demands simultaneous web retrieval, document parsing, knowledge synthesis, and formatted report generation, a single agent becomes a traffic jam: sequential, context-bloated, and brittle.
The solution isn’t a smarter agent. It’s a smarter team.
Multi-agent systems decompose work across specialized nodes, each carrying only the tools and context it needs. The question then becomes: how do agents coordinate without creating a coordination nightmare?
Direct agent-to-agent messaging is fragile. Shared global state is a nightmare for consistency. What we need is a principled architecture, and that’s exactly what this paper describes.
A multi-agent system is only as reliable as its weakest coordination primitive. MCP servers, used correctly, replace that primitive with something robust: a shared memory substrate that agents write to and read from without ever needing to know each other exist.
This system is an end-to-end, fully autonomous research pipeline. Fed a single broad topic, like Artificial Intelligence in 2026—it independently executes everything from source retrieval and structured fact extraction to theme synthesis and contradiction detection. With automated quality evaluation built into every phase, it produces a comprehensive, fully cited report with zero human intervention and no fragile workarounds.
Architecture Overview: Why Multi-Agent?
Before diving into implementation, it’s worth being precise about when to reach for multi-agent architecture and when to stay single-agent. I’ve seen teams over-engineer simple assistants into six-agent monstrosities. I’ve also seen genuinely complex pipelines crammed into a single agent with a 120,000-token context window — technically functional, practically unmaintainable.
The decision rule I use in practice: reach for multi-agent when the task has distinct phases, different phases need different tool sets, you want parallel execution, and each phase benefits meaningfully from a specialized system prompt. This research pipeline hits all four criteria.
In practice, I deploy a multi-agent architecture when a workflow meets four specific criteria:
Phased Execution: The task breaks down into distinct, sequential stages.
Tool Isolation: Different phases require entirely different toolsets.
Parallel Processing: Workstreams can and should run concurrently.
Prompt Specialization: Each step benefits significantly from a tailored system prompt.
This research pipeline is a textbook use case, checking every single box.
The Coordinator Pattern: The Brain That Doesn’t Do the Work
The Coordinator is perhaps the most misunderstood component in multi-agent literature. Engineers instinctively want to give it superpowers — make it also do some retrieval, a bit of analysis. Resist this impulse. The Coordinator’s value comes precisely from its restraint.
It does exactly five things:
Plan — break the topic into a DAG of subtasks
Delegate — assign tasks to the right specialist agent
Collect — fan-in results from parallel agents
Validate — run quality gates between phases
Deliver — return the final report with an audit trail
Every research plan is, at its core, a Directed Acyclic Graph. Phases within a DAG level can run in parallel — all web searches fire simultaneously. Phases between levels are strictly ordered — you cannot analyze documents that haven’t been fetched yet.
The Coordinator generates this plan dynamically via an LLM prompt:
- depends_on: list of task IDs that must complete first
Respond ONLY with a JSON array. No preamble.
"""
response=awaitllm.generate(prompt)
tasks=json.loads(response)
returnResearchPlan(build_dag(tasks))
MCP Integration: Shared Memory, Not Shared Secrets
Model Context Protocol is the backbone of this architecture. In a multi-agent system, MCP becomes far more important than in single-agent deployments — it’s the only sanctioned channel for inter-agent data sharing.
The design principle is least privilege: each agent can access only the MCP servers it legitimately needs. A WebSearchAgent has no business reading the knowledge base. A Reporter has no business writing new documents to the doc store. This isn’t just good security hygiene — it prevents subtle bugs where an agent reads stale state from the wrong phase.
The critical insight is this: agents never talk to each other directly. They communicate through shared data stores. Agent A writes to an MCP server. Agent B reads from it. This gives us three properties that are very hard to achieve with direct messaging:
Decoupling — agents don’t know about each other’s existence
Persistence — data survives agent crashes and restarts
Consistency — single source of truth, no stale copies
Python Code
# Production MCP server: doc_store_mcp
# In real deployments, each MCP server is a microservice
@mcp.tool()
asyncdefstore_document(
url: str,
content: str,
metadata: dict
) -> str:
"""Store a fetched document; return doc_id for downstream agents."""
embedding=awaitembed(content)
doc_id=awaitvector_db.upsert(
id=hash_url(url),
vector=embedding,
metadata={"url": url,"ts": now(),**metadata}
)
returndoc_id
@mcp.tool()
asyncdefadd_fact(
subject: str,
predicate: str,
obj: str,
confidence: float,
source_url: str
) -> str:
"""Write a structured triple to the knowledge graph."""
Exactly three things define each agent: its system prompt (what role it plays), its MCP tool access (what it can do), and its autonomous loop (how it drives its task to completion). Let’s walk through each.
1. WebSearchAgent
The WebSearchAgent is the system’s eyes. Given a research question, it generates 2–3 search queries from different angles, searches, deduplicates results, fetches the top pages, and stores them in doc_store_mcp. It never reads from the knowledge base.
classWebSearchAgent:
SYSTEM_PROMPT="""
You are a rigorous research librarian. Your task:
1. Generate 2-3 search queries from DIFFERENT angles
2. Search and rank by source credibility
3. Fetch the top 3 pages per query
4. Deduplicate (same domain = same perspective)
5. Store each document with metadata
Quality threshold: prioritize .gov, .edu, peer-reviewed over blog posts.
The DocAnalyzer is where raw text becomes structured knowledge. It reads documents chunk by chunk, extracts verifiable facts as subject–predicate–object triples, assigns a confidence score based on source type, and writes each fact to the knowledge base. Critically, it scores — an academic paper gets 0.9; a personal blog gets 0.6. Downstream agents can filter on this confidence floor.
facts=awaitllm.extract_facts(chunk)# → list of triples
forfinfacts:
awaitmcp.add_fact(
subject=f.subject,
predicate=f.predicate,
obj=f.object,
confidence=conf*f.extraction_confidence,
source_url=source["url"]
)
fact_count+=1
returnAnalysisResult(fact_count=fact_count)
3. SynthesizerAgent
The Synthesizer does what its name implies: puts things together. It queries the full fact graph, runs a clustering step (semantically grouping related triples), names each cluster as a theme, draws conclusions within each theme, and critically, identifies gaps — areas where the evidence is thin or contradictory.
The distinction I want to emphasize: analysis breaks things apart; synthesis puts them back together, but richer than they were before. Analysis gives you 30 facts. Synthesis gives you 3 themes that make sense of those 30 facts.
4. ReporterAgent
The Reporter produces the final artifact. It receives themes and facts, structures a report with executive summary, findings by theme, supporting evidence, conclusions, research gaps, and a citation list. Every claim is backed by a cited source. No orphaned assertions.
Context Management Across Three Layers
Context management in multi-agent systems is fundamentally different from single-agent systems. Multiple agents need shared state, but they also need independence, each agent’s local context shouldn’t pollute another’s reasoning.
Key Pattern
The MCP servers act as the shared external memory. Agent A writes to MCP. Agent B reads from MCP. They never call each other. The Coordinator manages execution order, but it never passes raw data between agents; it only passes references (doc_ids, fact_ids) that agents resolve against MCP independently.
Quality Gates: The Immune System of the Pipeline
Multi-agent systems fail in subtle ways. An empty search result doesn’t throw an exception — it just produces an analysis of nothing. A low-confidence fact base doesn’t crash — it just generates a confidently wrong report. Quality gates are the system’s immune system, catching these failures before they propagate downstream.
Multi-agent systems fail differently than single-agent systems, and more insidiously. The failures are often silent — not exceptions, but degraded outputs that look fine until you read them closely. Here are the failure modes I’ve encountered and how the system handles each:
The contradiction detection case deserves extra attention. When two high-confidence sources assert contradictory facts, as one paper claims GPT-4 has 1T parameters, another claims 1.76T, the system doesn’t resolve this by source recency or confidence alone. It flags both facts, surfaces the contradiction in the report’s uncertainty section, and recommends the claim for human verification. Confident wrongness is worse than acknowledged uncertainty.
Experimental Results
We ran the system against ten benchmark topics spanning technology, science, and policy domains. The results below are representative of the “Artificial Intelligence” benchmark run, which produced the system’s highest confidence score.
“The system didn’t just produce a report. It produced a report with an audit trail — every claim traceable to a source, every phase’s quality metrics recorded, every decision the Coordinator made logged. That’s what separates a research agent from a research system.”
Production Considerations
If you’re taking this architecture from prototype to production, here are the decisions that actually matter:
MCP Server Deployment on Kubernetes
Each MCP server runs as a microservice: doc_store as a Pinecone-backed FastAPI service, knowledge_base as a Neo4j operator deployment, web_search as a cached proxy sidecar. On Databricks, you can leverage Delta Lake as the persistence layer for doc_store, which gives you Unity Catalogue lineage for free — every document write is a catalogued asset.
Each MCP server gets its own Kubernetes ServiceAccount with IRSA (IAM Roles for Service Accounts) on AWS, or Workload Identity on GCP. No agent ever holds credentials for an MCP server it’s not authorized to use. The Coordinator doesn’t hold any MCP credentials at all — it only orchestrates, it doesn’t execute tool calls directly.
3. Unity Catalog Integration
If you’re on Databricks, registering your MCP servers as UC external locations gives you fine-grained access control without credential sprawl. The doc_store maps to a managed Delta table; the knowledge_base maps to a GraphFrame stored in Unity Catalog. Agents authenticate via token-scoped OAuth — no long-lived credentials, no rotation headaches.
When NOT to Use This Architecture
I want to be direct about this, because I’ve seen the pattern cargo-culted into the wrong problems. This architecture adds real coordination overhead. If your use case is:
A customer support bot answering FAQs → use a single agent with a retrieval tool
A code review assistant → single agent, possibly with file reading tools
A simple Q&A over a document corpus → RAG pipeline, not multi-agent
Any task that fits in 20k tokens of context → don’t add agents for the sake of it
Multi-agent is justified when: the context window genuinely can’t hold all the work, the phases genuinely benefit from parallelism, and different phases have meaningfully different tool requirements. If you can’t check all three, you’re adding complexity for its own sake.
Conclusion
The multi-agent research system described here demonstrates that complex, phased research pipelines can be made reliable, auditable, and production-grade through three design choices: the Coordinator pattern (which enforces separation of planning from execution), MCP-mediated shared state (which eliminates direct agent coupling), and quality-gated phase transitions (which catch failures before they propagate).
The 93/100 confidence score on our benchmark isn’t the interesting number. The interesting number is the full audit trail — 10 MCP tool calls, 30 extracted facts, 3 synthesized themes, all traceable from the final report back to the originating URL. That’s what an enterprise-grade research system looks like.
The architecture is available as a reference implementation. The hardest part isn’t the code — it’s the restraint. Resist giving the Coordinator too many tools. Resist letting agents talk to each other directly. Resist cramming everything into one context window. Multi-agent systems earn their complexity only when the problem genuinely demands it.
Leave a Reply