Orchestrating Intelligence: A Multi-Agent Research System with Coordinator Pattern, MCP Integration, and Quality-Gated Pipelines

Please share to show your support

Keywords: Multi-agent systems, agentic AI, AI orchestration, Model Context Protocol (MCP), coordinator pattern

Abstract

This paper presents a production-grade multi-agent research system built on the Coordinator pattern, implementing fan-out/fan-in parallelism, Model Context Protocol (MCP) for decoupled inter-agent communication, and quality-gated phase transitions. The system orchestrates four specialized sub-agents — WebSearchAgent, DocAnalyzer, Synthesizer, and Reporter.

Each operating on disjoint MCP server sets enforcing least-privilege access. We demonstrate that shared MCP state eliminates direct agent-to-agent coupling while enabling reliable cross-agent data propagation. The pipeline achieves an empirical confidence score of 93/100 on benchmark topics, with 100% quality gate pass rates across discovery, analysis, and synthesis phases.

The paper discuss task decomposition as a Directed Acyclic Graph (DAG), context management across three isolation layers, contradiction detection, and graceful degradation under failure modes. Architecture comparisons with single-agent systems are provided.

Introduction

I’ve spent the better part of the last eighteen months building AI agents on top of Kubernetes, Databricks, and a growing stack of MCP servers. Single-agent systems are elegant, one context window, one system prompt, one accountability chain. They’re also fundamentally limited. When a research task demands simultaneous web retrieval, document parsing, knowledge synthesis, and formatted report generation, a single agent becomes a traffic jam: sequential, context-bloated, and brittle.

The solution isn’t a smarter agent. It’s a smarter team.

Multi-agent systems decompose work across specialized nodes, each carrying only the tools and context it needs. The question then becomes: how do agents coordinate without creating a coordination nightmare? 

Direct agent-to-agent messaging is fragile. Shared global state is a nightmare for consistency. What we need is a principled architecture, and that’s exactly what this paper describes.

A multi-agent system is only as reliable as its weakest coordination primitive. MCP servers, used correctly, replace that primitive with something robust: a shared memory substrate that agents write to and read from without ever needing to know each other exist.

This system is an end-to-end, fully autonomous research pipeline. Fed a single broad topic, like Artificial Intelligence in 2026—it independently executes everything from source retrieval and structured fact extraction to theme synthesis and contradiction detection. With automated quality evaluation built into every phase, it produces a comprehensive, fully cited report with zero human intervention and no fragile workarounds.

Single Agent vs Multi Agent

Architecture Overview: Why Multi-Agent?

Before diving into implementation, it’s worth being precise about when to reach for multi-agent architecture and when to stay single-agent. I’ve seen teams over-engineer simple assistants into six-agent monstrosities. I’ve also seen genuinely complex pipelines crammed into a single agent with a 120,000-token context window — technically functional, practically unmaintainable.

Architecture Overview

The decision rule I use in practice: reach for multi-agent when the task has distinct phases, different phases need different tool sets, you want parallel execution, and each phase benefits meaningfully from a specialized system prompt. This research pipeline hits all four criteria.

In practice, I deploy a multi-agent architecture when a workflow meets four specific criteria:

  • Phased Execution: The task breaks down into distinct, sequential stages.
  • Tool Isolation: Different phases require entirely different toolsets.
  • Parallel Processing: Workstreams can and should run concurrently.
  • Prompt Specialization: Each step benefits significantly from a tailored system prompt.

This research pipeline is a textbook use case, checking every single box.

The pipeline architecture
The Coordinator Pattern: The Brain That Doesn’t Do the Work

The Coordinator is perhaps the most misunderstood component in multi-agent literature. Engineers instinctively want to give it superpowers — make it also do some retrieval, a bit of analysis. Resist this impulse. The Coordinator’s value comes precisely from its restraint.

It does exactly five things:

  1. Plan — break the topic into a DAG of subtasks
  2. Delegate — assign tasks to the right specialist agent
  3. Collect — fan-in results from parallel agents
  4. Validate — run quality gates between phases
  5. Deliver — return the final report with an audit trail

Python Code

# Coordinator: the manager, not the worker
class ResearchCoordinator:
def __init__(self):
self.planner = ResearchPlanner()
self.web_agent = WebSearchAgent()
self.analyzer = DocAnalyzerAgent()
self.synthesizer = SynthesizerAgent()
self.reporter = ReporterAgent()
self.quality = QualityChecker()
async def research(self, topic: str) -> ResearchResult:
# ── Phase 1: Planning ─────────────────────────────────
plan = await self.planner.create_plan(topic)
# ── Phase 2: Discovery (fan-out) ──────────────────────
search_tasks = plan.get_tasks_by_phase("discovery")
results = await asyncio.gather(
*[self.web_agent.execute(t) for t in search_tasks]
)
await self.quality.check_discovery(results) # gate
# ── Phase 3: Analysis (sequential pipeline) ───────────
doc_ids = [r.doc_id for r in results]
facts = await self.analyzer.execute(doc_ids)
await self.quality.check_analysis(facts) # gate
# ── Phase 4: Synthesis ────────────────────────────────
themes = await self.synthesizer.execute(topic)
await self.quality.check_synthesis(themes) # gate
# ── Phase 5: Reporting ────────────────────────────────
report = await self.reporter.execute(themes, facts)
score = compute_quality_score(results, facts, themes, report)
return ResearchResult(report=report, confidence=score)
Task Decomposition as a DAG

Every research plan is, at its core, a Directed Acyclic Graph. Phases within a DAG level can run in parallel — all web searches fire simultaneously. Phases between levels are strictly ordered — you cannot analyze documents that haven’t been fetched yet.

The Coordinator generates this plan dynamically via an LLM prompt:

Python Code

async def create_plan(self, topic: str) -> ResearchPlan:
prompt = f"""
Break '{topic}' into 5-7 specific research questions.
For each question, specify:
- question: the specific question to answer
- agent: "web_search" | "doc_analyzer" | "synthesizer"
- priority: 1-5
- depends_on: list of task IDs that must complete first
Respond ONLY with a JSON array. No preamble.
"""
response = await llm.generate(prompt)
tasks = json.loads(response)
return ResearchPlan(build_dag(tasks))
Acyclic Graph (DAG)

MCP Integration: Shared Memory, Not Shared Secrets

Model Context Protocol is the backbone of this architecture. In a multi-agent system, MCP becomes far more important than in single-agent deployments — it’s the only sanctioned channel for inter-agent data sharing.

The design principle is least privilege: each agent can access only the MCP servers it legitimately needs. A WebSearchAgent has no business reading the knowledge base. A Reporter has no business writing new documents to the doc store. This isn’t just good security hygiene — it prevents subtle bugs where an agent reads stale state from the wrong phase.

image 4

The critical insight is this: agents never talk to each other directly. They communicate through shared data stores. Agent A writes to an MCP server. Agent B reads from it. This gives us three properties that are very hard to achieve with direct messaging:

  1. Decoupling — agents don’t know about each other’s existence
  2. Persistence — data survives agent crashes and restarts
  3. Consistency — single source of truth, no stale copies

Python Code

# Production MCP server: doc_store_mcp
# In real deployments, each MCP server is a microservice
@mcp.tool()
async def store_document(
url: str,
content: str,
metadata: dict
) -> str:
"""Store a fetched document; return doc_id for downstream agents."""
embedding = await embed(content)
doc_id = await vector_db.upsert(
id=hash_url(url),
vector=embedding,
metadata={"url": url, "ts": now(), **metadata}
)
return doc_id
@mcp.tool()
async def add_fact(
subject: str,
predicate: str,
obj: str,
confidence: float,
source_url: str
) -> str:
"""Write a structured triple to the knowledge graph."""
# Stores: (subject) --[predicate]--> (object)
# e.g.: (GPT-4) --[is_a]--> (large_language_model, conf=0.98)
return await neo4j.merge_triple(subject, predicate, obj, confidence)

Specialized Agent Deep-Dive

Exactly three things define each agent: its system prompt (what role it plays), its MCP tool access (what it can do), and its autonomous loop (how it drives its task to completion). Let’s walk through each.

1. WebSearchAgent

The WebSearchAgent is the system’s eyes. Given a research question, it generates 2–3 search queries from different angles, searches, deduplicates results, fetches the top pages, and stores them in doc_store_mcp. It never reads from the knowledge base.

class WebSearchAgent:
SYSTEM_PROMPT = """
You are a rigorous research librarian. Your task:
1. Generate 2-3 search queries from DIFFERENT angles
2. Search and rank by source credibility
3. Fetch the top 3 pages per query
4. Deduplicate (same domain = same perspective)
5. Store each document with metadata
Quality threshold: prioritize .gov, .edu, peer-reviewed over blog posts.
Never store content shorter than 500 words.
"""
async def execute(self, task: SearchTask) -> SearchResult:
queries = await self.generate_queries(task.question)
raw = await asyncio.gather(*[self.search(q) for q in queries])
ranked = self.rank_by_credibility(raw)
doc_ids = []
for result in ranked[:6]:
content = await mcp.web_fetch(result.url)
doc_id = await mcp.store_document(result.url, content, {
"source_type": classify_source(result.url),
"query": task.question
})
doc_ids.append(doc_id)
return SearchResult(doc_ids=doc_ids, source_count=len(doc_ids))

2. DocAnalyzerAgent

The DocAnalyzer is where raw text becomes structured knowledge. It reads documents chunk by chunk, extracts verifiable facts as subject–predicate–object triples, assigns a confidence score based on source type, and writes each fact to the knowledge base. Critically, it scores — an academic paper gets 0.9; a personal blog gets 0.6. Downstream agents can filter on this confidence floor.

async def execute(self, doc_ids: list[str]) -> AnalysisResult:
fact_count = 0
for doc_id in doc_ids:
chunks = await mcp.retrieve_document_chunks(doc_id)
source = await mcp.get_document_metadata(doc_id)
conf = {
"academic": 0.90,
"government": 0.85,
"news": 0.75,
"blog": 0.60
}.get(source["source_type"], 0.65)
for chunk in chunks:
facts = await llm.extract_facts(chunk) # → list of triples
for f in facts:
await mcp.add_fact(
subject = f.subject,
predicate = f.predicate,
obj = f.object,
confidence = conf * f.extraction_confidence,
source_url = source["url"]
)
fact_count += 1
return AnalysisResult(fact_count=fact_count)

3. SynthesizerAgent

The Synthesizer does what its name implies: puts things together. It queries the full fact graph, runs a clustering step (semantically grouping related triples), names each cluster as a theme, draws conclusions within each theme, and critically, identifies gaps — areas where the evidence is thin or contradictory.

The distinction I want to emphasize: analysis breaks things apart; synthesis puts them back together, but richer than they were before. Analysis gives you 30 facts. Synthesis gives you 3 themes that make sense of those 30 facts.

4. ReporterAgent

The Reporter produces the final artifact. It receives themes and facts, structures a report with executive summary, findings by theme, supporting evidence, conclusions, research gaps, and a citation list. Every claim is backed by a cited source. No orphaned assertions.

Context Management Across Three Layers

Context management in multi-agent systems is fundamentally different from single-agent systems. Multiple agents need shared state, but they also need independence, each agent’s local context shouldn’t pollute another’s reasoning.

Three layer context model

Key Pattern

The MCP servers act as the shared external memory. Agent A writes to MCP. Agent B reads from MCP. They never call each other. The Coordinator manages execution order, but it never passes raw data between agents; it only passes references (doc_ids, fact_ids) that agents resolve against MCP independently.

Quality Gates: The Immune System of the Pipeline

Multi-agent systems fail in subtle ways. An empty search result doesn’t throw an exception — it just produces an analysis of nothing. A low-confidence fact base doesn’t crash — it just generates a confidently wrong report. Quality gates are the system’s immune system, catching these failures before they propagate downstream.

class QualityChecker:
async def check_discovery(self, results: list[SearchResult]) -> QualityReport:
total_docs = sum(r.doc_count for r in results)
source_types = set(r.source_type for r in results)
if total_docs < 5:
raise InsufficientSourcesError("Retry with broader queries")
if len(source_types) < 2:
raise LowDiversityError("Need multiple source types")
return QualityReport(passed=True, score=min(100, total_docs * 6))
async def check_analysis(self, analysis: AnalysisResult) -> QualityReport:
avg_conf = analysis.average_confidence()
if analysis.fact_count < 10:
raise InsufficientFactsError("Analyze more documents")
if avg_conf < 0.65:
raise LowConfidenceError("Sources below reliability threshold")
return QualityReport(passed=True, score=int(avg_conf * 100))
PYTHON
Quality Gate Scoring Breakdown

Failure Modes and Resilience Strategies

Multi-agent systems fail differently than single-agent systems, and more insidiously. The failures are often silent — not exceptions, but degraded outputs that look fine until you read them closely. Here are the failure modes I’ve encountered and how the system handles each:

image 7

The contradiction detection case deserves extra attention. When two high-confidence sources assert contradictory facts, as one paper claims GPT-4 has 1T parameters, another claims 1.76T, the system doesn’t resolve this by source recency or confidence alone. It flags both facts, surfaces the contradiction in the report’s uncertainty section, and recommends the claim for human verification. Confident wrongness is worse than acknowledged uncertainty.

Experimental Results

We ran the system against ten benchmark topics spanning technology, science, and policy domains. The results below are representative of the “Artificial Intelligence” benchmark run, which produced the system’s highest confidence score.

system's highest confidence score.

“The system didn’t just produce a report. It produced a report with an audit trail — every claim traceable to a source, every phase’s quality metrics recorded, every decision the Coordinator made logged. That’s what separates a research agent from a research system.”

Production Considerations

If you’re taking this architecture from prototype to production, here are the decisions that actually matter:

  1. MCP Server Deployment on Kubernetes

Each MCP server runs as a microservice: doc_store as a Pinecone-backed FastAPI service, knowledge_base as a Neo4j operator deployment, web_search as a cached proxy sidecar. On Databricks, you can leverage Delta Lake as the persistence layer for doc_store, which gives you Unity Catalogue lineage for free — every document write is a catalogued asset.

Also read, Agentic AI- How it can redefine the Software Development Lifecycle at https://journals-times.com/2025/05/31/agentic-ai-how-it-can-redefine-the-software-development-lifecycle/

# k8s deployment for doc_store_mcp
apiVersion: apps/v1
kind: Deployment
metadata:
name: doc-store-mcp
labels:
app: mcp-server
role: doc-store
spec:
replicas: 3
selector:
matchLabels: { app: doc-store-mcp }
template:
spec:
containers:
- name: doc-store
image: acme/doc-store-mcp:1.2.0
env:
- name: PINECONE_API_KEY
valueFrom:
secretKeyRef: { name: pinecone-creds, key: api-key }
- name: EMBEDDING_MODEL
value: text-embedding-3-large
resources:
requests: { memory: "512Mi", cpu: "250m" }
limits: { memory: "1Gi", cpu: "500m" }

2. Credential Isolation

Each MCP server gets its own Kubernetes ServiceAccount with IRSA (IAM Roles for Service Accounts) on AWS, or Workload Identity on GCP. No agent ever holds credentials for an MCP server it’s not authorized to use. The Coordinator doesn’t hold any MCP credentials at all — it only orchestrates, it doesn’t execute tool calls directly.

3. Unity Catalog Integration

If you’re on Databricks, registering your MCP servers as UC external locations gives you fine-grained access control without credential sprawl. The doc_store maps to a managed Delta table; the knowledge_base maps to a GraphFrame stored in Unity Catalog. Agents authenticate via token-scoped OAuth — no long-lived credentials, no rotation headaches.

When NOT to Use This Architecture

I want to be direct about this, because I’ve seen the pattern cargo-culted into the wrong problems. This architecture adds real coordination overhead. If your use case is:

  • A customer support bot answering FAQs → use a single agent with a retrieval tool
  • A code review assistant → single agent, possibly with file reading tools
  • A simple Q&A over a document corpus → RAG pipeline, not multi-agent
  • Any task that fits in 20k tokens of context → don’t add agents for the sake of it

Multi-agent is justified when: the context window genuinely can’t hold all the work, the phases genuinely benefit from parallelism, and different phases have meaningfully different tool requirements. If you can’t check all three, you’re adding complexity for its own sake.

Conclusion

The multi-agent research system described here demonstrates that complex, phased research pipelines can be made reliable, auditable, and production-grade through three design choices: the Coordinator pattern (which enforces separation of planning from execution), MCP-mediated shared state (which eliminates direct agent coupling), and quality-gated phase transitions (which catch failures before they propagate).

The 93/100 confidence score on our benchmark isn’t the interesting number. The interesting number is the full audit trail — 10 MCP tool calls, 30 extracted facts, 3 synthesized themes, all traceable from the final report back to the originating URL. That’s what an enterprise-grade research system looks like.

The architecture is available as a reference implementation. The hardest part isn’t the code — it’s the restraint. Resist giving the Coordinator too many tools. Resist letting agents talk to each other directly. Resist cramming everything into one context window. Multi-agent systems earn their complexity only when the problem genuinely demands it.

References

Please share to show your support

Leave a Reply

Up ↑

Translate »

Discover more from E-JOURNAL TIMES MAGAZINE

Subscribe now to keep reading and get access to the full archive.

Continue reading