What does "knowledge infrastructure" actually mean for AI agents?

Knowledge infrastructure is the system that manages what your agents know. It includes the knowledge sources (docs, Confluence, Notion, etc.), the mechanisms that keep knowledge current and consistent, the APIs agents use to retrieve knowledge, and the observability that lets you see what each agent knows and where it came from. Think of it as the plumbing between agent queries and authoritative information.

Why do agents give inconsistent answers if they're using the same model and prompts?

Because they're querying different knowledge. Agent A might be pulling from a doc updated three days ago. Agent B might be pulling from a cached version from last week. Agent C might be pulling from a different source entirely. Same model, same prompts, different knowledge = different answers. This is a knowledge infrastructure problem, not a model problem.

How do I handle cross-agent knowledge consistency without rebuilding everything?

Start by identifying your single source of truth for product knowledge - usually your primary product documentation. Make sure all agents query that source, not multiple copies or cached versions. Then implement automatic propagation so that when the source updates, all agents see the change.

What's the difference between a vector database and a knowledge layer?

A vector database stores and retrieves embeddings based on semantic similarity. A knowledge layer manages knowledge: keeping it current, detecting contradictions, understanding document structure, synchronizing across multiple agents, providing tracing and observability. A vector database is a component that might sit inside a knowledge layer - but a knowledge layer is something larger and more structured.

How does a knowledge layer fit into an existing agent architecture?

Agents replace their direct knowledge source connections with API calls to the knowledge layer. Instead of Agent A querying Confluence directly and Agent B querying Notion, both agents query the knowledge layer API. The knowledge layer abstracts the underlying sources, handles updates, detects conflicts, and serves consistent answers. You don't have to rewrite your agents - just change where they get knowledge from.

How do I know if knowledge staleness is causing my accuracy degradation?

Run a retrieval audit on your 20-30 most recent failure cases. For each failure, check whether the retrieved context was outdated, incomplete, or simply not found. If more than 60% of failures trace back to knowledge-layer problems rather than model reasoning errors, staleness is likely your primary blocker. Most teams find this is true. Almost no teams check.

What's the difference between output monitoring and retrieval observability?

Output monitoring tracks what your model says. Retrieval observability tracks what your retriever finds. Output monitoring catches hallucinations after they happen. Retrieval observability lets you diagnose why they happened, whether the right context was retrieved, whether it was current, and whether the retriever understood the document structure. You need both, but most teams only have the first.

How does hierarchical retrieval reasoning (HRR) actually work?

HRR maps relationships between documents before retrieval happens. Instead of treating every chunk as independent, it understands that section 3.2 belongs to chapter 3, which belongs to a specific product version. When a query comes in, it retrieves at the right level of the hierarchy, not just the closest embedding match. This means if you ask about a process, it retrieves the full process context, not just the paragraph that happened to match your query vector.

How long does it take to integrate a knowledge layer into an existing RAG pipeline?

For most teams, 2-6 weeks depending on how many data sources you're connecting and how clean your existing infrastructure is. The integration is usually straightforward: swap your vector database calls for knowledge layer API calls, connect your data sources, and configure your sync frequency. Teams using native connectors for Confluence, Notion, Salesforce, SharePoint, and Google Drive report no custom pipeline work required.

What if we're already using LangChain or LlamaIndex?

A knowledge layer is retrieval-agnostic. It sits upstream of your retriever and changes what your retriever searches over. You can use it with LangChain, LlamaIndex, agent frameworks, or internal architectures. Most teams add the knowledge layer by adding a single API call in their retrieval chain. Your existing orchestration stays intact.

Inside the 5%: How Three Companies Made Their AI Pilots Actually Deliver ROI

Why do 95% of AI pilots fail to show ROI? Three case studies show the 5% that win: unified knowledge, context by cohort, and measurable support automation.

TL;DR

MIT research found that 95% of enterprise AI pilots return zero measurable ROI. The 5% that succeed share one trait: they build a continuous knowledge layer that gives the AI context about both the product and the customer, instead of bolting a generic AI agent onto a stale knowledge base.

This post breaks down three Brainfish customers in that 5%:

A cyber SaaS company automating 80% of inbound support tickets while pausing a planned 50% headcount expansion.
A legal tech platform automating 74% of tickets across 256+ product/pricing combinations and lifting NPS by 37 points.
A US city government moving from handwritten SOPs to instant multilingual answers in Microsoft Teams for 600 staff.

The pattern is the same in every case. Read on for what they actually built.

🎥 Prefer to watch? Stream the full 45-minute webinar replay →

Why 95% of AI Pilots Fail

In 2025, MIT's NANDA initiative published findings that have since been quoted in nearly every enterprise AI conversation: roughly 95% of generative AI pilots in enterprises return no measurable financial ROI. Most never make it past the proof-of-concept stage.

The reflex is to blame the model. Or the vendor. Or the budget.

It's almost never any of those.

After working inside the knowledge infrastructure of hundreds of AI deployments, the pattern at Brainfish is consistent: AI pilots fail because the AI doesn't have the right context. Not the right model. Not the right prompt. The right context.

A generic AI agent connected to a generalist knowledge base will produce generalist answers. Good enough for a demo. Useless to a real customer with a real product on a real plan in a real region.

The 5% that succeed do something specific:

They consolidate scattered knowledge — Zendesk articles, Slack threads, call recordings, internal docs, app session recordings — into a single source of truth.
They segment that knowledge by product, plan, region, and customer cohort.
They distribute it through the channels their customers already use, instead of forcing migrations.
They measure deflection AND experience, not just deflection.

Below are three customers doing exactly that.

Case Study 1: Cyber SaaS: 80% Ticket Deflection in a Zero-Tolerance Audience

The before state

A cybersecurity SaaS company serving MSPs and IT partners came to Brainfish with about 5,000 hyper-technical tickets per month. Their audience does not tolerate vague or wrong answers — incorrect guidance during a security incident isn't a customer service problem, it's a risk event.

Their knowledge was scattered across Zendesk articles, Slack threads, and tribal memory in the heads of senior support engineers. They had already tried Zendesk AI and Intercom Fin and couldn't hit the accuracy required for the audience. They were experimenting with pulling Zendesk exports into Claude to manually theme tickets.

The forcing function: they were growing 50% year-over-year and on track to scale support headcount by the same 50% — which the team didn't want to do.

What they built

Brainfish was added as a knowledge layer on top of Zendesk and Slack. Nothing was ripped out.

Zendesk stayed as the ticketing system. Brainfish replies to incoming chats and emails inside Zendesk with answers built from a unified knowledge base.
Slack became the escalation surface. When an issue can't be solved by AI, Brainfish routes it to the right Slack channel with full context — not a raw ticket dump.
Compliance guardrails were built in to prevent the AI from returning PII or providing guidance on regulated cybersecurity scenarios where a wrong answer is dangerous.

The results

80% of inbound chats automated.
2–3 second average response time. Critical in cybersecurity, where speed of response during an active incident matters.
24/7 coverage across US, EU, and APAC time zones without expanding headcount.
Headcount plan paused. The 50% support team expansion is no longer needed.

Why it worked

The team didn't replace their stack. They gave their existing stack the context it had been missing.

Case Study 2: Legal Tech — Personalization Across 256+ Product Combinations

The before state

A legal tech platform serves lawyers across multiple practice areas with 4 products, each with 3–4 pricing tiers, producing 200+ combinations of plan, packaging, and product. One generalist knowledge base was answering every customer the same way — which meant most answers were technically correct and practically useless.

Compounding the problem: law in the US differs by state, law in Australia differs from US law, and APAC has its own regulatory shape. A generic answer about a feature can be wrong for a customer in a different jurisdiction.

What they built

Brainfish helped them split one knowledge base into roughly 12 segmented knowledge sets — without losing a single source of truth.

The architecture:

Source ingestion. Existing Zendesk content, community articles, learning center material, internal Slack, call recordings, tickets, and app session recordings all flow into Brainfish's knowledge intelligence layer.
Source-of-truth generation. When sources contradict each other, app session recordings are used as ground truth — they show what the product actually does today, not what someone wrote about it 18 months ago.
Segmentation. The single source of truth is split into specialist knowledge bases per product, per plan, per region, and per cohort.
Distribution. Answers are delivered through an in-app AI agent, the help center, an email-reply agent in Zendesk, and a customer success agent in Slack.

Personalization runs without PII. Brainfish only needs to know what plan, product, and region the user is on — not who they are.

The results

74% of inbound tickets automated (up from a more generalist solution that started in the 30–40% range).
37-point NPS lift over 12 months. NPS was the harder number to move — it's the proof that customers are getting better experiences, not just fewer responses.
98% answer accuracy, validated by audits with the customer's own support team.
Support team repurposed. Freed-up capacity now runs customer marketing, churn-risk renewals, and field marketing pop-ups in front of customer offices.

Why it worked

Personalization without PII. Specialist knowledge per cohort. Source-of-truth knowledge that updates as the product changes — not a static help center that ages.

Case Study 3: US City Government — From Paper SOPs to Multilingual Instant Answers

The before state

A US city government with about 600 staff was writing standard operating procedures by hand, on paper. Some of those procedures predated the internet. Knowledge was technically accessible — but only through HR. The rest of the workforce had no searchable system.

The workforce is multilingual. Many staff speak Spanish or Haitian Creole as a first language. English-only SOPs created an obvious accessibility gap.

What they built

Two pieces did most of the heavy lifting:

Video-to-SOP ingestion. Department leads — IT, procurement, HR, parks, public works — recorded themselves performing each procedure. Brainfish converted those recordings into structured SOP documents automatically. What used to take days of writing took minutes of recording.
Multilingual AI chat. The chat experience was deployed in English, Spanish, and Haitian Creole, then surfaced inside Microsoft Teams so staff could ask questions where they already work.

The results

From 0 to ~300 staff with self-serve access to the knowledge base (up from HR-only).
Native-language answers for Spanish and Creole speakers, including frontline workers in parks, public works, and citizen services.
Instant onboarding for new hires across every department, with up-to-date SOPs that mirror current process.
Days of writing collapsed to minutes of recording.

Why it worked

The team met workers where they already were (Microsoft Teams, native language) instead of asking them to learn a new system. The knowledge layer absorbed institutional memory — including the parts that lived in nobody's documentation.

What All Three Have in Common

Three completely different industries. Three completely different stacks. One shared playbook.

Pattern	Cyber SaaS	Legal Tech	City Government
Did not rip and replace	✓ Kept Zendesk and Slack	✓ Kept the help center and Zendesk	✓ Layered onto Microsoft Teams
Unified scattered knowledge	✓ Tickets, Slack, and tribal knowledge	✓ Docs, community content, and recordings	✓ Paper SOPs, LMS content, and video
Personalized by context	✓ By customer type	✓ By product, plan, and region	✓ By language and department
Measured beyond deflection	✓ Speed and coverage	✓ NPS and accuracy	✓ Access and onboarding speed
Repurposed the team	✓ Paused hiring	✓ Shifted focus to field marketing and retention	✓ Expanded into department-wide SOP authoring

The common thread is context — context about the product (what it does, in this region, on this plan, in this version), and context about the customer (who they are, what they're trying to do, what tier they're on).

That's the entire delta between the 5% and the 95%.

How to Tell If Your AI Pilot Is in the 95% (or the 5%)

Five diagnostic questions:

Where does your knowledge live today? If the answer is "across Zendesk, Notion, Slack, Drive, and a few people's heads" — and your AI only reads one of them — you are missing context.
Does your AI know what plan the customer is on? If every customer gets the same answer regardless of tier, region, or product, your AI is generalist. Generalist answers don't drive ROI.
What happens to the 20–30% the AI can't solve? If it dumps a raw transcript on a human agent, you're saving deflection cost but adding handoff cost. Escalations need context too.
Are you measuring NPS or just deflection? Deflection alone doesn't prove the AI is helping the customer. NPS, CSAT, and resolution time prove the experience improved.
What is your support team doing with the time you saved? The 5% turn freed-up capacity into customer marketing, churn rescue, and retention. The 95% just shrink the team.

If three or more of these answers feel uncomfortable, your pilot is likely in the 95%.

What to Do Next

Three concrete moves, in order of effort:

Audit your knowledge. Pull every source the AI can reach today — help center, internal docs, tickets, call recordings — and check whether each is up to date, segmented, and contradicting nothing else. If you have an MCP-capable AI like Claude, you can do this in days, not months using the Brainfish MCP.
Map context to cohort. List the plans, products, regions, and personas you serve. Map which knowledge is universal and which is specific. The specifics are what most pilots miss.
Don't migrate. Layer. Keep your ticketing, your chat, your help center. Add a knowledge layer underneath that feeds all of them with the right context per customer.

See It in Your Own Stack

Brainfish is offering free context audits for AI implementations. No sales pitch — just a structured review of your knowledge, segmentation, and integration setup, benchmarked against the patterns above.

Book a context audit →

Related Reading

‍

import time
import requests
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader

# --- 1. OpenTelemetry Setup for Observability ---
# Configure exporters to print telemetry data to the console.
# In a production system, these would export to a backend like Prometheus or Jaeger.
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = SimpleSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)

metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter(__name__)

# Create custom OpenTelemetry metrics
agent_latency_histogram = meter.create_histogram("agent.latency", unit="ms", description="Agent response time")
agent_invocations_counter = meter.create_counter("agent.invocations", description="Number of times the agent is invoked")
hallucination_rate_gauge = meter.create_gauge("agent.hallucination_rate", unit="percentage", description="Rate of hallucinated responses")
pii_exposure_counter = meter.create_counter("agent.pii_exposure.count", description="Count of responses with PII exposure")

# --- 2. Define the Agent using NeMo Agent Toolkit concepts ---
# The NeMo Agent Toolkit orchestrates agents, tools, and workflows, often via configuration.
# This class simulates an agent that would be managed by the toolkit.
class MultimodalSupportAgent:
    def __init__(self, model_endpoint):
        self.model_endpoint = model_endpoint

    # The toolkit would route incoming requests to this method.
    def process_query(self, query, context_data):
        # Start an OpenTelemetry span to trace this specific execution.
        with tracer.start_as_current_span("agent.process_query") as span:
            start_time = time.time()
            span.set_attribute("query.text", query)
            span.set_attribute("context.data_types", [type(d).__name__ for d in context_data])

            # In a real scenario, this would involve complex logic and tool calls.
            print(f"\nAgent processing query: '{query}'...")
            time.sleep(0.5) # Simulate work (e.g., tool calls, model inference)
            agent_response = f"Generated answer for '{query}' based on provided context."
            
            latency = (time.time() - start_time) * 1000
            
            # Record metrics
            agent_latency_histogram.record(latency)
            agent_invocations_counter.add(1)
            span.set_attribute("agent.response", agent_response)
            span.set_attribute("agent.latency_ms", latency)
            
            return {"response": agent_response, "latency_ms": latency}

# --- 3. Define the Evaluation Logic using NeMo Evaluator ---
# This function simulates calling the NeMo Evaluator microservice API.
def run_nemo_evaluation(agent_response, ground_truth_data):
    with tracer.start_as_current_span("evaluator.run") as span:
        print("Submitting response to NeMo Evaluator...")
        # In a real system, you would make an HTTP request to the NeMo Evaluator service.
        # eval_endpoint = "http://nemo-evaluator-service/v1/evaluate"
        # payload = {"response": agent_response, "ground_truth": ground_truth_data}
        # response = requests.post(eval_endpoint, json=payload)
        # evaluation_results = response.json()
        
        # Mocking the evaluator's response for this example.
        time.sleep(0.2) # Simulate network and evaluation latency
        mock_results = {
            "answer_accuracy": 0.95,
            "hallucination_rate": 0.05,
            "pii_exposure": False,
            "toxicity_score": 0.01,
            "latency": 25.5
        }
        span.set_attribute("eval.results", str(mock_results))
        print(f"Evaluation complete: {mock_results}")
        return mock_results

# --- 4. The Main Agent Evaluation Loop ---
def agent_evaluation_loop(agent, query, context, ground_truth):
    with tracer.start_as_current_span("agent_evaluation_loop") as parent_span:
        # Step 1: Agent processes the query
        output = agent.process_query(query, context)

        # Step 2: Response is evaluated by NeMo Evaluator
        eval_metrics = run_nemo_evaluation(output["response"], ground_truth)

        # Step 3: Log evaluation results using OpenTelemetry metrics
        hallucination_rate_gauge.set(eval_metrics.get("hallucination_rate", 0.0))
        if eval_metrics.get("pii_exposure", False):
            pii_exposure_counter.add(1)
        
        # Add evaluation metrics as events to the parent span for rich, contextual traces.
        parent_span.add_event("EvaluationComplete", attributes=eval_metrics)

        # Step 4: (Optional) Trigger retraining or alerts based on metrics
        if eval_metrics["answer_accuracy"] < 0.8:
            print("[ALERT] Accuracy has dropped below threshold! Triggering retraining workflow.")
            parent_span.set_status(trace.Status(trace.StatusCode.ERROR, "Low Accuracy Detected"))

# --- Run the Example ---
if __name__ == "__main__":
    support_agent = MultimodalSupportAgent(model_endpoint="http://model-server/invoke")
    
    # Simulate an incoming user request with multimodal context
    user_query = "What is the status of my recent order?"
    context_documents = ["order_invoice.pdf", "customer_history.csv"]
    ground_truth = {"expected_answer": "Your order #1234 has shipped."}

    # Execute the loop
    agent_evaluation_loop(support_agent, user_query, context_documents, ground_truth)
    
    # In a real application, the metric reader would run in the background.
    # We call it explicitly here to see the output.
    metric_reader.collect()

Frequently Asked Questions

How long does an AI implementation take?

Implementations using a knowledge layer approach typically run weeks rather than months. With MCP-driven knowledge audits, even large-scale rebuilds — historically 6–8 month projects — can compress to days or weeks of focused work.

Does AI customer support replace human support teams?

In the highest-ROI deployments, no. Brainfish customers consistently redirect freed-up capacity into customer success, retention, churn rescue, and customer marketing. The team gets smaller in headcount-growth terms (paused expansions are common) but rarely smaller in absolute terms — and the work shifts from reactive ticket-handling to proactive revenue work.

How much of customer support can AI realistically automate?

In the case studies above, automation rates ranged from 74% (legal tech) to 80% (cyber SaaS) for inbound support volume. Achievable rates depend on knowledge coverage, segmentation quality, and how many issues require backend integrations versus pure information retrieval.

What is a knowledge layer for AI?

A knowledge layer is a unified, continuously updated source of product and customer context that sits between an organization's data sources (help docs, tickets, call recordings, app sessions, internal docs) and the AI agents that talk to customers and employees. It gives every AI surface — chat, email, Slack, Teams, in-app — the same accurate, segmented information.

Why do most enterprise AI pilots fail?

The most common failure mode is missing context. Generic AI agents pull from a single, often outdated, knowledge source and serve the same answer to every customer regardless of plan, product, region, or use case. Without segmented, continuously updated knowledge, accuracy stays too low to deliver ROI.

What is the MIT 95% AI ROI statistic?

MIT research published in 2025 found that approximately 95% of generative AI pilots in enterprise environments return zero measurable financial ROI. The remaining 5% deliver real efficiency, revenue, or experience gains — typically by addressing context and integration rather than swapping models.

Share this post

Daniel Kimber

May 7, 2026

CEO & Co-founder, Brainfish

Inside the 5%: How Three Companies Made Their AI Pilots Actually Deliver ROI

TL;DR

Why 95% of AI Pilots Fail

Case Study 1: Cyber SaaS: 80% Ticket Deflection in a Zero-Tolerance Audience

The before state

What they built

The results

Why it worked

Case Study 2: Legal Tech — Personalization Across 256+ Product Combinations

The before state

What they built

The results

Why it worked

Case Study 3: US City Government — From Paper SOPs to Multilingual Instant Answers

The before state

What they built

The results

Why it worked

What All Three Have in Common

How to Tell If Your AI Pilot Is in the 95% (or the 5%)

What to Do Next

See It in Your Own Stack

Related Reading

Frequently Asked Questions

How long does an AI implementation take?

Does AI customer support replace human support teams?

How much of customer support can AI realistically automate?

What is a knowledge layer for AI?

Why do most enterprise AI pilots fail?

What is the MIT 95% AI ROI statistic?

Recent Posts...