Inside the 5%: How Three Companies Made Their AI Pilots Actually Deliver ROI
Published on
May 7, 2026

Why do 95% of AI pilots fail to show ROI? Three case studies show the 5% that win: unified knowledge, context by cohort, and measurable support automation.
TL;DR
MIT research found that 95% of enterprise AI pilots return zero measurable ROI. The 5% that succeed share one trait: they build a continuous knowledge layer that gives the AI context about both the product and the customer, instead of bolting a generic AI agent onto a stale knowledge base.
This post breaks down three Brainfish customers in that 5%:
- A cyber SaaS company automating 80% of inbound support tickets while pausing a planned 50% headcount expansion.
- A legal tech platform automating 74% of tickets across 256+ product/pricing combinations and lifting NPS by 37 points.
- A US city government moving from handwritten SOPs to instant multilingual answers in Microsoft Teams for 600 staff.
The pattern is the same in every case. Read on for what they actually built.
π₯ Prefer to watch? Stream the full 45-minute webinar replay β
Why 95% of AI Pilots Fail
In 2025, MIT's NANDA initiative published findings that have since been quoted in nearly every enterprise AI conversation: roughly 95% of generative AI pilots in enterprises return no measurable financial ROI. Most never make it past the proof-of-concept stage.
The reflex is to blame the model. Or the vendor. Or the budget.
It's almost never any of those.
After working inside the knowledge infrastructure of hundreds of AI deployments, the pattern at Brainfish is consistent: AI pilots fail because the AI doesn't have the right context. Not the right model. Not the right prompt. The right context.
A generic AI agent connected to a generalist knowledge base will produce generalist answers. Good enough for a demo. Useless to a real customer with a real product on a real plan in a real region.
The 5% that succeed do something specific:
- They consolidate scattered knowledge β Zendesk articles, Slack threads, call recordings, internal docs, app session recordings β into a single source of truth.
- They segment that knowledge by product, plan, region, and customer cohort.
- They distribute it through the channels their customers already use, instead of forcing migrations.
- They measure deflection AND experience, not just deflection.
Below are three customers doing exactly that.
Case Study 1: Cyber SaaS: 80% Ticket Deflection in a Zero-Tolerance Audience
The before state
A cybersecurity SaaS company serving MSPs and IT partners came to Brainfish with about 5,000 hyper-technical tickets per month. Their audience does not tolerate vague or wrong answers β incorrect guidance during a security incident isn't a customer service problem, it's a risk event.
Their knowledge was scattered across Zendesk articles, Slack threads, and tribal memory in the heads of senior support engineers. They had already tried Zendesk AI and Intercom Fin and couldn't hit the accuracy required for the audience. They were experimenting with pulling Zendesk exports into Claude to manually theme tickets.
The forcing function: they were growing 50% year-over-year and on track to scale support headcount by the same 50% β which the team didn't want to do.
What they built
Brainfish was added as a knowledge layer on top of Zendesk and Slack. Nothing was ripped out.
- Zendesk stayed as the ticketing system. Brainfish replies to incoming chats and emails inside Zendesk with answers built from a unified knowledge base.
- Slack became the escalation surface. When an issue can't be solved by AI, Brainfish routes it to the right Slack channel with full context β not a raw ticket dump.
- Compliance guardrails were built in to prevent the AI from returning PII or providing guidance on regulated cybersecurity scenarios where a wrong answer is dangerous.
The results
- 80% of inbound chats automated.
- 2β3 second average response time. Critical in cybersecurity, where speed of response during an active incident matters.
- 24/7 coverage across US, EU, and APAC time zones without expanding headcount.
- Headcount plan paused. The 50% support team expansion is no longer needed.
Why it worked
The team didn't replace their stack. They gave their existing stack the context it had been missing.
Case Study 2: Legal Tech β Personalization Across 256+ Product Combinations
The before state
A legal tech platform serves lawyers across multiple practice areas with 4 products, each with 3β4 pricing tiers, producing 200+ combinations of plan, packaging, and product. One generalist knowledge base was answering every customer the same way β which meant most answers were technically correct and practically useless.
Compounding the problem: law in the US differs by state, law in Australia differs from US law, and APAC has its own regulatory shape. A generic answer about a feature can be wrong for a customer in a different jurisdiction.
What they built
Brainfish helped them split one knowledge base into roughly 12 segmented knowledge sets β without losing a single source of truth.
The architecture:
- Source ingestion. Existing Zendesk content, community articles, learning center material, internal Slack, call recordings, tickets, and app session recordings all flow into Brainfish's knowledge intelligence layer.
- Source-of-truth generation. When sources contradict each other, app session recordings are used as ground truth β they show what the product actually does today, not what someone wrote about it 18 months ago.
- Segmentation. The single source of truth is split into specialist knowledge bases per product, per plan, per region, and per cohort.
- Distribution. Answers are delivered through an in-app AI agent, the help center, an email-reply agent in Zendesk, and a customer success agent in Slack.
Personalization runs without PII. Brainfish only needs to know what plan, product, and region the user is on β not who they are.
The results
- 74% of inbound tickets automated (up from a more generalist solution that started in the 30β40% range).
- 37-point NPS lift over 12 months. NPS was the harder number to move β it's the proof that customers are getting better experiences, not just fewer responses.
- 98% answer accuracy, validated by audits with the customer's own support team.
- Support team repurposed. Freed-up capacity now runs customer marketing, churn-risk renewals, and field marketing pop-ups in front of customer offices.
Why it worked
Personalization without PII. Specialist knowledge per cohort. Source-of-truth knowledge that updates as the product changes β not a static help center that ages.
Case Study 3: US City Government β From Paper SOPs to Multilingual Instant Answers
The before state
A US city government with about 600 staff was writing standard operating procedures by hand, on paper. Some of those procedures predated the internet. Knowledge was technically accessible β but only through HR. The rest of the workforce had no searchable system.
The workforce is multilingual. Many staff speak Spanish or Haitian Creole as a first language. English-only SOPs created an obvious accessibility gap.
What they built
Two pieces did most of the heavy lifting:
- Video-to-SOP ingestion. Department leads β IT, procurement, HR, parks, public works β recorded themselves performing each procedure. Brainfish converted those recordings into structured SOP documents automatically. What used to take days of writing took minutes of recording.
- Multilingual AI chat. The chat experience was deployed in English, Spanish, and Haitian Creole, then surfaced inside Microsoft Teams so staff could ask questions where they already work.
The results
- From 0 to ~300 staff with self-serve access to the knowledge base (up from HR-only).
- Native-language answers for Spanish and Creole speakers, including frontline workers in parks, public works, and citizen services.
- Instant onboarding for new hires across every department, with up-to-date SOPs that mirror current process.
- Days of writing collapsed to minutes of recording.
Why it worked
The team met workers where they already were (Microsoft Teams, native language) instead of asking them to learn a new system. The knowledge layer absorbed institutional memory β including the parts that lived in nobody's documentation.
What All Three Have in Common
Three completely different industries. Three completely different stacks. One shared playbook.
The common thread is context β context about the product (what it does, in this region, on this plan, in this version), and context about the customer (who they are, what they're trying to do, what tier they're on).
That's the entire delta between the 5% and the 95%.
How to Tell If Your AI Pilot Is in the 95% (or the 5%)
Five diagnostic questions:
- Where does your knowledge live today? If the answer is "across Zendesk, Notion, Slack, Drive, and a few people's heads" β and your AI only reads one of them β you are missing context.
- Does your AI know what plan the customer is on? If every customer gets the same answer regardless of tier, region, or product, your AI is generalist. Generalist answers don't drive ROI.
- What happens to the 20β30% the AI can't solve? If it dumps a raw transcript on a human agent, you're saving deflection cost but adding handoff cost. Escalations need context too.
- Are you measuring NPS or just deflection? Deflection alone doesn't prove the AI is helping the customer. NPS, CSAT, and resolution time prove the experience improved.
- What is your support team doing with the time you saved? The 5% turn freed-up capacity into customer marketing, churn rescue, and retention. The 95% just shrink the team.
If three or more of these answers feel uncomfortable, your pilot is likely in the 95%.
What to Do Next
Three concrete moves, in order of effort:
- Audit your knowledge. Pull every source the AI can reach today β help center, internal docs, tickets, call recordings β and check whether each is up to date, segmented, and contradicting nothing else. If you have an MCP-capable AI like Claude, you can do this in days, not months using the Brainfish MCP.
- Map context to cohort. List the plans, products, regions, and personas you serve. Map which knowledge is universal and which is specific. The specifics are what most pilots miss.
- Don't migrate. Layer. Keep your ticketing, your chat, your help center. Add a knowledge layer underneath that feeds all of them with the right context per customer.
See It in Your Own Stack
Brainfish is offering free context audits for AI implementations. No sales pitch β just a structured review of your knowledge, segmentation, and integration setup, benchmarked against the patterns above.
Related Reading
- Watch the full webinar: What AI Support Actually Looks Like When It Works β
- AI Knowledge Base: The Ultimate Guide for 2026 β
- AI Knowledge Base vs. Chatbot: Which Does Your Support Team Actually Need? β
- What We Learned from Analyzing 1M Support Interactions β
- The $90 Billion Race to Supercharge Customer Service with AI β
- How Brainfish works: the knowledge layer for AI β
- Brainfish customer stories β
β
import time
import requests
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader
# --- 1. OpenTelemetry Setup for Observability ---
# Configure exporters to print telemetry data to the console.
# In a production system, these would export to a backend like Prometheus or Jaeger.
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = SimpleSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)
metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter(__name__)
# Create custom OpenTelemetry metrics
agent_latency_histogram = meter.create_histogram("agent.latency", unit="ms", description="Agent response time")
agent_invocations_counter = meter.create_counter("agent.invocations", description="Number of times the agent is invoked")
hallucination_rate_gauge = meter.create_gauge("agent.hallucination_rate", unit="percentage", description="Rate of hallucinated responses")
pii_exposure_counter = meter.create_counter("agent.pii_exposure.count", description="Count of responses with PII exposure")
# --- 2. Define the Agent using NeMo Agent Toolkit concepts ---
# The NeMo Agent Toolkit orchestrates agents, tools, and workflows, often via configuration.
# This class simulates an agent that would be managed by the toolkit.
class MultimodalSupportAgent:
def __init__(self, model_endpoint):
self.model_endpoint = model_endpoint
# The toolkit would route incoming requests to this method.
def process_query(self, query, context_data):
# Start an OpenTelemetry span to trace this specific execution.
with tracer.start_as_current_span("agent.process_query") as span:
start_time = time.time()
span.set_attribute("query.text", query)
span.set_attribute("context.data_types", [type(d).__name__ for d in context_data])
# In a real scenario, this would involve complex logic and tool calls.
print(f"\nAgent processing query: '{query}'...")
time.sleep(0.5) # Simulate work (e.g., tool calls, model inference)
agent_response = f"Generated answer for '{query}' based on provided context."
latency = (time.time() - start_time) * 1000
# Record metrics
agent_latency_histogram.record(latency)
agent_invocations_counter.add(1)
span.set_attribute("agent.response", agent_response)
span.set_attribute("agent.latency_ms", latency)
return {"response": agent_response, "latency_ms": latency}
# --- 3. Define the Evaluation Logic using NeMo Evaluator ---
# This function simulates calling the NeMo Evaluator microservice API.
def run_nemo_evaluation(agent_response, ground_truth_data):
with tracer.start_as_current_span("evaluator.run") as span:
print("Submitting response to NeMo Evaluator...")
# In a real system, you would make an HTTP request to the NeMo Evaluator service.
# eval_endpoint = "http://nemo-evaluator-service/v1/evaluate"
# payload = {"response": agent_response, "ground_truth": ground_truth_data}
# response = requests.post(eval_endpoint, json=payload)
# evaluation_results = response.json()
# Mocking the evaluator's response for this example.
time.sleep(0.2) # Simulate network and evaluation latency
mock_results = {
"answer_accuracy": 0.95,
"hallucination_rate": 0.05,
"pii_exposure": False,
"toxicity_score": 0.01,
"latency": 25.5
}
span.set_attribute("eval.results", str(mock_results))
print(f"Evaluation complete: {mock_results}")
return mock_results
# --- 4. The Main Agent Evaluation Loop ---
def agent_evaluation_loop(agent, query, context, ground_truth):
with tracer.start_as_current_span("agent_evaluation_loop") as parent_span:
# Step 1: Agent processes the query
output = agent.process_query(query, context)
# Step 2: Response is evaluated by NeMo Evaluator
eval_metrics = run_nemo_evaluation(output["response"], ground_truth)
# Step 3: Log evaluation results using OpenTelemetry metrics
hallucination_rate_gauge.set(eval_metrics.get("hallucination_rate", 0.0))
if eval_metrics.get("pii_exposure", False):
pii_exposure_counter.add(1)
# Add evaluation metrics as events to the parent span for rich, contextual traces.
parent_span.add_event("EvaluationComplete", attributes=eval_metrics)
# Step 4: (Optional) Trigger retraining or alerts based on metrics
if eval_metrics["answer_accuracy"] < 0.8:
print("[ALERT] Accuracy has dropped below threshold! Triggering retraining workflow.")
parent_span.set_status(trace.Status(trace.StatusCode.ERROR, "Low Accuracy Detected"))
# --- Run the Example ---
if __name__ == "__main__":
support_agent = MultimodalSupportAgent(model_endpoint="http://model-server/invoke")
# Simulate an incoming user request with multimodal context
user_query = "What is the status of my recent order?"
context_documents = ["order_invoice.pdf", "customer_history.csv"]
ground_truth = {"expected_answer": "Your order #1234 has shipped."}
# Execute the loop
agent_evaluation_loop(support_agent, user_query, context_documents, ground_truth)
# In a real application, the metric reader would run in the background.
# We call it explicitly here to see the output.
metric_reader.collect()Frequently Asked Questions
How long does an AI implementation take?
Implementations using a knowledge layer approach typically run weeks rather than months. With MCP-driven knowledge audits, even large-scale rebuilds β historically 6β8 month projects β can compress to days or weeks of focused work.
Does AI customer support replace human support teams?
In the highest-ROI deployments, no. Brainfish customers consistently redirect freed-up capacity into customer success, retention, churn rescue, and customer marketing. The team gets smaller in headcount-growth terms (paused expansions are common) but rarely smaller in absolute terms β and the work shifts from reactive ticket-handling to proactive revenue work.
How much of customer support can AI realistically automate?
In the case studies above, automation rates ranged from 74% (legal tech) to 80% (cyber SaaS) for inbound support volume. Achievable rates depend on knowledge coverage, segmentation quality, and how many issues require backend integrations versus pure information retrieval.
What is a knowledge layer for AI?
A knowledge layer is a unified, continuously updated source of product and customer context that sits between an organization's data sources (help docs, tickets, call recordings, app sessions, internal docs) and the AI agents that talk to customers and employees. It gives every AI surface β chat, email, Slack, Teams, in-app β the same accurate, segmented information.
Why do most enterprise AI pilots fail?
The most common failure mode is missing context. Generic AI agents pull from a single, often outdated, knowledge source and serve the same answer to every customer regardless of plan, product, region, or use case. Without segmented, continuously updated knowledge, accuracy stays too low to deliver ROI.
What is the MIT 95% AI ROI statistic?
MIT research published in 2025 found that approximately 95% of generative AI pilots in enterprise environments return zero measurable financial ROI. The remaining 5% deliver real efficiency, revenue, or experience gains β typically by addressing context and integration rather than swapping models.

Recent Posts...
You'll receive the latest insights from the Brainfish blog every other week if you join the Brainfish blog.

%20(1)-p-1600.jpg)
