Posts

Operational Context for AI: Why AI Fails in Production Brainfish Webinar Recap

Published on

February 27, 2026

Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Operational Context for AI: Why AI Fails in Production Brainfish Webinar Recap

AI adoption is moving fast, but trust drops when answers come from stale, conflicting, or unowned knowledge. Docs alone rarely capture the judgment and edge cases teams rely on day to day, so “operational context” needs to be captured and kept current. A knowledge layer makes that practical by continuously updating sources, resolving contradictions, and turning messy inputs into something teams can actually depend on.

Operational Context for AI: Top Takeaways from Brainfish Live (Webinar Recap)

AI adoption is surging. But in production, many teams are learning the hard way that access to AI is not the same as trust.

In Brainfish’s first live session, Dani Wilson (Go-To-Market) hosted Brainfish CEO and Co-Founder Daniel Kimber for a candid conversation on what’s breaking enterprise AI initiatives, why “connecting the docs” is not enough, and what it takes to build AI people can rely on.

TL;DR

  • GenAI adoption is widespread, but inaccuracy is the #1 reported negative outcome.
  • AI in production breaks most often because of stale, conflicting, and ownerless knowledge.
  • Documentation is only part of the picture. Teams also need operational context: the judgment, edge cases, and real-world expertise that lives across chats, calls, videos, and SMEs.
  • A knowledge layer continuously captures and updates knowledge, resolves contradictions, structures information for retrieval, and activates it across tools and workflows.

The problem: AI reliability drops fast in production

Most teams are past the experimentation stage. The question is no longer “Can we use AI?” It’s “Can we trust it in front of customers, prospects, and internal teams?”

Daniel highlighted a key pattern:

  • Adoption is high across organizations.
  • A significant portion of adopters still experience negative consequences.
  • The most common issue is inaccuracy, which creates a bigger downstream problem: people stop trusting the system.

When trust disappears, AI becomes something employees “double check,” which often cancels out the productivity gains.

Why AI fails in production: the knowledge underneath is what breaks

Daniel’s framing was direct: today’s models are strong. The bottleneck is usually the underlying knowledge.

In production environments, AI systems often operate over information that is:

  • Stale or conflicting (different answers depending on which doc the model retrieved)
  • Missing feedback loops (no mechanism to correct mistakes and keep knowledge current)
  • Missing ownership (no clear person or process accountable for accuracy)
  • Missing product context (new releases, customer configurations, edge cases)
  • Distributed across noisy sources (Slack, Teams, calls, docs, ticket threads)

A memorable takeaway from the session:

As LLMs get stronger, they scale poor knowledge faster.

The business impact: support, GTM, product, and enablement feel it first

When AI is inaccurate, the cost is not abstract. It shows up as rework, escalations, and credibility loss.

Daniel called out the teams most affected:

  • Support: more escalations, more “human rework” after AI attempts resolution
  • Go-to-market (sales, marketing, CS): credibility issues in deals and customer conversations, increasing renewal risk
  • Product and engineering: AI features don’t get adopted if users cannot rely on them
  • Enablement: teams waste time hunting for the “correct” version of an answer or asset

Why documentation is not enough: most knowledge is not written down

Dani shared a GTM scenario many teams recognize: a seller asks for an asset to move a deal forward and receives multiple conflicting versions. The time cost is bad. The trust cost is worse.

Daniel’s answer: the way humans learn a business is not limited to docs. People ramp through:

  • Learned experiences
  • Edge cases
  • Nuance and judgment
  • “How we actually do this here” context

That knowledge often lives in:

  • Slack and Teams threads
  • call recordings and videos
  • tribal knowledge held by SMEs
  • cross-functional context spread across multiple leaders

This is what Daniel referred to as operational context: real-world expertise that is more current and more nuanced than static documentation.

Treat AI like a new hire (but train it even better)

A practical mental model from the session: treat AI agents like a new teammate.

But there’s a catch. AI does not come with common sense or judgment. If you want AI agents to do end-to-end knowledge work, you need a better way to provide the context your best SMEs use every day.

That leads to the core question:

How do you capture operational context continuously without slowing the company down?

The solution: what a knowledge layer needs to do

Daniel described the “knowledge layer” as the missing foundation that makes AI reliable in real business workflows.

A knowledge layer needs to:

  1. Capture knowledge continuously (not only during releases)
  2. Keep it current through ongoing updates
  3. Eliminate contradictions across systems and sources
  4. Structure knowledge for retrieval so AI can find the right answer quickly
  5. Activate knowledge across tools and teams (not another isolated interface)

The goal is to make knowledge a consolidating force as the company changes, rather than letting context fragment across more tools over time.

Webinar Q&A recap

Watch the Q&A in context:

The live Q&A centered on the questions teams ask when they start thinking about enterprise AI governance and accuracy.

Does this replace our knowledge base?

  • Potentially, but most teams are deeply ingrained in existing tools. The focus is improving productivity without forcing a workflow overhaul.

Is this competing with our support bot?

  • The knowledge layer is positioned as the engine underneath the bot, improving answer quality and trust.

Is it secure? Where is data stored?

  • The session described AWS-hosted storage and enterprise options, aligned with typical SaaS security expectations.

Is customer data used to train models?

  • No. The knowledge layer was positioned as consolidating and governing knowledge for retrieval, not training foundation models on customer data.

How do you decide source of truth when documents conflict?

  • Capture broad inputs (including unstructured sources), use signals like time and relevance to generate a draft, and route it for human verification.

Final takeaway

The core message from Brainfish Live:

The future isn’t just smarter models. It’s smarter models + rich operational context.

If teams want AI they can trust in production, the work cannot stop at “connect the docs.” It needs a system that continuously captures, resolves, and activates the knowledge people rely on every day.

FAQ

What is operational context in AI?

Operational context is the real-world knowledge people use to do their jobs: edge cases, judgment, current practices, and nuanced expertise that often lives outside formal documentation.

Why do AI systems hallucinate or give inconsistent answers at work?

A common cause is inconsistent or conflicting knowledge sources. When AI retrieves stale or contradictory information, outputs become unreliable.

What is a knowledge layer?

A knowledge layer is a foundation that continuously captures and updates context, resolves contradictions, structures it for retrieval, and activates trusted knowledge across tools and AI workflows.

How do you handle conflicting documents in an AI knowledge base?

Use signals such as recency and relevance to propose a best-available draft, then route it to humans for verification so the source of truth stays accurate.

import time
import requests
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader

# --- 1. OpenTelemetry Setup for Observability ---
# Configure exporters to print telemetry data to the console.
# In a production system, these would export to a backend like Prometheus or Jaeger.
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = SimpleSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)

metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter(__name__)

# Create custom OpenTelemetry metrics
agent_latency_histogram = meter.create_histogram("agent.latency", unit="ms", description="Agent response time")
agent_invocations_counter = meter.create_counter("agent.invocations", description="Number of times the agent is invoked")
hallucination_rate_gauge = meter.create_gauge("agent.hallucination_rate", unit="percentage", description="Rate of hallucinated responses")
pii_exposure_counter = meter.create_counter("agent.pii_exposure.count", description="Count of responses with PII exposure")

# --- 2. Define the Agent using NeMo Agent Toolkit concepts ---
# The NeMo Agent Toolkit orchestrates agents, tools, and workflows, often via configuration.
# This class simulates an agent that would be managed by the toolkit.
class MultimodalSupportAgent:
    def __init__(self, model_endpoint):
        self.model_endpoint = model_endpoint

    # The toolkit would route incoming requests to this method.
    def process_query(self, query, context_data):
        # Start an OpenTelemetry span to trace this specific execution.
        with tracer.start_as_current_span("agent.process_query") as span:
            start_time = time.time()
            span.set_attribute("query.text", query)
            span.set_attribute("context.data_types", [type(d).__name__ for d in context_data])

            # In a real scenario, this would involve complex logic and tool calls.
            print(f"\nAgent processing query: '{query}'...")
            time.sleep(0.5) # Simulate work (e.g., tool calls, model inference)
            agent_response = f"Generated answer for '{query}' based on provided context."
            
            latency = (time.time() - start_time) * 1000
            
            # Record metrics
            agent_latency_histogram.record(latency)
            agent_invocations_counter.add(1)
            span.set_attribute("agent.response", agent_response)
            span.set_attribute("agent.latency_ms", latency)
            
            return {"response": agent_response, "latency_ms": latency}

# --- 3. Define the Evaluation Logic using NeMo Evaluator ---
# This function simulates calling the NeMo Evaluator microservice API.
def run_nemo_evaluation(agent_response, ground_truth_data):
    with tracer.start_as_current_span("evaluator.run") as span:
        print("Submitting response to NeMo Evaluator...")
        # In a real system, you would make an HTTP request to the NeMo Evaluator service.
        # eval_endpoint = "http://nemo-evaluator-service/v1/evaluate"
        # payload = {"response": agent_response, "ground_truth": ground_truth_data}
        # response = requests.post(eval_endpoint, json=payload)
        # evaluation_results = response.json()
        
        # Mocking the evaluator's response for this example.
        time.sleep(0.2) # Simulate network and evaluation latency
        mock_results = {
            "answer_accuracy": 0.95,
            "hallucination_rate": 0.05,
            "pii_exposure": False,
            "toxicity_score": 0.01,
            "latency": 25.5
        }
        span.set_attribute("eval.results", str(mock_results))
        print(f"Evaluation complete: {mock_results}")
        return mock_results

# --- 4. The Main Agent Evaluation Loop ---
def agent_evaluation_loop(agent, query, context, ground_truth):
    with tracer.start_as_current_span("agent_evaluation_loop") as parent_span:
        # Step 1: Agent processes the query
        output = agent.process_query(query, context)

        # Step 2: Response is evaluated by NeMo Evaluator
        eval_metrics = run_nemo_evaluation(output["response"], ground_truth)

        # Step 3: Log evaluation results using OpenTelemetry metrics
        hallucination_rate_gauge.set(eval_metrics.get("hallucination_rate", 0.0))
        if eval_metrics.get("pii_exposure", False):
            pii_exposure_counter.add(1)
        
        # Add evaluation metrics as events to the parent span for rich, contextual traces.
        parent_span.add_event("EvaluationComplete", attributes=eval_metrics)

        # Step 4: (Optional) Trigger retraining or alerts based on metrics
        if eval_metrics["answer_accuracy"] < 0.8:
            print("[ALERT] Accuracy has dropped below threshold! Triggering retraining workflow.")
            parent_span.set_status(trace.Status(trace.StatusCode.ERROR, "Low Accuracy Detected"))

# --- Run the Example ---
if __name__ == "__main__":
    support_agent = MultimodalSupportAgent(model_endpoint="http://model-server/invoke")
    
    # Simulate an incoming user request with multimodal context
    user_query = "What is the status of my recent order?"
    context_documents = ["order_invoice.pdf", "customer_history.csv"]
    ground_truth = {"expected_answer": "Your order #1234 has shipped."}

    # Execute the loop
    agent_evaluation_loop(support_agent, user_query, context_documents, ground_truth)
    
    # In a real application, the metric reader would run in the background.
    # We call it explicitly here to see the output.
    metric_reader.collect()
Share this post

Recent Posts...