Posts

"Crap In, Crap Out": Why AI Answers Are Always Wrong

Published on

December 11, 2025

Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
"Crap In, Crap Out": Why AI Answers Are Always Wrong

Most SaaS teams think they have an AI problem. They actually have a knowledge problem. This article reveals how outdated documentation sabotages every AI rollout, and the new layer built to fix it.

Here’s the uncomfortable truth about AI in SaaS:
AI isn’t failing. Your knowledge is.

Every time a bot gives the wrong answer, it’s just echoing something outdated that your team wrote, copied, forgot about, or never documented properly in the first place.

People think they have a model problem when they really have a product-knowledge problem.

And the bigger the product, the more brutal that gap becomes.

Your bots aren’t broken (well, some of them aren’t). Your knowledge is.

Most people skip right past that part. The conversation quickly shifts to model quality or vendor choice. Meanwhile, the real failure is in plain sight. The information going into the system was inconsistent, fragmented, and/or outdated. No model can correct that.

We see this play out across almost every SaaS team we talk to. The product moves in one direction and the knowledge that explains it moves in another.

The part nobody talks about

Modern products change constantly. Buttons shift. Labels change. Enterprise accounts get custom variations. Engineering ships without ceremony. Product ships again. Then again. No team can realistically keep up.

The knowledge behind the product rarely stays aligned. It spreads across places never meant for long-term accuracy. Old help center pages. Slack threads. Internal wikis. Sales decks. Random PDFs in Drive. One engineer’s head. Community posts where users guess at the right answer. None of it stays consistent.

Inside the company, people work around it. A support agent messages the one engineer who knows. A CSM memorizes their accounts’ edge cases. A PM gives the same demo five times a week. Everyone survives through improvisation.

Then AI enters the picture and exposes the gap.

AI simply repeats what it was given.

This is where “crap in, crap out” starts being a technical truth.

The industry’s blind spot

Many companies rushed to try AI in 2025. Few treated it as something that depends entirely on the quality of the information beneath it. They wired AI into their help center and waited to see what would happen.

When the answers came back wrong, the blame went to the AI tool.
“We tried AI and it isn’t ready.”
Or the more permanent version: “AI doesn’t work for our product.”

The harder truth is that most teams aren’t ready for AI because their knowledge isn’t ready. 

A documentation leader we spoke with recently captured it well. She said teams keep asking which AI tool to choose, when the real question is what needs to be cleaned up before AI can be useful at all.

Her team did what many others did. Another department scraped everything they could find and dumped it into a bot. Nobody checked whether the content matched the current product. Nobody cleaned it. When customers started complaining, leadership finally saw the problem writers had been warning about for years: quality matters first.

The cost of skipping the foundation

When AI sits on top of weak product knowledge, the results don’t look chaotic. They look polished. And that’s the danger.

If teams disagree, the AI picks one version and presents it as fact.
If content is outdated, the AI echoes the old behavior.
If flows changed, the AI won’t know.

The system becomes a megaphone for the mess teams didn’t have time to fix.

This is why “AI readiness” has nothing to do with model choice. It’s about whether the underlying truth is stable. If the foundation is weak, everything built on top of it collapses.

What “fix the foundation” actually means

Fixing the foundation isn’t a giant rewrite project. It’s understanding where knowledge lives, how it contradicts itself, and how much of it reflects the real product.

For most teams, the work looks like:

  • Finding where content sits today
  • Spotting the pieces that disagree
  • Seeing what changed in the last few releases
  • Understanding where customers get stuck
  • Mapping the flows that matter most
  • Capturing the product as it exists now
  • Establishing one place that stays accurate

Until recently this was all manual. Writers hunting through folder trees. PMs answering the same questions. CSMs rewriting the same instructions. A constant cycle of rework.

This is the gap Brainfish focuses on.

How we approach the problem

We didn’t set out to create a new chatbot or a new help center. The problem we kept returning to was simpler. There wasn’t a system whose job was to understand the product, watch it change, and keep the knowledge aligned.

So that’s what we built: a Knowledge Intelligence Layer.

It learns from what companies already produce. Product walkthroughs. Release demos. Support transcripts. Internal training sessions. Customer session recordings. It also ingests existing docs and flags what no longer matches the product.

If you want to see what that looks like in practice, read how our auto-updating knowledge system works.

By pulling all of this together, Brainfish forms a picture of the product that isn’t stuck in the past. It sees how screens connect. How flows branch. Where steps break. Which terms teams use inconsistently. Instead of a pile of pages, it starts to understand the shape of the product itself.

Once the system sees that structure, it becomes easier to surface inaccuracies.
A renamed field.
A step removed in a new build.
A section customers replay over and over.

Human review still matters. What changes is the amount of detective work required.

From there, accurate knowledge becomes something every part of the company can use:

All of this flows from the same source: a continuously updated understanding of the product.

Preparing for AI starts with truth

If you’ve worked in SaaS long enough, you’ve lived this cycle. The product grows more complex. Documentation falls behind. Teams invent shortcuts. Then someone asks whether AI can help fix it.

The real question is simpler.
Is the knowledge underneath all of this something you trust?

If the answer is no, the newest AI model won’t save you.
It will only expose the problem faster.

If the answer is yes, AI finally becomes useful.
Support becomes consistent.
Product adoption improves.
Onboarding gets smoother.
Teams waste less time rewriting the same thing.
Customers stop running into contradictions.

This is the work we’re focused on at Brainfish. Not the hype. Not the shortcuts. Just the foundation that every AI system depends on.

If you want to see how this comes together, you can request a walkthrough here:
https://www.brainfishai.com/get-a-demo

Truth in.
Truth out.

import time
import requests
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter, PeriodicExportingMetricReader

# --- 1. OpenTelemetry Setup for Observability ---
# Configure exporters to print telemetry data to the console.
# In a production system, these would export to a backend like Prometheus or Jaeger.
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = SimpleSpanProcessor(ConsoleSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)

metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter(__name__)

# Create custom OpenTelemetry metrics
agent_latency_histogram = meter.create_histogram("agent.latency", unit="ms", description="Agent response time")
agent_invocations_counter = meter.create_counter("agent.invocations", description="Number of times the agent is invoked")
hallucination_rate_gauge = meter.create_gauge("agent.hallucination_rate", unit="percentage", description="Rate of hallucinated responses")
pii_exposure_counter = meter.create_counter("agent.pii_exposure.count", description="Count of responses with PII exposure")

# --- 2. Define the Agent using NeMo Agent Toolkit concepts ---
# The NeMo Agent Toolkit orchestrates agents, tools, and workflows, often via configuration.
# This class simulates an agent that would be managed by the toolkit.
class MultimodalSupportAgent:
    def __init__(self, model_endpoint):
        self.model_endpoint = model_endpoint

    # The toolkit would route incoming requests to this method.
    def process_query(self, query, context_data):
        # Start an OpenTelemetry span to trace this specific execution.
        with tracer.start_as_current_span("agent.process_query") as span:
            start_time = time.time()
            span.set_attribute("query.text", query)
            span.set_attribute("context.data_types", [type(d).__name__ for d in context_data])

            # In a real scenario, this would involve complex logic and tool calls.
            print(f"\nAgent processing query: '{query}'...")
            time.sleep(0.5) # Simulate work (e.g., tool calls, model inference)
            agent_response = f"Generated answer for '{query}' based on provided context."
            
            latency = (time.time() - start_time) * 1000
            
            # Record metrics
            agent_latency_histogram.record(latency)
            agent_invocations_counter.add(1)
            span.set_attribute("agent.response", agent_response)
            span.set_attribute("agent.latency_ms", latency)
            
            return {"response": agent_response, "latency_ms": latency}

# --- 3. Define the Evaluation Logic using NeMo Evaluator ---
# This function simulates calling the NeMo Evaluator microservice API.
def run_nemo_evaluation(agent_response, ground_truth_data):
    with tracer.start_as_current_span("evaluator.run") as span:
        print("Submitting response to NeMo Evaluator...")
        # In a real system, you would make an HTTP request to the NeMo Evaluator service.
        # eval_endpoint = "http://nemo-evaluator-service/v1/evaluate"
        # payload = {"response": agent_response, "ground_truth": ground_truth_data}
        # response = requests.post(eval_endpoint, json=payload)
        # evaluation_results = response.json()
        
        # Mocking the evaluator's response for this example.
        time.sleep(0.2) # Simulate network and evaluation latency
        mock_results = {
            "answer_accuracy": 0.95,
            "hallucination_rate": 0.05,
            "pii_exposure": False,
            "toxicity_score": 0.01,
            "latency": 25.5
        }
        span.set_attribute("eval.results", str(mock_results))
        print(f"Evaluation complete: {mock_results}")
        return mock_results

# --- 4. The Main Agent Evaluation Loop ---
def agent_evaluation_loop(agent, query, context, ground_truth):
    with tracer.start_as_current_span("agent_evaluation_loop") as parent_span:
        # Step 1: Agent processes the query
        output = agent.process_query(query, context)

        # Step 2: Response is evaluated by NeMo Evaluator
        eval_metrics = run_nemo_evaluation(output["response"], ground_truth)

        # Step 3: Log evaluation results using OpenTelemetry metrics
        hallucination_rate_gauge.set(eval_metrics.get("hallucination_rate", 0.0))
        if eval_metrics.get("pii_exposure", False):
            pii_exposure_counter.add(1)
        
        # Add evaluation metrics as events to the parent span for rich, contextual traces.
        parent_span.add_event("EvaluationComplete", attributes=eval_metrics)

        # Step 4: (Optional) Trigger retraining or alerts based on metrics
        if eval_metrics["answer_accuracy"] < 0.8:
            print("[ALERT] Accuracy has dropped below threshold! Triggering retraining workflow.")
            parent_span.set_status(trace.Status(trace.StatusCode.ERROR, "Low Accuracy Detected"))

# --- Run the Example ---
if __name__ == "__main__":
    support_agent = MultimodalSupportAgent(model_endpoint="http://model-server/invoke")
    
    # Simulate an incoming user request with multimodal context
    user_query = "What is the status of my recent order?"
    context_documents = ["order_invoice.pdf", "customer_history.csv"]
    ground_truth = {"expected_answer": "Your order #1234 has shipped."}

    # Execute the loop
    agent_evaluation_loop(support_agent, user_query, context_documents, ground_truth)
    
    # In a real application, the metric reader would run in the background.
    # We call it explicitly here to see the output.
    metric_reader.collect()
Share this post

Recent Posts...