Files
ragflow/docs/guides/rag_failure_modes_checklist.mdx
PSBigBig × MiniPS b7eca981d4 docs: add RAG failure modes checklist guide (refs #13138) (#13204)
### What problem does this PR solve?

This PR adds a new guide: **"RAG failure modes checklist"**.

RAG systems often fail in ways that are not immediately visible from a
single metric like accuracy or latency. In practice, debugging
production RAG applications requires identifying recurring failure
patterns across retrieval, routing, evaluation, and deployment stages.

This guide introduces a structured, pattern-based checklist (P01–P12) to
help users interpret traces, evaluation results, and dataset behavior
within RAGFlow. The goal is to provide a practical way to classify
incidents (e.g., retrieval hallucination, chunking issues, index
staleness, routing misalignment) and reason about minimal structural
fixes rather than ad-hoc prompt changes.

The change is documentation-only and does not modify any code or
configuration.

Refs #13138


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-02-25 19:35:15 +08:00

139 lines
8.8 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

| id | title | sidebar_label |
| ----------------------- | ---------------------------- | --------------------------- |
| rag-failure-modes-checklist | RAG failure modes checklist | RAG failure modes checklist |
# RAG failure modes checklist
Retrieval-Augmented Generation (RAG) systems rarely “fail” because of a single metric like accuracy or latency.
In practice, debugging a production RAG application means looking at **patterns of incidents** across the whole
pipeline, not just one bad answer.
This page gives a small, opinionated checklist of common RAG failure patterns and how they typically show up when
you inspect runs and evaluations in RAGFlow.
It is inspired by an **MIT-licensed open-source 16-problem RAG failure map** used in several evaluation projects,
and adapted here to match RAGFlows terminology and tooling.
---
## Why a failure-modes view?
RAGFlow already gives you traces, evals, and dataset views.
The checklist below is about **how to read what you see**:
- When an evaluation score is low, what kind of failure is it?
- When a trace looks noisy, is it a retrieval issue, a prompt issue, or data quality?
- When metrics are “good on average” but users still complain, what should you look at next?
Thinking in explicit failure modes makes it easier to design better eval datasets, to triage incidents, and to
communicate problems to the rest of your team.
---
## How to use this checklist with RAGFlow
A simple way to use this page in day-to-day work:
1. **Start from a symptom**
- Low eval scores on a dataset.
- A single bad run reported by a user.
- A noisy trace or strange tool behavior.
2. **Find the closest pattern**
- Scan the table below for the pattern whose “Typical symptom in RAGFlow” matches what you see.
- It does not need to be perfect; pick the closest one first.
3. **Use the “Where to look in RAGFlow” column**
- Go to the suggested views (traces, evals, datasets, knowledge base, or system logs).
- Confirm or reject the hypothesis.
4. **Record the failure mode**
- Tag the run, add a comment, or track it in your own incident system using the pattern ID (P01P12).
- Over time you will see which patterns dominate your incidents.
You can also adapt these IDs into your own internal taxonomy if you already have an incident process.
---
## Core failure patterns (P01P12)
The table below lists 12 reusable patterns.
Each describes **what goes wrong**, **how it tends to look in RAGFlow**, and **where to investigate**.
| ID | Pattern name | Typical symptom in RAGFlow | Where to look in RAGFlow |
| --- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| P01 | Retrieval hallucination / grounding drift | Answers look fluent and confident, but contradict the retrieved documents or cite facts that are not there. | Trace view (answer vs. retrieved documents), evals with “groundedness” or “factuality” scores |
| P02 | Chunk boundary or segmentation bug | Key sentences are cut in the middle, or relevant context is split across chunks so no single chunk is useful. | Knowledge base / document preview, chunk metadata, retrieval examples |
| P03 | Embedding mismatch (semantic vs vector distance) | Top-k results look “close” by vector distance, but human reviewers judge them as off-topic or shallow matches. | Vector search results, embedding configuration, eval datasets that directly check retrieval relevance |
| P04 | Index skew or staleness | Users see old or missing data even though the source of truth has been updated; evals pass on old snapshots. | Ingestion jobs, index timestamps, dataset versions, deployment timeline |
| P05 | Query rewriting or router misalignment | Similar user questions get routed to different tools or datasets; some flows never hit the right collection. | Router / tool selection traces, query rewrite logs, routing rules |
| P06 | Long-chain reasoning drift | Multi-step tasks start correctly but violate earlier constraints in later steps (dates, prices, policies, etc.). | Multi-hop traces, intermediate tool outputs, step-by-step evals |
| P07 | Tool-call misuse or ungrounded tools | LLM calls tools with wrong arguments, or calls tools when the answer is already in context; wasted latency and quota. | Tool call spans, arguments vs. retrieved context, cost / latency breakdown |
| P08 | Session memory leak / missing context | Follow-up questions ignore important details from earlier turns, or accidentally reuse stale context from another session. | Conversation history, session identifiers, memory storage / retrieval configuration |
| P09 | Evaluation blind spots | Evals look “green” but users still report obvious failures; dataset examples are too easy or not representative. | Eval dataset definitions, label guidelines, score distributions vs. real incidents |
| P10 | Startup ordering / dependency not ready | Newly deployed versions show spikes of 5xx, empty retrievals, or missing models during the first minutes after release. | Deployment logs, health checks, first-run traces after deploy |
| P11 | Config or secrets drift across environments | The same flow works locally but fails in staging or prod; model names, endpoints, or API keys differ silently. | Environment configs, secret management, environment-specific traces |
| P12 | Multi-tenant / multi-agent interference | Requests from different tenants or agents interfere with each others state, tools, or rate limits. | Tenant IDs and agent IDs in traces, shared resources (indexes, caches, queues) |
You do not need to use all 12.
It is completely fine to start with 35 that match your most common issues, then refine or extend the list.
---
## From symptom to pattern: a few examples
Here are three concrete “reading patterns” you can apply inside RAGFlow.
### Example A Good retrieval, bad answer
- Retrieval evals show high relevance.
- Traces confirm that the correct document is in the top-k results.
- The model still answers incorrectly or adds extra facts.
This is usually **P01 Retrieval hallucination / grounding drift**:
- The retriever does its job, but the answer prompt does not strictly tie the response to the retrieved context.
- Fixes tend to involve tighter answer prompts, better instructions around quoting sources, or adding explicit groundedness evals.
### Example B Noisy trace, weak retrieval
- User reports “it answers something, but not what I asked”.
- Trace shows multiple tool calls and retries.
- Retrieved chunks are partially related but miss the critical detail.
This often indicates a mix of **P02 Chunk boundary or segmentation bug** and **P03 Embedding mismatch**:
- Check how documents were split and whether important sentences are being cut.
- Check embedding model, normalization, and distance metric.
- Add a small retrieval-only eval dataset to isolate the problem from answer generation.
### Example C Everything is green except production
- Automated evals are mostly high.
- Synthetic test questions pass.
- Real user traffic still contains surprising failures that evals never catch.
This is classic **P09 Evaluation blind spots**:
- Your eval set covers only a narrow slice of real queries.
- Real incidents fall into different patterns (P01P08, P10P12) that were never sampled.
- The fix is to feed real incidents back into new eval datasets and label them with the appropriate failure mode.
---
## Extending the checklist for your team
This page is intentionally small. It is meant as a **starting vocabulary**, not a finished ontology.
When you see repeated patterns in your own RAGFlow traces:
- Copy the table and add **team-specific variants** (for example, splitting P03 into separate patterns for different indexes).
- Attach pattern IDs (P01P12 or your own) to incidents, tickets, or run tags.
- Use the distribution of failure modes to prioritize engineering work:
- If most incidents are P02/P03, invest in ingestion and indexing.
- If most incidents are P01/P06/P07, focus on prompts, tools, and chain design.
- If many incidents are P09/P10/P11, improve deployment, configs, and eval datasets.
Over time, this checklist should evolve into **your own RAG incident map**, built on top of RAGFlows traces and
evaluations, and tailored to your stack and users.