mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-05-02 08:17:48 +08:00
### What problem does this PR solve? This PR adds a new guide: **"RAG failure modes checklist"**. RAG systems often fail in ways that are not immediately visible from a single metric like accuracy or latency. In practice, debugging production RAG applications requires identifying recurring failure patterns across retrieval, routing, evaluation, and deployment stages. This guide introduces a structured, pattern-based checklist (P01–P12) to help users interpret traces, evaluation results, and dataset behavior within RAGFlow. The goal is to provide a practical way to classify incidents (e.g., retrieval hallucination, chunking issues, index staleness, routing misalignment) and reason about minimal structural fixes rather than ad-hoc prompt changes. The change is documentation-only and does not modify any code or configuration. Refs #13138 ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):
139 lines
8.8 KiB
Plaintext
139 lines
8.8 KiB
Plaintext
| id | title | sidebar_label |
|
||
| ----------------------- | ---------------------------- | --------------------------- |
|
||
| rag-failure-modes-checklist | RAG failure modes checklist | RAG failure modes checklist |
|
||
|
||
# RAG failure modes checklist
|
||
|
||
Retrieval-Augmented Generation (RAG) systems rarely “fail” because of a single metric like accuracy or latency.
|
||
In practice, debugging a production RAG application means looking at **patterns of incidents** across the whole
|
||
pipeline, not just one bad answer.
|
||
|
||
This page gives a small, opinionated checklist of common RAG failure patterns and how they typically show up when
|
||
you inspect runs and evaluations in RAGFlow.
|
||
|
||
It is inspired by an **MIT-licensed open-source 16-problem RAG failure map** used in several evaluation projects,
|
||
and adapted here to match RAGFlow’s terminology and tooling.
|
||
|
||
---
|
||
|
||
## Why a failure-modes view?
|
||
|
||
RAGFlow already gives you traces, evals, and dataset views.
|
||
The checklist below is about **how to read what you see**:
|
||
|
||
- When an evaluation score is low, what kind of failure is it?
|
||
- When a trace looks noisy, is it a retrieval issue, a prompt issue, or data quality?
|
||
- When metrics are “good on average” but users still complain, what should you look at next?
|
||
|
||
Thinking in explicit failure modes makes it easier to design better eval datasets, to triage incidents, and to
|
||
communicate problems to the rest of your team.
|
||
|
||
---
|
||
|
||
## How to use this checklist with RAGFlow
|
||
|
||
A simple way to use this page in day-to-day work:
|
||
|
||
1. **Start from a symptom**
|
||
- Low eval scores on a dataset.
|
||
- A single bad run reported by a user.
|
||
- A noisy trace or strange tool behavior.
|
||
|
||
2. **Find the closest pattern**
|
||
- Scan the table below for the pattern whose “Typical symptom in RAGFlow” matches what you see.
|
||
- It does not need to be perfect; pick the closest one first.
|
||
|
||
3. **Use the “Where to look in RAGFlow” column**
|
||
- Go to the suggested views (traces, evals, datasets, knowledge base, or system logs).
|
||
- Confirm or reject the hypothesis.
|
||
|
||
4. **Record the failure mode**
|
||
- Tag the run, add a comment, or track it in your own incident system using the pattern ID (P01–P12).
|
||
- Over time you will see which patterns dominate your incidents.
|
||
|
||
You can also adapt these IDs into your own internal taxonomy if you already have an incident process.
|
||
|
||
---
|
||
|
||
## Core failure patterns (P01–P12)
|
||
|
||
The table below lists 12 reusable patterns.
|
||
Each describes **what goes wrong**, **how it tends to look in RAGFlow**, and **where to investigate**.
|
||
|
||
| ID | Pattern name | Typical symptom in RAGFlow | Where to look in RAGFlow |
|
||
| --- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
|
||
| P01 | Retrieval hallucination / grounding drift | Answers look fluent and confident, but contradict the retrieved documents or cite facts that are not there. | Trace view (answer vs. retrieved documents), evals with “groundedness” or “factuality” scores |
|
||
| P02 | Chunk boundary or segmentation bug | Key sentences are cut in the middle, or relevant context is split across chunks so no single chunk is useful. | Knowledge base / document preview, chunk metadata, retrieval examples |
|
||
| P03 | Embedding mismatch (semantic vs vector distance) | Top-k results look “close” by vector distance, but human reviewers judge them as off-topic or shallow matches. | Vector search results, embedding configuration, eval datasets that directly check retrieval relevance |
|
||
| P04 | Index skew or staleness | Users see old or missing data even though the source of truth has been updated; evals pass on old snapshots. | Ingestion jobs, index timestamps, dataset versions, deployment timeline |
|
||
| P05 | Query rewriting or router misalignment | Similar user questions get routed to different tools or datasets; some flows never hit the right collection. | Router / tool selection traces, query rewrite logs, routing rules |
|
||
| P06 | Long-chain reasoning drift | Multi-step tasks start correctly but violate earlier constraints in later steps (dates, prices, policies, etc.). | Multi-hop traces, intermediate tool outputs, step-by-step evals |
|
||
| P07 | Tool-call misuse or ungrounded tools | LLM calls tools with wrong arguments, or calls tools when the answer is already in context; wasted latency and quota. | Tool call spans, arguments vs. retrieved context, cost / latency breakdown |
|
||
| P08 | Session memory leak / missing context | Follow-up questions ignore important details from earlier turns, or accidentally reuse stale context from another session. | Conversation history, session identifiers, memory storage / retrieval configuration |
|
||
| P09 | Evaluation blind spots | Evals look “green” but users still report obvious failures; dataset examples are too easy or not representative. | Eval dataset definitions, label guidelines, score distributions vs. real incidents |
|
||
| P10 | Startup ordering / dependency not ready | Newly deployed versions show spikes of 5xx, empty retrievals, or missing models during the first minutes after release. | Deployment logs, health checks, first-run traces after deploy |
|
||
| P11 | Config or secrets drift across environments | The same flow works locally but fails in staging or prod; model names, endpoints, or API keys differ silently. | Environment configs, secret management, environment-specific traces |
|
||
| P12 | Multi-tenant / multi-agent interference | Requests from different tenants or agents interfere with each other’s state, tools, or rate limits. | Tenant IDs and agent IDs in traces, shared resources (indexes, caches, queues) |
|
||
|
||
You do not need to use all 12.
|
||
It is completely fine to start with 3–5 that match your most common issues, then refine or extend the list.
|
||
|
||
---
|
||
|
||
## From symptom to pattern: a few examples
|
||
|
||
Here are three concrete “reading patterns” you can apply inside RAGFlow.
|
||
|
||
### Example A – Good retrieval, bad answer
|
||
|
||
- Retrieval evals show high relevance.
|
||
- Traces confirm that the correct document is in the top-k results.
|
||
- The model still answers incorrectly or adds extra facts.
|
||
|
||
This is usually **P01 – Retrieval hallucination / grounding drift**:
|
||
|
||
- The retriever does its job, but the answer prompt does not strictly tie the response to the retrieved context.
|
||
- Fixes tend to involve tighter answer prompts, better instructions around quoting sources, or adding explicit groundedness evals.
|
||
|
||
### Example B – Noisy trace, weak retrieval
|
||
|
||
- User reports “it answers something, but not what I asked”.
|
||
- Trace shows multiple tool calls and retries.
|
||
- Retrieved chunks are partially related but miss the critical detail.
|
||
|
||
This often indicates a mix of **P02 – Chunk boundary or segmentation bug** and **P03 – Embedding mismatch**:
|
||
|
||
- Check how documents were split and whether important sentences are being cut.
|
||
- Check embedding model, normalization, and distance metric.
|
||
- Add a small retrieval-only eval dataset to isolate the problem from answer generation.
|
||
|
||
### Example C – Everything is green except production
|
||
|
||
- Automated evals are mostly high.
|
||
- Synthetic test questions pass.
|
||
- Real user traffic still contains surprising failures that evals never catch.
|
||
|
||
This is classic **P09 – Evaluation blind spots**:
|
||
|
||
- Your eval set covers only a narrow slice of real queries.
|
||
- Real incidents fall into different patterns (P01–P08, P10–P12) that were never sampled.
|
||
- The fix is to feed real incidents back into new eval datasets and label them with the appropriate failure mode.
|
||
|
||
---
|
||
|
||
## Extending the checklist for your team
|
||
|
||
This page is intentionally small. It is meant as a **starting vocabulary**, not a finished ontology.
|
||
|
||
When you see repeated patterns in your own RAGFlow traces:
|
||
|
||
- Copy the table and add **team-specific variants** (for example, splitting P03 into separate patterns for different indexes).
|
||
- Attach pattern IDs (P01–P12 or your own) to incidents, tickets, or run tags.
|
||
- Use the distribution of failure modes to prioritize engineering work:
|
||
- If most incidents are P02/P03, invest in ingestion and indexing.
|
||
- If most incidents are P01/P06/P07, focus on prompts, tools, and chain design.
|
||
- If many incidents are P09/P10/P11, improve deployment, configs, and eval datasets.
|
||
|
||
Over time, this checklist should evolve into **your own RAG incident map**, built on top of RAGFlow’s traces and
|
||
evaluations, and tailored to your stack and users.
|