mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-05-21 00:36:43 +08:00
## RAG Optimization Description Optimize the core `BaseTitleChunker` in `rag/flow/chunker/title_chunker/common.py` to improve RAG document chunking quality and retrieval accuracy. ## Key Changes 1. **Format-branched text processing**: Preserve original whitespace & indentation for Markdown/HTML payloads to maintain document semantics and chunk fidelity; only perform full whitespace cleaning on plain text content. 2. **Empty chunk filtering**: Thoroughly filter invalid pure-blank lines to reduce noisy data in vector database. 3. **Code deduplication**: Unified markdown/text/html payload extraction logic, removed redundant repeated code blocks. 4. **None serialization fix**: Avoid converting `None` value into literal `"None"` string in chunk text fields. 5. **Production logging**: Added input/output line count logging for filter logic, observable in online environment. 6. **100% backward compatible**: No changes to chunking hierarchy rules, output format and all existing workflows. ## RAG Business Value - Preserves document format fidelity for structured Markdown/HTML files - Reduces invalid noisy chunks → improves RAG retrieval precision - Cleans plain text data → optimizes vector embedding quality - Improves code maintainability with no breaking changes - Provides observable logging for chunk filtering behavior ## Compatibility - ✅ No API changes - ✅ No chunk logic modifications - ✅ All document parsing/chunking workflows unaffected - ✅ All pre-checks passed, no code conflicts ### Type of change - [x] Refactoring - [x] Performance Improvement