ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-21 00:36:43 +08:00

Files

07heco e194027b01 refactor: optimize BaseTitleChunker to improve RAG document chunk quality (#14247 )

## RAG Optimization Description
Optimize the core `BaseTitleChunker` in
`rag/flow/chunker/title_chunker/common.py` to improve RAG document
chunking quality and retrieval accuracy.

## Key Changes
1. **Format-branched text processing**: Preserve original whitespace &
indentation for Markdown/HTML payloads to maintain document semantics
and chunk fidelity; only perform full whitespace cleaning on plain text
content.
2. **Empty chunk filtering**: Thoroughly filter invalid pure-blank lines
to reduce noisy data in vector database.
3. **Code deduplication**: Unified markdown/text/html payload extraction
logic, removed redundant repeated code blocks.
4. **None serialization fix**: Avoid converting `None` value into
literal `"None"` string in chunk text fields.
5. **Production logging**: Added input/output line count logging for
filter logic, observable in online environment.
6. **100% backward compatible**: No changes to chunking hierarchy rules,
output format and all existing workflows.

## RAG Business Value
- Preserves document format fidelity for structured Markdown/HTML files
- Reduces invalid noisy chunks → improves RAG retrieval precision
- Cleans plain text data → optimizes vector embedding quality
- Improves code maintainability with no breaking changes
- Provides observable logging for chunk filtering behavior

## Compatibility
- ✅ No API changes
- ✅ No chunk logic modifications
- ✅ All document parsing/chunking workflows unaffected
- ✅ All pre-checks passed, no code conflicts

### Type of change

- [x] Refactoring
- [x] Performance Improvement

2026-05-18 10:00:18 +08:00

chunker

refactor: optimize BaseTitleChunker to improve RAG document chunk quality (#14247 )

2026-05-18 10:00:18 +08:00

extractor

Fix: tokenizer issue. (#11902 )

2025-12-11 17:38:17 +08:00

parser

Feat: add button for remove header & footer in pipeline (#14486 )

2026-04-30 12:30:41 +08:00

tests

Feat: Refact pipeline (#13826 )