ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-29 20:17:35 +08:00

Files

tunsuy 292a1a8566 fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 )

## Problem

When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer)
cannot map CIDs to correct Unicode characters, outputting PUA characters
(U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully
trusted pdfplumber text without any garbled detection, causing garbled
output in the final parsed result.

Relates to #13366

## Solution

### 1. Garbled text detection functions
- `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16),
replacement character U+FFFD, control characters, and
unassigned/surrogate codepoints
- `_is_garbled_text(text, threshold)`: Calculates garbled ratio and
detects `(cid:xxx)` patterns

### 2. Box-level fallback (in `__ocr()`)
When a text box has ≥50% garbled characters, discard pdfplumber text and
fallback to OCR recognition.

### 3. Page-level detection (in `__images__()`)
Sample characters from each page; if garbled rate ≥30%, clear all
pdfplumber characters for that page, forcing full OCR.

### 4. Layout recognizer CID filtering
Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text
processing to prevent them from polluting layout analysis.

## Testing
- 29 unit tests covering: normal CJK/English text, PUA characters, CID
patterns, mixed text, boundary thresholds, edge cases
- All 85 existing project unit tests pass without regression

2026-03-10 11:20:31 +08:00

__init__.py

Refactor file utils (#10970 )

2025-11-03 18:54:55 +08:00

layout_recognizer.py

fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 )

2026-03-10 11:20:31 +08:00

ocr.py

feat(deepdoc): add configurable ONNX thread counts and GPU memory shrinkage (#12777 )

2026-01-23 11:36:28 +08:00

operators.py

refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 )