ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-24 18:07:35 +08:00

Files

tunsuy 292a1a8566 fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 )

## Problem

When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer)
cannot map CIDs to correct Unicode characters, outputting PUA characters
(U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully
trusted pdfplumber text without any garbled detection, causing garbled
output in the final parsed result.

Relates to #13366

## Solution

### 1. Garbled text detection functions
- `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16),
replacement character U+FFFD, control characters, and
unassigned/surrogate codepoints
- `_is_garbled_text(text, threshold)`: Calculates garbled ratio and
detects `(cid:xxx)` patterns

### 2. Box-level fallback (in `__ocr()`)
When a text box has ≥50% garbled characters, discard pdfplumber text and
fallback to OCR recognition.

### 3. Page-level detection (in `__images__()`)
Sample characters from each page; if garbled rate ≥30%, clear all
pdfplumber characters for that page, forcing full OCR.

### 4. Layout recognizer CID filtering
Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text
processing to prevent them from polluting layout analysis.

## Testing
- 29 unit tests covering: normal CJK/English text, PUA characters, CID
patterns, mixed text, boundary thresholds, edge cases
- All 85 existing project unit tests pass without regression

2026-03-10 11:20:31 +08:00

resume

Add license and Fix IDE warnings (#11985 )

2025-12-17 17:04:44 +08:00

__init__.py

Feat: advanced markdown parsing (#9607 )

2025-08-21 09:36:18 +08:00

docling_parser.py

Refactor: improve the logic about docling parser extract box (#13215 )

2026-02-28 10:05:24 +08:00

docx_parser.py

Refactor parser code (#9042 )