ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-03-30 02:40:00 +08:00

Author	SHA1	Message	Date
黄圣祺	406339af1f	Fix(paddleocr): load all PDF pages for image cropping instead of first 100 (#13811 ) ## Summary Closes #13803 The `__images__` method in `paddleocr_parser.py` defaulted to `page_to=100`, only loading the first 100 pages for image cropping. However, the PaddleOCR API processes all pages of the PDF. For PDFs with more than 100 pages, page indices beyond 99 were rejected as out of range during crop validation, causing content loss. ## Root Cause ``` __images__(page_to=100) → loads pages 0-99 → page_images has 100 entries PaddleOCR API → processes all 226 pages → tags reference pages 1-226 extract_positions() → converts tag "101" to index 100 crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range" ``` ## Fix Changed `page_to` default from `100` to `10**9`, so all PDF pages are loaded for cropping. Python's list slicing safely handles oversized indices. ## Test plan - [ ] Parse a PDF with >100 pages using PaddleOCR — no more "out of range" warnings - [ ] Parse a PDF with <100 pages — behavior unchanged - [ ] Verify cropped images are generated correctly for all pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 09:33:11 +08:00
Magicbook1108	09ff1bc2b0	Fix: paddle ocr coordinate lower > upper (#13630 ) ### What problem does this PR solve? Fix: paddle ocr coordinate lower > upper #13618 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-16 20:15:26 +08:00
Stephen Hu	d0465ba909	refactor: improve paddle ocr logic (#13467 ) ### What problem does this PR solve? improve paddle ocr logic ### Type of change - [x] Refactoring	2026-03-09 14:16:57 +08:00
Magicbook1108	826af383b4	Fix: paddle ocr missing outlines (#13441 ) ### What problem does this PR solve? Fix: paddle ocr missing outlines #13422 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-06 17:19:51 +08:00
Lin Manhui	27a36344d4	Feat: Support PaddleOCR-VL-1.5 interface (#12819 ) ### What problem does this PR solve? This PR adds support to PaddleOCR-VL-1.5 interface to the PaddleOCR PDF Parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-27 09:49:46 +08:00
Lin Manhui	4fe3c24198	feat: PaddleOCR PDF parser supports thumnails and positions (#12565 ) ### What problem does this PR solve? 1. PaddleOCR PDF parser supports thumnails and positions. 2. Add FAQ documentation for PaddleOCR PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:51:08 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00

7 Commits