ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-21 16:40:07 +08:00

Author	SHA1	Message	Date
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
Magicbook1108	69264b3a70	Feat: Refact pipeline (#13826 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 19:26:45 +08:00
黄圣祺	406339af1f	Fix(paddleocr): load all PDF pages for image cropping instead of first 100 (#13811 ) ## Summary Closes #13803 The `__images__` method in `paddleocr_parser.py` defaulted to `page_to=100`, only loading the first 100 pages for image cropping. However, the PaddleOCR API processes all pages of the PDF. For PDFs with more than 100 pages, page indices beyond 99 were rejected as out of range during crop validation, causing content loss. ## Root Cause ``` __images__(page_to=100) → loads pages 0-99 → page_images has 100 entries PaddleOCR API → processes all 226 pages → tags reference pages 1-226 extract_positions() → converts tag "101" to index 100 crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range" ``` ## Fix Changed `page_to` default from `100` to `10**9`, so all PDF pages are loaded for cropping. Python's list slicing safely handles oversized indices. ## Test plan - [ ] Parse a PDF with >100 pages using PaddleOCR — no more "out of range" warnings - [ ] Parse a PDF with <100 pages — behavior unchanged - [ ] Verify cropped images are generated correctly for all pages 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 09:33:11 +08:00
Magicbook1108	09ff1bc2b0	Fix: paddle ocr coordinate lower > upper (#13630 ) ### What problem does this PR solve? Fix: paddle ocr coordinate lower > upper #13618 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-16 20:15:26 +08:00
Stephen Hu	d0465ba909	refactor: improve paddle ocr logic (#13467 ) ### What problem does this PR solve? improve paddle ocr logic ### Type of change - [x] Refactoring	2026-03-09 14:16:57 +08:00
Magicbook1108	826af383b4	Fix: paddle ocr missing outlines (#13441 ) ### What problem does this PR solve? Fix: paddle ocr missing outlines #13422 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-06 17:19:51 +08:00
Lin Manhui	27a36344d4	Feat: Support PaddleOCR-VL-1.5 interface (#12819 ) ### What problem does this PR solve? This PR adds support to PaddleOCR-VL-1.5 interface to the PaddleOCR PDF Parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-27 09:49:46 +08:00
Lin Manhui	4fe3c24198	feat: PaddleOCR PDF parser supports thumnails and positions (#12565 ) ### What problem does this PR solve? 1. PaddleOCR PDF parser supports thumnails and positions. 2. Add FAQ documentation for PaddleOCR PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:51:08 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00

9 Commits