mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-05-20 16:26:42 +08:00
## Summary - **Lazy img_np loading**: `np.array(img)` is now deferred until the first OCR text extraction is actually needed, avoiding unnecessary memory allocation for pages that already have text. - **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via `PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's boxes are normalized with `_to_global_boxes` to produce globally consistent page numbers and position tags. - **DLA early init**: Move remote-client initialization before model loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy `TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser containers relying on remote inference. - **Fix outline regression**: Restore `self.outlines = extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped during refactoring and is required by downstream `remove_toc` and metadata handling in `rag/flow/parser/parser.py`. ## Test plan - [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines` is populated - [ ] Large PDF (>50 pages): verify chunked processing produces globally consistent page numbers - [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local model is not downloaded - [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>