ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-16 07:58:18 +08:00

Files

Zhichang Yu c446c403de perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385 )

## Summary

- **Lazy img_np loading**: `np.array(img)` is now deferred until the
first OCR text extraction is actually needed, avoiding unnecessary
memory allocation for pages that already have text.
- **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via
`PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's
boxes are normalized with `_to_global_boxes` to produce globally
consistent page numbers and position tags.
- **DLA early init**: Move remote-client initialization before model
loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy
`TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser
containers relying on remote inference.
- **Fix outline regression**: Restore `self.outlines =
extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped
during refactoring and is required by downstream `remove_toc` and
metadata handling in `rag/flow/parser/parser.py`.

## Test plan

- [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines`
is populated
- [ ] Large PDF (>50 pages): verify chunked processing produces globally
consistent page numbers
- [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local
model is not downloaded
- [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-27 16:52:43 +08:00

__init__.py

fix: use context managers for file handles to prevent resource leaks (#13514 )

2026-03-11 16:47:06 +08:00

layout_recognizer.py

perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385 )

2026-04-27 16:52:43 +08:00

ocr.py

fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969 )

2026-04-09 19:10:34 +08:00

operators.py

refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 )