From d430446e694dafa992c594678901bda4e7ef44bd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=B0=91=E5=8D=BF?= <121151546+shaoqing404@users.noreply.github.com> Date: Mon, 2 Mar 2026 14:58:37 +0800 Subject: [PATCH] fix:absolute page index mix-up in DeepDoc PDF parser (#12848) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What problem does this PR solve? Summary: This PR addresses critical indexing issues in deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with chunk-based pagination: Normalize rotated table page numbering: Rotated-table re-OCR now writes page_number in chunk-local 1-based form, eliminating double-addition of page_from offset that caused misalignment between table positions and document boxes. Convert absolute positions to chunk-local coordinates: When inserting tables/figures extracted via _extract_table_figure, positions are now converted from absolute (0-based) to chunk-local indices before distance matching and box insertion. This prevents IndexError and out-of-range accesses during paged parsing of long documents. Root Cause: The parser mixed absolute (0-based, document-global) and relative (1-based, chunk-local) page numbering systems. Table/figure positions from layout extraction carried absolute page numbers, but insertion logic expected chunk-local coordinates aligned with self.boxes and page_cum_height. Testing(I do): Manual verification: Parse a 200+ page PDF with from_page > 0 and table rotation enabled. Confirm that: Tables and figures appear on correct pages No IndexError or position mismatches occur Page numbers in output match expected chunk-local offsets Automated testing: 我没做 ## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max and claude 4.5 opus and me) ### Context The current implementation loads entire page ranges into memory (`__images__`, `page_chars`, intermediates), which can cause RAM exhaustion on large documents. While the page numbering fix resolves correctness issues, scalability remains a concern. ### Proposed Architecture **Pipeline-Driven Chunking with Explicit Resource Management:** 1. **Authoritative chunk planning**: Accept page-range specifications from upstream pipeline as the single source of truth. The parser should be a stateless worker that processes assigned chunks without making independent pagination decisions. 2. **Granular memory lifecycle**: ```python for chunk_spec in chunk_plan: # Load only chunk_spec.pages into __images__ page_images = load_page_range(chunk_spec.start, chunk_spec.end) # Process with offset tracking results = process_chunk(page_images, offset=chunk_spec.start) # Explicit cleanup before next iteration del page_images, page_chars, layout_intermediates gc.collect() # Force collection of large objects ``` 3. **Persistent lightweight state**: Keep model instances (layout detector, OCR engine), document metadata (outlines, PDF structure), and configuration across chunks to avoid reinitialization overhead (~2-5s per chunk for model loading). 4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50) only when pipeline doesn't supply a plan. Never exceed pipeline-specified ranges to maintain predictable memory bounds. 5. **Optional: Dynamic budgeting**: Expose a memory budget parameter that adjusts chunk size based on observed image dimensions and format (e.g., reduce chunk size for high-DPI scanned documents). ### Benefits - **Predictable memory footprint**: RAM usage bounded by `chunk_size × avg_page_size` rather than total document size - **Horizontal scalability**: Enables parallel chunk processing across workers - **Failure isolation**: Page extraction errors affect only current chunk, not entire document - **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB per worker) ### Trade-offs - **Increased I/O**: Re-opening PDF for each chunk vs. keeping file handle (mitigated by page-range seeks) - **Complexity**: Requires careful offset tracking and stateful coordination between pipeline and parser - **Warmup cost**: Model initialization overhead amortized across chunks (acceptable for documents >100 pages) ### Implementation Priority This optimization should be **deferred to a separate PR** after the current correctness fix is merged, as: 1. It requires broader architectural changes across the pipeline 2. Current fix is critical for correctness and can be backported 3. Memory optimization needs comprehensive benchmarking on representative document corpus ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --- deepdoc/parser/pdf_parser.py | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/deepdoc/parser/pdf_parser.py b/deepdoc/parser/pdf_parser.py index 6681e4a89..49880c3c5 100644 --- a/deepdoc/parser/pdf_parser.py +++ b/deepdoc/parser/pdf_parser.py @@ -1594,15 +1594,32 @@ class RAGFlowPdfParser: return math.sqrt(dx * dx + dy * dy) # + (pn2-pn1)*10000 for (img, txt), poss in tbls_or_figs: + # Positions coming from _extract_table_figure carry absolute 0-based page + # indices (page_from offset). Convert back to chunk-local indices so we + # stay consistent with self.boxes/page_cum_height, which are all relative + # to the current parsing window. + local_poss = [] + for pn, left, right, top, bott in poss: + local_pn = pn - self.page_from + if 0 <= local_pn < len(self.page_cum_height) - 1: + local_poss.append((local_pn, left, right, top, bott)) + else: + logging.debug(f"Skip out-of-range table/figure position pn={pn}, page_from={self.page_from}") + if not local_poss: + logging.debug("No valid local positions for table/figure; skip insertion.") + continue + bboxes = [(i, (b["page_number"], b["x0"], b["x1"], b["top"], b["bottom"])) for i, b in enumerate(self.boxes)] dists = [ - (min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i) for i, rect in bboxes for pn, left, right, top, bott in poss + (min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i) + for i, rect in bboxes + for pn, left, right, top, bott in local_poss ] min_i = np.argmin(dists, axis=0)[0] min_i, rect = bboxes[dists[min_i][-1]] if isinstance(txt, list): txt = "\n".join(txt) - pn, left, right, top, bott = poss[0] + pn, left, right, top, bott = local_poss[0] if self.boxes[min_i]["bottom"] < top + self.page_cum_height[pn]: min_i += 1 self.boxes.insert(