fix:absolute page index mix-up in DeepDoc PDF parser (#12848)

### What problem does this PR solve? Summary: This PR addresses critical indexing issues in deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with chunk-based pagination: Normalize rotated table page numbering: Rotated-table re-OCR now writes page_number in chunk-local 1-based form, eliminating double-addition of page_from offset that caused misalignment between table positions and document boxes. Convert absolute positions to chunk-local coordinates: When inserting tables/figures extracted via _extract_table_figure, positions are now converted from absolute (0-based) to chunk-local indices before distance matching and box insertion. This prevents IndexError and out-of-range accesses during paged parsing of long documents. Root Cause: The parser mixed absolute (0-based, document-global) and relative (1-based, chunk-local) page numbering systems. Table/figure positions from layout extraction carried absolute page numbers, but insertion logic expected chunk-local coordinates aligned with self.boxes and page_cum_height. Testing(I do): Manual verification: Parse a 200+ page PDF with from_page > 0 and table rotation enabled. Confirm that: Tables and figures appear on correct pages No IndexError or position mismatches occur Page numbers in output match expected chunk-local offsets Automated testing: 我没做 ## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max and claude 4.5 opus and me) ### Context The current implementation loads entire page ranges into memory (`__images__`, `page_chars`, intermediates), which can cause RAM exhaustion on large documents. While the page numbering fix resolves correctness issues, scalability remains a concern. ### Proposed Architecture **Pipeline-Driven Chunking with Explicit Resource Management:** 1. **Authoritative chunk planning**: Accept page-range specifications from upstream pipeline as the single source of truth. The parser should be a stateless worker that processes assigned chunks without making independent pagination decisions. 2. **Granular memory lifecycle**: ```python for chunk_spec in chunk_plan: # Load only chunk_spec.pages into __images__ page_images = load_page_range(chunk_spec.start, chunk_spec.end) # Process with offset tracking results = process_chunk(page_images, offset=chunk_spec.start) # Explicit cleanup before next iteration del page_images, page_chars, layout_intermediates gc.collect() # Force collection of large objects ``` 3. **Persistent lightweight state**: Keep model instances (layout detector, OCR engine), document metadata (outlines, PDF structure), and configuration across chunks to avoid reinitialization overhead (~2-5s per chunk for model loading). 4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50) only when pipeline doesn't supply a plan. Never exceed pipeline-specified ranges to maintain predictable memory bounds. 5. **Optional: Dynamic budgeting**: Expose a memory budget parameter that adjusts chunk size based on observed image dimensions and format (e.g., reduce chunk size for high-DPI scanned documents). ### Benefits - **Predictable memory footprint**: RAM usage bounded by `chunk_size × avg_page_size` rather than total document size - **Horizontal scalability**: Enables parallel chunk processing across workers - **Failure isolation**: Page extraction errors affect only current chunk, not entire document - **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB per worker) ### Trade-offs - **Increased I/O**: Re-opening PDF for each chunk vs. keeping file handle (mitigated by page-range seeks) - **Complexity**: Requires careful offset tracking and stateful coordination between pipeline and parser - **Warmup cost**: Model initialization overhead amortized across chunks (acceptable for documents >100 pages) ### Implementation Priority This optimization should be **deferred to a separate PR** after the current correctness fix is merged, as: 1. It requires broader architectural changes across the pipeline 2. Current fix is critical for correctness and can be backported 3. Memory optimization needs comprehensive benchmarking on representative document corpus ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-25 18:36:59 +08:00 · 2026-03-02 14:58:37 +08:00
parent 184388879d
commit d430446e69
1 changed files with 19 additions and 2 deletions
--- a/deepdoc/parser/pdf_parser.py
+++ b/deepdoc/parser/pdf_parser.py
@ -1594,15 +1594,32 @@ class RAGFlowPdfParser:
                return math.sqrt(dx * dx + dy * dy)  # + (pn2-pn1)*10000

            for (img, txt), poss in tbls_or_figs:
+                # Positions coming from _extract_table_figure carry absolute 0-based page
+                # indices (page_from offset). Convert back to chunk-local indices so we
+                # stay consistent with self.boxes/page_cum_height, which are all relative
+                # to the current parsing window.
+                local_poss = []
+                for pn, left, right, top, bott in poss:
+                    local_pn = pn - self.page_from
+                    if 0 <= local_pn < len(self.page_cum_height) - 1:
+                        local_poss.append((local_pn, left, right, top, bott))
+                    else:
+                        logging.debug(f"Skip out-of-range table/figure position pn={pn}, page_from={self.page_from}")
+                if not local_poss:
+                    logging.debug("No valid local positions for table/figure; skip insertion.")
+                    continue
+
                bboxes = [(i, (b["page_number"], b["x0"], b["x1"], b["top"], b["bottom"])) for i, b in enumerate(self.boxes)]
                dists = [
-                    (min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i) for i, rect in bboxes for pn, left, right, top, bott in poss
+                    (min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i)
+                    for i, rect in bboxes
+                    for pn, left, right, top, bott in local_poss
                ]
                min_i = np.argmin(dists, axis=0)[0]
                min_i, rect = bboxes[dists[min_i][-1]]
                if isinstance(txt, list):
                    txt = "\n".join(txt)
-                pn, left, right, top, bott = poss[0]
+                pn, left, right, top, bott = local_poss[0]
                if self.boxes[min_i]["bottom"] < top + self.page_cum_height[pn]:
                    min_i += 1
                self.boxes.insert(