fix:absolute page index mix-up in DeepDoc PDF parser (#12848)

### What problem does this PR solve?

Summary:
This PR addresses critical indexing issues in
deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with
chunk-based pagination:

Normalize rotated table page numbering: Rotated-table re-OCR now writes
page_number in chunk-local 1-based form, eliminating double-addition of
page_from offset that caused misalignment between table positions and
document boxes.
Convert absolute positions to chunk-local coordinates: When inserting
tables/figures extracted via _extract_table_figure, positions are now
converted from absolute (0-based) to chunk-local indices before distance
matching and box insertion. This prevents IndexError and out-of-range
accesses during paged parsing of long documents.

Root Cause:
The parser mixed absolute (0-based, document-global) and relative
(1-based, chunk-local) page numbering systems. Table/figure positions
from layout extraction carried absolute page numbers, but insertion
logic expected chunk-local coordinates aligned with self.boxes and
page_cum_height.


Testing(I do):

Manual verification: Parse a 200+ page PDF with from_page > 0 and table
rotation enabled. Confirm that:

Tables and figures appear on correct pages
No IndexError or position mismatches occur
Page numbers in output match expected chunk-local offsets


Automated testing: 我没做


## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max
and claude 4.5 opus and me)

### Context

The current implementation loads entire page ranges into memory
(`__images__`, `page_chars`, intermediates), which can cause RAM
exhaustion on large documents. While the page numbering fix resolves
correctness issues, scalability remains a concern.

### Proposed Architecture

**Pipeline-Driven Chunking with Explicit Resource Management:**

1. **Authoritative chunk planning**: Accept page-range specifications
from upstream pipeline as the single source of truth. The parser should
be a stateless worker that processes assigned chunks without making
independent pagination decisions.

2. **Granular memory lifecycle**:
   ```python
   for chunk_spec in chunk_plan:
       # Load only chunk_spec.pages into __images__
       page_images = load_page_range(chunk_spec.start, chunk_spec.end)
       
       # Process with offset tracking
       results = process_chunk(page_images, offset=chunk_spec.start)
       
       # Explicit cleanup before next iteration
       del page_images, page_chars, layout_intermediates
       gc.collect()  # Force collection of large objects
   ```

3. **Persistent lightweight state**: Keep model instances (layout
detector, OCR engine), document metadata (outlines, PDF structure), and
configuration across chunks to avoid reinitialization overhead (~2-5s
per chunk for model loading).

4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50)
only when pipeline doesn't supply a plan. Never exceed
pipeline-specified ranges to maintain predictable memory bounds.

5. **Optional: Dynamic budgeting**: Expose a memory budget parameter
that adjusts chunk size based on observed image dimensions and format
(e.g., reduce chunk size for high-DPI scanned documents).

### Benefits

- **Predictable memory footprint**: RAM usage bounded by `chunk_size ×
avg_page_size` rather than total document size
- **Horizontal scalability**: Enables parallel chunk processing across
workers
- **Failure isolation**: Page extraction errors affect only current
chunk, not entire document
- **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB
per worker)

### Trade-offs

- **Increased I/O**: Re-opening PDF for each chunk vs. keeping file
handle (mitigated by page-range seeks)
- **Complexity**: Requires careful offset tracking and stateful
coordination between pipeline and parser
- **Warmup cost**: Model initialization overhead amortized across chunks
(acceptable for documents >100 pages)

### Implementation Priority

This optimization should be **deferred to a separate PR** after the
current correctness fix is merged, as:
1. It requires broader architectural changes across the pipeline
2. Current fix is critical for correctness and can be backported
3. Memory optimization needs comprehensive benchmarking on
representative document corpus


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
This commit is contained in:
少卿
2026-03-02 14:58:37 +08:00
committed by GitHub
parent 184388879d
commit d430446e69

View File

@ -1594,15 +1594,32 @@ class RAGFlowPdfParser:
return math.sqrt(dx * dx + dy * dy) # + (pn2-pn1)*10000
for (img, txt), poss in tbls_or_figs:
# Positions coming from _extract_table_figure carry absolute 0-based page
# indices (page_from offset). Convert back to chunk-local indices so we
# stay consistent with self.boxes/page_cum_height, which are all relative
# to the current parsing window.
local_poss = []
for pn, left, right, top, bott in poss:
local_pn = pn - self.page_from
if 0 <= local_pn < len(self.page_cum_height) - 1:
local_poss.append((local_pn, left, right, top, bott))
else:
logging.debug(f"Skip out-of-range table/figure position pn={pn}, page_from={self.page_from}")
if not local_poss:
logging.debug("No valid local positions for table/figure; skip insertion.")
continue
bboxes = [(i, (b["page_number"], b["x0"], b["x1"], b["top"], b["bottom"])) for i, b in enumerate(self.boxes)]
dists = [
(min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i) for i, rect in bboxes for pn, left, right, top, bott in poss
(min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i)
for i, rect in bboxes
for pn, left, right, top, bott in local_poss
]
min_i = np.argmin(dists, axis=0)[0]
min_i, rect = bboxes[dists[min_i][-1]]
if isinstance(txt, list):
txt = "\n".join(txt)
pn, left, right, top, bott = poss[0]
pn, left, right, top, bott = local_poss[0]
if self.boxes[min_i]["bottom"] < top + self.page_cum_height[pn]:
min_i += 1
self.boxes.insert(