Files
ragflow/deepdoc/parser
少卿 d430446e69 fix:absolute page index mix-up in DeepDoc PDF parser (#12848)
### What problem does this PR solve?

Summary:
This PR addresses critical indexing issues in
deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with
chunk-based pagination:

Normalize rotated table page numbering: Rotated-table re-OCR now writes
page_number in chunk-local 1-based form, eliminating double-addition of
page_from offset that caused misalignment between table positions and
document boxes.
Convert absolute positions to chunk-local coordinates: When inserting
tables/figures extracted via _extract_table_figure, positions are now
converted from absolute (0-based) to chunk-local indices before distance
matching and box insertion. This prevents IndexError and out-of-range
accesses during paged parsing of long documents.

Root Cause:
The parser mixed absolute (0-based, document-global) and relative
(1-based, chunk-local) page numbering systems. Table/figure positions
from layout extraction carried absolute page numbers, but insertion
logic expected chunk-local coordinates aligned with self.boxes and
page_cum_height.


Testing(I do):

Manual verification: Parse a 200+ page PDF with from_page > 0 and table
rotation enabled. Confirm that:

Tables and figures appear on correct pages
No IndexError or position mismatches occur
Page numbers in output match expected chunk-local offsets


Automated testing: 我没做


## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max
and claude 4.5 opus and me)

### Context

The current implementation loads entire page ranges into memory
(`__images__`, `page_chars`, intermediates), which can cause RAM
exhaustion on large documents. While the page numbering fix resolves
correctness issues, scalability remains a concern.

### Proposed Architecture

**Pipeline-Driven Chunking with Explicit Resource Management:**

1. **Authoritative chunk planning**: Accept page-range specifications
from upstream pipeline as the single source of truth. The parser should
be a stateless worker that processes assigned chunks without making
independent pagination decisions.

2. **Granular memory lifecycle**:
   ```python
   for chunk_spec in chunk_plan:
       # Load only chunk_spec.pages into __images__
       page_images = load_page_range(chunk_spec.start, chunk_spec.end)
       
       # Process with offset tracking
       results = process_chunk(page_images, offset=chunk_spec.start)
       
       # Explicit cleanup before next iteration
       del page_images, page_chars, layout_intermediates
       gc.collect()  # Force collection of large objects
   ```

3. **Persistent lightweight state**: Keep model instances (layout
detector, OCR engine), document metadata (outlines, PDF structure), and
configuration across chunks to avoid reinitialization overhead (~2-5s
per chunk for model loading).

4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50)
only when pipeline doesn't supply a plan. Never exceed
pipeline-specified ranges to maintain predictable memory bounds.

5. **Optional: Dynamic budgeting**: Expose a memory budget parameter
that adjusts chunk size based on observed image dimensions and format
(e.g., reduce chunk size for high-DPI scanned documents).

### Benefits

- **Predictable memory footprint**: RAM usage bounded by `chunk_size ×
avg_page_size` rather than total document size
- **Horizontal scalability**: Enables parallel chunk processing across
workers
- **Failure isolation**: Page extraction errors affect only current
chunk, not entire document
- **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB
per worker)

### Trade-offs

- **Increased I/O**: Re-opening PDF for each chunk vs. keeping file
handle (mitigated by page-range seeks)
- **Complexity**: Requires careful offset tracking and stateful
coordination between pipeline and parser
- **Warmup cost**: Model initialization overhead amortized across chunks
(acceptable for documents >100 pages)

### Implementation Priority

This optimization should be **deferred to a separate PR** after the
current correctness fix is merged, as:
1. It requires broader architectural changes across the pipeline
2. Current fix is critical for correctness and can be backported
3. Memory optimization needs comprehensive benchmarking on
representative document corpus


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 14:58:37 +08:00
..
2025-07-25 12:04:07 +08:00
2025-12-08 09:42:10 +08:00