mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-03-05 15:47:14 +08:00
fix:absolute page index mix-up in DeepDoc PDF parser (#12848)
### What problem does this PR solve?
Summary:
This PR addresses critical indexing issues in
deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with
chunk-based pagination:
Normalize rotated table page numbering: Rotated-table re-OCR now writes
page_number in chunk-local 1-based form, eliminating double-addition of
page_from offset that caused misalignment between table positions and
document boxes.
Convert absolute positions to chunk-local coordinates: When inserting
tables/figures extracted via _extract_table_figure, positions are now
converted from absolute (0-based) to chunk-local indices before distance
matching and box insertion. This prevents IndexError and out-of-range
accesses during paged parsing of long documents.
Root Cause:
The parser mixed absolute (0-based, document-global) and relative
(1-based, chunk-local) page numbering systems. Table/figure positions
from layout extraction carried absolute page numbers, but insertion
logic expected chunk-local coordinates aligned with self.boxes and
page_cum_height.
Testing(I do):
Manual verification: Parse a 200+ page PDF with from_page > 0 and table
rotation enabled. Confirm that:
Tables and figures appear on correct pages
No IndexError or position mismatches occur
Page numbers in output match expected chunk-local offsets
Automated testing: 我没做
## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max
and claude 4.5 opus and me)
### Context
The current implementation loads entire page ranges into memory
(`__images__`, `page_chars`, intermediates), which can cause RAM
exhaustion on large documents. While the page numbering fix resolves
correctness issues, scalability remains a concern.
### Proposed Architecture
**Pipeline-Driven Chunking with Explicit Resource Management:**
1. **Authoritative chunk planning**: Accept page-range specifications
from upstream pipeline as the single source of truth. The parser should
be a stateless worker that processes assigned chunks without making
independent pagination decisions.
2. **Granular memory lifecycle**:
```python
for chunk_spec in chunk_plan:
# Load only chunk_spec.pages into __images__
page_images = load_page_range(chunk_spec.start, chunk_spec.end)
# Process with offset tracking
results = process_chunk(page_images, offset=chunk_spec.start)
# Explicit cleanup before next iteration
del page_images, page_chars, layout_intermediates
gc.collect() # Force collection of large objects
```
3. **Persistent lightweight state**: Keep model instances (layout
detector, OCR engine), document metadata (outlines, PDF structure), and
configuration across chunks to avoid reinitialization overhead (~2-5s
per chunk for model loading).
4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50)
only when pipeline doesn't supply a plan. Never exceed
pipeline-specified ranges to maintain predictable memory bounds.
5. **Optional: Dynamic budgeting**: Expose a memory budget parameter
that adjusts chunk size based on observed image dimensions and format
(e.g., reduce chunk size for high-DPI scanned documents).
### Benefits
- **Predictable memory footprint**: RAM usage bounded by `chunk_size ×
avg_page_size` rather than total document size
- **Horizontal scalability**: Enables parallel chunk processing across
workers
- **Failure isolation**: Page extraction errors affect only current
chunk, not entire document
- **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB
per worker)
### Trade-offs
- **Increased I/O**: Re-opening PDF for each chunk vs. keeping file
handle (mitigated by page-range seeks)
- **Complexity**: Requires careful offset tracking and stateful
coordination between pipeline and parser
- **Warmup cost**: Model initialization overhead amortized across chunks
(acceptable for documents >100 pages)
### Implementation Priority
This optimization should be **deferred to a separate PR** after the
current correctness fix is merged, as:
1. It requires broader architectural changes across the pipeline
2. Current fix is critical for correctness and can be backported
3. Memory optimization needs comprehensive benchmarking on
representative document corpus
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
This commit is contained in:
@ -1594,15 +1594,32 @@ class RAGFlowPdfParser:
|
||||
return math.sqrt(dx * dx + dy * dy) # + (pn2-pn1)*10000
|
||||
|
||||
for (img, txt), poss in tbls_or_figs:
|
||||
# Positions coming from _extract_table_figure carry absolute 0-based page
|
||||
# indices (page_from offset). Convert back to chunk-local indices so we
|
||||
# stay consistent with self.boxes/page_cum_height, which are all relative
|
||||
# to the current parsing window.
|
||||
local_poss = []
|
||||
for pn, left, right, top, bott in poss:
|
||||
local_pn = pn - self.page_from
|
||||
if 0 <= local_pn < len(self.page_cum_height) - 1:
|
||||
local_poss.append((local_pn, left, right, top, bott))
|
||||
else:
|
||||
logging.debug(f"Skip out-of-range table/figure position pn={pn}, page_from={self.page_from}")
|
||||
if not local_poss:
|
||||
logging.debug("No valid local positions for table/figure; skip insertion.")
|
||||
continue
|
||||
|
||||
bboxes = [(i, (b["page_number"], b["x0"], b["x1"], b["top"], b["bottom"])) for i, b in enumerate(self.boxes)]
|
||||
dists = [
|
||||
(min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i) for i, rect in bboxes for pn, left, right, top, bott in poss
|
||||
(min_rectangle_distance((pn, left, right, top + self.page_cum_height[pn], bott + self.page_cum_height[pn]), rect), i)
|
||||
for i, rect in bboxes
|
||||
for pn, left, right, top, bott in local_poss
|
||||
]
|
||||
min_i = np.argmin(dists, axis=0)[0]
|
||||
min_i, rect = bboxes[dists[min_i][-1]]
|
||||
if isinstance(txt, list):
|
||||
txt = "\n".join(txt)
|
||||
pn, left, right, top, bott = poss[0]
|
||||
pn, left, right, top, bott = local_poss[0]
|
||||
if self.boxes[min_i]["bottom"] < top + self.page_cum_height[pn]:
|
||||
min_i += 1
|
||||
self.boxes.insert(
|
||||
|
||||
Reference in New Issue
Block a user