mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-27 11:15:59 +08:00

Files

少卿 d430446e69 fix:absolute page index mix-up in DeepDoc PDF parser (#12848 )

### What problem does this PR solve?

Summary:
This PR addresses critical indexing issues in
deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with
chunk-based pagination:

Normalize rotated table page numbering: Rotated-table re-OCR now writes
page_number in chunk-local 1-based form, eliminating double-addition of
page_from offset that caused misalignment between table positions and
document boxes.
Convert absolute positions to chunk-local coordinates: When inserting
tables/figures extracted via _extract_table_figure, positions are now
converted from absolute (0-based) to chunk-local indices before distance
matching and box insertion. This prevents IndexError and out-of-range
accesses during paged parsing of long documents.

Root Cause:
The parser mixed absolute (0-based, document-global) and relative
(1-based, chunk-local) page numbering systems. Table/figure positions
from layout extraction carried absolute page numbers, but insertion
logic expected chunk-local coordinates aligned with self.boxes and
page_cum_height.


Testing(I do):

Manual verification: Parse a 200+ page PDF with from_page > 0 and table
rotation enabled. Confirm that:

Tables and figures appear on correct pages
No IndexError or position mismatches occur
Page numbers in output match expected chunk-local offsets


Automated testing: 我没做


## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max
and claude 4.5 opus and me)

### Context

The current implementation loads entire page ranges into memory
(`__images__`, `page_chars`, intermediates), which can cause RAM
exhaustion on large documents. While the page numbering fix resolves
correctness issues, scalability remains a concern.

### Proposed Architecture

**Pipeline-Driven Chunking with Explicit Resource Management:**

1. **Authoritative chunk planning**: Accept page-range specifications
from upstream pipeline as the single source of truth. The parser should
be a stateless worker that processes assigned chunks without making
independent pagination decisions.

2. **Granular memory lifecycle**:
   ```python
   for chunk_spec in chunk_plan:
       # Load only chunk_spec.pages into __images__
       page_images = load_page_range(chunk_spec.start, chunk_spec.end)
       
       # Process with offset tracking
       results = process_chunk(page_images, offset=chunk_spec.start)
       
       # Explicit cleanup before next iteration
       del page_images, page_chars, layout_intermediates
       gc.collect()  # Force collection of large objects
   ```

3. **Persistent lightweight state**: Keep model instances (layout
detector, OCR engine), document metadata (outlines, PDF structure), and
configuration across chunks to avoid reinitialization overhead (~2-5s
per chunk for model loading).

4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50)
only when pipeline doesn't supply a plan. Never exceed
pipeline-specified ranges to maintain predictable memory bounds.

5. **Optional: Dynamic budgeting**: Expose a memory budget parameter
that adjusts chunk size based on observed image dimensions and format
(e.g., reduce chunk size for high-DPI scanned documents).

### Benefits

- **Predictable memory footprint**: RAM usage bounded by `chunk_size ×
avg_page_size` rather than total document size
- **Horizontal scalability**: Enables parallel chunk processing across
workers
- **Failure isolation**: Page extraction errors affect only current
chunk, not entire document
- **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB
per worker)

### Trade-offs

- **Increased I/O**: Re-opening PDF for each chunk vs. keeping file
handle (mitigated by page-range seeks)
- **Complexity**: Requires careful offset tracking and stateful
coordination between pipeline and parser
- **Warmup cost**: Model initialization overhead amortized across chunks
(acceptable for documents >100 pages)

### Implementation Priority

This optimization should be **deferred to a separate PR** after the
current correctness fix is merged, as:
1. It requires broader architectural changes across the pipeline
2. Current fix is critical for correctness and can be backported
3. Memory optimization needs comprehensive benchmarking on
representative document corpus


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

2026-03-02 14:58:37 +08:00

parser

fix:absolute page index mix-up in DeepDoc PDF parser (#12848 )

2026-03-02 14:58:37 +08:00

vision

refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 )

2026-02-28 11:22:31 +08:00

__init__.py

Update comments (#4569 )

2025-01-21 20:52:28 +08:00

README_zh.md

[Feat]Automatic table orientation detection and correction (#12719 )

2026-01-22 12:47:55 +08:00

README.md

[Feat]Automatic table orientation detection and correction (#12719 )

2026-01-22 12:47:55 +08:00

README.md

English | 简体中文

DeepDoc

1. Introduction
2. Vision
3. Parser

1. Introduction

With a bunch of documents from various domains with various formats and along with diverse retrieval requirements, an accurate analysis becomes a very challenge task. DeepDoc is born for that purpose. There are 2 parts in DeepDoc so far: vision and parser. You can run the flowing test programs if you're interested in our results of OCR, layout recognition and TSR.

python deepdoc/vision/t_ocr.py -h
usage: t_ocr.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR]

options:
  -h, --help            show this help message and exit
  --inputs INPUTS       Directory where to store images or PDFs, or a file path to a single image or PDF
  --output_dir OUTPUT_DIR
                        Directory where to store the output images. Default: './ocr_outputs'

python deepdoc/vision/t_recognizer.py -h
usage: t_recognizer.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR] [--threshold THRESHOLD] [--mode {layout,tsr}]

options:
  -h, --help            show this help message and exit
  --inputs INPUTS       Directory where to store images or PDFs, or a file path to a single image or PDF
  --output_dir OUTPUT_DIR
                        Directory where to store the output images. Default: './layouts_outputs'
  --threshold THRESHOLD
                        A threshold to filter out detections. Default: 0.5
  --mode {layout,tsr}   Task mode: layout recognition or table structure recognition

Our models are served on HuggingFace. If you have trouble downloading HuggingFace models, this might help!!

export HF_ENDPOINT=https://hf-mirror.com

2. Vision

We use vision information to resolve problems as human being.

OCR. Since a lot of documents presented as images or at least be able to transform to image, OCR is a very essential and fundamental or even universal solution for text extraction.
```
    python deepdoc/vision/t_ocr.py --inputs=path_to_images_or_pdfs --output_dir=path_to_store_result
```
The inputs could be directory to images or PDF, or an image or PDF. You can look into the folder 'path_to_store_result' where has images which demonstrate the positions of results, txt files which contain the OCR text.
Layout recognition. Documents from different domain may have various layouts, like, newspaper, magazine, book and résumé are distinct in terms of layout. Only when machine have an accurate layout analysis, it can decide if these text parts are successive or not, or this part needs Table Structure Recognition(TSR) to process, or this part is a figure and described with this caption. We have 10 basic layout components which covers most cases:
- Text
- Title
- Figure
- Figure caption
- Table
- Table caption
- Header
- Footer
- Reference
- Equation
Have a try on the following command to see the layout detection results.
```
   python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=layout --output_dir=path_to_store_result
```
The inputs could be directory to images or PDF, or an image or PDF. You can look into the folder 'path_to_store_result' where has images which demonstrate the detection results as following:
Table Structure Recognition(TSR). Data table is a frequently used structure to present data including numbers or text. And the structure of a table might be very complex, like hierarchy headers, spanning cells and projected row headers. Along with TSR, we also reassemble the content into sentences which could be well comprehended by LLM. We have five labels for TSR task:
- Column
- Row
- Column header
- Projected row header
- Spanning cell
Have a try on the following command to see the layout detection results.
```
   python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=tsr --output_dir=path_to_store_result
```
The inputs could be directory to images or PDF, or a image or PDF. You can look into the folder 'path_to_store_result' where has both images and html pages which demonstrate the detection results as following:
Table Auto-Rotation. For scanned PDFs where tables may be incorrectly oriented (rotated 90°, 180°, or 270°), the PDF parser automatically detects the best rotation angle using OCR confidence scores before performing table structure recognition. This significantly improves OCR accuracy and table structure detection for rotated tables.

The feature evaluates 4 rotation angles (0°, 90°, 180°, 270°) and selects the one with highest OCR confidence. After determining the best orientation, it re-performs OCR on the correctly rotated table image.

This feature is enabled by default. You can control it via environment variable:
```
# Disable table auto-rotation
export TABLE_AUTO_ROTATE=false

# Enable table auto-rotation (default)
export TABLE_AUTO_ROTATE=true
```
Or via API parameter:
```
from deepdoc.parser import PdfParser

parser = PdfParser()
# Disable auto-rotation for this call
boxes, tables = parser(pdf_path, auto_rotate_tables=False)
```

3. Parser

Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser. The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:

Text chunks with their own positions in PDF(page number and rectangular positions).
Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
Figures with caption and text in the figures.

Résumé

The résumé is a very complicated kind of document. A résumé which is composed of unstructured text with various layouts could be resolved into structured data composed of nearly a hundred of fields. We haven't opened the parser yet, as we open the processing method after parsing procedure.