ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-01 13:27:54 +08:00

Files

黄圣祺 406339af1f Fix(paddleocr): load all PDF pages for image cropping instead of first 100 (#13811 )

## Summary

Closes #13803

The `__images__` method in `paddleocr_parser.py` defaulted to
`page_to=100`, only loading the first 100 pages for image cropping.
However, the PaddleOCR API processes **all** pages of the PDF. For PDFs
with more than 100 pages, page indices beyond 99 were rejected as out of
range during crop validation, causing content loss.

## Root Cause

```
__images__(page_to=100) → loads pages 0-99 → page_images has 100 entries
PaddleOCR API → processes all 226 pages → tags reference pages 1-226
extract_positions() → converts tag "101" to index 100
crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range"
```

## Fix

Changed `page_to` default from `100` to `10**9`, so all PDF pages are
loaded for cropping. Python's list slicing safely handles oversized
indices.

## Test plan

- [ ] Parse a PDF with >100 pages using PaddleOCR — no more "out of
range" warnings
- [ ] Parse a PDF with <100 pages — behavior unchanged
- [ ] Verify cropped images are generated correctly for all pages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-27 09:33:11 +08:00

resume

Add license and Fix IDE warnings (#11985 )

2025-12-17 17:04:44 +08:00

__init__.py

Feat: support epub parsing (#13650 )

2026-03-17 20:14:06 +08:00

docling_parser.py

feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527 )

2026-03-12 17:09:03 +08:00

docx_parser.py

refactor: let excel use lazy image loader (#13558 )