Commit Graph

20 Commits

Author SHA1 Message Date
fa71f8d0c7 refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.

This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.

**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:

* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.

**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.

**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.

**Validation & Testing**
I've tested this to ensure no regressions:

* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.

**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
b226e06e2d refactor: remove debug print statements (#12534)
### What problem does this PR solve?

refactor: remove debug print statements

### Type of change

- [x] Refactoring
2026-01-09 19:23:50 +08:00
011bbe9556 Feat: support context window for docx (#12455)
### What problem does this PR solve?

Feat: support context window for docx

#12303

Done:
- [x] naive.py
- [x] one.py

TODO:
- [ ] book.py
- [ ] manual.py

Fix: incorrect image position
Fix: incorrect chunk type tag

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-01-07 15:08:17 +08:00
4cd4526492 Feat: PDF vision figure parser supports reading context (#12416)
### What problem does this PR solve?

PDF vision figure parser supports reading context.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-05 09:55:43 +08:00
712d537d66 Fix: vision_figure_parser_docx/pdf_wrapper (#12104)
### What problem does this PR solve?

Fix: vision_figure_parser_docx/pdf_wrapper  #11735

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-23 11:51:28 +08:00
b49eb6826b Feat: enhance Excel image extraction with vision-based descriptions (#12054)
### What problem does this PR solve?
issue:
[#11618](https://github.com/infiniflow/ragflow/issues/11618)
change:
enhance Excel image extraction with vision-based descriptions

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-22 10:17:44 +08:00
797e03f843 Fix: none type error. (#11735)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-04 14:14:38 +08:00
ba71160b14 Refa: rm useless code. (#11238)
### Type of change

- [x] Refactoring
2025-11-13 09:59:55 +08:00
bab3fce136 Move some constants to common (#11004)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-05 08:01:39 +08:00
1e45137284 Move 'timeout' to common folder (#10983)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-04 11:51:12 +08:00
41fade3fe6 Fix:wrong param in manual chunk (#10710)
### What problem does this PR solve?

change:
wrong param in manual chunk

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-21 20:10:54 +08:00
6ab96287c9 Feat:Vision Model Image Enhancement in Manual/Paper/Book/One chunker (#10640)
### What problem does this PR solve?
issue:
[#7472](https://github.com/infiniflow/ragflow/issues/7472)
change:
Vision Model Image Enhancement in Manual chunker
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-10-21 09:36:27 +08:00
4eb7659499 Fix bug: broken import from rag.prompts.prompts (#10217)
### What problem does this PR solve?

Fix broken imports

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: jinhai <haijin.chn@gmail.com>
2025-09-23 10:19:25 +08:00
ecdb1701df Perf: test llm before RAPTOR. (#8897)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2025-07-17 16:48:50 +08:00
4a2ff633e0 Fix typo in code (#8327)
### What problem does this PR solve?

Fix typo in code

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-06-18 09:41:09 +08:00
b908c33464 Fix: uncaptured image data with position information (#7683)
### What problem does this PR solve?

Fixed uncaptured figure data with position information. #7466, #7681

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-05-19 19:33:28 +08:00
953b3e1b3f Fix: Sometimes VisionFigureParser.figures may is tuple (#7477)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/7466
I think due to some times we can not get position 

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-06 17:38:22 +08:00
2f768b96e8 perf: optimze figure parser (#7392)
### What problem does this PR solve?

When parsing documents containing images, the current code uses a
single-threaded approach to call the VL model, resulting in extremely
slow parsing speed (e.g., parsing a Word document with dozens of images
takes over 20 minutes).

By switching to a multithreaded approach to call the VL model, the
parsing speed can be improved to an acceptable level.

### Type of change

- [x] Performance Improvement

---------

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
2025-05-06 14:39:45 +08:00
9611185eb4 Feat: add VLM-boosted DocX parser (#6307)
### What problem does this PR solve?

Add VLM-boosted DocX parser

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 11:24:44 +08:00
1d6760dd84 Feat: add VLM-boosted PDF parser (#6278)
### What problem does this PR solve?

Add VLM-boosted PDF parser if VLM is set.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-03-20 09:39:32 +08:00