ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-03-20 22:27:42 +08:00

Author	SHA1	Message	Date
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
Rin	651d9fff9f	security: replace unsafe eval with ast.literal_eval in vision operators (#12236 ) Addresses a potential RCE vulnerability in NormalizeImage by using ast.literal_eval for safer string parsing. --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-29 13:28:09 +08:00
yihong	4326873af6	refactor: no need to inherit in python3 clean the code (#5659 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-03-05 18:03:53 +08:00
yihong	37aacb3960	Refa: drop useless fasttext (#5470 ) ### What problem does this PR solve? This patch drop useless fastext which is seems useless in the code base and its very kind of hard install should close #4498 ### Type of change - [x] Refactoring Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-02-28 14:30:56 +08:00
Kevin Hu	b08bb56f6c	Display thinking for deepseek r1 (#4904 ) ### What problem does this PR solve? #4903 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-02-12 15:43:13 +08:00
Mathias Panzenböck	6b389e01b5	Remove use of eval() from operators.py (#4888 ) Use `np.float32()` instead. ### What problem does this PR solve? Using `eval()` can lead to code injections. I think `eval()` is only used to parse a floating point number here. This change preserves the correct behavior if the string `"None"` is supplied. But if that behavior isn't intended then this part could be just deleted instead, since `np.float32()` is parsing strings anyway: ```Python if isinstance(scale, str): scale = eval(scale) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-12 12:53:42 +08:00
Kevin Hu	ce1e855328	Upgrades Document Layout Analysis model. (#4054 ) ### What problem does this PR solve? #4052 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-12-17 11:27:19 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403 ) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-11-14 17:13:48 +08:00
Zhichang Yu	a2a5631da4	Rework logging (#3358 ) Unified all log files into one. ### What problem does this PR solve? Unified all log files into one. ### Type of change - [x] Refactoring	2024-11-12 17:35:13 +08:00
Ikko Eltociear Ashimine	c552a02e7f	chore: update operators.py (#2724 ) ### What problem does this PR solve? substract -> subtract ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-10-08 10:34:52 +08:00
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949 ) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-08-15 09:17:36 +08:00
KevinHuSh	fd7fcb5baf	apply pep8 formalize (#155 )	2024-03-27 11:33:46 +08:00
KevinHuSh	d32322c081	rename vision, add layour and tsr recognizer (#70 ) * rename vision, add layour and tsr recognizer * trivial fixing	2024-02-22 19:11:37 +08:00

14 Commits