ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-03-20 22:27:42 +08:00

Author	SHA1	Message	Date
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Kevin Hu	bd76b8ff1a	Fix: Tika server upgrades. (#12073 ) ### What problem does this PR solve? #12037 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-23 09:35:52 +08:00
Yongteng Lei	672958a192	Fix: model not authorized (#12001 ) ### What problem does this PR solve? Fix model not authorized. #11973. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-17 19:48:24 +08:00
Jin Hai	0e8b9588ba	Fix error and format issue (#11975 ) ### What problem does this PR solve? 1. Fix error of book chunking. 2. Fix format issues. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-16 19:29:37 +08:00
Jin Hai	43f51baa96	Fix errors (#11804 ) ### What problem does this PR solve? 1. typos 2. grammar errors. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 12:21:18 +08:00
Stephen Hu	b66881a371	Refactor:book parser use with to handle bytesIO (#11800 ) ### What problem does this PR solve? book parser use with to handle bytesIO ### Type of change - [x] Refactoring	2025-12-08 10:18:46 +08:00
Yongteng Lei	9d8b96c1d0	Feat: add context for figure and table (#11547 ) ### What problem does this PR solve? Add context for figure table. ![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e) `==================()` for demonstrating purpose. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 10:21:44 +08:00
coding	971c1bcba7	Fix: missing parameters in by_plaintext method for PDF naive mode (#11408 ) ### What problem does this PR solve? FIx: missing parameters in by_plaintext method for PDF naive mode ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: lih <dev_lih@139.com>	2025-11-21 09:33:36 +08:00
Billy Bao	4b8ce08050	Fix: fix pdf_parser ignored in rag/app/naive.py (#11065 ) ### What problem does this PR solve? Fix: fix pdf_parser ignored in rag/app/naive.py #11000 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-06 15:20:35 +08:00
Billy Bao	cf9611c96f	Feat: Support more chunking methods (#11000 ) ### What problem does this PR solve? Feat: Support more chunking methods #10772 This PR enables multiple chunking methods — including books, laws, naive, one, and presentation — to be used with all existing PDF parsers (DeepDOC, MinerU, Docling, TCADP, Plain Text, and Vision modes). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-05 13:00:42 +08:00
buua436	6ab96287c9	Feat:Vision Model Image Enhancement in Manual/Paper/Book/One chunker (#10640 ) ### What problem does this PR solve? issue: [#7472](https://github.com/infiniflow/ragflow/issues/7472) change: Vision Model Image Enhancement in Manual chunker ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-21 09:36:27 +08:00
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113 ) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-07-30 19:41:09 +08:00
Kevin Hu	dd0ebbea35	Light GraphRAG (#4585 ) ### What problem does this PR solve? #4543 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-01-22 19:43:14 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Jin Hai	e079656473	Update progress info and start welcome info (#3768 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Refactoring --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-11-30 18:48:06 +08:00
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403 ) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-11-14 17:13:48 +08:00
Zhichang Yu	a2a5631da4	Rework logging (#3358 ) Unified all log files into one. ### What problem does this PR solve? Unified all log files into one. ### Type of change - [x] Refactoring	2024-11-12 17:35:13 +08:00
yqkcn	570ad420a8	remove unused import (#2679 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-09-30 16:59:39 +08:00
Kevin Hu	fc867cb959	rename get_txt to get_text (#2649 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-09-29 12:47:09 +08:00
yqkcn	aea553c3a8	Add get_txt function (#2639 ) ### What problem does this PR solve? Add get_txt function to reduce duplicate code ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-09-29 10:29:56 +08:00
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949 ) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-08-15 09:17:36 +08:00
KevinHuSh	0171082cc5	fix create dialog bug (#982 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-30 09:25:05 +08:00
Zhedong Cen	8dd45459be	Add support for HTML file (#973 ) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-05-30 09:12:55 +08:00
KevinHuSh	7013d7f620	refine text decode (#657 ) ### What problem does this PR solve? #651 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 12:25:47 +08:00
KevinHuSh	8c07992b6c	refine code (#595 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-28 19:13:33 +08:00
Jin Hai	f1c98aad6b	Update version info (#564 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Documentation Update - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-04-26 20:07:26 +08:00
KevinHuSh	369400c483	fix bug of table in docx (#510 ) ### What problem does this PR solve? #509 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-23 19:10:33 +08:00
chrysanthemum-boy	72384b191d	Add `.doc` file parser. (#497 ) ### What problem does this PR solve? Add `.doc` file parser, using tika. ``` pip install tika ``` ``` from tika import parser from io import BytesIO def extract_text_from_doc_bytes(doc_bytes): file_like_object = BytesIO(doc_bytes) parsed = parser.from_buffer(file_like_object) return parsed["content"] ``` ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chrysanthemum-boy <fannc@qq.com>	2024-04-23 15:31:43 +08:00
KevinHuSh	0dfc8ddc0f	enlarge docker memory usage (#501 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-23 14:41:10 +08:00
KevinHuSh	a38e163035	remove doc from supported processing types (#488 ) ### What problem does this PR solve? #474 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-22 15:46:09 +08:00
KevinHuSh	ed6081845a	Fit a lot of encodings for text file. (#458 ) ### What problem does this PR solve? #384 ### Type of change - [x] Performance Improvement	2024-04-19 18:02:53 +08:00
KevinHuSh	f6c7204002	refine log format (#312 ) ### What problem does this PR solve? Issue link:#264 ### Type of change - [x] Documentation Update - [x] Refactoring	2024-04-11 10:13:43 +08:00
KevinHuSh	fd7fcb5baf	apply pep8 formalize (#155 )	2024-03-27 11:33:46 +08:00
KevinHuSh	f6aee7f230	add use layout or not option (#145 ) * add use layout or not option * trival	2024-03-22 19:21:09 +08:00
KevinHuSh	602038ac49	fix task cancling bug (#98 )	2024-03-05 16:33:47 +08:00
KevinHuSh	8a57f2afd5	change callback strategy, add timezone to docker (#96 )	2024-03-05 12:08:41 +08:00
KevinHuSh	7bfaf0df29	fix position extraction bug (#93 ) * fix position extraction bug * remove delimiter for naive parser	2024-03-04 17:08:35 +08:00
KevinHuSh	685b4d8a95	fix table desc bugs, add positions to chunks (#91 )	2024-03-04 14:42:26 +08:00
KevinHuSh	8a726fb04b	solve task execution issues (#90 )	2024-03-01 19:48:01 +08:00
KevinHuSh	7fd1eca582	init README of deepdoc, add picture processer. (#71 ) * init README of deepdoc, add picture processer. * add resume parsing	2024-02-23 18:28:12 +08:00
KevinHuSh	cacd36c5e1	use onnx models, new deepdoc (#68 )	2024-02-21 16:32:38 +08:00
KevinHuSh	a8294f2168	Refine resume parts and fix bugs in retrival using sql (#66 )	2024-02-19 19:22:17 +08:00
KevinHuSh	407b2523b6	remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55 )	2024-02-05 18:08:17 +08:00
KevinHuSh	51482f3e2a	Some document API refined. (#53 ) Add naive chunking method to RAG	2024-02-02 19:21:37 +08:00
KevinHuSh	e6acaf6738	Add Q&A and Book, fix task running bugs (#50 )	2024-02-01 18:53:56 +08:00

49 Commits