ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-08 08:07:21 +08:00

Author	SHA1	Message	Date
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
eviaaaaa	d0ca388bec	Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329 ) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * Centralized Abstraction (`docx_parser.py`): Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * Robust Corrupted Image Fallback (`docx_parser.py`): Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * Legacy Code & Redundancy Purge: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * Output Consistency: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * Memory Footprint Drop: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.	2026-03-11 10:00:07 +08:00
lys1313013	37e4485415	feat: add MDX file support (#12261 ) Feat: add MDX file support #12057 ### What problem does this PR solve? <img width="1055" height="270" alt="image" src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b" /> The page states that MDX files are supported, but uploading fails with the error: "x.mdx: This type of file has not been supported yet!" <img width="381" height="110" alt="image" src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-29 12:54:31 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Jin Hai	43f51baa96	Fix errors (#11804 ) ### What problem does this PR solve? 1. typos 2. grammar errors. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 12:21:18 +08:00
Jin Hai	766d900a41	Refactor: rename rmSpace to remove_redundant_spaces (#10796 ) ### What problem does this PR solve? - rename rmSpace to remove_redundant_spaces - move clean_markdown_block to common module - add unit tests for remove_redundant_spaces and clean_markdown_block ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-28 09:46:32 +08:00
liuzhenghua	5256980ffb	Fix: Solve the OOM issue when passing large PDF files while using QA chunking method. (#8464 ) ### What problem does this PR solve? Using the QA chunking method with a large PDF (e.g., 300+ pages) may lead to OOM in the ragflow-worker module. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-25 10:25:45 +08:00
Kevin Hu	321a280031	Feat: add image preview to retrieval test. (#7610 ) ### What problem does this PR solve? #7608 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-05-13 14:30:36 +08:00
Kevin Hu	1333d3c02a	Fix: float transfer exception. (#6197 ) ### What problem does this PR solve? #6177 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-18 11:13:44 +08:00
Kevin Hu	9d717f0b6e	Fix csv reader exception. (#4628 ) ### What problem does this PR solve? #4552 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-01-24 14:47:19 +08:00
Kevin Hu	13f04b7cca	Fix pdf applying Q&A issue. (#4599 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-01-23 12:30:46 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Kevin Hu	c5da3cdd97	Tagging (#4426 ) ### What problem does this PR solve? #4367 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-01-09 17:07:21 +08:00
TeslaZY	dd13a5d05c	Fix some bugs in text2sql.(#4279 )(#4281 ) (#4280 ) Fix some bugs in text2sql.(#4279)(#4281) ### What problem does this PR solve? - The incorrect results in parsing CSV files of the QA knowledge base in the text2sql scenario. Process CSV files using the csv library. Decouple CSV parsing from TXT parsing - Most llm return results in markdown format ```sql query ```, Fix execution error caused by LLM output SQLmarkdown format.### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-30 10:32:19 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Jin Hai	e079656473	Update progress info and start welcome info (#3768 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Refactoring --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-11-30 18:48:06 +08:00
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403 ) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-11-14 17:13:48 +08:00
Zhichang Yu	a2a5631da4	Rework logging (#3358 ) Unified all log files into one. ### What problem does this PR solve? Unified all log files into one. ### Type of change - [x] Refactoring	2024-11-12 17:35:13 +08:00
Kevin Hu	f86826b7a0	refactor error message of qwen (#3074 ) ### What problem does this PR solve? #3055 ### Type of change - [x] Refactoring	2024-10-29 10:08:08 +08:00
Kevin Hu	1fce6caf80	make titles in markdown not be splited with following content (#2971 ) ### What problem does this PR solve? #2970 ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2024-10-22 15:25:23 +08:00
Kevin Hu	b540d41cdc	let presentation do raptor (#2838 ) ### What problem does this PR solve? #2837 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-10-15 10:11:09 +08:00
yqkcn	570ad420a8	remove unused import (#2679 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-09-30 16:59:39 +08:00
Kevin Hu	fc867cb959	rename get_txt to get_text (#2649 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-09-29 12:47:09 +08:00
yqkcn	aea553c3a8	Add get_txt function (#2639 ) ### What problem does this PR solve? Add get_txt function to reduce duplicate code ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-09-29 10:29:56 +08:00
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949 ) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-08-15 09:17:36 +08:00
cHz	4b195cc14c	fix: Misspelled Variable Name (#1662 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-07-24 11:14:46 +08:00
Zhedong Cen	b75bb1d8d3	Support displaying tables in the chunks of pdf file when using QA parser (#1263 ) ### What problem does this PR solve? Support displaying tables in the chunks of pdf file when using QA parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-24 19:02:18 +08:00
Zhedong Cen	38bd02f402	Support displaying images in the chunks of docx files when using general parser (#1253 ) ### What problem does this PR solve? Support displaying images in chunks of docx files when using general parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-24 16:29:36 +08:00
Zhedong Cen	f8fe4154e8	Place pdf's image at the correct position in QA parser (#1235 ) ### What problem does this PR solve? Place pdf's image at the correct position in QA parser ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-06-24 10:41:03 +08:00
Zhedong Cen	3c1444ab19	Add docx support for manual parser (#1227 ) ### What problem does this PR solve? Add docx support for manual parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-20 17:03:02 +08:00
Zhedong Cen	fb56a29478	Add docx support for QA parser (#1213 ) ### What problem does this PR solve? Add docx support for QA parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-20 16:09:09 +08:00
KevinHuSh	e35f7610e7	fix too long query exception (#1195 ) ### What problem does this PR solve? #1161 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-06-18 09:50:59 +08:00
Zhedong Cen	7920a5c78d	Add markdown support for QA parser (#1180 ) ### What problem does this PR solve? Add markdown support for QA parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-18 09:45:13 +08:00
Zhedong Cen	90975460af	Add pdf support for QA parser (#1155 ) ### What problem does this PR solve? Support extracting questions and answers from PDF files ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-14 15:12:39 +08:00
KevinHuSh	7013d7f620	refine text decode (#657 ) ### What problem does this PR solve? #651 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 12:25:47 +08:00
KevinHuSh	674b3aeafd	fix disable and enable llm setting in dialog (#616 ) ### What problem does this PR solve? #614 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-30 11:04:14 +08:00
KevinHuSh	8c07992b6c	refine code (#595 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-28 19:13:33 +08:00
KevinHuSh	ed6081845a	Fit a lot of encodings for text file. (#458 ) ### What problem does this PR solve? #384 ### Type of change - [x] Performance Improvement	2024-04-19 18:02:53 +08:00
KevinHuSh	392e515c3f	fix bug about reload knowledgebase configuration reloading (#210 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ Issue link:#[[Link the issue here](https://github.com/infiniflow/ragflow/issues/209)] ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-03 11:00:50 +08:00
KevinHuSh	6999598101	refine for English corpus (#135 )	2024-03-20 16:56:16 +08:00
KevinHuSh	9a843667b3	fix github account login issue (#132 )	2024-03-19 15:31:47 +08:00
KevinHuSh	7fd1eca582	init README of deepdoc, add picture processer. (#71 ) * init README of deepdoc, add picture processer. * add resume parsing	2024-02-23 18:28:12 +08:00
KevinHuSh	cacd36c5e1	use onnx models, new deepdoc (#68 )	2024-02-21 16:32:38 +08:00
KevinHuSh	a8294f2168	Refine resume parts and fix bugs in retrival using sql (#66 )	2024-02-19 19:22:17 +08:00
KevinHuSh	5e0a689c43	refactor retieval_test, add SQl retrieval methods (#61 )	2024-02-08 17:01:01 +08:00
KevinHuSh	407b2523b6	remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55 )	2024-02-05 18:08:17 +08:00
KevinHuSh	51482f3e2a	Some document API refined. (#53 ) Add naive chunking method to RAG	2024-02-02 19:21:37 +08:00
KevinHuSh	e6acaf6738	Add Q&A and Book, fix task running bugs (#50 )	2024-02-01 18:53:56 +08:00

48 Commits