ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-03 08:47:48 +08:00

Author	SHA1	Message	Date
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
Stephen Hu	d32967eda8	refactor: let excel use lazy image loader (#13558 ) ### What problem does this PR solve? let excel use lazy image loader ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-23 21:24:40 +08:00
qinling0210	4bc622b409	Fix parameter of calling self.dataStore.get() and warning info during parser (#13068 ) ### What problem does this PR solve? Fix parameter of calling self.dataStore.get() and warning info during parser https://github.com/infiniflow/ragflow/issues/13036 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-09 17:56:59 +08:00
yH	5333e764fc	fix: optimize Excel row counting for files with abnormal max_row (#13018 ) ### What problem does this PR solve? Some Excel files have abnormal `max_row` metadata (e.g., `max_row=1,048,534` with only 300 actual data rows). This causes: - `row_number()` returns incorrect count, creating 350+ tasks instead of 1 - `list(ws.rows)` iterates through millions of empty rows, causing system hang This PR uses binary search to find the actual last row with data. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-06 14:43:52 +08:00
Carve_	23bdf25a1f	feature:Add OceanBase Storage Support for Table Parser (#12923 ) ### What problem does this PR solve? close #12770 This PR adds OceanBase as a storage backend for the Table Parser. It enables dynamic table schema storage via JSON and implements OceanBase SQL execution for text-to-SQL retrieval. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes - Table Parser stores row data into `chunk_data` when doc engine is OceanBase. (table.py) - OceanBase table schema adds `chunk_data` JSON column and migrates if needed. - Implemented OceanBase `sql()` to execute text-to-SQL results. (ob_conn.py) - Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py) ### Test 1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`) <img width="1290" height="783" alt="doc_engine_ob" src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1" /> 2. Upload an Excel file to Knowledge Base.(for test, we use as below) <img width="786" height="930" alt="excel" src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a" /> 3. Choose Table as parsing method. <img width="2550" height="1134" alt="parse_excel" src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0" /> 4.Ask a natural language query in chat. <img width="2550" height="1134" alt="query" src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91" />	2026-01-31 15:11:54 +08:00
qinling0210	b40d639fdb	Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541 ) ### What problem does this PR solve? 1) Create dataset using table parser for infinity 2) Answer questions in chat using SQL ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-19 19:35:14 +08:00
Jin Hai	f0392e7501	Fix IDE warnings (#12315 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 15:04:09 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
buua436	b49eb6826b	Feat: enhance Excel image extraction with vision-based descriptions (#12054 ) ### What problem does this PR solve? issue: [#11618](https://github.com/infiniflow/ragflow/issues/11618) change: enhance Excel image extraction with vision-based descriptions ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-22 10:17:44 +08:00
buua436	ab4b62031f	Fix:csv parse in Table (#11870 ) ### What problem does this PR solve? change: csv parse in Table ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-10 16:44:06 +08:00
Billy Bao	e59458c36b	Fix: parsing excel with chartsheet & Clamp begin to a minimum of 0 to prevent negative indexing (#10819 ) ### What problem does this PR solve? Fix: parsing excel with chartsheet #10815 Fix: Clamp begin to a minimum of 0 to prevent negative indexing #10804 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-28 09:40:37 +08:00
HaiyangP	79399f7f25	Support the case of one cell split by multiple columns. (#9225 ) ### What problem does this PR solve? Support the case of one cell split by multiple columns. Besides, the codes are compatible with the common cell case. #8606 can be fixed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) I provide a case of one cell split by multiple columns: [test.xlsx](https://github.com/user-attachments/files/21578693/test.xlsx) The chunk res: <img width="236" height="57" alt="2025-06-17 16-04-07 的屏幕截图" src="https://github.com/user-attachments/assets/b0a499ac-349d-4c3d-8c6e-0931c8fc26de" />	2025-08-11 17:17:56 +08:00
Stephen Hu	ce140f1393	Fix:Better Support Table Value Type (#8822 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8782 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-14 17:51:26 +08:00
HaiyangP	d6a941ebf5	Fix the bug of long type value overflow (#8313 ) ### What problem does this PR solve? This PR will fix the #8271 by extending int type to float type when there is any value out of long type range in a column. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-24 18:18:30 +08:00
HaiyangP	baf32ee461	Display only the duplicate column names and corresponding original source. (#8138 ) ### What problem does this PR solve? This PR aims to slove #8120 which request a better error display of duplicate column names. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-10 10:16:38 +08:00
QuintinTao	1b4016317e	fix bug chunking:expected string or bytes-like object (#7116 ) … bytes-like object ### What problem does this PR solve? fix bug #6990 internal server error ehile chunking:expected string or bytes-like object _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: unknown <taoshi.ln@chinatelecom.cn>	2025-04-18 14:42:36 +08:00
hy89	b0c21b00d9	Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633 ) Optimize error handling and support parsing of XLS(EXCEL97—2003) files.	2025-03-05 11:55:27 +08:00
Kevin Hu	b418ce5643	Fix table parser issue. (#5482 ) ### What problem does this PR solve? #1475 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-28 16:09:12 +08:00
Kevin Hu	c24137bd11	Fix too long integer for `Table`. (#4651 ) ### What problem does this PR solve? #4594 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-01-26 12:54:58 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Kevin Hu	8fb18f37f6	Code refactor. (#4291 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-12-30 18:38:51 +08:00
ly0303521	101b8ff813	fix chunk method "Table" losing content when the Excel file has multi… (#4123 ) …ple sheets ### What problem does this PR solve? discussed in https://github.com/infiniflow/ragflow/pull/4102 - In excel_parser.py, `total` means the total number of rows in Excel, but it return in the first iterate, that lead to the wrong `to_page` - In table.py, it when Excel file has multiple sheets, it will be divided into multiple parts, every part size is 3000, `data` may be empty, because it has recorded in the last iterate. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-19 17:30:26 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Zhichang Yu	f4c52371ab	Integration with Infinity (#2894 ) ### What problem does this PR solve? Integration with Infinity - Replaced ELASTICSEARCH with dataStoreConn - Renamed deleteByQuery with delete - Renamed bulk to upsertBulk - getHighlight, getAggregation - Fix KGSearch.search - Moved Dealer.sql_retrieval to es_conn.py ### Type of change - [x] Refactoring	2024-11-12 14:59:41 +08:00
Kevin Hu	fc867cb959	rename get_txt to get_text (#2649 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-09-29 12:47:09 +08:00
yqkcn	aea553c3a8	Add get_txt function (#2639 ) ### What problem does this PR solve? Add get_txt function to reduce duplicate code ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-09-29 10:29:56 +08:00
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949 ) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-08-15 09:17:36 +08:00
KevinHuSh	7013d7f620	refine text decode (#657 ) ### What problem does this PR solve? #651 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-07 12:25:47 +08:00
KevinHuSh	8acc01a227	refine redis connection (#599 ) ### What problem does this PR solve? #591 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-29 08:52:38 +08:00
KevinHuSh	8c07992b6c	refine code (#595 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-04-28 19:13:33 +08:00
KevinHuSh	ed6081845a	Fit a lot of encodings for text file. (#458 ) ### What problem does this PR solve? #384 ### Type of change - [x] Performance Improvement	2024-04-19 18:02:53 +08:00
YC	e8570da856	Update table.py to convert clmns to string (#414 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-17 19:48:11 +08:00
KevinHuSh	800b5c7aaa	fix bulk error for table method (#407 ) ### What problem does this PR solve? Issue link:#366 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-17 12:17:14 +08:00
KevinHuSh	23b448cf96	fix docker compose issue (#238 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ Issue link:#[[Link the issue here](https://github.com/infiniflow/ragflow/issues/226)] ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-04-07 09:04:32 +08:00
KevinHuSh	36f2d7b797	To avoid assertion while no rows in excel (#197 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ Issue link:#[[Link the issue here](https://github.com/infiniflow/ragflow/issues/196)] ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Test cases - [ ] Python SDK impacted, Need to update PyPI - [ ] Other (please describe):	2024-04-02 10:51:21 +08:00
KevinHuSh	fd7fcb5baf	apply pep8 formalize (#155 )	2024-03-27 11:33:46 +08:00
KevinHuSh	6999598101	refine for English corpus (#135 )	2024-03-20 16:56:16 +08:00
KevinHuSh	de09b0e1a4	resolve table issues (#125 )	2024-03-15 14:59:28 +08:00
KevinHuSh	675a9f8d9a	add dockerfile for cuda envirement. Refine table search strategy, (#123 )	2024-03-14 19:45:29 +08:00
KevinHuSh	0feb085c88	refine table parser (#120 )	2024-03-12 18:56:04 +08:00
KevinHuSh	f1f09df901	add local llm implementation (#119 )	2024-03-12 11:57:08 +08:00
KevinHuSh	7fd1eca582	init README of deepdoc, add picture processer. (#71 ) * init README of deepdoc, add picture processer. * add resume parsing	2024-02-23 18:28:12 +08:00
KevinHuSh	cacd36c5e1	use onnx models, new deepdoc (#68 )	2024-02-21 16:32:38 +08:00
KevinHuSh	a8294f2168	Refine resume parts and fix bugs in retrival using sql (#66 )	2024-02-19 19:22:17 +08:00
KevinHuSh	5e0a689c43	refactor retieval_test, add SQl retrieval methods (#61 )	2024-02-08 17:01:01 +08:00
KevinHuSh	c5ea37cd30	Add resume parser and fix bugs (#59 ) * Update .gitignore * Update .gitignore * Add resume parser and fix bugs	2024-02-07 19:27:23 +08:00
KevinHuSh	407b2523b6	remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55 )	2024-02-05 18:08:17 +08:00

47 Commits