ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-28 11:43:06 +08:00

Author	SHA1	Message	Date
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
Jack	343bda1119	Refactor: deco document upload_and_parse API (#14366 ) ### What problem does this PR solve? remove unused "POST /v1/document/upload_and_parse" ### Type of change - [x] Refactoring	2026-04-27 20:35:00 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
Jack	009e538a4e	Refactor: Consolidation WEB API & HTTP API for document get_filter (#14248 ) ### What problem does this PR solve? Before consolidation Web API: POST /v1/document/filter Http API - GET /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents?type=filter ### Type of change - [x] Refactoring	2026-04-21 18:55:30 +08:00
Jack	939933649a	Refactor: Consolidation WEB API & HTTP API for document list_docs (#14176 ) ### What problem does this PR solve? Before consolidation Web API: POST /v1/document/list Http API - GET /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API -- GET /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-20 14:54:40 +08:00
Qi Wang	57aec2e65d	Fix bug: run Knowledge graph or RAPTOR, it will update an existing task (#14102 ) ### What problem does this PR solve? It fixed the bug: https://github.com/infiniflow/ragflow/issues/14101 When run Knowledge graph or RAPTOR, the last document running status will be wrongly set, see below: It should never touch existing document result. ![Image](https://github.com/user-attachments/assets/14fe1f9e-0541-4093-8111-ed0bd25b87ba) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 16:37:41 +08:00
Magicbook1108	8d52ef2893	Feat: enable sync deleted files for connector (#14000 ) ### What problem does this PR solve? Feat: enable sync deleted files for connector 1. first comes with github ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added "sync deleted files" feature for data sources, enabling automatic removal of files deleted from the source system. * Added multilingual support for the new sync deleted files setting across multiple languages. * UI Improvements * Improved checkbox form field rendering and layout. * Enhanced full-width display for authentication token input fields.	2026-04-09 16:40:14 +08:00
Jinghan Xu	f6b06fab72	Fix: allow document parsing status recovery after transient errors (#13341 ) ### What problem does this PR solve? Fixes #13285 When an LLM returns a transient error (e.g. overloaded) during parsing, the task progress is set to -1. Previously, the progress could never be updated again, leaving the document permanently stuck in FAIL status even after the task successfully recovered and completed. Three coordinated changes address this: 1. task_service.update_progress: relax the progress update guard to accept prog >= 1 even when current progress is -1, so a task that recovers from a transient failure can report completion. 2. document_service.get_unfinished_docs: include documents that are marked FAIL (progress == -1) but still have at least one non-failed task (task.progress >= 0) in the polling set, so their status can be re-synced once a task recovers. Documents where all tasks have permanently failed are excluded to avoid unnecessary polling. 3. document_service.update_progress: explicitly set document status to RUNNING when not all tasks have finished, instead of preserving whatever stale status (potentially FAIL) the document previously had. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 18:02:12 +08:00
Yongteng Lei	375a910bcf	Fix: add deadlock retry (#13552 ) ### What problem does this PR solve? Add deadlock retry. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 12:39:01 +08:00
Josh	2d2d3cdbcf	Fix document metadata loading for paged listings (#13515 ) ## Summary - scope normal document-list metadata lookups to the current page's document IDs - keep the `return_empty_metadata=True` path dataset-wide because it needs full knowledge of docs that already have metadata - add unit tests for both paged listing paths and the unchanged empty-metadata behavior ## Why `DocumentService.get_list()` and the normal `get_by_kb_id()` path were calling `DocMetadataService.get_metadata_for_documents(None, kb_id)`, which loads metadata for the entire dataset on every page request. That becomes especially problematic on large datasets. The metadata scan path paginates through the full metadata index without an explicit sort, while the ES helper only switches to `search_after` beyond `10000` results when a sort is present. In practice this can lead to unnecessary full-dataset metadata work, slower document-list loading, and unreliable `meta_fields` in list responses for large KBs. This change keeps the existing empty-metadata filter behavior intact, but scopes normal list responses to metadata for the current page only.	2026-03-11 13:42:16 +08:00
Heyang Wang	08f83ff331	Feat: Support get aggregated parsing status to dataset via the API (#13481 ) ### What problem does this PR solve? Support getting aggregated parsing status to dataset via the API Issue: #12810 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>	2026-03-10 18:05:45 +08:00
qinling0210	185ab0d4ef	Fix delete_document_metadata (#13496 ) ### What problem does this PR solve? Avoid getting doc in function delete_document_metadata as the doc might have been removed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 13:44:24 +08:00
BitToby	383986dc5f	fix: re-chunk documents when data source content is updated (#12918 ) Closes: #12889 ### What problem does this PR solve? When syncing external data sources (e.g., Jira, Confluence, Google Drive), updated documents were not being re-chunked. The raw content was correctly updated in blob storage, but the vector database retained stale chunks, causing search results to return outdated information. Root cause: The task digest used for chunk reuse optimization was calculated only from parser configuration fields (`parser_id`, `parser_config`, `kb_id`, etc.), without any content-dependent fields. When a document's content changed but the parser configuration remained the same, the system incorrectly reused old chunks instead of regenerating new ones. Example scenario: 1. User syncs a Jira issue: "Meeting scheduled for Monday" 2. User updates the Jira issue to: "Meeting rescheduled to Friday" 3. User triggers sync again 4. Raw content panel shows updated text ✓ 5. Chunk panel still shows old text "Monday" ✗ Solution: 1. Include `update_time` and `size` in the chunking config, so the task digest changes when document content is updated 2. Track updated documents separately in `upload_document()` and return them for processing 3. Process updated documents through the re-parsing pipeline to regenerate chunks [1.webm](https://github.com/user-attachments/assets/d21d4dcd-e189-4d39-8700-053bae0ca5a0) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-06 12:48:47 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Yongteng Lei	0110151e12	Fix: document remove race condition (#13242 ) ### What problem does this PR solve? Fix document remove race condition. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 11:23:24 +08:00
qinling0210	212d6f3660	Fix metadata in get_list() (#12906 ) ### What problem does this PR solve? test_update_document.py failed as metadata is not included in the response of get_list(), fix the issue. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-30 14:06:49 +08:00
Kevin Hu	32c0161ff1	Refa: Clean the folders. (#12890 ) ### Type of change - [x] Refactoring	2026-01-29 14:23:26 +08:00
qinling0210	9a5208976c	Put document metadata in ES/Infinity (#12826 ) ### What problem does this PR solve? Put document metadata in ES/Infinity. Index name of meta data: ragflow_doc_meta_{tenant_id} ### Type of change - [x] Refactoring	2026-01-28 13:29:34 +08:00
Kevin Hu	e20d56a34c	Fix: metadata update issue (#12815 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-26 18:02:44 +08:00
LIRUI YU	4236a62855	Fix: Cancel tasks before document or datasets deletion to prevent queue blocking (#12799 ) ### What problem does this PR solve? When deleting the knowledge base, the records in the Document and Knowledgebase tables are immediately deleted But there are still a large number of pending task messages in the Redis queue (asynchronous queue) if you did not click on stopping tasks before deleting knowledge base. TaskService.get_task() uses a JOIN query to associate three tables (Task ← Document ← Knowledgebase) Since Document/Knowledgebase have been deleted, the JOIN returns an empty result, even though the Task records still exist task-executor considers the task does not exist ("collect task xxx is unknown"), can only skip and warn log：2026-01-23 16:43:21,716 WARNING 1190179 collect task 110fbf70f5bd11f0945a23b0930487df is unknown 2026-01-23 16:43:21,818 WARNING 1190179 collect task 11146bc4f5bd11f0945a23b0930487df is unknown 2026-01-23 16:43:21,918 WARNING 1190179 collect task 111c3336f5bd11f0945a23b0930487df is unknown 2026-01-23 16:43:22,021 WARNING 1190179 collect task 112471b8f5bd11f0945a23b0930487df is unknown 2026-01-23 16:43:26,719 WARNING 1190179 collect task 112e855ef5bd11f0945a23b0930487df is unknown 2026-01-23 16:43:26,734 WARNING 1190179 collect task 1134380af5bd11f0945a23b0930487df is unknown 2026-01-23 16:43:26,834 WARNING 1190179 collect task 1138cb2cf5bd11f0945a23b0930487df is unknown As a consequence, a large number of such tasks occupy the queue processing capacity, causing new tasks to queue and wait <img width="1910" height="947" alt="9a00f2e0-9112-4dbb-b357-7f66b8eb5acf" src="https://github.com/user-attachments/assets/0e1227c2-a2df-4ef3-ba8f-e04c3f6ef0e1" /> Solution Add logic to stop all ongoing tasks before deleting the knowledge base and Tasks ### Type of change - Bug Fix (non-breaking change which fixes an issue)	2026-01-26 10:45:59 +08:00
Kevin Hu	08c01b76d5	Fix: missing parent chunk issue. (#12789 ) ### What problem does this PR solve? Close #12783 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-23 12:54:08 +08:00
Kevin Hu	3beb85efa0	Feat: enhance metadata arranging. (#12745 ) ### What problem does this PR solve? #11564 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-22 15:34:08 +08:00
6ba3i	960ecd3158	Feat: update and add new tests for web api apps (#12714 ) ### What problem does this PR solve? This PR adds missing web API tests (system, search, KB, LLM, plugin, connector). It also addresses a contract mismatch that was causing test failures: metadata updates did not persist new keys (update‑only behavior). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): Test coverage expansion and test helper instrumentation	2026-01-20 19:12:15 +08:00
qinling0210	b40d639fdb	Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541 ) ### What problem does this PR solve? 1) Create dataset using table parser for infinity 2) Answer questions in chat using SQL ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-19 19:35:14 +08:00
6ba3i	2b20d0b3bb	Fix : Web API tests by normalizing errors, validation, and uploads (#12620 ) ### What problem does this PR solve? Fixes web API behavior mismatches that caused test failures by normalizing error responses, tightening validations, correcting error messages, and closing upload file handles. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-16 11:09:22 +08:00
Vedant Madane	ac936005e6	fix: ensure deleted chunks are not returned in retrieval (#12520 ) (#12546 ) ## Summary Fixes #12520 - Deleted chunks should not appear in retrieval/reference results. ## Changes ### Core Fix - api/apps/chunk_app.py: Include \doc_id\ in delete condition to properly scope the delete operation ### Improved Error Handling - api/db/services/document_service.py: Better separation of concerns with individual try-catch blocks and proper logging for each cleanup operation ### Doc Store Updates - rag/utils/es_conn.py: Updated delete query construction to support compound conditions - rag/utils/opensearch_conn.py: Same updates for OpenSearch compatibility ### Tests - test/testcases/.../test_retrieval_chunks.py: Added \TestDeletedChunksNotRetrievable\ class with regression tests - test/unit/test_delete_query_construction.py: Unit tests for delete query construction ## Testing - Added regression tests that verify deleted chunks are not returned by retrieval API - Tests cover single chunk deletion and batch deletion scenarios	2026-01-15 14:45:55 +08:00
lys1313013	8d3f9d61da	Fix: Delete chunk images on document parser config change. (#12262 ) ### What problem does this PR solve? Modifying a document’s parser config previously left behind obsolete chunk images. If the dataset isn’t manually deleted, these images accumulate and waste storage. This PR fixes the issue by automatically removing associated images when the parser config changes. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-29 12:54:11 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00
Yongteng Lei	059f375d85	Feat: supports filter documents by empty metadata (#12180 ) ### What problem does this PR solve? Supports filter documents by empty metadata ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 14:06:50 +08:00
Kevin Hu	f0dac1d90e	Fix: loopitem None issue. (#12166 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 12:12:38 +08:00
Yongteng Lei	49dbfdbfb0	Feat: deduplicate metadata lists during updates (#12125 ) ### What problem does this PR solve? Deduplicate metadata lists during updates. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 11:53:16 +08:00
Kevin Hu	8197f9a873	Fix: table tag on chunks. (#12126 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:25:38 +08:00
Yongteng Lei	3ee47e4af7	Feat: document list and filter supports metadata filtering (#12053 ) ### What problem does this PR solve? Document list and filter supports metadata filtering. OR within the same field, AND across different fields Example 1 (multi-field AND): ```markdown Doc1 metadata: { "a": "b", "as": ["a", "b", "c"] } Doc2 metadata: { "a": "x", "as": ["d"] } Query: metadata = { "a": ["b"], "as": ["d"] } Result: Doc1 matches a=b but not as=d → excluded Doc2 matches as=d but not a=b → excluded Final result: empty ``` Example 2 (same field OR): ```markdown Doc1 metadata: { "as": ["a", "b", "c"] } Doc2 metadata: { "as": ["d"] } Query: metadata = { "as": ["a", "d"] } Result: Doc1 matches as=a → included Doc2 matches as=d → included Final result: Doc1 + Doc2 ``` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-22 09:35:11 +08:00
Jin Hai	30019dab9f	Change knowledge base to dataset (#11976 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-17 10:03:33 +08:00
Yongteng Lei	6560388f2b	Fix: correct metadata update behavior (#11919 ) ### What problem does this PR solve? Correct metadata update behavior. #11912 When update `value` is omitted, the corresponding keys are updated to `"value"` regardless of their current values. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-12 12:50:17 +08:00
Yongteng Lei	8370bc61b7	Feat: enhance metadata operation (#11874 ) ### What problem does this PR solve? Add metadata condition in document list. Add metadata bulk update. Add metadata summary. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2025-12-11 09:59:15 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00
Yongteng Lei	e3f40db963	Refa: make RAGFlow more asynchronous 2 (#11689 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] Refactoring - [x] Performance Improvement	2025-12-03 14:19:53 +08:00
Kevin Hu	a6681d6366	Revert "Refa: make RAGFlow more asynchronous 2" (#11669 ) Reverts infiniflow/ragflow#11664	2025-12-02 19:42:05 +08:00
Yongteng Lei	627c11c429	Refa: make RAGFlow more asynchronous 2 (#11664 ) ### What problem does this PR solve? Make RAGFlow more asynchronous 2. #11551, #11579, #11619. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring - [x] Performance Improvement	2025-12-02 18:57:07 +08:00
Yongteng Lei	9d8b96c1d0	Feat: add context for figure and table (#11547 ) ### What problem does this PR solve? Add context for figure table. ![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e) `==================()` for demonstrating purpose. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 10:21:44 +08:00
Kevin Hu	d1716d865a	Feat: Alter flask to Quart for async API serving. (#11275 ) ### What problem does this PR solve? #11277 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-18 17:05:16 +08:00
Jin Hai	61cf430dbb	Minor tweats (#11271 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-16 19:29:20 +08:00
Jin Hai	296476ab89	Refactor function name (#11210 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-12 19:00:15 +08:00
Yongteng Lei	8ae562504b	Fix: GraphRAG and RAPTOR tasks do not affect document status (#11194 ) ### What problem does this PR solve? GraphRAG and RAPTOR tasks do not affect document status. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-12 12:03:41 +08:00
Kevin Hu	c30ffb5716	Fix: ollama model list issue. (#11175 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-11 19:46:41 +08:00
Kevin Hu	d207291217	Fix: add download stats to kb logs. (#11112 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-10 13:28:07 +08:00
Kevin Hu	3bd1fefe1f	Feat: debug sync data. (#11073 ) ### What problem does this PR solve? #10953 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-06 16:48:04 +08:00
Jin Hai	f98b24c9bf	Move api.settings to common.settings (#11036 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 09:36:38 +08:00
Jin Hai	1a9215bc6f	Move some vars to globals (#11017 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 14:14:38 +08:00

1 2 3 4

155 Commits