ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-01 05:17:51 +08:00

Author	SHA1	Message	Date
Jack Storment	59bb184e63	feat(moodle): support deleted-file sync (#14548 ) Fixes #14551 ### What problem does this PR solve? The Moodle connector did not let the sync runner clean up indexed documents that were deleted from the source. Other connectors such as dropbox, seafile, webdav, and rss already do this through a slim snapshot pass. This PR adds the same support for Moodle. When `sync_deleted_files` is on, the runner now asks the Moodle connector for a lightweight list of every module id that could be indexed. The runner then compares this list with the index and removes any indexed document whose id is not in the list. The slim pass does not download files. It only goes through courses and modules and yields ids. The id format matches the ids that the loader produces, so the match is exact. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Notes - `MoodleConnector` now also implements `SlimConnectorWithPermSync`. - New `retrieve_all_slim_docs_perm_sync` yields slim docs with the same ids the loader uses (`moodle_resource_<id>`, `moodle_forum_<id>`, `moodle_page_<id>`, `moodle_book_<id>`, `moodle_assign_<id>`, `moodle_quiz_<id>`). - The `Moodle` sync class now returns `(document_generator, file_list)` so the runner can do the cleanup. If the slim snapshot fails, `file_list` is set back to `None` and the run continues without cleanup. - The web data source map exposes `syncDeletedFiles` for Moodle so the option shows up in the UI. ### How was this tested? - `ruff check` passes on the changed Python files. - Manual review of the produced slim ids against the ids the loader builds in `_process_resource`, `_process_forum`, `_process_page`, `_process_book`, and `_process_activity`. - Behavior parity with the merged dropbox (#14476), seafile (#14499), webdav (#14491), and rss (#14493) PRs.	2026-05-07 17:44:46 +08:00
Octopus	5c9124c3ef	fix: prepend bucket prefix in Azure Blob (SAS/SPN) to prevent cross-dataset file overwrites (#14174 ) Fixes #14159 ## Problem The `put()`, `get()`, `rm()`, and `obj_exist()` methods in both `azure_spn_conn.py` and `azure_sas_conn.py` ignore the `bucket` parameter entirely, storing all files flat using only the filename. This causes files from different datasets to overwrite each other when they share the same filename. By contrast, the MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix, creating logical folder isolation like `{kb_id}/{filename}`. ## Solution Prepend the `bucket` parameter as a path prefix to all file operations in both Azure storage implementations: - `azure_spn_conn.py`: `create_file`, `delete_file`, `get_file_client` now use `f"{bucket}/{fnm}"` - `azure_sas_conn.py`: `upload_blob`, `delete_blob`, `download_blob`, `get_blob_client` now use `f"{bucket}/{fnm}"` This matches the behavior of all other storage backends (MinIO, S3) and prevents filename collisions across knowledge bases. ## Testing - Verified the fix aligns with how MinIO/S3 connectors handle the bucket parameter - The `health()` method is left unchanged as it uses a fixed test path for connectivity checks only Co-authored-by: octo-patch <octo-patch@github.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:13:43 +08:00
Magicbook1108	911671cef0	Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615 ) ### What problem does this PR solve? Feat: enable sync deleted files for RDBMS & fix remove last file issue ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-05-07 13:31:05 +08:00
Zhichang Yu	86fe78c73f	feat(llm): add MiniMax GroupId header support (#14610 ) ## Summary - Add MiniMax provider GroupId query parameter support in `LiteLLMBase` - Extract `group_id` from key configuration in `__init__` - Append `GroupId` as query parameter to `api_base` in `_construct_complete_args` ## Why this change is needed MiniMax provides an OpenAI-compatible API endpoint (`/v1/chat/completions`), but `GroupId` is a MiniMax-specific account identifier required for billing and rate limiting - it is not part of the OpenAI standard. Looking at LiteLLM's `MinimaxChatConfig`: - `get_complete_url()` only constructs the base URL (e.g., `https://api.minimaxi.com/v1/chat/completions`) - LiteLLM does not automatically inject `GroupId` into requests - This must be handled by the caller (ragflow's chat_model.py) The implementation appends `GroupId` as a query parameter to `api_base`: ```python api_base = completion_args.get("api_base", self.base_url) separator = "&" if "?" in api_base else "?" completion_args["api_base"] = f"{api_base}{separator}GroupId={self.group_id}" ``` This matches MiniMax's official API format (as documented by LlamaFactory): ```bash curl --location 'https://api.minimaxi.chat/v1/text/chatcompletion?GroupId=你的GroupId' \ --header 'Authorization: Bearer 你的API_Key' ``` ## Test plan - [ ] Verify MiniMax API calls work with GroupId query parameter - [ ] Verify backward compatibility for other providers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 11:54:49 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
Idriss Sbaaoui	38f6484e98	Fix OpenDataLoader naive parsing by normalizing `@OpenDataLoader` and filtering unsupported parser kwargs (#14581 ) ### What problem does this PR solve? This PR fixes a bug where `layout_recognize="<name>@OpenDataLoader"` was misrouted and then failed during parsing in the naive parser path. It now routes correctly to OpenDataLoader and avoids passing unsupported arguments that caused runtime errors. fixes #14572 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-06 15:00:55 +08:00
euvre	8269fa01b4	Fix AttributeError when appending non-streaming tool calls to chat history in Agentic Agent (#14456 ) ### What problem does this PR solve? Fix #14340 ## Problem Description When using an Agentic Agent (not Workflow) with one or more Retrieval tools (e.g., Dataset Retrieval + Memory Retrieval), the agent silently returns an empty response (`agent_response: ""`) after hanging for several minutes. The server logs show: ``` AttributeError: 'ChatCompletionMessageToolCall' object has no attribute 'index' ``` This error propagates as a `GENERIC_ERROR`, causing the canvas to return an empty response. The subsequent Memory save task then receives the empty `agent_response` and logs: ``` Document for referred_document_id XXXX not found ``` ## Reproduction Steps 1. Set `DOC_ENGINE=infinity` (or `elasticsearch` — the engine itself is not the root cause). 2. Create a blank Agentic Agent (not a Workflow). 3. Add two Retrieval tools to the Agent node: - `Retrieval_DS` → Dataset (Knowledge Base) - `Retrieval_Mem` → Memory component 4. Add a Message node with Save to Memory enabled. 5. Launch the agent and send any message (e.g., "hola"). 6. The agent hangs and returns an empty response. ## Root Cause Analysis The crash occurs in `_append_history` and `_append_history_batch` inside `rag/llm/chat_model.py`. These methods directly access `.index` on tool call objects: ```python # _append_history_batch { "index": tc.index, # <-- crashes here ... } ``` However, non-streaming LLM responses (`stream=False`) return `ChatCompletionMessageToolCall` objects, which do not have an `index` field according to the OpenAI API specification. The `index` field only exists on `ChoiceDeltaToolCall` objects returned in streaming responses (`stream=True`). When the agentic agent triggers an internal `full_question` call (used to compress multi-turn conversation history), the request is incorrectly routed through `async_chat_with_tools` because `is_tools=True` is set at the `LLMBundle` level. If the LLM decides to emit `tool_calls` during this auxiliary request, the code enters the non-streaming tool loop and crashes when trying to append history. ## Fix Replaced all direct `.index` accesses with `getattr(..., "index", None)` for safe, backward-compatible access: \| Method \| File \| Line \| Change \| \|--------\|------\|------\|--------\| \| `_append_history` \| `rag/llm/chat_model.py` \| ~L304 \| `tool_call.index` → `getattr(tool_call, "index", None)` \| \| `_append_history_batch` \| `rag/llm/chat_model.py` \| ~L332 \| `tc.index` → `getattr(tc, "index", None)` \| \| `_append_history` \| `rag/llm/chat_model.py` \| ~L1467 \| `tool_call.index` → `getattr(tool_call, "index", None)` \| \| `_append_history_batch` \| `rag/llm/chat_model.py` \| ~L1496 \| `tc.index` → `getattr(tc, "index", None)` \| ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-06 14:39:40 +08:00
buua436	5672be0652	Feat: add IMAP deleted document sync (#14539 ) ### What problem does this PR solve? add IMAP deleted document sync ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-06 14:06:46 +08:00
NeedmeFordev	89961962c0	feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525 ) ### What problem does this PR solve? Incremental DingTalk AI Table (Notable) sync did not reconcile rows removed on the remote side with documents already in the knowledge base. This follows the coordinated datasource work in #14362 (“sync deleted files”). This PR adds a full slim snapshot (`retrieve_all_slim_docs_perm_sync`) that lists current record IDs for all sheets without building document blobs, using the same logical document IDs as full ingest (`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When `sync_deleted_files` is enabled on incremental runs, `DingTalkAITable._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove KB rows that no longer exist remotely. Design notes: - `_document_id` centralizes the ID string so slim snapshots and `_convert_record_to_document` stay aligned with `hash128(doc.id)` semantics used during ingestion/cleanup. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with other Dropbox-style connectors). - `batch_size` from connector config is coerced to a positive `int` before constructing the connector. - Slim snapshot failures are caught in `_generate`; `file_list` is set to `None` so cleanup is skipped rather than running on partial/error state. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (summary) \| Area \| Change \| \|------\|--------\| \| `common/data_source/dingtalk_ai_table_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `_document_id` shared with document conversion \| \| `rag/svr/sync_data_source.py` \| `DingTalkAITable._generate`: slim snapshot + tuple return; `batch_size` validation; shared `end_ts` with `poll_source` \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for DingTalk AI Table in `DataSourceFeatureVisibilityMap` \| Closes / relates to: #14362	2026-05-06 14:06:23 +08:00
Attili-sys	24af0875e5	Feat/configurable metadata display (#13464 ) ### What problem does this PR solve? Currently, RAGFlow's Search and Chat interfaces display only raw vectorized text chunks during retrieval, without contextual information about their source documents. Users cannot see document titles, page numbers, upload dates, or custom metadata fields that would help them understand and trust the retrieved results. This PR introduces an optional metadata display feature that enriches retrieved chunks with document-level metadata in both the Search tab and Chatbot interface. Key improvements: - Search results: Display document metadata as styled badges beneath chunk snippets - Chat citations: Show metadata in citation popovers and reference lists for better source context - LLM context: Metadata is injected into the LLM prompt to enable more accurate, citation-aware responses - External API support: Applications using RAGFlow's SDK retrieval endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in via request parameters - User control: Multi-select dropdown UI allows users to choose which metadata fields to display Implementation approach: - ✅ Reuses existing `DocMetadataService` infrastructure (no new database tables or indices) - ✅ Settings stored in existing JSON configuration fields (`search_config.reference_metadata`, `prompt_config.reference_metadata`) - ✅ No database migrations required - ✅ Disabled by default (fully opt-in and backward-compatible) - ✅ Dynamic metadata field selection populated from actual document metadata keys - ✅ Fixed critical bug where Python's builtin `set()` was shadowed by a route handler function Modified endpoints (all backward-compatible): - `POST /v1/retrieval` (Public SDK) - `POST /v1/searchbots/retrieval_test` (Searchbots) - `POST /v1/chunk/retrieval_test` (UI/Internal) - Chat completions endpoints (via `extra_body.reference_metadata` or `prompt_config`) ### Type of change - [x] New Feature (non-breaking change which adds functionality) ###Images - <img width="879" height="1275" alt="image" src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb" /> <br><br> <br><br> <img width="1532" height="362" alt="image" src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e" /> <br><br> <br><br> <img width="2586" height="1320" alt="image" src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776" /> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-04-30 23:13:27 +08:00
Magicbook1108	5fd4579a2f	Fix: sync data source empty list (#14530 ) ### What problem does this PR solve? Fix: sync data source empty list ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 18:56:43 +08:00
bitloi	a69e0c73c7	feat(rss): support deleted-file sync (#14493 ) ### What problem does this PR solve? Partially addresses #14362. This PR enables syncing deleted files for RSS data sources. Previously, RSS incremental sync only returned feed entries whose timestamps were inside the poll window. If an entry was removed from the RSS feed, RAGFlow had no full current RSS snapshot to pass into the shared stale-document cleanup path, so the deleted remote entry could remain in the knowledge base. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector` - reuses the same `rss:<md5(stable_key)>` document ID derivation used by normal RSS ingest - returns `(document_generator, file_list)` for incremental RSS sync when `sync_deleted_files` is enabled - captures the poll end timestamp before snapshot/poll so cleanup does not race against the same sync window - adds start/end logs around RSS slim snapshot collection - exposes the deleted-file sync toggle for RSS in the data source UI Per maintainer request on related datasource PRs, this PR contains no test-case changes. Local verification was run with an external script. Validation: - `uv run ruff check common/data_source/rss_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` - `uv run python /tmp/verify_rss_deleted_sync.py --repo /root/74/ragflow` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 18:56:13 +08:00
NeedmeFordev	bedf9592ef	feat(webdav): support deleted-file sync via slim snapshot (#14491 ) ## What problem does this PR solve? Incremental WebDAV sync only ingested files whose modification time fell inside the poll window; documents removed on the WebDAV server were never removed from the knowledge base. This aligns with [#14362](https://github.com/infiniflow/ragflow/issues/14362) (coordinated datasource “sync deleted files” work). This PR adds a full-tree slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote paths without downloading file contents, using the same logical document IDs as full ingest (`webdav:{base_url}:{file_path}`). When `sync_deleted_files` is enabled on incremental runs, sync returns `(document_generator, file_list)` so `SyncBase` runs `cleanup_stale_documents_for_task` and removes KB rows no longer present remotely. Design notes: - `_list_files_recursive` gains `filter_by_mtime`: snapshot passes `filter_by_mtime=False` (full tree under `remote_path`); `poll_source` keeps mtime-window filtering as before. - Slim snapshot applies the same extension and `size_threshold` rules as `_yield_webdav_documents` so retain IDs match what would be indexed. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with Dropbox-style connectors). ## Type of change - [x] New Feature (non-breaking change which adds functionality) ## Files changed \| Area \| Change \| \|------\|--------\| \| `common/data_source/webdav_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `filter_by_mtime` on `_list_files_recursive` \| \| `rag/svr/sync_data_source.py` \| WebDAV `_generate`: `file_list` + tuple return; pass `batch_size` from connector config \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` \|	2026-04-30 17:26:27 +08:00
Wang Qi	f45ce00347	Not allow to sort by id (#14526 ) ### What problem does this PR solve? id as "text", not a "keyword", order by it will cause error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 14:52:43 +08:00
bitloi	17eda04b8d	feat(zendesk): support deleted-file sync (#14487 ) ### What problem does this PR solve? Refs #14362. This PR enables syncing deleted files for Zendesk data sources. Previously, Zendesk incremental sync never returned a slim remote snapshot to the shared stale-document cleanup path, so deleted remote Zendesk records could remain in RAGFlow. The existing Zendesk slim snapshot also included records that ingestion intentionally skips, such as draft articles, articles without bodies, skipped-label articles, empty-body articles, and tickets with `status == "deleted"`. This PR: - exposes the deleted-file sync option for Zendesk in the data source UI - returns Zendesk slim snapshots during incremental sync when `sync_deleted_files` is enabled - reuses Zendesk indexability rules so cleanup compares against the same records ingestion can materialize - adds start/end logs around Zendesk slim snapshot collection for operational visibility Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/zendesk_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-04-30 14:44:05 +08:00
bitloi	8f75e52bbf	feat(asana): support deleted-file sync (#14468 ) ### What problem does this PR solve? Partially addresses #14362. Adds deleted-file sync support for the Asana data source. Asana already indexes task attachments as documents, but it did not provide the slim document snapshot required by stale-document reconciliation, and the sync wrapper never returned a `file_list` for cleanup. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector` - builds slim IDs with the same `asana:{task_id}:{attachment_gid}` format used by indexed documents - avoids downloading attachment blobs during the snapshot - aborts the snapshot if Asana API errors occur, preventing partial snapshots from deleting valid local docs - captures the incremental poll end time before snapshotting and makes `poll_source()` respect that boundary - exposes the deleted-file sync toggle for Asana in the data source UI Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/asana_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` ### Type of change - [x] New Feature	2026-04-30 14:41:36 +08:00
orbisai0security	e992fe39b2	fix: the oceanbase database connector constructs sql... in ob_conn.py (#14470 ) ## Summary Fix critical severity security issue in `rag/utils/ob_conn.py`. ## Vulnerability \| Field \| Value \| \|-------\|-------\| \| ID \| V-003 \| \| Severity \| CRITICAL \| \| Scanner \| multi_agent_ai \| \| Rule \| `V-003` \| \| File \| `rag/utils/ob_conn.py:691` \| Description: The OceanBase database connector constructs SQL WHERE clauses by directly embedding user-controlled filter expressions using Python f-strings at lines 726, 777, 781, 787, 793, 821, and 827. No parameterization or allowlist validation is applied before the expressions are incorporated into live SQL queries. This is the most critical vulnerability in the codebase because it directly exposes the RAG knowledge base — the platform's core business asset — to complete compromise. ## Changes - `rag/utils/ob_conn.py` ## Verification - [x] Build passes - [x] Scanner re-scan confirms fix - [x] LLM code review passed --- Automated security fix by [OrbisAI Security](https://orbisappsec.com)	2026-04-30 14:25:17 +08:00
Magicbook1108	bb3b99f0a5	Feat: add button for remove header & footer in pipeline (#14486 ) ### What problem does this PR solve? Feat: add button for remove header & footer in pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 12:30:41 +08:00
NeedmeFordev	2932b65da6	feat(seafile): support deleted-file sync via slim snapshot (#14499 ) ### What problem does this PR solve? Incremental Seafile sync only ingests files whose modification time falls in the poll window; documents removed in Seafile were never removed from the knowledge base. This contributes to [#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource “sync deleted files” coordination). This PR adds a slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote file IDs without downloading content, using the same logical IDs as full ingest (`seafile:{repo_id}:{file_id}`). When `sync_deleted_files` is enabled on incremental runs, `SeaFile._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove stale KB documents. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### What changed - `common/data_source/seafile_connector.py`: `SeaFileConnector` implements `SlimConnectorWithPermSync`; `_list_files_recursive(..., filter_by_mtime=...)` supports full-tree listing for snapshots; `retrieve_all_slim_docs_perm_sync()` reuses the same library/root scan as ingest and applies the same size ceiling; logging for snapshot start/end and counts. - `rag/svr/sync_data_source.py`: `SeaFile._generate` validates `batch_size`, captures `end_ts` before snapshot + `poll_source`, wraps slim retrieval in `try`/`except` ( `file_list = None` on failure so ingest continues), returns `(generator, file_list)`. - `web/src/pages/user-setting/data-source/constant/index.tsx`: `syncDeletedFiles` for Seafile in `DataSourceFeatureVisibilityMap`.	2026-04-30 12:05:12 +08:00
Idriss Sbaaoui	9075872435	Fix: Manual/Naive outline tuple unpack crash (#14518 ) ### What problem does this PR solve? This fixes a crash in Manual and Naive parsing when PDF outlines include page numbers as a third tuple value. It makes outline unpacking accept extra values so parsing no longer fails. fixes #14411 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:55:02 +08:00
sapienza yoan	811e9826d0	perf: avoid O(n²) array growth in embedding accumulation (#14369 ) ### What problem does this PR solve? Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and `BuiltinEmbed.encode` (`rag/llm/embedding_model.py`) currently accumulate embedding batches via `np.concatenate` inside the per-batch loop. `np.concatenate` allocates a new array and copies all existing data on every call, so accumulating N batches is O(N²) in both time and peak memory. Replacing the incremental concatenate with a list-of-batches + a single `np.vstack` at the end gives O(N) total work. For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)` is also replaced by `np.tile`, which does the same job with a single contiguous allocation instead of building a Python list of references. This is purely a CPU/memory optimisation — output shape and dtype are unchanged. Measured impact grows with document size: - 1k chunks (batch 512, 2 iters): ~negligible - 10k chunks (20 iters): ~10× speedup on this stage - 100k chunks (195 iters): ~100× speedup, and peak RAM drops from O(N) extra to near-zero ### Type of change - [x] Performance Improvement Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>	2026-04-30 11:00:10 +08:00
FuturMix	2548c28d65	feat: add FuturMix as model provider (#14419 ) ## Summary Add [FuturMix](https://futurmix.ai) as a new model provider. FuturMix is an OpenAI-compatible unified AI gateway that provides access to 22+ models (GPT, Claude, Gemini, DeepSeek, and more) through a single API endpoint and key. - API Base: `https://futurmix.ai/v1` (OpenAI-compatible) - Supported capabilities: Chat, Embedding, Image2Text, TTS, Speech2Text, Rerank ### Changes \| File \| Change \| \|------\|--------\| \| `rag/llm/__init__.py` \| Add `FuturMix` to `SupportedLiteLLMProvider` enum, `FACTORY_DEFAULT_BASE_URL`, and `LITELLM_PROVIDER_PREFIX` \| \| `rag/llm/chat_model.py` \| Add `FuturMixChat(Base)` — follows Astraflow/Avian pattern \| \| `rag/llm/embedding_model.py` \| Add `FuturMixEmbed(OpenAIEmbed)` — follows Astraflow pattern \| \| `rag/llm/cv_model.py` \| Add `FuturMixCV(GptV4)` — follows SILICONFLOW/OpenRouter pattern \| \| `rag/llm/tts_model.py` \| Add `FuturMixTTS(OpenAITTS)` — follows CometAPI/DeerAPI pattern \| \| `rag/llm/sequence2txt_model.py` \| Add `FuturMixSeq2txt(GPTSeq2txt)` — follows StepFun pattern \| \| `rag/llm/rerank_model.py` \| Add `FuturMixRerank(OpenAI_APIRerank)` \| \| `conf/llm_factories.json` \| Add factory config with 8 chat, 2 embedding, 1 image2text, 2 TTS, 1 speech2text models \| \| `docs/guides/models/supported_models.mdx` \| Add FuturMix to supported models table \| ### Models included - Chat: claude-sonnet-4-20250514, claude-3.5-haiku, gpt-4o, gpt-4o-mini, gemini-2.5-flash, gemini-2.0-flash, deepseek-chat, deepseek-reasoner - Embedding: text-embedding-3-small, text-embedding-3-large - Image2Text: gpt-4o - TTS: tts-1, tts-1-hd - Speech2Text: whisper-1 ## Test plan - [ ] Verify FuturMix appears in the model provider list in RAGFlow UI - [ ] Configure FuturMix with API key and test chat completion - [ ] Test embedding model with document indexing - [ ] Test image2text with a sample image 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-30 10:59:37 +08:00
Magicbook1108	de8c6ad0f3	Feat: enable sync deleted file for Discord (#14451 ) ### What problem does this PR solve? Feat: enable sync deleted file for Discord ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:40 +08:00
bitloi	2bc8c6d35e	feat(dropbox): support deleted-file sync (#14476 ) ### What problem does this PR solve? Partially addresses #14362 by adding deleted-file sync support for the Dropbox data source. Dropbox previously did not provide the slim current-file snapshot required by stale document reconciliation, and its sync runner returned only document batches. As a result, enabling deleted-file sync could not remove local documents that had been deleted from Dropbox. This PR: - Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`. - Reuses Dropbox metadata traversal to collect current remote file IDs without downloading file contents. - Wires incremental Dropbox sync to return `(document_generator, file_list)` when `sync_deleted_files` is enabled. - Enables the deleted-file sync toggle for Dropbox in the data source settings UI. - Adds regression coverage for slim snapshots, nested folders, paginated listings, duplicate filenames, and full reindex behavior. Tests: - `uv run pytest test/unit_test/common/test_dropbox_connector.py -q` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `uv run pytest test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py -q` - `uv run ruff check common/data_source/dropbox_connector.py rag/svr/sync_data_source.py test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:11 +08:00
Magicbook1108	db1a73b255	Feat: enable sync deleted files in gitlab (#14481 ) ### What problem does this PR solve? Feat: enable sync deleted files in gitlab ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:04:10 +08:00
Magicbook1108	e0b3070012	Feat: enable sync deleted files for Gmail && fix google drive issues (#14462 ) ### What problem does this PR solve? Feat: enable sync deleted files for Gmail && fix google drive issues ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: bill <yibie_jingnian@163.com> Co-authored-by: balibabu <assassin_cike@163.com>	2026-04-29 17:03:56 +08:00
buua436	c08ced09a7	Fix: add retrieval fallback comments (#14457 ) ### What problem does this PR solve? add retrieval fallback comments ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 14:44:31 +08:00
buua436	a7ce1b1677	Fix: prune deleted doc chunks from retrieval (#14454 ) ### What problem does this PR solve? prune deleted doc chunks from retrieval ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 13:03:09 +08:00
Magicbook1108	3b7a6eaa6c	Feat: sync deleted files in Bitbucket (#14450 ) ### What problem does this PR solve? Feat: sync deleted files in Bitbucket ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 11:29:17 +08:00
Paras Sondhi	74fa54f122	feat(google-drive): optimize memory payload and enable sync deletion (#14372 ) Addresses the Google Drive integration for #14362 This PR completely overhauls the Google Drive sync logic to accurately detect remote deletions, while drastically reducing the memory footprint during the snapshot phase. ### What changed under the hood: * Killed the memory bloat: Swapped out the massive document dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc = namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during `retrieve_all_slim_docs_perm_sync` on massive enterprise drives. * Flawless downstream integration: The `SlimDoc` object relies on simple duck typing. It perfectly delivers the `.id` attribute required by `ConnectorService.cleanup_stale_documents_for_task`, meaning your core `hash128` vector cleanup logic runs natively without modification. * Fixed the Shared Drive blindspot: The standard API query was missing team folders. Injected the `corpora="allDrives"` and `includeItemsFromAllDrives=True` override flags so the connector now accurately maps state across both personal workspaces and organizational Shared Drives. ### Testing: Isolated the Google API retrieval logic locally to prove the `SlimDoc` mapping works and correctly registers state drops when a file is trashed remotely. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Performance Improvement	2026-04-29 10:04:36 +08:00
Stephen Hu	345bec812d	refactor: improve QwenRerank logic (#14388 ) ### What problem does this PR solve? improve QwenRerank logic ### Type of change - [x] Refactoring	2026-04-28 20:17:34 +08:00
Magicbook1108	0d18b293f5	Fix: enable sync deleted file in airtable (#14438 ) ### What problem does this PR solve? Fix: enable sync deleted file in airtable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 20:09:08 +08:00
buua436	e6e80041f5	Fix: agent toolcall null response & schema validation & DeepSeek think history (#14425 ) ### What problem does this PR solve? agent toolcall null response & schema validation & DeepSeek think history ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 17:09:08 +08:00
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
Idriss Sbaaoui	2a37562791	Fix manual naive parser position extraction fallback (#14420 ) ### What problem does this PR solve? This PR fixes a regression where Manual pipeline + Naive (Plain Text) PDF parsing crashed with `AttributeError: 'PlainParser' object has no attribute 'extract_positions'` in `rag/app/manual.py`. fixes #14411 ### Type of change: - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 14:21:30 +08:00
Jack	872ff08304	Fix: add executor.shutdown (#14403 ) ### What problem does this PR solve? Add executor shutdown in finally clause to free resources. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 22:38:43 +08:00
Idriss Sbaaoui	4303be223f	Fix metadata parsing regression for upgraded v0.24 datasets (#14383 ) ### What problem does this PR solve? This PR fixes issue #14371 where file parsing failed after upgrading from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema object but was handled like a list and later caused `KeyError: 'properties'`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 16:18:06 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
yuch85	0d87cecae2	feat: persist PDF bookmark outline as document metadata (#13287 ) ## Summary PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples via pypdf, but they are used ephemerally during chunking (`manual` parser uses them for hierarchy detection) and then discarded. This PR persists the outline as `doc.meta_fields["outline"]` — a JSON array of `{"title": str, "depth": int}` objects — so downstream features can use the structural information. ### Why this matters - Complementary to `toc_extraction` — the existing `toc_extraction` feature uses LLM calls to generate a TOC and only works for the `naive` parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure. - Document navigation — frontends can render a clickable TOC from the outline - Entity extraction — the outline provides a structural map for identifying document sections and key topics - Search result context — knowing which section a chunk belongs to helps users evaluate relevance ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/app/naive.py` \| Attach `pdf_parser.outlines` as `__outline__` on first chunk dict \| ~7 \| \| `rag/app/manual.py` \| Same for the manual parser \| ~5 \| \| `rag/svr/task_executor.py` \| Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()` \| ~12 \| ### Design decisions - Transient key pattern: The outline is passed from parser → task_executor via `__outline__` on the first chunk dict, then removed before indexing. This follows the same pattern as `metadata_obj` for LLM-generated metadata. - No schema changes: Uses the existing `meta_fields` JSON column on the document table. - Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted. ### Backward compatibility - Fully backward compatible — no existing fields, behavior, or schemas changed - PDFs without outlines are unaffected - Existing `meta_fields` data is preserved (merged, not overwritten) ## Test plan - [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document), verify `meta_fields["outline"]` is populated - [ ] Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields - [ ] Verify existing `meta_fields` data is preserved (not overwritten) when outline is added - [ ] Verify `manual` parser also persists outlines - [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth": 0}, ...]` Related: #9921 (Deterministic Document Access Layer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 11:57:06 +08:00
euvre	f3b7d55a1e	fix: handle Infinity table-not-exist error (3022) in update() methods (#14153 ) ### What problem does this PR solve? ## Summary Closes #6102 When using Infinity as the document store engine (GPU version), calling `update()` on a non-existent table throws an unhandled `InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This causes users to see a raw "3022" error when clicking on a parsed document. ## Root Cause The `update()` methods in both `rag/utils/infinity_conn.py` and `memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)` without catching `InfinityException`. In contrast, other CRUD methods (`insert`, `delete`, `search`) all handle this exception gracefully: \| Method \| Handles table-not-exist? \| Behavior \| \|----------\|--------------------------\|----------\| \| `insert` \| ✅ Yes \| Auto-creates the table \| \| `search` \| ✅ Yes \| Skips the table \| \| `delete` \| ✅ Yes \| Returns 0 \| \| `update` \| ❌ No \| Crashes with 3022 \| Additionally, `api/apps/document_app.py` worked around this with a fragile string match (`"3022" in msg`) to detect the error. ## Changes - `rag/utils/infinity_conn.py`: Catch `InfinityException` in `update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return `False` — consistent with `delete()`. - `memory/utils/infinity_conn.py`: Apply the same fix to its `update()` method. - `api/apps/document_app.py`: Remove the fragile `"3022"` string-matching workaround. Table-not-exist is now handled by the `if not ok` path with an improved error message. ### Type of change - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 11:52:22 +08:00
yuch85	3ad3241ae0	feat: persist RAPTOR layer metadata on summary chunks (#13286 ) ## Summary RAPTOR's recursive clustering builds a `layers` list tracking `(start_idx, end_idx)` boundaries per level, but currently discards this information — only the flat `chunks` list is returned. This makes it impossible to distinguish leaf-level summaries from top-level ones. This PR: - Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__` - Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first summary level, 2 = summary-of-summaries, etc.) - Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch handles it via existing `_int` dynamic template) ### Why this matters Downstream features need to know which RAPTOR layer a summary belongs to: - Retrieving the top-level document summary* for entity extraction, search snippets, or document comparison - Filtering by abstraction level — users may want only high-level summaries or only leaf-level cluster summaries - RAPTOR recall quality — #10951 reports summaries not being recalled for definition queries; layer metadata enables targeted retrieval ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/raptor.py` \| Return `(chunks, layers)` tuple \| ~3 \| \| `rag/svr/task_executor.py` \| Build `chunk_layer` mapping, set `raptor_layer_int` \| ~12 \| \| `conf/infinity_mapping.json` \| Add `raptor_layer_int` integer field \| ~1 \| ### Backward compatibility - Additive only — no existing fields or behavior changed - Existing RAPTOR chunks continue to work (they'll have `raptor_layer_int = 0` by default) - New RAPTOR chunks get layer metadata automatically ## Test plan - [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is set on indexed chunks - [ ] Verify `raptor_layer_int` values increase with abstraction level (layer 1 < layer 2 < ...) - [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still works - [ ] Verify Infinity backend accepts the new field Fixes #7488 Related: #4104, #11191, #10951 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 10:20:46 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Lynn	e22cf333ed	Fix: allow search id or _id (#14356 ) ### What problem does this PR solve? Allow search id or _id when using es as doc_engine. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 21:38:19 +08:00
Magicbook1108	25089600d0	Feat: introduce minimum type check for pipeline (#14354 ) ### What problem does this PR solve? Feat: introduce minimum type check for pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-24 21:12:50 +08:00
Idriss Sbaaoui	ca01c7a745	Fix blob sync: skip unsupported files before download (#14357 ) ### What problem does this PR solve? Blob storage sync was downloading unsupported files first and rejecting them later, which wasted bandwidth and made sync slower. This PR skips unsupported extensions before download and applies `allow_images` in blob sync. fixes #14338 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 19:22:32 +08:00
qinling0210	1473000135	Implement retrieval_test in GO (#14231 ) ### What problem does this PR solve? Implement retrieval_test in GO ### Type of change - [x] Refactoring	2026-04-24 15:30:14 +08:00
newyangyang	d84438fd53	fix azure blob put method param (#14329 ) ### What problem does this PR solve? when use azure blob as the file container, when click parse file, it calls: ```python partial(settings.STORAGE_IMPL.put, tenant_id=task["tenant_id"]) ``` So any storage backend used there must accept tenant_id as a kwarg. RAGFlowAzureSasBlob.put() did not, causing: ``` TypeError: ... got an unexpected keyword argument 'tenant_id' ``` Now it does, so parsing should proceed past this point. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-23 20:40:54 +08:00
Magicbook1108	75a5548b85	Feat: optimize title chunk (#14325 ) ### What problem does this PR solve? Feat: optimize title chunk 1. Add a new button to enable "Use root chunk as H0 heading", so that the first chunk is carried on to all remaining chunks. 2. Update resume agent template ### Type of change - [x] New Feature (non-breaking change which adds functionality) <img width="700" alt="img_v3_02111_63b04951-b3d7-4001-a08b-539db6d5298g" src="https://github.com/user-attachments/assets/4179ac4d-90e7-4353-9b93-d649a455e634" /> <img width="700" alt="image" src="https://github.com/user-attachments/assets/c0ba0f3c-05aa-4f2c-b418-e808ca1a2641" />	2026-04-23 18:55:55 +08:00
Wang Qi	224574831c	Add REDIS zcard (#14316 ) ### What problem does this PR solve? As description. ### Type of change - [x] Refactoring	2026-04-23 12:51:55 +08:00
NeedmeFordev	38e45a1117	Fix: serialize GraphRAG entity resolution merges to avoid graph mutation races (#14237 ) ### What problem does this PR solve? This PR fixes the merge-phase crash reported in #14236 during GraphRAG entity resolution. The issue happens after candidate pair resolution completes, when multiple merge coroutines mutate the same shared `networkx` graph concurrently. In `_merge_graph_nodes`, the code iterates over `graph.neighbors(node1)` and also awaits during edge/description merging. That allows another coroutine to modify the graph adjacency structure in between, which can trigger `RuntimeError: dictionary keys changed during iteration` and can also lead to unsafe shared-graph mutation. This change keeps the PR scoped to that single issue by: - serializing merge-time graph mutations with a dedicated merge lock - snapshotting `graph.neighbors(node1)` with `list(...)` before iteration Together, these changes prevent concurrent mutation of the shared graph during the merge phase and make the merge loop safe against live-view invalidation. Fixes #14236 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-22 16:42:53 +08:00

1 2 3 4 5 ...

1421 Commits