ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-24 01:47:36 +08:00

Author	SHA1	Message	Date
Jack Storment	59bb184e63	feat(moodle): support deleted-file sync (#14548 ) Fixes #14551 ### What problem does this PR solve? The Moodle connector did not let the sync runner clean up indexed documents that were deleted from the source. Other connectors such as dropbox, seafile, webdav, and rss already do this through a slim snapshot pass. This PR adds the same support for Moodle. When `sync_deleted_files` is on, the runner now asks the Moodle connector for a lightweight list of every module id that could be indexed. The runner then compares this list with the index and removes any indexed document whose id is not in the list. The slim pass does not download files. It only goes through courses and modules and yields ids. The id format matches the ids that the loader produces, so the match is exact. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Notes - `MoodleConnector` now also implements `SlimConnectorWithPermSync`. - New `retrieve_all_slim_docs_perm_sync` yields slim docs with the same ids the loader uses (`moodle_resource_<id>`, `moodle_forum_<id>`, `moodle_page_<id>`, `moodle_book_<id>`, `moodle_assign_<id>`, `moodle_quiz_<id>`). - The `Moodle` sync class now returns `(document_generator, file_list)` so the runner can do the cleanup. If the slim snapshot fails, `file_list` is set back to `None` and the run continues without cleanup. - The web data source map exposes `syncDeletedFiles` for Moodle so the option shows up in the UI. ### How was this tested? - `ruff check` passes on the changed Python files. - Manual review of the produced slim ids against the ids the loader builds in `_process_resource`, `_process_forum`, `_process_page`, `_process_book`, and `_process_activity`. - Behavior parity with the merged dropbox (#14476), seafile (#14499), webdav (#14491), and rss (#14493) PRs.	2026-05-07 17:44:46 +08:00
Jin Hai	94324afee9	Go: fix auth issue in hybrid mode (#14611 ) ### What problem does this PR solve? Since secret key get and set logic is updated, the go server also need to update. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:14:22 +08:00
buua436	0501134820	Fix: support tool call config (#14616 ) ### What problem does this PR solve? support tool call config ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-07 15:54:57 +08:00
buua436	5b162a0c46	Fix: preserve doc generator download metadata in message (#14626 ) ### What problem does this PR solve? preserve doc generator download metadata ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-07 15:48:36 +08:00
Magicbook1108	911671cef0	Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615 ) ### What problem does this PR solve? Feat: enable sync deleted files for RDBMS & fix remove last file issue ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-05-07 13:31:05 +08:00
Vivek Dubey	33d8320ce8	fix: normalize double-escaped LaTeX backslashes and HTML entities (#14564 ) Fixes #14562 ## Problem LLMs like DeepSeek V4 Flash and Qwen3-MAX return \\( and \\[ (double backslash) in LaTeX output. The preprocessLaTeX() function only handled single backslash delimiters, so equations showed as raw text. HTML entities like < and > were also not decoded. ## Solution Added normalization step before existing delimiter conversion: - \\( → \( and \\[ → \[ - < → < and > → > and & → & --------- Co-authored-by: Vivek <viveksantoshkumardubey@email.com>	2026-05-06 19:14:34 +08:00
Wang Qi	f32034e83e	Refactor: completion -> completions (#14584 ) ### What problem does this PR solve? Keep only /completions, deprecated /completion ### Type of change - [x] Refactoring	2026-05-06 17:19:22 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
buua436	5672be0652	Feat: add IMAP deleted document sync (#14539 ) ### What problem does this PR solve? add IMAP deleted document sync ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-06 14:06:46 +08:00
NeedmeFordev	89961962c0	feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525 ) ### What problem does this PR solve? Incremental DingTalk AI Table (Notable) sync did not reconcile rows removed on the remote side with documents already in the knowledge base. This follows the coordinated datasource work in #14362 (“sync deleted files”). This PR adds a full slim snapshot (`retrieve_all_slim_docs_perm_sync`) that lists current record IDs for all sheets without building document blobs, using the same logical document IDs as full ingest (`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When `sync_deleted_files` is enabled on incremental runs, `DingTalkAITable._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove KB rows that no longer exist remotely. Design notes: - `_document_id` centralizes the ID string so slim snapshots and `_convert_record_to_document` stay aligned with `hash128(doc.id)` semantics used during ingestion/cleanup. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with other Dropbox-style connectors). - `batch_size` from connector config is coerced to a positive `int` before constructing the connector. - Slim snapshot failures are caught in `_generate`; `file_list` is set to `None` so cleanup is skipped rather than running on partial/error state. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (summary) \| Area \| Change \| \|------\|--------\| \| `common/data_source/dingtalk_ai_table_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `_document_id` shared with document conversion \| \| `rag/svr/sync_data_source.py` \| `DingTalkAITable._generate`: slim snapshot + tuple return; `batch_size` validation; shared `end_ts` with `poll_source` \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for DingTalk AI Table in `DataSourceFeatureVisibilityMap` \| Closes / relates to: #14362	2026-05-06 14:06:23 +08:00
Jin Hai	aa57b5bd8b	Go: move logger to common module (#14545 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-06 10:41:58 +08:00
Attili-sys	24af0875e5	Feat/configurable metadata display (#13464 ) ### What problem does this PR solve? Currently, RAGFlow's Search and Chat interfaces display only raw vectorized text chunks during retrieval, without contextual information about their source documents. Users cannot see document titles, page numbers, upload dates, or custom metadata fields that would help them understand and trust the retrieved results. This PR introduces an optional metadata display feature that enriches retrieved chunks with document-level metadata in both the Search tab and Chatbot interface. Key improvements: - Search results: Display document metadata as styled badges beneath chunk snippets - Chat citations: Show metadata in citation popovers and reference lists for better source context - LLM context: Metadata is injected into the LLM prompt to enable more accurate, citation-aware responses - External API support: Applications using RAGFlow's SDK retrieval endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in via request parameters - User control: Multi-select dropdown UI allows users to choose which metadata fields to display Implementation approach: - ✅ Reuses existing `DocMetadataService` infrastructure (no new database tables or indices) - ✅ Settings stored in existing JSON configuration fields (`search_config.reference_metadata`, `prompt_config.reference_metadata`) - ✅ No database migrations required - ✅ Disabled by default (fully opt-in and backward-compatible) - ✅ Dynamic metadata field selection populated from actual document metadata keys - ✅ Fixed critical bug where Python's builtin `set()` was shadowed by a route handler function Modified endpoints (all backward-compatible): - `POST /v1/retrieval` (Public SDK) - `POST /v1/searchbots/retrieval_test` (Searchbots) - `POST /v1/chunk/retrieval_test` (UI/Internal) - Chat completions endpoints (via `extra_body.reference_metadata` or `prompt_config`) ### Type of change - [x] New Feature (non-breaking change which adds functionality) ###Images - <img width="879" height="1275" alt="image" src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb" /> <br><br> <br><br> <img width="1532" height="362" alt="image" src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e" /> <br><br> <br><br> <img width="2586" height="1320" alt="image" src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776" /> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-04-30 23:13:27 +08:00
bitloi	a69e0c73c7	feat(rss): support deleted-file sync (#14493 ) ### What problem does this PR solve? Partially addresses #14362. This PR enables syncing deleted files for RSS data sources. Previously, RSS incremental sync only returned feed entries whose timestamps were inside the poll window. If an entry was removed from the RSS feed, RAGFlow had no full current RSS snapshot to pass into the shared stale-document cleanup path, so the deleted remote entry could remain in the knowledge base. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector` - reuses the same `rss:<md5(stable_key)>` document ID derivation used by normal RSS ingest - returns `(document_generator, file_list)` for incremental RSS sync when `sync_deleted_files` is enabled - captures the poll end timestamp before snapshot/poll so cleanup does not race against the same sync window - adds start/end logs around RSS slim snapshot collection - exposes the deleted-file sync toggle for RSS in the data source UI Per maintainer request on related datasource PRs, this PR contains no test-case changes. Local verification was run with an external script. Validation: - `uv run ruff check common/data_source/rss_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` - `uv run python /tmp/verify_rss_deleted_sync.py --repo /root/74/ragflow` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 18:56:13 +08:00
NeedmeFordev	bedf9592ef	feat(webdav): support deleted-file sync via slim snapshot (#14491 ) ## What problem does this PR solve? Incremental WebDAV sync only ingested files whose modification time fell inside the poll window; documents removed on the WebDAV server were never removed from the knowledge base. This aligns with [#14362](https://github.com/infiniflow/ragflow/issues/14362) (coordinated datasource “sync deleted files” work). This PR adds a full-tree slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote paths without downloading file contents, using the same logical document IDs as full ingest (`webdav:{base_url}:{file_path}`). When `sync_deleted_files` is enabled on incremental runs, sync returns `(document_generator, file_list)` so `SyncBase` runs `cleanup_stale_documents_for_task` and removes KB rows no longer present remotely. Design notes: - `_list_files_recursive` gains `filter_by_mtime`: snapshot passes `filter_by_mtime=False` (full tree under `remote_path`); `poll_source` keeps mtime-window filtering as before. - Slim snapshot applies the same extension and `size_threshold` rules as `_yield_webdav_documents` so retain IDs match what would be indexed. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with Dropbox-style connectors). ## Type of change - [x] New Feature (non-breaking change which adds functionality) ## Files changed \| Area \| Change \| \|------\|--------\| \| `common/data_source/webdav_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `filter_by_mtime` on `_list_files_recursive` \| \| `rag/svr/sync_data_source.py` \| WebDAV `_generate`: `file_list` + tuple return; pass `batch_size` from connector config \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` \|	2026-04-30 17:26:27 +08:00
balibabu	00e03a1945	Fix: LaTeX formulas cannot be displayed on the chat page. (#14531 ) ### What problem does this PR solve? Fix: LaTeX formulas cannot be displayed on the chat page. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 16:12:13 +08:00
bitloi	17eda04b8d	feat(zendesk): support deleted-file sync (#14487 ) ### What problem does this PR solve? Refs #14362. This PR enables syncing deleted files for Zendesk data sources. Previously, Zendesk incremental sync never returned a slim remote snapshot to the shared stale-document cleanup path, so deleted remote Zendesk records could remain in RAGFlow. The existing Zendesk slim snapshot also included records that ingestion intentionally skips, such as draft articles, articles without bodies, skipped-label articles, empty-body articles, and tickets with `status == "deleted"`. This PR: - exposes the deleted-file sync option for Zendesk in the data source UI - returns Zendesk slim snapshots during incremental sync when `sync_deleted_files` is enabled - reuses Zendesk indexability rules so cleanup compares against the same records ingestion can materialize - adds start/end logs around Zendesk slim snapshot collection for operational visibility Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/zendesk_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-04-30 14:44:05 +08:00
bitloi	8f75e52bbf	feat(asana): support deleted-file sync (#14468 ) ### What problem does this PR solve? Partially addresses #14362. Adds deleted-file sync support for the Asana data source. Asana already indexes task attachments as documents, but it did not provide the slim document snapshot required by stale-document reconciliation, and the sync wrapper never returned a `file_list` for cleanup. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector` - builds slim IDs with the same `asana:{task_id}:{attachment_gid}` format used by indexed documents - avoids downloading attachment blobs during the snapshot - aborts the snapshot if Asana API errors occur, preventing partial snapshots from deleting valid local docs - captures the incremental poll end time before snapshotting and makes `poll_source()` respect that boundary - exposes the deleted-file sync toggle for Asana in the data source UI Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/asana_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` ### Type of change - [x] New Feature	2026-04-30 14:41:36 +08:00
Yingfeng	4ee0702aed	Feat: add skills space to context engine (#13908 ) ### What problem does this PR solve? issue #13714 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 12:36:03 +08:00
Magicbook1108	bb3b99f0a5	Feat: add button for remove header & footer in pipeline (#14486 ) ### What problem does this PR solve? Feat: add button for remove header & footer in pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 12:30:41 +08:00
NeedmeFordev	2932b65da6	feat(seafile): support deleted-file sync via slim snapshot (#14499 ) ### What problem does this PR solve? Incremental Seafile sync only ingests files whose modification time falls in the poll window; documents removed in Seafile were never removed from the knowledge base. This contributes to [#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource “sync deleted files” coordination). This PR adds a slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote file IDs without downloading content, using the same logical IDs as full ingest (`seafile:{repo_id}:{file_id}`). When `sync_deleted_files` is enabled on incremental runs, `SeaFile._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove stale KB documents. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### What changed - `common/data_source/seafile_connector.py`: `SeaFileConnector` implements `SlimConnectorWithPermSync`; `_list_files_recursive(..., filter_by_mtime=...)` supports full-tree listing for snapshots; `retrieve_all_slim_docs_perm_sync()` reuses the same library/root scan as ingest and applies the same size ceiling; logging for snapshot start/end and counts. - `rag/svr/sync_data_source.py`: `SeaFile._generate` validates `batch_size`, captures `end_ts` before snapshot + `poll_source`, wraps slim retrieval in `try`/`except` ( `file_list = None` on failure so ingest continues), returns `(generator, file_list)`. - `web/src/pages/user-setting/data-source/constant/index.tsx`: `syncDeletedFiles` for Seafile in `DataSourceFeatureVisibilityMap`.	2026-04-30 12:05:12 +08:00
buua436	47129fdd08	Fix: optimize file batch delete (#14473 ) ### What problem does this PR solve? optimize file batch delete ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:00:39 +08:00
balibabu	7c0584a2b7	Fix: The GraphRAG icon is not displaying. (#14514 ) ### What problem does this PR solve? Fix: The GraphRAG icon is not displaying. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 10:44:05 +08:00
euvre	6dd38eca6a	fix: file logs not displayed in dataset ingestion page (#14479 ) ### What problem does this PR solve? ## Summary Fixed a bug where the File Logs tab in the dataset ingestion page always showed "No logs" even after files were parsed successfully. ## Root Cause Both the File Logs and Dataset Logs tabs on the frontend called the same backend endpoint `/datasets/{dataset_id}/ingestions`. However, the backend only queried `get_dataset_logs_by_kb_id`, which hard-filtered records by `document_id == GRAPH_RAPTOR_FAKE_DOC_ID` (dataset-level logs). As a result, real file-level logs were never returned, causing the table to appear empty. ## Changes ### Backend - `api/apps/restful_apis/dataset_api.py` - Added two new query parameters to `list_ingestion_logs`: - `log_type` — `"file"` or `"dataset"` (default: `"dataset"`) - `keywords` — search keyword for filtering by document / task name - `api/apps/services/dataset_api_service.py` - Updated `list_ingestion_logs` signature to accept `log_type` and `keywords`. - Added conditional routing: - When `log_type == "file"`, call `PipelineOperationLogService.get_file_logs_by_kb_id` - Otherwise, call `PipelineOperationLogService.get_dataset_logs_by_kb_id` - `api/db/services/pipeline_operation_log_service.py` - Extended `get_dataset_logs_by_kb_id` with an optional `keywords` parameter so dataset logs can also be searched. ### Frontend - `web/src/pages/dataset/dataset-overview/hook.ts` - Removed the separate API function switching (`listPipelineDatasetLogs` vs `listDataPipelineLogDocument`). - Unified both tabs to call `listDataPipelineLogDocument` with the new `log_type` query parameter (`"file"` or `"dataset"`). - Ensured `keywords` and filter values are passed through correctly. ## Behavior After Fix \| Tab \| `log_type` \| Returned Records \| Searchable Field \| \|---\|---\|---\|---\| \| File Logs \| `file` \| Real document-level logs \| `document_name` (file name) \| \| Dataset Logs \| `dataset` \| GraphRAG / RAPTOR / MindMap logs \| `document_name` (task type) \| ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com> Co-authored-by: Wang Qi <wangq8@outlook.com> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-29 22:10:24 +08:00
Wang Qi	5018459112	Fix metadata config (#14480 ) ### What problem does this PR solve? Fix metadata config ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 21:09:54 +08:00
Wang Qi	c4d0b0ebcf	Fix visit dataset error (#14490 ) ### What problem does this PR solve? Fix visit dataset error ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 20:17:00 +08:00
balibabu	1692f0928f	Fix: The pipeline column header in the FileLogsTable is displaying incorrectly. (#14489 ) ### What problem does this PR solve? Fix: The pipeline column header in the FileLogsTable is displaying incorrectly. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 19:52:28 +08:00
writinwaters	9280c64518	Docs: Updated Title chunker references (#14483 ) ### What problem does this PR solve? Updated Title chunker references ### Type of change - [x] Documentation Update	2026-04-29 19:37:24 +08:00
Magicbook1108	de8c6ad0f3	Feat: enable sync deleted file for Discord (#14451 ) ### What problem does this PR solve? Feat: enable sync deleted file for Discord ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:40 +08:00
bitloi	2bc8c6d35e	feat(dropbox): support deleted-file sync (#14476 ) ### What problem does this PR solve? Partially addresses #14362 by adding deleted-file sync support for the Dropbox data source. Dropbox previously did not provide the slim current-file snapshot required by stale document reconciliation, and its sync runner returned only document batches. As a result, enabling deleted-file sync could not remove local documents that had been deleted from Dropbox. This PR: - Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`. - Reuses Dropbox metadata traversal to collect current remote file IDs without downloading file contents. - Wires incremental Dropbox sync to return `(document_generator, file_list)` when `sync_deleted_files` is enabled. - Enables the deleted-file sync toggle for Dropbox in the data source settings UI. - Adds regression coverage for slim snapshots, nested folders, paginated listings, duplicate filenames, and full reindex behavior. Tests: - `uv run pytest test/unit_test/common/test_dropbox_connector.py -q` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `uv run pytest test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py -q` - `uv run ruff check common/data_source/dropbox_connector.py rag/svr/sync_data_source.py test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:11 +08:00
Magicbook1108	db1a73b255	Feat: enable sync deleted files in gitlab (#14481 ) ### What problem does this PR solve? Feat: enable sync deleted files in gitlab ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:04:10 +08:00
euvre	a0f9ae16d2	Fix: RAPTOR "Generation scope" reset to "Single file" when selecting "Dataset" (#14477 ) ## Problem In the Dataset Configuration page, changing the RAPTOR Generation scope from "Single file" to "Dataset" and clicking Save did not persist the change. After refreshing or re-entering the page, the scope always reverted to "Single file". ## Root Cause 1. Backend: The `RaptorConfig` Pydantic model in `api/utils/validation_utils.py` was configured with `extra="forbid"` but did not declare a `scope` field. When the frontend sent `"scope": "dataset"`, Pydantic rejected the request. 2. Frontend: The `extractRaptorConfigExt` utility in `web/src/hooks/parser-config-utils.ts` treated `scope` as an unknown field and moved it into the nested `ext` object. Consequently, the backend could not read `raptor_config.get("scope", "file")` correctly, so the default `"file"` was always used. ## Changes - Added `scope: Literal["file", "dataset"]` to the backend `RaptorConfig` model with a default of `"file"`. - Added `scope` to the known-field whitelist in the frontend `extractRaptorConfigExt` helper so it is transmitted as a top-level raptor field instead of being buried in `ext`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-29 18:46:28 +08:00
Wang Qi	1b84892e3a	Fix delete graph (#14484 ) ### What problem does this PR solve? Fix delete graph ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 18:09:10 +08:00
Wang Qi	3991bdfaf5	Fix graph task type (#14475 ) ### What problem does this PR solve? Fix graph task type ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 17:05:56 +08:00
Magicbook1108	e0b3070012	Feat: enable sync deleted files for Gmail && fix google drive issues (#14462 ) ### What problem does this PR solve? Feat: enable sync deleted files for Gmail && fix google drive issues ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: bill <yibie_jingnian@163.com> Co-authored-by: balibabu <assassin_cike@163.com>	2026-04-29 17:03:56 +08:00
balibabu	a736948493	Fix: Clicking the button in the bottom-right corner of the `/chats/widget` page fails to display the dialog box. (#14465 ) ### What problem does this PR solve? Fix: Clicking the button in the bottom-right corner of the `/chats/widget` page fails to display the dialog box. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 17:03:33 +08:00
Wang Qi	9690923516	Fix delete graphrag raptor (#14469 ) ### What problem does this PR solve? Fix delete graphrag raptor ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 16:47:42 +08:00
balibabu	ce933357c6	Fix: Dataset: When configuring the "general chunk method," options such as chunk size and parent-child slicing are unavailable. (#14459 ) ### What problem does this PR solve? Fix: Dataset: When configuring the "general chunk method," options such as chunk size and parent-child slicing are unavailable. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: balibabu <assassin_cike@163.com>	2026-04-29 14:37:48 +08:00
Magicbook1108	3b7a6eaa6c	Feat: sync deleted files in Bitbucket (#14450 ) ### What problem does this PR solve? Feat: sync deleted files in Bitbucket ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 11:29:17 +08:00
Paras Sondhi	74fa54f122	feat(google-drive): optimize memory payload and enable sync deletion (#14372 ) Addresses the Google Drive integration for #14362 This PR completely overhauls the Google Drive sync logic to accurately detect remote deletions, while drastically reducing the memory footprint during the snapshot phase. ### What changed under the hood: * Killed the memory bloat: Swapped out the massive document dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc = namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during `retrieve_all_slim_docs_perm_sync` on massive enterprise drives. * Flawless downstream integration: The `SlimDoc` object relies on simple duck typing. It perfectly delivers the `.id` attribute required by `ConnectorService.cleanup_stale_documents_for_task`, meaning your core `hash128` vector cleanup logic runs natively without modification. * Fixed the Shared Drive blindspot: The standard API query was missing team folders. Injected the `corpora="allDrives"` and `includeItemsFromAllDrives=True` override flags so the connector now accurately maps state across both personal workspaces and organizational Shared Drives. ### Testing: Isolated the Google API retrieval logic locally to prove the `SlimDoc` mapping works and correctly registers state drops when a file is trashed remotely. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Performance Improvement	2026-04-29 10:04:36 +08:00
Magicbook1108	0d18b293f5	Fix: enable sync deleted file in airtable (#14438 ) ### What problem does this PR solve? Fix: enable sync deleted file in airtable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 20:09:08 +08:00
euvre	35f6d81b73	Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402 ) ### What problem does this PR solve? ## Summary Migrate two web API endpoints to REST-style HTTP API endpoints, following the pattern established in #14222: \| Old Endpoint \| New Endpoint \| \|---\|---\| \| `POST /v1/chunk/retrieval_test` \| `POST /api/v1/datasets/<dataset_id>/search` \| \| `GET /v1/chunk/knowledge_graph` \| `GET /api/v1/datasets/<dataset_id>/graph` \|	2026-04-28 20:00:26 +08:00
Magicbook1108	d532151be0	Feat: more model for paddle (#14436 ) ### What problem does this PR solve? Feat: more model for paddle ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 18:07:00 +08:00
Jack	c330005659	Fix: document level auto metadata config missing after save (#14421 ) ### What problem does this PR solve? Steps to re-produce (existing bug before API migration): create a new dataset upload a file click on "General" in "Parse" column and then click on "switch or configure ingestion pipeline" click on "Settings" (at right of "Auto metadata") click "Add" to add new metadata click on "Save" re-open "Settings" and the newly added metadata is not there ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 17:09:23 +08:00
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
buua436	444e564329	Fix: align chat recommendation and thumbup APIs (#14413 ) ### What problem does this PR solve? align chat recommendation and thumbup APIs ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 12:55:16 +08:00
Jack	2d522ccb36	Fix: thumbnails issue in chat (#14415 ) [Uploading part_4-13.pdf…]() ### What problem does this PR solve? In chat, the thumbnails didn't display correctly ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) Steps to reproduce: 1. create dataset and upload a file (see attached) 2. parse the document 3. once parsing completed, create a chat and associate it with the dataset 4. ask a question (DAP VS DAPE comparison) 5. check result	2026-04-28 11:39:29 +08:00
Jack	c81081f8ef	Refactor: Doc change parser (#14327 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/change_parser HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API PATCH /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-27 23:42:57 +08:00
Jack	c5116b90e5	Refactor: migrate document thumbnails API (#14344 ) ### What problem does this PR solve? Before migration: GET /v1/document/thumbnails After migration: GET /api/v1/thumbnails ### Type of change - [x] Refactoring	2026-04-27 21:29:09 +08:00
Jack	49912a156e	Refactor: migrate document run api (#14351 ) ### What problem does this PR solve? Before migration: POST /v1/document/run After migration: POST /api/v1/documents/ingest/ ### Type of change - [x] Refactoring	2026-04-27 21:25:58 +08:00
Jack	a536980e22	Refactor: Doc batch change status (#14337 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/change_status After consolidation, Restful API POST /api/v1/datasets/<dataset_id>/documents/batch-update-status ### Type of change - [x] Refactoring	2026-04-27 20:00:23 +08:00

1 2 3 4 5 ...

1983 Commits