ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-08 08:07:21 +08:00

Author	SHA1	Message	Date
hyl64	02c2587ca4	fix(agent): support iteration item aliases in child nodes (#14146 ) ## Summary This PR fixes the iteration variable mismatch reported in #14142. Changes: - restore compatibility for `IterationItem@result` by exposing `result` alongside `item` - support bare iteration aliases like `{item}`, `{index}`, and `{result}` inside iteration child-node inputs - add focused unit/runtime tests covering both alias styles and multi-item iteration execution ## Validation ```bash pytest -q --noconftest \ test/testcases/test_web_api/test_canvas_app/test_iterationitem_unit.py \ test/testcases/test_web_api/test_canvas_app/test_iteration_runtime_unit.py \ test/testcases/test_web_api/test_canvas_app/test_invoke_component_unit.py ``` Result: `12 passed` Closes #14142	2026-05-12 13:05:21 +08:00
CaptainTimon	2717ee283f	feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679 ) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>	2026-05-12 09:42:31 +08:00
buua436	daf8a58c4b	Fix: add codeexec attachments output (#14787 ) ### What problem does this PR solve? add codeexec attachments output ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 19:16:33 +08:00
tmimmanuel	663fc1d42c	fix(opensearch): implement doc-meta dispatch surface on OSConnection (#14577 ) ### What problem does this PR solve? Fixes #14570. On OpenSearch backends (`DOC_ENGINE=opensearch`) every document-metadata write failed with `'OSConnection' object has no attribute 'create_doc_meta_idx'`, so both `PATCH /api/v1/datasets/{ds}/documents/{doc}` with `meta_fields` and `POST /api/v1/datasets/{ds}/metadata/update` were unusable while every other document operation (retrieval, parsing, name update, chunk management) worked correctly on the same OpenSearch cluster. The bug runs deeper than the missing method name in the error message suggests. `DocMetadataService` also reached into `settings.docStoreConn.es.*` directly for the index refresh, the scripted partial update, and the count call, which means that even after adding `create_doc_meta_idx` to `OSConnection` the very next call in the same metadata flow would still raise `AttributeError` because `OSConnection` exposes `self.os` rather than `self.es`. Fixing only the reported symptom would have moved the failure one line down without restoring the feature. This PR adds a uniform document-metadata dispatch surface to both connection classes so they present the same abstract API, and routes the service layer through that surface via `getattr` guards instead of poking at backend-specific attributes. The four new methods on `OSConnection` and `ESConnectionBase` are `create_doc_meta_idx`, `refresh_idx`, `count_idx`, and `replace_meta_fields`. `OSConnection.create_doc_meta_idx` reuses the existing `conf/doc_meta_es_mapping.json` schema in the OpenSearch `body=` form because OpenSearch and Elasticsearch share the same index-creation payload, and `replace_meta_fields` emits a full scripted assignment (`ctx._source.meta_fields = params.meta_fields`) on both backends so removed keys actually disappear instead of being preserved by deep-merge semantics. The `getattr`-guarded dispatch in `DocMetadataService` keeps the existing fall-through paths intact for Infinity and OceanBase, which continue to rely on their search-based count fallback and on the delete-then-insert metadata replacement they used before, so this change is strictly additive for those two backends. Verification: `pytest test/unit_test/rag/utils/test_opensearch_doc_meta.py` runs 16 new unit tests that pass locally and pin the `OSConnection` dispatch surface, the `create_doc_meta_idx` short-circuit when the index already exists, the mapping-file payload routing, the `IndicesClient.create` failure path, the `refresh_idx` and `count_idx` success and error sentinels, and the full-assignment script emitted by `replace_meta_fields`. The test module stubs `common.settings` and `rag.nlp` at import time so the suite runs without the heavy backend SDKs that the rest of the repository pulls in transitively. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>	2026-05-11 17:04:28 +08:00
box4wangjing	292b0b8bce	chore: fix some comments to improve readability (#14756 ) ### What problem does this PR solve? fix some comments to improve readability ### Type of change - [x] Documentation Update --------- Signed-off-by: box4wangjing <box4wangjing@outlook.com>	2026-05-11 16:48:48 +08:00
as-ondewo	6fb8c31c22	Fix: Document parse status set to DONE before chunks are retrievable (#13352 ) ### What problem does this PR solve? The document parse status was set to DONE before the document chunks were actually retrievable from Elasticsearch/Opensearch because it did not wait for the index refresh. This meant that it was possible that the document parse status returned by the API was DONE but when trying to retrieve chunks there were none. Since the index refreshes every 1 second this was quite likely to happen when wait for document parsing by polling with a short interval and then immediately trying to retrieve chunks once the status was DONE. I fixed this bug and added a test case that would have caught it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 16:04:08 +08:00
tmimmanuel	6ce014c23b	fix: offload blocking DB/Redis calls to thread pool for high-concurrency support (#13825 ) (#13941 ) ### What problem does this PR solve? Addresses event-loop blocking under high concurrency reported in #13825. When multiple requests hit the API simultaneously, synchronous DB/Redis calls block the async event loop, preventing Quart from handling other requests and causing cascading 502/504 timeouts. This PR wraps all remaining blocking DB/Redis calls in `canvas_app.py`, `chat_api.py`, `session.py`, and `canvas_service.py` with `await thread_pool_exec()` - Offload all synchronous `Service.`, `REDIS_CONN.`, and `APIToken.query` calls to the thread pool - Convert sync endpoint handlers (`list_chats`, `get_chat`, `templates`, `sessions`, etc.) to `async def` - Convert sync helper functions (`_ensure_owned_chat`, `_validate_llm_id`, `_validate_dataset_ids`, etc.) to async - no duplicate sync/async pairs - Wrap `CanvasReplicaService` Redis IO calls (`bootstrap`, `replace_for_set`, `commit_after_run`) - Use `asyncio.gather()` for concurrent file uploads and chat response building Note: This fixes the code-level event-loop blocking, which is a prerequisite for handling concurrent requests. For the full "30 concurrent requests without 502/504" goal described in the issue, users should also tune deployment config: - `WS=4` or higher (HTTP worker processes, default 1) - `MAX_CONCURRENT_CHATS=50` (default 10) - `SANDBOX_EXECUTOR_MANAGER_POOL_SIZE` for workflow-heavy workloads ### Performance verification Reviewer asked for a before-vs-after comparison ([comment](https://github.com/infiniflow/ragflow/pull/13941#issuecomment-4393667231)). I built a self-contained microbenchmark that reproduces the exact failure mode this PR targets: an async handler that performs blocking DB/Redis-style calls (50 ms each, 3 per request, 30 concurrent requests) is run twice — once with the pre-PR pattern (sync call directly inside the async handler) and once with the post-PR pattern (`await thread_pool_exec(...)`). The benchmark imports nothing from RAGFlow except `thread_pool_exec` itself, so it is hermetic and reproducible (`THREAD_POOL_MAX_WORKERS=128`, Python 3.13.12). Throughput — wall-clock for 30 concurrent requests (lower is better) \| flavour \| wall(s) \| p50(s) \| p95(s) \| max(s) \| \|---\|---:\|---:\|---:\|---:\| \| before \| 4.986 \| 0.158 \| 0.207 \| 0.269 \| \| after \| 0.248 \| 0.181 \| 0.230 \| 0.231 \| The pre-PR handler serializes the entire load on the event-loop thread, so 30 × 3 × 50 ms ≈ 4.5 s shows up as the wall time. The post-PR handler parallelizes the blocking work across the thread pool and finishes the same load in 248 ms — a ~20× speedup on this workload. Event-loop responsiveness — latency of an unrelated probe coroutine while the 30 slow requests are running (lower is better) \| flavour \| samples \| probe p50 (ms) \| probe p95 (ms) \| probe max (ms) \| \|---\|---:\|---:\|---:\|---:\| \| before \| 1 \| 5442.26 \| 5442.26 \| 5442.26 \| \| after \| 28 \| 0.88 \| 11.53 \| 98.02 \| This is the metric that maps directly to "the API still answers other requests while one is busy". A 5 ms-interval probe was scheduled while the 30 slow handlers ran. With the pre-PR code the event loop was frozen for the entire duration of the blocking work, so only one probe sample was ever picked up and it waited 5,442 ms. After the PR, 28 probe samples landed with p50 0.88 ms / p95 11.53 ms, meaning unrelated requests are no longer starved by the slow ones. That is the regression mode behind the cascading 502/504s reported in #13825. <details> <summary>Raw benchmark output</summary> ``` config: 30 concurrent requests, 3 blocking calls of 50ms each per request, THREAD_POOL_MAX_WORKERS=128 === Throughput (lower wall is better) === flavour wall(s) p50(s) p95(s) max(s) before 4.986 0.158 0.207 0.269 after 0.248 0.181 0.230 0.231 === Event-loop responsiveness (lower probe latency is better) === flavour samples probe p50(ms) probe p95(ms) probe max(ms) before 1 5442.26 5442.26 5442.26 after 28 0.88 11.53 98.02 ``` </details> The benchmark script is included as a comment on the PR for reproducibility. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Closes [#13825](https://github.com/infiniflow/ragflow/issues/13825) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 15:08:55 +08:00
buua436	a03b95f8c4	Fix: shared dataset chunk index lookup (#14764 ) ### What problem does this PR solve? shared dataset chunk index lookup ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:50:08 +08:00
Achieve3318	16354f4e14	fix(dify): guard retrieval argument error behavior (#14169 ) ## What problem does this PR solve? The Dify-compatible `/dify/retrieval` endpoint recently gained stricter parsing and validation for its request payload, including: - Normalized `retrieval_setting.top_k` and `retrieval_setting.score_threshold` types. - Clear separation between malformed arguments vs missing required fields. Previously, there was no unit test explicitly guarding the exact error code and message contract for these cases. ## What does this PR change? - Add guard-style unit test in `test_dify_retrieval_routes_unit.py`: - `test_retrieval_argument_error_messages`: - Sends a request with malformed numeric options: - `retrieval_setting = {"top_k": "not-int", "score_threshold": "not-float"}` - Asserts `code == RetCode.ARGUMENT_ERROR` and message contains `"invalid or malformed arguments:"`. - Sends a request with required fields missing: - Empty payload (`{}`) - Asserts `code == RetCode.ARGUMENT_ERROR` and message contains `"required arguments are missing:"`. This test encodes the intended behavior of the Dify retrieval API so future refactors cannot silently regress error handling. ## Type of change - [x] Tests (add coverage and guardrails for existing behavior) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 13:17:42 +08:00
Wang Qi	3838770e7a	GraphRAG feature - Part 1 - add spacy to extract entity and relation (#14670 ) ### What problem does this PR solve? GraphRAG feature - Part 1 - add spacy to extract entity and relation <img width="1621" height="1288" alt="image" src="https://github.com/user-attachments/assets/aadeddad-94da-46c6-adad-9c3784181f61" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 12:59:59 +08:00
hyl64	77ce88dfcc	fix(prompt): reserve system budget in message_fit_in (#14164 ) ## Summary This PR fixes the `message_fit_in()` truncation bug reported in #13607. Changes: - fix the user-message truncation branch to reserve room for the system prompt token budget - guard the zero-token edge case to avoid dividing by zero in the truncation ratio check - add focused regression tests covering both the user-dominant truncation path and the zero-token boundary case ## Validation ```bash pytest -q --noconftest test/unit_test/rag/prompts/test_generator_message_fit_in.py ``` Result: `2 passed` Closes #13607	2026-05-11 12:44:27 +08:00
Ahmad Intisar	3c4d1da98f	Feature/table parser column roles (#13710 ) ### What problem does this PR solve? The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes. For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value. The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion. This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns. Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 10:06:04 +08:00
Mehmet Karakose	7ec87f7cb7	fix(auth): fall back to session-based auth in _load_user (#14569 ) ## Summary Closes #13663. OAuth / OIDC callbacks call `login_user(user)` which writes `_user_id` into the session cookie, but `_load_user()` in `api/apps/__init__.py` only ever looked at the `Authorization` header. The SPA's response interceptor wipes the Authorization value from `localStorage` on the first 401 it sees — meaning that during the post-redirect window after an OAuth login, a single transient 401 sends every subsequent request back to the login page even though `login_user()` had already established a perfectly good server-side session. The reporter's analysis traces this all the way through the redirect → `navigate('/')` → first request → empty header → 401 → `removeAll()` → infinite-redirect-to-login chain. ## What changed - New `_load_user_from_session()` helper that reads `session["_user_id"]`, looks up the user in `UserService` (with the same `StatusEnum.VALID` and `access_token` checks already used elsewhere), and assigns `g.user`. - Every `return None` path in `_load_user()` now routes through that helper before giving up: - missing `Authorization` header - malformed `bearer ` prefix - empty / too-short JWT payload - JWT signature failure - JWT-resolved user not found / has no `access_token` - `APIToken.query()` fallback exhausted The JWT and API-token paths still take precedence — the session is only consulted when those can't authenticate the request. So existing local-login and SDK callers see no behaviour change; only OAuth / OIDC users that hit the original race now stay logged in. The Bearer-prefix issue called out in #13663 (lines 103-110) is already handled in the current code, so this PR only addresses the second half of the report. ## Test plan - [ ] Configure OIDC under `oauth` in `service_conf.yaml` - [ ] Click the OIDC login button, complete auth at the IdP - [ ] Confirm that navigating between pages no longer bounces back to `/login` - [ ] Confirm local email/password login still issues + accepts JWTs - [ ] Confirm SDK/API key callers still authenticate via `Authorization: Bearer <api-token>` --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 09:59:52 +08:00
Hunnyboy1217	782084780e	feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628 ) (#14677 ) ### What problem does this PR solve? S3-family connector syncs currently re-download every in-window object just so we can compute `xxhash128(blob)` and compare against `Document.content_hash`. Anything that bumps `LastModified` without changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays full bandwidth and re-parses files that didn't actually change. #14628 covers the broader incremental-ingestion redesign; this PR is the first slice. The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2 / GCS / OCI / S3-compat) now implements a new `FingerprintConnector` interface: `list_keys()` paginates `list_objects_v2` and yields `KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The orchestrator joins those against the connector's existing `{doc_id: content_hash}` map and only calls `get_value(key)` when the fingerprint differs. Unchanged keys are skipped entirely — no `GetObject`, no re-parse. No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing `Document.content_hash` column per @yingfeng's suggestion; the connector decides at listing time whether to populate it. Local uploads and connectors that don't opt in fall through to the existing post-download `xxhash128(blob)` path with no behavior change. This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS (PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint` (PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation (PR-4). Holding those back keeps this PR additive and reviewable on its own. #### Files touched - `common/data_source/models.py` — new `KeyRecord`; optional `fingerprint` on `Document` - `common/data_source/interfaces.py` — `IncrementalCapability` enum, `FingerprintConnector` ABC - `common/data_source/blob_connector.py` — `BlobStorageConnector` implements `FingerprintConnector`; per-object download factored into `_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`, `get_value` all share it - `rag/svr/sync_data_source.py` — `_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop; `_run_task_logic` plumbs `doc.fingerprint` into the upload dict - `api/db/services/document_service.py` — `list_id_content_hash_map_by_kb_and_source_type()` helper - `api/db/services/connector_service.py` + `file_service.py` — fingerprint flows through `duplicate_and_parse → upload_document` and lands in `content_hash` - `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests covering ETag normalization (single-part, multipart, quoted, empty), `list_keys()` not calling `GetObject`, `get_value()` materializing with fingerprint, deterministic/stable fingerprints, and the bypass loop asserting `GetObject` is not called on a match #### Worth flagging for review Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a `LastModified` window when `poll_range_start` was set. New code uses `_fingerprint_filtered_generator` (full bucket listing + fingerprint compare) outside of explicit `reindex=1`. Strictly better for unchanged-bucket cases since it skips `GetObject`, but it does mean every sync now does a full `list_objects_v2` paginate. Should still be cheap for most buckets — flagging in case anyone has a very large bucket where the time-window filter was meaningful. On migration: existing rows have `content_hash = xxhash128(blob)` from the old code. The first sync after this lands sees ETag-derived fingerprints that don't match, re-fetches every object once, and writes the new fingerprint. From the second sync onward the bypass works as expected. "Slow day one, fast every day after." A `fingerprint_backfill: trust` opt-out is sketched in the design doc but not in this PR. #### Test plan - [x] `uv run ruff check` — clean on all 8 touched files - [x] `uv run pytest test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed - [x] Broader unit-test suite — no regressions in anything I touched - [ ] Manual smoke against a real S3 bucket — configure a connector, run sync twice, expect the second sync to log `bypassed=N, fetched=0` and no `GetObject` calls in CloudTrail / bucket access logs - [ ] Manual smoke with `reindex=1` — confirm the full re-download path still works ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-09 20:03:56 +08:00
Liu An	57b24be6d6	Docs: Update version references to v0.25.2 in READMEs and docs (#14731 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.25.1 to v0.25.2 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-05-09 19:06:05 +08:00
euvre	f4b8f53b6d	Fix: restore embedding model switching for datasets with existing chunks (#14732 ) ### What problem does this PR solve? ## Problem During the REST API refactoring (#13690), the `/api/v2/kb/check_embedding` endpoint was removed and never migrated to the new RESTful structure. The frontend was pointed to the `/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a completely different function). Additionally, a hard guard was introduced that rejects any `embd_id` change when `chunk_num > 0`, making it impossible to switch embedding models on datasets with existing chunks. ## Root Cause 1. Missing endpoint: The old `check_embedding` logic (sample random chunks, re-embed with the new model, compare cosine similarity) was not carried over to the new REST API service layer. 2. Wrong frontend URL: `checkEmbedding` in `api.ts` pointed to `/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated check endpoint. 3. Overly restrictive guard: `dataset_api_service.py` line 310 blocked all `embd_id` updates when `chunk_num > 0`. This check did not exist in the pre-refactor code — it was incorrectly introduced during the refactor. ## Changes ### Backend - `api/apps/services/dataset_api_service.py` - Remove the `chunk_num > 0` hard guard on `embd_id` updates - Add `check_embedding()` service function: samples random chunks, re-embeds them with the candidate model, computes cosine similarity, returns compatibility result (avg ≥ 0.9 = compatible) - Add `import re` for the `_clean()` helper - `api/apps/restful_apis/dataset_api.py` - Add `POST /datasets/<dataset_id>/embedding/check` endpoint following the new REST API conventions - Clean up unused top-level imports (`random`, `re`, `numpy`) ### Frontend - `web/src/utils/api.ts` - Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` → `/datasets/${datasetId}/embedding/check` ### Tests - `test/testcases/test_http_api/test_dataset_management/test_update_dataset.py` - Update `test_embedding_model_with_existing_chunks` to assert success (`code == 0`) instead of expecting the old `102` error - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py` - Update `test_update_route_branch_matrix_unit` to assert `RetCode.SUCCESS` when updating `embd_id` on a chunked dataset, replacing the old `chunk_num` error assertion ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-09 18:48:57 +08:00
akie	c11650bb4c	Fix IDOR: Add permission checks to file ancestry endpoints (#14725 ) Close #14292 ## Issue File ancestry endpoints return folder metadata without validating tenant permissions, allowing any authenticated user to query arbitrary `file_id` values across tenant boundaries. ## Affected Endpoints - `GET /v1/file/parent_folder?file_id={file_id}` - `GET /v1/file/all_parent_folder?file_id={file_id}` - `GET /api/v1/files/{id}/ancestors` ## Root Cause These endpoints skip the permission check that other file operations (Delete, Download, Move) perform. ## Expected Permission Check All file operations should follow this 3-step validation: - Check file.tenant_id - Check if user_id belongs to this tenant (via user_tenant join table) - Check KB permission type (team permission) Code reference: This is implemented in `checkFileTeamPermission()` and used by Delete/Download/Move, but missing from GetParentFolder/GetAllParentFolders. ## Reproduction ```bash # User B (tenant: BBB) accessing User A's file (tenant: AAA) curl -H "Authorization: Bearer USER_B_TOKEN" \ "http://localhost:9384/v1/file/parent_folder?file_id=AAA_FILE_123" # Result: Returns User A's folder metadata ❌ # Expected: "No authorization." ✅ Fix Pass userID from handler to service and call checkFileTeamPermission() — same as Download/Delete/Move handlers. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 16:03:23 +08:00
jony376	3b6eeabb09	Fix: private dataset authorization bypass in shared dataset access checks (#14645 ) ### Related issues Closes #14644 ### What problem does this PR solve? This PR fixes an authorization bug where datasets marked with `permission = me` could still be accessed by other members of the same tenant through APIs that relied on `KnowledgebaseService.accessible()` or `DocumentService.accessible()`. Before this change, those shared access helpers only checked tenant membership and did not enforce the dataset's permission mode. As a result, a non-owner who knew a private `dataset_id` could still reach downstream document and chunk operations even though the dataset was intended to be owner-only. This change updates the central access checks so that: - dataset owners always retain access - joined tenant members only get access when the dataset permission is `TEAM` - private datasets with `permission = me` remain inaccessible to non-owners - document-level access follows the same dataset permission rules The PR also adds regression coverage for private-vs-team dataset access behavior. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - Added `test/unit_test/api/db/services/test_dataset_access_permissions.py` - Attempted to run: `python -m pytest test\\unit_test\\api\\db\\services\\test_dataset_access_permissions.py -q` - Local execution in this workspace is currently blocked during test collection because the environment is missing the `strenum` dependency --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: jony376 <jony376@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com> Co-authored-by: d 🔹 <liusway405@gmail.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Magicbook1108 <newyorkupperbay@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: sxxtony <166789813+sxxtony@users.noreply.github.com> Co-authored-by: sxxtony <sxxtony@users.noreply.github.com> Co-authored-by: Baki Burak Öğün <63836730+bakiburakogun@users.noreply.github.com> Co-authored-by: bakiburakogun <bakiburakogun@users.noreply.github.com> Co-authored-by: Panda Dev <56657208+pandadev66@users.noreply.github.com> Co-authored-by: Haruko386 <tryeverypossible@163.com> Co-authored-by: D2758695161 <13510221939@163.com> Co-authored-by: Hunter <hunter@yitong.ai> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: buua436 <sz_buua@foxmail.com> Co-authored-by: web-dev0521 <jasonpette1783@gmail.com> Co-authored-by: Tim Wang <38489718+wanghualoong@users.noreply.github.com> Co-authored-by: wanghualoong <wanghualoong@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: qinling0210 <88864212+qinling0210@users.noreply.github.com> Co-authored-by: dale053 <star05223@outlook.com>	2026-05-09 13:30:14 +08:00
Ricardo-M-L	1046042e01	fix(llm): replace mutable default `gen_conf={}` with None + defensive copy (#14566 ) ### What 19 methods across `rag/llm/chat_model.py` and `rag/llm/cv_model.py` declare `gen_conf={}` (or `gen_conf: dict = {}`) as a parameter default and then mutate `gen_conf` in place — typically `del gen_conf["max_tokens"]`, `gen_conf["penalty_score"] = ...`, or `gen_conf.pop(...)` as part of provider-specific normalization. ### The two bugs in this pattern 1. Mutable default argument (Python footgun). Python evaluates default values once at function-definition time, so the single `{}` dict is shared across every caller that doesn't pass `gen_conf`. The first such call's mutations leak into the default seen by every subsequent call. ```python # Before def chat_streamly(self, system, history, gen_conf={}, kwargs): if "max_tokens" in gen_conf: del gen_conf["max_tokens"] # mutates the SHARED default dict ... ``` After call N with `max_tokens` set, call N+1 that omits `gen_conf` no longer sees `max_tokens` — even though the caller never touched it. 2. Caller-dict pollution.** When the caller does pass a `gen_conf` dict, the same in-place mutations modify the caller's dict. A reused `gen_conf` (very common for chat-loop callers that build the config once and pass it on every turn) silently loses `max_tokens`, `presence_penalty`, etc. after the first round. ### The fix In every affected method: - Change `gen_conf={}` (or `gen_conf: dict = {}`) → `gen_conf=None`. - Add `gen_conf = dict(gen_conf or {})` as the first statement of the body so all subsequent mutations operate on a fresh local copy. ```python # After def chat_streamly(self, system, history, gen_conf=None, kwargs): gen_conf = dict(gen_conf or {}) if "max_tokens" in gen_conf: del gen_conf["max_tokens"] # local copy — safe ... ``` This is byte-for-byte identical provider-side behavior for callers that already pass a fresh `gen_conf` per call. The new `dict(...)` copy is O(small constant) per call. ### Files changed - `rag/llm/chat_model.py` — 17 methods - `rag/llm/cv_model.py` — 2 methods ### Tests Adds `test/unit_test/rag/llm/test_gen_conf_no_mutable_default.py` — an `ast`-based regression guard that walks both modules and asserts no parameter named `gen_conf` ever has a mutable literal (`{}` or `[]`) as its default. The test caught five additional `gen_conf: dict = {}` sites that an initial `gen_conf={}` text grep had missed (annotated parameters with whitespace), and would fail again if the pattern is ever reintroduced. ``` $ pytest test/unit_test/rag/llm/test_gen_conf_no_mutable_default.py -v ============================== 3 passed in 0.04s =============================== ``` `ruff check` passes on all touched files. ### Notes - This PR is intentionally focused on just** the `gen_conf` default + copy fix. There's a related (but separate) `history.insert(0, ...)` pattern in the same files that mutates the caller's history list in 12 places — left for a follow-up so this PR stays mechanical and easy to review. ### Latest revision (`700bb54a7`) — addresses CodeRabbit review - Type annotation: `gen_conf: dict = None` → `gen_conf: dict \| None = None` (5 occurrences in `chat_model.py`). The old annotation was a static-checker mismatch since `None` isn't a `dict`. - Regression test: the AST check accessed `default.keys` directly. `ast.List` has no `.keys` attribute — a future `gen_conf=[]` would crash with `AttributeError` instead of being caught. Use `getattr` for both `.keys` (Dict) and `.elts` (List). Manually verified the updated check correctly catches both `gen_conf={}` and `gen_conf=[]` while ignoring `gen_conf=None` and non-empty literals. --------- Co-authored-by: Ricardo <ricardo@example.com>	2026-05-09 13:11:44 +08:00
Xing Hong	c428187350	Fix: validate kb_ids as UUIDs before SQL interpolation in use_sql (#14087 ) ### What problem does this PR solve? The use_sql() function in dialog_service.py constructed SQL WHERE clauses and Infinity table names by directly interpolating kb_id values using Python f-strings, with no validation of the input values. A malformed or maliciously crafted kb_id (introduced via a compromised admin account or a separate injection vector) could alter the structure of the generated SQL query, potentially leading to unauthorized data access or data manipulation. This PR adds strict UUID format validation for all kb_id values before they are interpolated into any SQL string, causing requests with invalid IDs to fail fast with a ValueError rather than executing a tampered query. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2026-05-09 10:52:06 +08:00
Wang Qi	7d35e40c7b	Refactor : Allow search multiple datasets (#14685 ) ### What problem does this PR solve? Refactor : Allow search multiple datasets 1. support /datasets/search 2. get rid of /graph/search, use /graph ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-05-08 19:01:35 +08:00
dale053	26d70189b6	fix: enforce tenant-scoped authorization for chatbot SDK endpoints (#14592 ) Closes #14590 ## Self Checks - [x] I have searched for existing issues [search for existing issues](https://github.com/infiniflow/ragflow/issues), including closed ones. - [x] I confirm that I am using English to submit this report ([Language Policy](https://github.com/infiniflow/ragflow/issues/5910)). - [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) ([Language Policy](https://github.com/infiniflow/ragflow/issues/5910)). - [x] Please do not modify this template :) and fill in all the required fields. ## RAGFlow workspace code commit ID `a1b2c3d4e5f67890123456789abcdef12345678` ## RAGFlow image version `0.13.1` ## Other environment information - Hardware parameters: N/A - OS type: Linux 6.17.0-22-generic - Others: API key authentication via `Authorization: Bearer <token>` ## Actual behavior The chatbot API endpoints: - `POST /chatbots/<dialog_id>/completions` - `GET /chatbots/<dialog_id>/info` validate only that the bearer token exists in `APIToken`, but do not verify that `dialog_id` belongs to the same tenant as that token. Current flow (simplified): 1. Route extracts bearer token and checks `APIToken.query(beta=token)`. 2. If token exists, request is accepted. 3. Downstream service resolves dialog globally by ID (`DialogService.get_by_id(dialog_id)` in `conversation_service.py`). 4. No tenant ownership check is enforced for `dialog_id`. Impact: Any user with a valid API key can attempt arbitrary `dialog_id` values and access/invoke chatbots outside their own tenant boundary if IDs are known/guessed/leaked. Security classification: - Vulnerability class: Broken Access Control (IDOR, OWASP Top 10 A01) - Severity recommendation: Critical - Exploit prerequisite: any valid API key + discoverable target `dialog_id` ## Expected behavior Requests to `/chatbots/<dialog_id>/completions` and `/chatbots/<dialog_id>/info` must be authorized only when: 1. bearer token is valid, and 2. `dialog_id` belongs to the same `tenant_id` as the token. Otherwise, reject with authorization failure (e.g., 403 or 404-equivalent policy). ## Steps to reproduce 1. Prepare two tenants: - Tenant A with API key `TOKEN_A` - Tenant B with chatbot `dialog_id = DIALOG_B` 2. Send request from Tenant A to Tenant B chatbot completion endpoint: ```bash curl -X POST "https://<host>/chatbots/DIALOG_B/completions" \ -H "Authorization: Bearer TOKEN_A" \ -H "Content-Type: application/json" \ -d '{"question":"hello","stream":false}' ``` 3. Observe request is processed (or reaches dialog resolution) without tenant ownership rejection. 4. Repeat against info endpoint: ```bash curl -X GET "https://<host>/chatbots/DIALOG_B/info" \ -H "Authorization: Bearer TOKEN_A" ``` 5. Observe the same missing ownership enforcement. ## Additional information Affected code paths: - `api/apps/sdk/session.py` - `chatbot_completions(dialog_id)` - `chatbots_inputs(dialog_id)` - `api/db/services/conversation_service.py` - `async_iframe_completion(...)` uses global dialog lookup Suggested fix: 1. In both chatbot endpoints: - Resolve `tenant_id = objs[0].tenant_id` from validated token. - Fetch dialog with tenant-scoped query (`DialogService.query(id=dialog_id, tenant_id=tenant_id)`). - Reject if dialog is not found/owned by tenant. 2. Defense in depth: - Require and enforce `tenant_id` in service-layer dialog resolution for external flows. - Avoid global `get_by_id(dialog_id)` where user-controlled dialog IDs are reachable. 3. Add regression tests: - Positive: same-tenant token + dialog succeeds. - Negative: cross-tenant token + dialog fails for both endpoints.	2026-05-08 18:00:18 +08:00
web-dev0521	a32ebf32bd	Fix: handle null document_metadata in kb_prompt to prevent citation crash (#14651 ) (#14666 ) ### What problem does this PR solve? Fixes #14651. `kb_prompt()` in `rag/prompts/generator.py` crashes with `AttributeError: 'NoneType' object has no attribute 'items'` during agent citation generation when a retrieved chunk carries `document_metadata: null`. Root cause. The crash happens at `rag/prompts/generator.py:132-133`: ```python meta = ck.get("document_metadata", {}) for k, v in meta.items(): ``` `dict.get(key, default)` only returns the default when the key is missing. When the key is present with an explicit `None` value, `.get()` returns `None`, and `.items()` crashes. How the chunk gets `None`. It's a round-trip inside RAGFlow itself, not bad input from retrieval: 1. The agent stores retrieved chunks via `agent/canvas.py:814`, which routes them through `chunks_format()`. 2. `rag/prompts/generator.py:61` canonicalizes the field with `chunk.get("document_metadata")` (no default), so chunks without metadata become `{"document_metadata": None, ...}`. 3. `agent/component/agent_with_tools.py:314` feeds those canonicalized chunks back into `kb_prompt()` for citation generation, and `.get("document_metadata", {})` no longer protects us. Fix. One-line change at `rag/prompts/generator.py:132`: use `ck.get("document_metadata") or {}` so an explicit `None` is also coerced to `{}`. The line-61 `None` is intentionally part of the API/UI contract — the frontend handles it via optional chaining (`web/src/components/markdown-content/index.tsx:184`, `web/src/pages/next-search/search-view.tsx:217`) — so the fix belongs at the consumer, not the producer. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-08 16:54:33 +08:00
web-dev0521	d51fb88573	Fix: enforce tenant authorization on document download endpoint (#14618 ) (#14625 ) ### What problem does this PR solve? Closes #14618. The `GET /v1/document/get/<doc_id>` endpoint in `api/apps/document_app.py` was protected only by `@login_required` and called `DocumentService.get_by_id(doc_id)` without verifying that the document's knowledge base belonged to the requesting user's tenant. Any authenticated user who knew (or guessed) a document ID could download files belonging to any other tenant — a cross-tenant IDOR. This PR adds a `DocumentService.accessible(doc_id, current_user.id)` check before serving the file. The helper already exists and joins `Document` → `Knowledgebase` → `UserTenant` to verify the requesting user belongs to the tenant that owns the document's KB. The same pattern is already used by `api/apps/restful_apis/document_api.py` and mirrors the tenant scoping in the SDK route at `api/apps/sdk/doc.py`. The check returns the existing `"Document not found!"` error for both non-existent and inaccessible documents, so attackers cannot use the response to enumerate valid doc IDs across tenants. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Other (please describe): Security fix (cross-tenant IDOR / authorization bypass)	2026-05-08 14:24:03 +08:00
jony376	6547751936	Fix: missing authorization checks in `/files/link-to-datasets` (#14649 ) ### Related issues Closes #14648 ### What problem does this PR solve? This PR fixes an authorization flaw in `POST /files/link-to-datasets`. Before this change, the endpoint only checked whether the supplied `file_ids` and `kb_ids` existed. It did not verify whether the authenticated user was actually allowed to access those files or target datasets. As a result, an authenticated user who knew valid IDs could relink another user's files to arbitrary datasets. This was especially risky because the relinking flow is state-changing: the background worker removes existing file-document mappings and then recreates documents under the attacker-supplied dataset IDs. This change makes the route enforce the same permission model already used by nearby file and document operations: - each resolved file must pass `check_file_team_permission(...)` - each target dataset must pass `check_kb_team_permission(...)` - authorization is enforced before scheduling background relinking work ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - Added regression coverage in `test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py` - Covered: - unauthorized file access is rejected - unauthorized dataset access is rejected - existing success path still returns immediately after scheduling background work - Attempted to run: - `python -m pytest test\\testcases\\test_web_api\\test_file_app\\test_file2document_routes_unit.py -q` - Local execution in this workspace is currently blocked by missing test dependencies during bootstrap, including `ragflow_sdk` --------- Co-authored-by: jony376 <jony376@gmail.com>	2026-05-08 13:49:23 +08:00
buua436	f703169117	Refa: migrate document preview/download to RESTful API (#14633 ) ### What problem does this PR solve? migrate document preview/download to RESTful API ### Type of change - [x] Refactoring	2026-05-08 13:26:13 +08:00
sxxtony	59c35100c5	Perf: push metadata filters down to Elasticsearch (#14576 ) ### What problem does this PR solve? Fixes #14412. `common.metadata_utils.meta_filter` evaluates user-defined metadata conditions in Python after `DocMetadataService.get_flatted_meta_by_kbs` loads the entire `meta_fields` table into memory. Past a few thousand documents per knowledge base this becomes a memory bottleneck and a wasted ES round-trip — every filter request currently fetches up to 10000 metadata rows even when the resulting `doc_ids` list is tiny. This PR adds an ES push-down path that translates the same filter language into a `bool` query and returns just the matching document IDs. Changes - `common/metadata_es_filter.py` (new): pure-Python translator from the RAGflow filter list to ES DSL. Covers every operator the in-memory path supports (`=`, `≠`, `>`, `<`, `≥`, `≤`, `in`, `not in`, `contains`, `not contains`, `start with`, `end with`, `empty`, `not empty`) with `case_insensitive: true` on `prefix` and `wildcard` for parity with the existing lower-cased Python comparisons. User wildcard metacharacters are escaped before being injected into `wildcard` patterns. Negative operators (`≠`, `not in`, `not contains`, ranges) are wrapped with an `exists` guard so they do not accidentally match documents missing the key, matching the legacy `if k not in metas` behaviour. - `api/db/services/doc_metadata_service.py`: new `DocMetadataService.filter_doc_ids_by_meta_pushdown(kb_ids, filters, logic)` that returns the doc IDs ES matched, or `None` to signal the caller should fall back to the in-memory path. Returns `None` when the active doc store is Infinity (`meta_fields` is a JSON column, not a dotted-object mapping), when any filter cannot be expressed in DSL (`UnsupportedMetaFilter`), or when the ES request or metadata index lookup errors. - `common/metadata_utils.py`: `apply_meta_data_filter` accepts an optional `kb_ids` argument. When supplied, conditions go through push-down first via a new `_try_meta_pushdown` helper; on `None` the function falls back to the original `meta_filter` call. Default behaviour is unchanged for callers that don't pass `kb_ids`. - Updated all four callers (`agent/tools/retrieval.py`, `api/db/services/dialog_service.py` ×2, `api/apps/services/dataset_api_service.py`, `api/apps/sdk/session.py`) to forward `kb_ids` so the push-down path is exercised in production. - `test/unit_test/common/test_metadata_es_filter.py` (new): 35 unit tests covering every operator's DSL shape, value coercion (`ast.literal_eval`, lowercasing, ISO-date pass-through), wildcard escaping, OR-logic wrapping that protects negative clauses, and the doc-ID extractor. Behaviour preserved - The in-memory `meta_filter` is untouched and still services every fallback case (Infinity backend, unknown operators, ES outages). - The eligibility / credibility / issue-multiplier semantics described in the LLM-driven `auto` and `semi_auto` modes still hand the LLM the full in-memory `metas` dict to choose conditions from. Only the evaluation of those generated conditions is pushed down. - Existing tests in `test/unit_test/common/test_metadata_filter_operators.py` continue to pass (14/14). Test plan - `pytest test/unit_test/common/test_metadata_es_filter.py` — 35 passed. - `pytest test/unit_test/common/test_metadata_filter_operators.py` — 14 passed. - `ruff check` clean on every modified file. - Reviewer please validate the ES query shapes against a live cluster — particularly `case_insensitive` on `wildcard` and `prefix` (requires ES 7.10+) and the `exists` + `must_not` pairing for `≠`. Notes - The first cut caps each push-down request at 10000 results, matching the existing `get_flatted_meta_by_kbs` limit, and logs a warning when the cap is hit. A `search_after` follow-up would let us drop the cap entirely once the push-down path is validated. - Operator parity with the in-memory path is exact for the canonical unicode operators (`≥`, `≤`, `≠`) used internally; the ASCII aliases (`>=`, `<=`, `!=`) are normalised by `convert_conditions` before they reach the translator. ### Type of change - [x] Performance Improvement --------- Co-authored-by: sxxtony <sxxtony@users.noreply.github.com>	2026-05-07 21:23:43 +08:00
Magicbook1108	c29335cbff	Feat: support local provider for code exec component & remove some outdated models (#14637 ) ### What problem does this PR solve? Feat: support local provider for code exec component & remove some outdated models ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-07 21:23:13 +08:00
Jin Hai	94324afee9	Go: fix auth issue in hybrid mode (#14611 ) ### What problem does this PR solve? Since secret key get and set logic is updated, the go server also need to update. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:14:22 +08:00
Wang Qi	c50028b1f3	Fix team member cannot edit agent (#14612 ) ### What problem does this PR solve? Follow on PR: https://github.com/infiniflow/ragflow/pull/14602 to fix: team member cannot edit agent. new behavior: beside delete, everything is allowed for team member. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-07 15:09:13 +08:00
Magicbook1108	911671cef0	Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615 ) ### What problem does this PR solve? Feat: enable sync deleted files for RDBMS & fix remove last file issue ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-05-07 13:31:05 +08:00
Jin Hai	1d0519d025	Fix secret key inconsistency cross the RAGFlow servers (#14591 ) ### What problem does this PR solve? A and B, two API servers and a REDIS server. If A and REDIS restart, B will hold the obsolete secret key and will lead to error. TODO: app.config['SECRET_KEY'] and app.secret_key still hold obsolete secret key. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 10:10:02 +08:00
Wang Qi	f32034e83e	Refactor: completion -> completions (#14584 ) ### What problem does this PR solve? Keep only /completions, deprecated /completion ### Type of change - [x] Refactoring	2026-05-06 17:19:22 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
Sebastion	7e83c5f421	fix: authorize beta document downloads by tenant (#14496 ) ## Summary This fixes a missing authorization check in the beta API document download endpoint: - CWE: CWE-862 (Missing Authorization) - Severity: Medium - Affected route/file: `GET /api/v1/documents/<document_id>` in `api/apps/sdk/doc.py` - Data flow: the route reads a bearer beta API token, resolves the token with `APIToken.query(beta=token)`, accepts `document_id` directly from the URL, loads the document with `DocumentService.query(id=document_id)`, and then fetches the backing object through `File2DocumentService.get_storage_address()` / `settings.STORAGE_IMPL.get()`. Before this change, that flow verified that the API token was valid, but it did not verify that the token's tenant owned the document's knowledge base. A caller with any valid beta API token and a known document ID could therefore reach storage for a document belonging to another tenant. ## Fix The endpoint now takes the tenant ID from the resolved API token and checks the document's knowledge base with: ```python KnowledgebaseService.query(id=doc[0].kb_id, tenant_id=tenant_id) ``` If the knowledge base is not owned by the token tenant, the request returns an access error before any storage lookup occurs. This mirrors the tenant-scoped ownership checks used by the dataset-scoped document download path and keeps the patch small. ## Tests Added unit coverage for `download_doc()` to assert that: - the beta token tenant ID is used in the knowledge-base ownership lookup; - cross-tenant access returns `You do not have access to this document.`; - storage resolution is not called before tenant authorization succeeds; - the existing same-tenant empty-file and successful-download paths still run after the authorization gate passes. I also verified the final patch is limited to `api/apps/sdk/doc.py` and the related document SDK route unit test. A local `pytest` invocation could not complete in this checkout because the shared test fixture attempts to log in to a RAGFlow server at `127.0.0.1:9380`, which was not running in the local environment. ## Security analysis This is exploitable when an attacker has a valid beta API token for their own tenant and obtains or guesses a document ID from another tenant. The token alone should not grant access to other tenants' files, but the direct document route previously authorized only the token itself and not the requested resource. The new tenant-scoped knowledge-base check binds the requested document back to the token tenant before storage is accessed, preventing cross-tenant document downloads through this endpoint. Before submitting, we attempted to disprove this by checking whether existing dataset-scoped routes, token validation, or framework protections already enforced ownership. They do not apply to this direct document-ID route: it bypassed the dataset path parameter and used only `DocumentService.query(id=document_id)` before reading storage. cc @lewiswigmore	2026-05-06 14:55:41 +08:00
Shiyao Huang	406b36a452	fix(#14389 ): normalize list metadata values for in filters (#14410 ) ## Summary - normalize string items for list-valued metadata filters in `meta_filter` - fix `in` / `not in` case asymmetry when document metadata is lowercased but filter list values are not - add regression tests that cover the original issue scenario using uppercase list values ## Validation - `PYTHONPATH=external/ragflow pytest external/ragflow/test/unit_test/common/test_metadata_filter_operators.py -q` ## Notes - I commented on #14389 before opening this PR to claim the issue. - The new tests use `value=["F2", "F11"]` so they fail on the old implementation and pass with this fix. - This also benefits other non-comparison operators that flow through the same normalization path. Co-authored-by: copizza <copizza@users.noreply.github.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-05-06 14:28:25 +08:00
Attili-sys	24af0875e5	Feat/configurable metadata display (#13464 ) ### What problem does this PR solve? Currently, RAGFlow's Search and Chat interfaces display only raw vectorized text chunks during retrieval, without contextual information about their source documents. Users cannot see document titles, page numbers, upload dates, or custom metadata fields that would help them understand and trust the retrieved results. This PR introduces an optional metadata display feature that enriches retrieved chunks with document-level metadata in both the Search tab and Chatbot interface. Key improvements: - Search results: Display document metadata as styled badges beneath chunk snippets - Chat citations: Show metadata in citation popovers and reference lists for better source context - LLM context: Metadata is injected into the LLM prompt to enable more accurate, citation-aware responses - External API support: Applications using RAGFlow's SDK retrieval endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in via request parameters - User control: Multi-select dropdown UI allows users to choose which metadata fields to display Implementation approach: - ✅ Reuses existing `DocMetadataService` infrastructure (no new database tables or indices) - ✅ Settings stored in existing JSON configuration fields (`search_config.reference_metadata`, `prompt_config.reference_metadata`) - ✅ No database migrations required - ✅ Disabled by default (fully opt-in and backward-compatible) - ✅ Dynamic metadata field selection populated from actual document metadata keys - ✅ Fixed critical bug where Python's builtin `set()` was shadowed by a route handler function Modified endpoints (all backward-compatible): - `POST /v1/retrieval` (Public SDK) - `POST /v1/searchbots/retrieval_test` (Searchbots) - `POST /v1/chunk/retrieval_test` (UI/Internal) - Chat completions endpoints (via `extra_body.reference_metadata` or `prompt_config`) ### Type of change - [x] New Feature (non-breaking change which adds functionality) ###Images - <img width="879" height="1275" alt="image" src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb" /> <br><br> <br><br> <img width="1532" height="362" alt="image" src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e" /> <br><br> <br><br> <img width="2586" height="1320" alt="image" src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776" /> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-04-30 23:13:27 +08:00
buua436	05ee7f8bb6	Fix: remove delete_documents uuid validation (#14533 ) ### What problem does this PR solve? remove delete_documents uuid validation ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 18:56:33 +08:00
Yingfeng	4ee0702aed	Feat: add skills space to context engine (#13908 ) ### What problem does this PR solve? issue #13714 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 12:36:03 +08:00
buua436	06c6da5d94	Fix: add document delete permission check (#14472 ) ### What problem does this PR solve? add document delete permission check ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:01:09 +08:00
buua436	47129fdd08	Fix: optimize file batch delete (#14473 ) ### What problem does this PR solve? optimize file batch delete ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:00:39 +08:00
Liu An	ce4c782fd7	Docs: Update version references to v0.25.1 in READMEs and docs (#14488 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.25.0 to v0.25.1 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-04-30 10:49:26 +08:00
euvre	6dd38eca6a	fix: file logs not displayed in dataset ingestion page (#14479 ) ### What problem does this PR solve? ## Summary Fixed a bug where the File Logs tab in the dataset ingestion page always showed "No logs" even after files were parsed successfully. ## Root Cause Both the File Logs and Dataset Logs tabs on the frontend called the same backend endpoint `/datasets/{dataset_id}/ingestions`. However, the backend only queried `get_dataset_logs_by_kb_id`, which hard-filtered records by `document_id == GRAPH_RAPTOR_FAKE_DOC_ID` (dataset-level logs). As a result, real file-level logs were never returned, causing the table to appear empty. ## Changes ### Backend - `api/apps/restful_apis/dataset_api.py` - Added two new query parameters to `list_ingestion_logs`: - `log_type` — `"file"` or `"dataset"` (default: `"dataset"`) - `keywords` — search keyword for filtering by document / task name - `api/apps/services/dataset_api_service.py` - Updated `list_ingestion_logs` signature to accept `log_type` and `keywords`. - Added conditional routing: - When `log_type == "file"`, call `PipelineOperationLogService.get_file_logs_by_kb_id` - Otherwise, call `PipelineOperationLogService.get_dataset_logs_by_kb_id` - `api/db/services/pipeline_operation_log_service.py` - Extended `get_dataset_logs_by_kb_id` with an optional `keywords` parameter so dataset logs can also be searched. ### Frontend - `web/src/pages/dataset/dataset-overview/hook.ts` - Removed the separate API function switching (`listPipelineDatasetLogs` vs `listDataPipelineLogDocument`). - Unified both tabs to call `listDataPipelineLogDocument` with the new `log_type` query parameter (`"file"` or `"dataset"`). - Ensured `keywords` and filter values are passed through correctly. ## Behavior After Fix \| Tab \| `log_type` \| Returned Records \| Searchable Field \| \|---\|---\|---\|---\| \| File Logs \| `file` \| Real document-level logs \| `document_name` (file name) \| \| Dataset Logs \| `dataset` \| GraphRAG / RAPTOR / MindMap logs \| `document_name` (task type) \| ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com> Co-authored-by: Wang Qi <wangq8@outlook.com> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-29 22:10:24 +08:00
Wang Qi	5018459112	Fix metadata config (#14480 ) ### What problem does this PR solve? Fix metadata config ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 21:09:54 +08:00
bitloi	2bc8c6d35e	feat(dropbox): support deleted-file sync (#14476 ) ### What problem does this PR solve? Partially addresses #14362 by adding deleted-file sync support for the Dropbox data source. Dropbox previously did not provide the slim current-file snapshot required by stale document reconciliation, and its sync runner returned only document batches. As a result, enabling deleted-file sync could not remove local documents that had been deleted from Dropbox. This PR: - Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`. - Reuses Dropbox metadata traversal to collect current remote file IDs without downloading file contents. - Wires incremental Dropbox sync to return `(document_generator, file_list)` when `sync_deleted_files` is enabled. - Enables the deleted-file sync toggle for Dropbox in the data source settings UI. - Adds regression coverage for slim snapshots, nested folders, paginated listings, duplicate filenames, and full reindex behavior. Tests: - `uv run pytest test/unit_test/common/test_dropbox_connector.py -q` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `uv run pytest test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py -q` - `uv run ruff check common/data_source/dropbox_connector.py rag/svr/sync_data_source.py test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:11 +08:00
Wang Qi	b684c89950	Add backward compat APIs (#14427 ) ### What problem does this PR solve? Add backward compat APIs: ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 15:15:49 +08:00
euvre	35f6d81b73	Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402 ) ### What problem does this PR solve? ## Summary Migrate two web API endpoints to REST-style HTTP API endpoints, following the pattern established in #14222: \| Old Endpoint \| New Endpoint \| \|---\|---\| \| `POST /v1/chunk/retrieval_test` \| `POST /api/v1/datasets/<dataset_id>/search` \| \| `GET /v1/chunk/knowledge_graph` \| `GET /api/v1/datasets/<dataset_id>/graph` \|	2026-04-28 20:00:26 +08:00
Magicbook1108	85575259ac	Fix: google authentication - gmail && google-drive (#14422 ) ### What problem does this PR solve? Fix: google authentication - gmail && google-drive ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 18:09:02 +08:00
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
Jack	c81081f8ef	Refactor: Doc change parser (#14327 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/change_parser HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API PATCH /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-27 23:42:57 +08:00

1 2 3 4 5 ...

264 Commits