ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-20 16:26:42 +08:00

Author	SHA1	Message	Date
plind	dd76653dc1	feat: add tag management for Agents with filtering and sorting (#14774 ) (#14799 ) ## Summary Closes #14774. Adds free-form tags on agents (UserCanvas) with full UI + API: - Stored as comma-separated `tags` column on `UserCanvas` with online migration. - New endpoints: `GET /v1/agents/tags` (aggregate counts) and `PUT /v1/agent/<id>/tags` (write). `GET /v1/agents` accepts a `tags=` query. - "Edit tags" item in agent dropdown opens a chip-style editor dialog; tags render as badges on each agent card. - New "Tags" facet in the agents filter bar, with counts. ## Implementation notes - Tag matching is exact-token: the SQL filter wraps stored tags as `,…,` and matches `,ml,` so `ml` doesn't match `ml-ops`. - Server-side normalization in `UserCanvasService.update_tags`: dedup (case-insensitive), per-tag cap of 64 chars, total length capped at 512 chars to fit the column, commas inside tag values are replaced with spaces. - Tenant authorization: `PUT /v1/agent/<id>/tags` gates on `UserCanvasService.accessible(canvas_id, tenant_id)`. - Tag listing scope: `UserCanvasService.list_tags` follows the same own + team-shared rule as `get_by_tenant_ids`. - i18n: keys added to `en.ts` and `zh.ts` only (per project convention; other locales fall back). - `HomeCard` gets a non-breaking `extra?: ReactNode` slot for the chip row; no `src/components/ui/` files modified. ## Test plan - [ ] Backend boot runs `migrate_db` → confirm `user_canvas.tags` column exists (`DESCRIBE user_canvas`). - [ ] Agents page renders cards normally (no console error from missing field). - [ ] `⋯ → Edit tags` opens a dialog that stays open (regression: dialog was unmounting with the dropdown). - [ ] Typing a tag without pressing Enter and clicking Save persists it (regression: last typed tag was being dropped). - [ ] Chip input supports Enter/comma to commit, Backspace on empty to remove, `×` to remove individual chip. - [ ] Tag containing a comma sent via API is stored with the comma replaced by a space. - [ ] 20 long tags sent via API does not error (length cap silently truncates). - [ ] "Tags" filter in the filter bar shows counts and narrows the list. - [ ] Filtering by `ml` does not return agents tagged `ml-ops`. - [ ] UI in Chinese shows 编辑标签 / 添加标签以整理和筛选你的智能体 etc. - [ ] `PUT /v1/agent/<other-tenant-id>/tags` returns `Agent not found or no permission.`	2026-05-13 21:41:32 +08:00
Ethan T.	8c5845f6ca	fix: use context manager for pdfplumber to prevent resource leak (#13512 ) ## Summary - Convert `pdfplumber.open()` to use `with` context manager in `api/utils/file_utils.py` (`thumbnail_img` function) - If any exception occurs between `open()` and `close()`, the PDF file handle leaks - The rest of the codebase (e.g. `read_potential_broken_pdf` in the same file) already uses `with pdfplumber.open(...)` correctly ## Test plan - [x] PDF thumbnail generation works correctly with context manager - [x] Resources properly cleaned up on exceptions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-13 21:09:51 +08:00
Ahmad Intisar	e994051eb9	Feature/generic api connector (#13545 ) # feat: Add Generic REST API Connector ## What problem does this PR solve? RAGFlow supports many specific data source connectors (MySQL, Slack, Google Drive, etc.), but there was no way to connect an arbitrary REST API as a data source. Users with custom or third-party APIs had to write a new connector class for each one. This PR adds a generic, configuration-driven REST API connector that lets users connect any REST API as a data source entirely through the UI — no code changes needed per API. --- ## Features ### Core Connector (`common/data_source/rest_api_connector.py`) - Implements `LoadConnector` and `PollConnector` interfaces for full and incremental sync - Configurable authentication: None, API Key (custom header), Bearer Token, Basic Auth - Pluggable pagination: Page-based, Offset-based, Cursor-based, or None - Smart page-size inference from user's query parameters to avoid duplicate/conflicting params - Configurable request delay between pages to prevent API rate limiting - Auto-detection of the items array in JSON responses (`items`, `results`, `data`, `records`, or first list found) - Advanced field mapping with dot-notation (`country.name`), array wildcards (`newsType[].name`), type hints, and default values - Optional content template rendering (`"Title: {title}\nBody: {body}"`) - HTML stripping for content fields - Stable document IDs via `hash128` from a configurable ID field or auto-generated from item content - Pydantic configuration schema with automatic coercion of UI string inputs to dicts/lists ### Backend Registration (`rag/svr/sync_data_source.py`, `common/constants.py`, `common/data_source/config.py`) - `REST_API` sync class wired into RAGFlow's `func_factory` - Full sync (`load_from_state`) and incremental polling (`poll_source`) support - Credentials and config passed from task to connector following existing patterns (MySQL, SeaFile, etc.) ### Test Connection Endpoint (`api/apps/connector_app.py`) - `POST /v1/connector/<id>/test` validates config schema, authentication, and API connectivity without triggering a sync - Clear error messages for auth failures vs. config issues ### Frontend UI (`web/src/pages/user-setting/data-source/constant/`) - Postman-style configuration:* Base URL, Query Parameters (key=value per line), Auth, Content Fields, Metadata Fields, Pagination Type - Auth-type-aware form: fields for API key header/value, Bearer token, or Basic username/password appear only when relevant - Advanced Settings toggle for: Custom Headers, Max Pages, Request Delay, Poll Timestamp Field, Request Body (POST) - Connector icon (SVG) and i18n strings (English) - "Test Connection" button to validate before syncing --- ## Controls & Safety - Configurable max pages safety cap (default: 1000, adjustable in UI) - Configurable request delay between pages (default: 0.5s, adjustable in UI) - Auth errors (401/403) fail immediately without retries; transient errors retry with exponential backoff - Diagnostic logging: auth setup confirmation, request details on failure, content field extraction status --- ## Type of change - [x] New Feature (non-breaking change which adds functionality) ##Visual Screenshots of Features <img width="482" height="510" alt="Screenshot 2026-03-11 at 5 19 52 PM" src="https://github.com/user-attachments/assets/dcb7ab4a-1622-44f3-bb02-d6f0527314c4" /> (Connector can be configured within the external data sources tab) Configuration Parameters: <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 46 PM" src="https://github.com/user-attachments/assets/5e154e71-4ab5-4872-bfb2-04f02b73c18a" /> <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 54 PM" src="https://github.com/user-attachments/assets/00cb14b7-0bcf-4b94-9d71-34e93369ecb2" /> Connection can be tested before attaching to dataset: <img width="981" height="681" alt="Screenshot 2026-03-11 at 5 21 40 PM" src="https://github.com/user-attachments/assets/aaa6eeeb-89a7-4349-bc34-2423bf8be9ee" /> Ingestion tested with API connector (works perfectly fine): <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 22 30 PM" src="https://github.com/user-attachments/assets/afcd0d58-cadd-4152-badc-d2f14d96fbec" /> Search & Retrieval works as well with metadata flow: <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 23 05 PM" src="https://github.com/user-attachments/assets/d41ee935-dcf7-4456-b317-22a76ca032c0" /> --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-13 20:35:01 +08:00
jony376	7f699d1202	Fix: enforce tenant authorization for `tenant_rerank_id` in retrieval flows (#14782 ) ### Related issues Closes #14781 ### What problem does this PR solve? Some retrieval endpoints accepted caller-supplied `tenant_rerank_id` and resolved it through `get_model_config_by_id(...)`. That helper loaded `TenantLLM` rows by global database id and returned decoded model configuration without checking whether the model belonged to the authenticated tenant or the dataset owner tenant. This meant dataset access was validated, but rerank-model selection was not. A caller who knew or could guess another tenant's `tenant_rerank_id` could attempt retrieval with a foreign rerank model config, creating a cross-tenant authorization gap for model usage. This PR closes that gap by making `tenant_rerank_id` resolution tenant-aware across the retrieval paths that accept it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Solution - Extend `get_model_config_by_id(...)` to accept an optional `allowed_tenant_ids` set and reject `TenantLLM` rows whose `tenant_id` is outside that set. - Pass the allowed tenant scope from retrieval endpoints that accept `tenant_rerank_id`: - `api/apps/sdk/doc.py` - `api/apps/sdk/session.py` - `api/apps/services/dataset_api_service.py` - Use the authenticated tenant plus dataset-owner tenant ids already derived by each retrieval flow as the authorization boundary for rerank model selection. - Add focused unit coverage to assert unauthorized `tenant_rerank_id` values are rejected and that the allowed tenant set is propagated correctly. ### Testing - `python -m py_compile` on: - `api/db/joint_services/tenant_model_service.py` - `api/apps/services/dataset_api_service.py` - `api/apps/sdk/doc.py` - `api/apps/sdk/session.py` - Added unit tests in: - `test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py` - `test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py` ### Notes for reviewers - This change is intentionally narrow: it affects only the `tenant_rerank_id` path, not the normal `rerank_id` name-based resolution path. - Local lint/syntax checks passed. - Full pytest execution could not be completed in this environment because the local test runtime is missing `strenum`, so the route-test files fail during collection before exercising the updated cases. --------- Co-authored-by: jony376 <jony376@gmail.com>	2026-05-13 19:53:08 +08:00
Wang Qi	f3b3596c29	Speed up ragflow server (#14894 ) ### What problem does this PR solve? Speed up ragflow server ### Type of change - [ ] Refactoring	2026-05-13 18:01:33 +08:00
buua436	8cb2bf04fb	Fix: llm add api key overridden (#14885 ) ### What problem does this PR solve? llm add api key overridden ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 17:15:32 +08:00
Wang Qi	ff685d3131	Delete duplicate route (#14883 ) ### What problem does this PR solve? The delete /graph is duplicated of `/datasets/<dataset_id>/<index_type>`, delete it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 15:57:44 +08:00
Wang Qi	45d676bc05	Fix delete graphrag not take effect in UI (#14879 ) ### What problem does this PR solve? Fix delete graphrag not take effect in UI ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 13:49:16 +08:00
Wang Qi	64bd0130d3	Add REST API backward compatibility (#14872 ) ### What problem does this PR solve? Add REST API backward compatibility ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 11:44:40 +08:00
dale053	5a5e766386	fix(api): authorize owner_ids for list chats and search apps (#14775 ) Closes #14768 ### What problem does this PR solve? The `list_chats` and `list_searches` REST API endpoints did not enforce authorization on the `owner_ids` query parameter. Any authenticated user could pass arbitrary tenant IDs to `owner_ids` and retrieve chats or search apps belonging to other tenants they are not a member of. This PR resolves the issue by: 1. Looking up the current user's authorized tenants via `TenantService.get_joined_tenants_by_user_id` and rejecting any `owner_ids` that fall outside that set. 2. When no `owner_ids` are provided, scoping the query to only the user's authorized tenants instead of returning an unfiltered result. 3. Adding unit tests that verify unauthorized `owner_ids` are rejected with `OPERATING_ERROR`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 09:43:44 +08:00
CaptainTimon	2717ee283f	feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679 ) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>	2026-05-12 09:42:31 +08:00
黄圣祺	415169d497	fix(dify): add GET method support to /dify/retrieval for health check (#13837 ) ## Summary - Add GET method handler to `/api/v1/dify/retrieval` endpoint for Dify external knowledge base connectivity verification - GET requests return a simple success response; POST requests retain existing retrieval logic unchanged ## Problem When Dify integrates with RAGFlow as an external knowledge base, it sends periodic GET requests to the retrieval endpoint for health/connectivity checks. The endpoint only accepted POST, causing werkzeug to return `405 Method Not Allowed`. After several successful POST retrievals, the failing GET health checks trigger Dify's circuit breaker, causing all subsequent requests to fail. Traceback from the issue: ``` werkzeug.exceptions.MethodNotAllowed: 405 Method Not Allowed: The method is not allowed for the requested URL. ``` ## Changes - `api/apps/sdk/dify_retrieval.py`: Added a separate GET route handler (`retrieval_health_check`) that returns `get_json_result(data=True)` ## Test plan - [ ] Verify `GET /api/v1/dify/retrieval` returns `{"code": 0, "message": "success", "data": true}` - [ ] Verify `POST /api/v1/dify/retrieval` with valid API key and body still works as before - [ ] Verify Dify external knowledge base integration no longer returns 405 errors Closes #13788 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-12 09:37:07 +08:00
tmimmanuel	663fc1d42c	fix(opensearch): implement doc-meta dispatch surface on OSConnection (#14577 ) ### What problem does this PR solve? Fixes #14570. On OpenSearch backends (`DOC_ENGINE=opensearch`) every document-metadata write failed with `'OSConnection' object has no attribute 'create_doc_meta_idx'`, so both `PATCH /api/v1/datasets/{ds}/documents/{doc}` with `meta_fields` and `POST /api/v1/datasets/{ds}/metadata/update` were unusable while every other document operation (retrieval, parsing, name update, chunk management) worked correctly on the same OpenSearch cluster. The bug runs deeper than the missing method name in the error message suggests. `DocMetadataService` also reached into `settings.docStoreConn.es.*` directly for the index refresh, the scripted partial update, and the count call, which means that even after adding `create_doc_meta_idx` to `OSConnection` the very next call in the same metadata flow would still raise `AttributeError` because `OSConnection` exposes `self.os` rather than `self.es`. Fixing only the reported symptom would have moved the failure one line down without restoring the feature. This PR adds a uniform document-metadata dispatch surface to both connection classes so they present the same abstract API, and routes the service layer through that surface via `getattr` guards instead of poking at backend-specific attributes. The four new methods on `OSConnection` and `ESConnectionBase` are `create_doc_meta_idx`, `refresh_idx`, `count_idx`, and `replace_meta_fields`. `OSConnection.create_doc_meta_idx` reuses the existing `conf/doc_meta_es_mapping.json` schema in the OpenSearch `body=` form because OpenSearch and Elasticsearch share the same index-creation payload, and `replace_meta_fields` emits a full scripted assignment (`ctx._source.meta_fields = params.meta_fields`) on both backends so removed keys actually disappear instead of being preserved by deep-merge semantics. The `getattr`-guarded dispatch in `DocMetadataService` keeps the existing fall-through paths intact for Infinity and OceanBase, which continue to rely on their search-based count fallback and on the delete-then-insert metadata replacement they used before, so this change is strictly additive for those two backends. Verification: `pytest test/unit_test/rag/utils/test_opensearch_doc_meta.py` runs 16 new unit tests that pass locally and pin the `OSConnection` dispatch surface, the `create_doc_meta_idx` short-circuit when the index already exists, the mapping-file payload routing, the `IndicesClient.create` failure path, the `refresh_idx` and `count_idx` success and error sentinels, and the full-assignment script emitted by `replace_meta_fields`. The test module stubs `common.settings` and `rag.nlp` at import time so the suite runs without the heavy backend SDKs that the rest of the repository pulls in transitively. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>	2026-05-11 17:04:28 +08:00
box4wangjing	292b0b8bce	chore: fix some comments to improve readability (#14756 ) ### What problem does this PR solve? fix some comments to improve readability ### Type of change - [x] Documentation Update --------- Signed-off-by: box4wangjing <box4wangjing@outlook.com>	2026-05-11 16:48:48 +08:00
Sank	592dba1489	Refact: Added a private helper _visibility_and_status_filter (#13627 ) ### What problem does this PR solve? Added a private helper _visibility_and_status_filter(joined_tenant_ids, user_id) that returns the Peewee condition: visible to user (team or own) and status is VALID. ### Type of change - [x] Refactoring --------- Co-authored-by: Serobabov Aleksandr <40SerobabovAS@region.cbr.ru> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-11 15:21:41 +08:00
tmimmanuel	6ce014c23b	fix: offload blocking DB/Redis calls to thread pool for high-concurrency support (#13825 ) (#13941 ) ### What problem does this PR solve? Addresses event-loop blocking under high concurrency reported in #13825. When multiple requests hit the API simultaneously, synchronous DB/Redis calls block the async event loop, preventing Quart from handling other requests and causing cascading 502/504 timeouts. This PR wraps all remaining blocking DB/Redis calls in `canvas_app.py`, `chat_api.py`, `session.py`, and `canvas_service.py` with `await thread_pool_exec()` - Offload all synchronous `Service.`, `REDIS_CONN.`, and `APIToken.query` calls to the thread pool - Convert sync endpoint handlers (`list_chats`, `get_chat`, `templates`, `sessions`, etc.) to `async def` - Convert sync helper functions (`_ensure_owned_chat`, `_validate_llm_id`, `_validate_dataset_ids`, etc.) to async - no duplicate sync/async pairs - Wrap `CanvasReplicaService` Redis IO calls (`bootstrap`, `replace_for_set`, `commit_after_run`) - Use `asyncio.gather()` for concurrent file uploads and chat response building Note: This fixes the code-level event-loop blocking, which is a prerequisite for handling concurrent requests. For the full "30 concurrent requests without 502/504" goal described in the issue, users should also tune deployment config: - `WS=4` or higher (HTTP worker processes, default 1) - `MAX_CONCURRENT_CHATS=50` (default 10) - `SANDBOX_EXECUTOR_MANAGER_POOL_SIZE` for workflow-heavy workloads ### Performance verification Reviewer asked for a before-vs-after comparison ([comment](https://github.com/infiniflow/ragflow/pull/13941#issuecomment-4393667231)). I built a self-contained microbenchmark that reproduces the exact failure mode this PR targets: an async handler that performs blocking DB/Redis-style calls (50 ms each, 3 per request, 30 concurrent requests) is run twice — once with the pre-PR pattern (sync call directly inside the async handler) and once with the post-PR pattern (`await thread_pool_exec(...)`). The benchmark imports nothing from RAGFlow except `thread_pool_exec` itself, so it is hermetic and reproducible (`THREAD_POOL_MAX_WORKERS=128`, Python 3.13.12). Throughput — wall-clock for 30 concurrent requests (lower is better) \| flavour \| wall(s) \| p50(s) \| p95(s) \| max(s) \| \|---\|---:\|---:\|---:\|---:\| \| before \| 4.986 \| 0.158 \| 0.207 \| 0.269 \| \| after \| 0.248 \| 0.181 \| 0.230 \| 0.231 \| The pre-PR handler serializes the entire load on the event-loop thread, so 30 × 3 × 50 ms ≈ 4.5 s shows up as the wall time. The post-PR handler parallelizes the blocking work across the thread pool and finishes the same load in 248 ms — a ~20× speedup on this workload. Event-loop responsiveness — latency of an unrelated probe coroutine while the 30 slow requests are running (lower is better) \| flavour \| samples \| probe p50 (ms) \| probe p95 (ms) \| probe max (ms) \| \|---\|---:\|---:\|---:\|---:\| \| before \| 1 \| 5442.26 \| 5442.26 \| 5442.26 \| \| after \| 28 \| 0.88 \| 11.53 \| 98.02 \| This is the metric that maps directly to "the API still answers other requests while one is busy". A 5 ms-interval probe was scheduled while the 30 slow handlers ran. With the pre-PR code the event loop was frozen for the entire duration of the blocking work, so only one probe sample was ever picked up and it waited 5,442 ms. After the PR, 28 probe samples landed with p50 0.88 ms / p95 11.53 ms, meaning unrelated requests are no longer starved by the slow ones. That is the regression mode behind the cascading 502/504s reported in #13825. <details> <summary>Raw benchmark output</summary> ``` config: 30 concurrent requests, 3 blocking calls of 50ms each per request, THREAD_POOL_MAX_WORKERS=128 === Throughput (lower wall is better) === flavour wall(s) p50(s) p95(s) max(s) before 4.986 0.158 0.207 0.269 after 0.248 0.181 0.230 0.231 === Event-loop responsiveness (lower probe latency is better) === flavour samples probe p50(ms) probe p95(ms) probe max(ms) before 1 5442.26 5442.26 5442.26 after 28 0.88 11.53 98.02 ``` </details> The benchmark script is included as a comment on the PR for reproducibility. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Closes [#13825](https://github.com/infiniflow/ragflow/issues/13825) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 15:08:55 +08:00
Paul Y Hui	a0efc453f3	Fix: safe argument guard and remove redundant redis call (#14060 ) ### What problem does this PR solve? - Moved if not all([email, new_pwd, new_pwd2]) guard to the top, before any decryption that could crash on None value - Removed the redundant REDIS_CONN.get() call — one call is sufficient ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-05-11 15:02:24 +08:00
Ricardo-M-L	5ef7f50eef	fix: use context manager for ThreadPoolExecutor in file_service.py (#14144 ) ## Summary - Wrap 2 `ThreadPoolExecutor` instances in `file_service.py` with `with` statement - Ensures threads are properly shut down after all futures complete ## Problem `parse_docs()` (line 532) and the file processing method (line 694) create `ThreadPoolExecutor` instances that are never shut down. In a long-running server process, this leaks thread resources on every invocation — threads remain alive consuming memory even after all submitted work is complete. ## Fix Replace bare `ThreadPoolExecutor()` with `with ThreadPoolExecutor() as exe:` context manager, which calls `executor.shutdown(wait=True)` on exit. ## Test plan - [x] Verified both call sites use `with` statement after fix - [x] No remaining bare `ThreadPoolExecutor` in `file_service.py` - [x] `document_service.py:1066` is a module-level executor (different pattern, not changed in this PR) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 14:02:45 +08:00
buua436	a03b95f8c4	Fix: shared dataset chunk index lookup (#14764 ) ### What problem does this PR solve? shared dataset chunk index lookup ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:50:08 +08:00
buua436	024c8cb0b5	Fix: dataset search rerank id type (#14759 ) ### What problem does this PR solve? issue: https://github.com/infiniflow/ragflow/issues/14748 change: dataset search rerank id type ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:48:05 +08:00
jony376	46897d6fa4	Fix: bind memory message `user_id` to authenticated user for JWT auth (#14745 ) ### Related issues Closes #14744 ### What problem does this PR solve? The Memory REST endpoint `POST /api/v1/messages` previously persisted whatever `user_id` the client sent in the JSON body. Memory rows were therefore attributed to an arbitrary string, even when the caller authenticated as a normal workspace user via JWT (browser/session-style bearer token decoded into an access token). That broke attribution and audit semantics for shared memories (team visibility): any authorized writer could spoof another subject id. The Python SDK already sends an optional `user_id` for integrations using API keys (`APIToken`) to tag an external subject distinct from the tenant owner user. ### Solution - Record `g.auth_via_api_token` in `_load_user` (`api/apps/__init__.py`): set `True` only when authentication resolves via `APIToken`, otherwise `False` after JWT-based login succeeds. - In `POST /messages` (`memory_api.add_message`): if the request was authenticated with an API key, keep accepting optional `user_id` from the body (default empty string). For JWT-authenticated users, always set stored `user_id` to `current_user.id` and ignore the client field. - Guard reads of `g` with `RuntimeError` handling so isolated imports or tests without a Quart application context do not fail when resolving `user_id`. - Document on `RAGFlow.add_message` that `user_id` is only meaningful for API-key authentication. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - `python -m py_compile` on modified modules (`api/apps/__init__.py`, `api/apps/restful_apis/memory_api.py`). - Recommended: run web/SDK memory message tests (`test_add_message`, `test_message_routes_unit`) against a full environment with `quart` and configured services. ### Notes for reviewers - Behavior change only for callers using JWT-style authorization on `POST /messages`; API-key callers keep prior optional `user_id` semantics. Co-authored-by: jony376 <jony376@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-11 13:26:05 +08:00
Achieve3318	16354f4e14	fix(dify): guard retrieval argument error behavior (#14169 ) ## What problem does this PR solve? The Dify-compatible `/dify/retrieval` endpoint recently gained stricter parsing and validation for its request payload, including: - Normalized `retrieval_setting.top_k` and `retrieval_setting.score_threshold` types. - Clear separation between malformed arguments vs missing required fields. Previously, there was no unit test explicitly guarding the exact error code and message contract for these cases. ## What does this PR change? - Add guard-style unit test in `test_dify_retrieval_routes_unit.py`: - `test_retrieval_argument_error_messages`: - Sends a request with malformed numeric options: - `retrieval_setting = {"top_k": "not-int", "score_threshold": "not-float"}` - Asserts `code == RetCode.ARGUMENT_ERROR` and message contains `"invalid or malformed arguments:"`. - Sends a request with required fields missing: - Empty payload (`{}`) - Asserts `code == RetCode.ARGUMENT_ERROR` and message contains `"required arguments are missing:"`. This test encodes the intended behavior of the Dify retrieval API so future refactors cannot silently regress error handling. ## Type of change - [x] Tests (add coverage and guardrails for existing behavior) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 13:17:42 +08:00
Wang Qi	3838770e7a	GraphRAG feature - Part 1 - add spacy to extract entity and relation (#14670 ) ### What problem does this PR solve? GraphRAG feature - Part 1 - add spacy to extract entity and relation <img width="1621" height="1288" alt="image" src="https://github.com/user-attachments/assets/aadeddad-94da-46c6-adad-9c3784181f61" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 12:59:59 +08:00
web-dev0521	cc207b5b05	Refactor: tidy up ThreadPoolExecutor lifecycle in file_service and task executor (#14668 ) ## Summary - Wrap the `ThreadPoolExecutor` instances in `FileService.parse_docs` and `FileService.get_files` with `with ... as exe:` blocks for deterministic cleanup - Replace the `concurrent.futures.ThreadPoolExecutor` in `do_handle_task` with `asyncio.create_task(asyncio.to_thread(build_TOC, ...))`, preserving the existing parallelism with chunk insertion while leveraging the surrounding async context - Drop the now-unused `import concurrent` and the `executor.shutdown(wait=False)` call in the `finally` block Closes #14622. No behavioral change, no public API change. Net diff: ~19 insertions / 25 deletions across two files. ## Test plan - [ ] `uv run ruff check api/db/services/file_service.py rag/svr/task_executor.py` passes - [ ] Upload a multi-file batch through the chat/file endpoint and confirm `FileService.parse_docs` still returns combined parsed text - [ ] Trigger `FileService.get_files` via the chat reference flow with a mix of image and non-image files; verify both `raw=True` and `raw=False` paths return correctly - [ ] Run a `naive`-parser document task with `toc_extraction: true` and confirm the TOC chunk is generated and inserted exactly as before - [ ] Run a `naive`-parser document task with `toc_extraction: false` and confirm the path with `toc_thread = None` is unaffected - [ ] Cancel a running task to exercise the `finally` block and confirm cleanup still works without the executor shutdown call --------- Co-authored-by: web-dev0521 <jasonpette1783@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-05-11 12:59:00 +08:00
Ahmad Intisar	3c4d1da98f	Feature/table parser column roles (#13710 ) ### What problem does this PR solve? The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes. For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value. The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion. This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns. Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 10:06:04 +08:00
Mehmet Karakose	7ec87f7cb7	fix(auth): fall back to session-based auth in _load_user (#14569 ) ## Summary Closes #13663. OAuth / OIDC callbacks call `login_user(user)` which writes `_user_id` into the session cookie, but `_load_user()` in `api/apps/__init__.py` only ever looked at the `Authorization` header. The SPA's response interceptor wipes the Authorization value from `localStorage` on the first 401 it sees — meaning that during the post-redirect window after an OAuth login, a single transient 401 sends every subsequent request back to the login page even though `login_user()` had already established a perfectly good server-side session. The reporter's analysis traces this all the way through the redirect → `navigate('/')` → first request → empty header → 401 → `removeAll()` → infinite-redirect-to-login chain. ## What changed - New `_load_user_from_session()` helper that reads `session["_user_id"]`, looks up the user in `UserService` (with the same `StatusEnum.VALID` and `access_token` checks already used elsewhere), and assigns `g.user`. - Every `return None` path in `_load_user()` now routes through that helper before giving up: - missing `Authorization` header - malformed `bearer ` prefix - empty / too-short JWT payload - JWT signature failure - JWT-resolved user not found / has no `access_token` - `APIToken.query()` fallback exhausted The JWT and API-token paths still take precedence — the session is only consulted when those can't authenticate the request. So existing local-login and SDK callers see no behaviour change; only OAuth / OIDC users that hit the original race now stay logged in. The Bearer-prefix issue called out in #13663 (lines 103-110) is already handled in the current code, so this PR only addresses the second half of the report. ## Test plan - [ ] Configure OIDC under `oauth` in `service_conf.yaml` - [ ] Click the OIDC login button, complete auth at the IdP - [ ] Confirm that navigating between pages no longer bounces back to `/login` - [ ] Confirm local email/password login still issues + accepts JWTs - [ ] Confirm SDK/API key callers still authenticate via `Authorization: Bearer <api-token>` --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 09:59:52 +08:00
Hunnyboy1217	782084780e	feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628 ) (#14677 ) ### What problem does this PR solve? S3-family connector syncs currently re-download every in-window object just so we can compute `xxhash128(blob)` and compare against `Document.content_hash`. Anything that bumps `LastModified` without changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays full bandwidth and re-parses files that didn't actually change. #14628 covers the broader incremental-ingestion redesign; this PR is the first slice. The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2 / GCS / OCI / S3-compat) now implements a new `FingerprintConnector` interface: `list_keys()` paginates `list_objects_v2` and yields `KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The orchestrator joins those against the connector's existing `{doc_id: content_hash}` map and only calls `get_value(key)` when the fingerprint differs. Unchanged keys are skipped entirely — no `GetObject`, no re-parse. No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing `Document.content_hash` column per @yingfeng's suggestion; the connector decides at listing time whether to populate it. Local uploads and connectors that don't opt in fall through to the existing post-download `xxhash128(blob)` path with no behavior change. This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS (PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint` (PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation (PR-4). Holding those back keeps this PR additive and reviewable on its own. #### Files touched - `common/data_source/models.py` — new `KeyRecord`; optional `fingerprint` on `Document` - `common/data_source/interfaces.py` — `IncrementalCapability` enum, `FingerprintConnector` ABC - `common/data_source/blob_connector.py` — `BlobStorageConnector` implements `FingerprintConnector`; per-object download factored into `_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`, `get_value` all share it - `rag/svr/sync_data_source.py` — `_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop; `_run_task_logic` plumbs `doc.fingerprint` into the upload dict - `api/db/services/document_service.py` — `list_id_content_hash_map_by_kb_and_source_type()` helper - `api/db/services/connector_service.py` + `file_service.py` — fingerprint flows through `duplicate_and_parse → upload_document` and lands in `content_hash` - `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests covering ETag normalization (single-part, multipart, quoted, empty), `list_keys()` not calling `GetObject`, `get_value()` materializing with fingerprint, deterministic/stable fingerprints, and the bypass loop asserting `GetObject` is not called on a match #### Worth flagging for review Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a `LastModified` window when `poll_range_start` was set. New code uses `_fingerprint_filtered_generator` (full bucket listing + fingerprint compare) outside of explicit `reindex=1`. Strictly better for unchanged-bucket cases since it skips `GetObject`, but it does mean every sync now does a full `list_objects_v2` paginate. Should still be cheap for most buckets — flagging in case anyone has a very large bucket where the time-window filter was meaningful. On migration: existing rows have `content_hash = xxhash128(blob)` from the old code. The first sync after this lands sees ETag-derived fingerprints that don't match, re-fetches every object once, and writes the new fingerprint. From the second sync onward the bypass works as expected. "Slow day one, fast every day after." A `fingerprint_backfill: trust` opt-out is sketched in the design doc but not in this PR. #### Test plan - [x] `uv run ruff check` — clean on all 8 touched files - [x] `uv run pytest test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed - [x] Broader unit-test suite — no regressions in anything I touched - [ ] Manual smoke against a real S3 bucket — configure a connector, run sync twice, expect the second sync to log `bypassed=N, fetched=0` and no `GetObject` calls in CloudTrail / bucket access logs - [ ] Manual smoke with `reindex=1` — confirm the full re-download path still works ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-09 20:03:56 +08:00
euvre	f4b8f53b6d	Fix: restore embedding model switching for datasets with existing chunks (#14732 ) ### What problem does this PR solve? ## Problem During the REST API refactoring (#13690), the `/api/v2/kb/check_embedding` endpoint was removed and never migrated to the new RESTful structure. The frontend was pointed to the `/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a completely different function). Additionally, a hard guard was introduced that rejects any `embd_id` change when `chunk_num > 0`, making it impossible to switch embedding models on datasets with existing chunks. ## Root Cause 1. Missing endpoint: The old `check_embedding` logic (sample random chunks, re-embed with the new model, compare cosine similarity) was not carried over to the new REST API service layer. 2. Wrong frontend URL: `checkEmbedding` in `api.ts` pointed to `/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated check endpoint. 3. Overly restrictive guard: `dataset_api_service.py` line 310 blocked all `embd_id` updates when `chunk_num > 0`. This check did not exist in the pre-refactor code — it was incorrectly introduced during the refactor. ## Changes ### Backend - `api/apps/services/dataset_api_service.py` - Remove the `chunk_num > 0` hard guard on `embd_id` updates - Add `check_embedding()` service function: samples random chunks, re-embeds them with the candidate model, computes cosine similarity, returns compatibility result (avg ≥ 0.9 = compatible) - Add `import re` for the `_clean()` helper - `api/apps/restful_apis/dataset_api.py` - Add `POST /datasets/<dataset_id>/embedding/check` endpoint following the new REST API conventions - Clean up unused top-level imports (`random`, `re`, `numpy`) ### Frontend - `web/src/utils/api.ts` - Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` → `/datasets/${datasetId}/embedding/check` ### Tests - `test/testcases/test_http_api/test_dataset_management/test_update_dataset.py` - Update `test_embedding_model_with_existing_chunks` to assert success (`code == 0`) instead of expecting the old `102` error - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py` - Update `test_update_route_branch_matrix_unit` to assert `RetCode.SUCCESS` when updating `embd_id` on a chunked dataset, replacing the old `chunk_num` error assertion ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-09 18:48:57 +08:00
buua436	330257b611	Fix: Add legacy system healthz route (#14738 ) ### What problem does this PR solve? Add legacy system healthz route ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-09 17:49:26 +08:00
akie	c11650bb4c	Fix IDOR: Add permission checks to file ancestry endpoints (#14725 ) Close #14292 ## Issue File ancestry endpoints return folder metadata without validating tenant permissions, allowing any authenticated user to query arbitrary `file_id` values across tenant boundaries. ## Affected Endpoints - `GET /v1/file/parent_folder?file_id={file_id}` - `GET /v1/file/all_parent_folder?file_id={file_id}` - `GET /api/v1/files/{id}/ancestors` ## Root Cause These endpoints skip the permission check that other file operations (Delete, Download, Move) perform. ## Expected Permission Check All file operations should follow this 3-step validation: - Check file.tenant_id - Check if user_id belongs to this tenant (via user_tenant join table) - Check KB permission type (team permission) Code reference: This is implemented in `checkFileTeamPermission()` and used by Delete/Download/Move, but missing from GetParentFolder/GetAllParentFolders. ## Reproduction ```bash # User B (tenant: BBB) accessing User A's file (tenant: AAA) curl -H "Authorization: Bearer USER_B_TOKEN" \ "http://localhost:9384/v1/file/parent_folder?file_id=AAA_FILE_123" # Result: Returns User A's folder metadata ❌ # Expected: "No authorization." ✅ Fix Pass userID from handler to service and call checkFileTeamPermission() — same as Download/Delete/Move handlers. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 16:03:23 +08:00
Magicbook1108	f7e8c39dcc	Fix: filter api in dataset document (#14728 ) ### What problem does this PR solve? Fix: filter api in dataset document ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-09 14:45:40 +08:00
jony376	3b6eeabb09	Fix: private dataset authorization bypass in shared dataset access checks (#14645 ) ### Related issues Closes #14644 ### What problem does this PR solve? This PR fixes an authorization bug where datasets marked with `permission = me` could still be accessed by other members of the same tenant through APIs that relied on `KnowledgebaseService.accessible()` or `DocumentService.accessible()`. Before this change, those shared access helpers only checked tenant membership and did not enforce the dataset's permission mode. As a result, a non-owner who knew a private `dataset_id` could still reach downstream document and chunk operations even though the dataset was intended to be owner-only. This change updates the central access checks so that: - dataset owners always retain access - joined tenant members only get access when the dataset permission is `TEAM` - private datasets with `permission = me` remain inaccessible to non-owners - document-level access follows the same dataset permission rules The PR also adds regression coverage for private-vs-team dataset access behavior. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - Added `test/unit_test/api/db/services/test_dataset_access_permissions.py` - Attempted to run: `python -m pytest test\\unit_test\\api\\db\\services\\test_dataset_access_permissions.py -q` - Local execution in this workspace is currently blocked during test collection because the environment is missing the `strenum` dependency --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: jony376 <jony376@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com> Co-authored-by: d 🔹 <liusway405@gmail.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Magicbook1108 <newyorkupperbay@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: sxxtony <166789813+sxxtony@users.noreply.github.com> Co-authored-by: sxxtony <sxxtony@users.noreply.github.com> Co-authored-by: Baki Burak Öğün <63836730+bakiburakogun@users.noreply.github.com> Co-authored-by: bakiburakogun <bakiburakogun@users.noreply.github.com> Co-authored-by: Panda Dev <56657208+pandadev66@users.noreply.github.com> Co-authored-by: Haruko386 <tryeverypossible@163.com> Co-authored-by: D2758695161 <13510221939@163.com> Co-authored-by: Hunter <hunter@yitong.ai> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: buua436 <sz_buua@foxmail.com> Co-authored-by: web-dev0521 <jasonpette1783@gmail.com> Co-authored-by: Tim Wang <38489718+wanghualoong@users.noreply.github.com> Co-authored-by: wanghualoong <wanghualoong@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: qinling0210 <88864212+qinling0210@users.noreply.github.com> Co-authored-by: dale053 <star05223@outlook.com>	2026-05-09 13:30:14 +08:00
Wang Qi	42504fa18c	Bugfix: keep document api backward compatible (#14726 ) ### What problem does this PR solve? Bugfix: keep document api backward compatible Fix 1: https://github.com/infiniflow/ragflow/issues/14634 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-09 13:03:09 +08:00
VincentLambert	870bc59365	Fix: Bedrock api_key overridden by existing-key fallback in add_llm (#14707 ) ## Summary - Adding a Bedrock model from the frontend fails with `Fail to access model(Bedrock/<model>).Expecting value: line 1 column 1 (char 0)`. - The assembled Bedrock JSON credentials are silently replaced by `"x"` before the connection test, causing `json.loads("x")` to raise a `JSONDecodeError`. ## What problem does this PR solve? Commit `050113482` introduced a fallback in `add_llm()` that reuses the existing DB key when `req.get("api_key") is None`: ```python if req.get("api_key") is None: api_key = existing_api_key if existing_api_key is not None else "x" ``` For Bedrock, credentials are sent as separate fields (`auth_mode`, `bedrock_ak`, `bedrock_sk`, `bedrock_region`, `aws_role_arn`) — the frontend does not send an `api_key` field. The function correctly assembles the JSON key: ```python api_key = apikey_json(["auth_mode", "bedrock_ak", "bedrock_sk", "bedrock_region", "aws_role_arn"]) ``` But since `req.get("api_key")` is `None`, the override immediately replaces `api_key` with `"x"` (or a stale DB value). `LiteLLMBase` then calls `json.loads("x")` for Bedrock auth → `JSONDecodeError`. ## Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ## Changes `api/apps/llm_app.py` Write the assembled key into `req["api_key"]` so the `None` check evaluates to `False` and the override is skipped — consistent with how `Tencent Cloud` is already handled. ```python # Before api_key = apikey_json(["auth_mode", "bedrock_ak", "bedrock_sk", "bedrock_region", "aws_role_arn"]) # After req["api_key"] = apikey_json(["auth_mode", "bedrock_ak", "bedrock_sk", "bedrock_region", "aws_role_arn"]) api_key = req["api_key"] ``` ## Test plan - [ ] Configure a Bedrock provider in Model Providers with valid AWS credentials - [ ] Add a Bedrock chat model — verify no `Expecting value` error - [ ] Update the same model — verify the existing key is reused correctly when credentials fields are left empty 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-09 10:54:58 +08:00
Xing Hong	c428187350	Fix: validate kb_ids as UUIDs before SQL interpolation in use_sql (#14087 ) ### What problem does this PR solve? The use_sql() function in dialog_service.py constructed SQL WHERE clauses and Infinity table names by directly interpolating kb_id values using Python f-strings, with no validation of the input values. A malformed or maliciously crafted kb_id (introduced via a compromised admin account or a separate injection vector) could alter the structure of the generated SQL query, potentially leading to unauthorized data access or data manipulation. This PR adds strict UUID format validation for all kb_id values before they are interpolated into any SQL string, causing requests with invalid IDs to fail fast with a ValueError rather than executing a tampered query. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2026-05-09 10:52:06 +08:00
VincentLambert	c44dc85143	Fix: IMAGE2TEXT→CHAT fallback with model_type normalization in tenant_model_service (#14704 ) ## Summary - When a model is registered as `chat` in `tenant_llm` but has the `IMAGE2TEXT` tag in `llm_factories.json`, requesting it as `image2text` (e.g. PDF parser) fails with `Tenant Model with name <model> and type image2text not found`. - After resolution via the new fallback, the returned `config_dict["model_type"]` was still `"chat"`, causing `tenant_llm_service.model_instance()` to instantiate `ChatModel` instead of `CvModel` — breaking `describe_with_prompt` at ingestion time. ## What problem does this PR solve? RAGFlow already has a `CHAT→IMAGE2TEXT` fallback: when a chat model is not found, it retries with `image2text`. The symmetric fallback (`IMAGE2TEXT→CHAT`) was missing. This matters for multimodal models declared as `model_type: "chat"` with an `IMAGE2TEXT` tag in `llm_factories.json` (e.g. models added after tenant creation, or providers where a single model serves both purposes). The frontend PDF parser selector correctly surfaces these models via the `IMAGE2TEXT` tag, but the backend fails to resolve them at runtime. ## Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ## Changes `api/db/joint_services/tenant_model_service.py` 1. Add `IMAGE2TEXT→CHAT` fallback in `get_model_config_by_type_and_name`: when an `image2text` model is not found in `tenant_llm`, retry with `chat` — but only if the `llm` table confirms `IMAGE2TEXT` capability via the `tags` field. This mirrors the philosophy of the existing `CHAT→IMAGE2TEXT` fallback: substitution is only allowed when the model has declared the required capability. 2. Normalize `config_dict["model_type"]` to `image2text` after the fallback, so the caller (`model_instance`) correctly routes to `CvModel` instead of `ChatModel`. 3. Extend the type validation guard to allow `(requested=image2text, found=chat)` alongside the existing `(requested=chat, found=image2text)` exception. ## Test plan - [ ] Add a model with `model_type=chat` and `tags` containing `IMAGE2TEXT` to a tenant - [ ] Select it as PDF parser in a knowledge base - [ ] Verify ingestion succeeds without `image2text not found` or `describe_with_prompt` errors - [ ] Verify the same model still works correctly in chat context 🤖 Generated with [Claude Code](https://claude.ai/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-09 10:40:58 +08:00
Octopus	653b00b94c	fix(sync): scope document IDs per connector to prevent cross-KB collisions (#14378 ) Fixes #14360 ## Problem When the same blob storage bucket is connected to multiple knowledge bases (each through a different data source connector), the sync pipeline hashes only the blob path (`bucket_type:bucket_name:object_key`) to derive the document ID. Every connector pointing at the same bucket therefore produces identical IDs for the same object. The collision guard in `FileService.upload_document` then fires for the second knowledge base: ``` Existing document id collision with another knowledge base; skipping update. ``` This makes it impossible to index the same bucket into more than one KB simultaneously. ## Solution Include `connector_id` in the hash input so that each connector produces a distinct document ID even when the underlying blob path is identical: ```python # Before "id": hash128(doc.id), # After "id": hash128(f"{task['connector_id']}:{doc.id}"), ``` Because each KB connection uses its own connector (with a unique `connector_id`), documents are now namespaced per connector and no collision occurs. Note: This is a breaking change for existing synced data sources. After upgrading, a re-sync will create new documents with the updated ID format. Old documents (indexed under the previous format) will remain in the database but can be manually deleted or cleaned up via a re-sync with reindex enabled. ## Testing - Verified that the one-line change produces unique IDs for two connectors pointing at the same S3 path. - Existing unit test `test_upload_document_skips_cross_kb_document_id_collision` continues to pass — the collision guard in `FileService` is still valid for genuinely colliding IDs from other sources. --------- Co-authored-by: octo-patch <octo-patch@github.com>	2026-05-09 10:33:54 +08:00
Wang Qi	7d35e40c7b	Refactor : Allow search multiple datasets (#14685 ) ### What problem does this PR solve? Refactor : Allow search multiple datasets 1. support /datasets/search 2. get rid of /graph/search, use /graph ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-05-08 19:01:35 +08:00
dale053	26d70189b6	fix: enforce tenant-scoped authorization for chatbot SDK endpoints (#14592 ) Closes #14590 ## Self Checks - [x] I have searched for existing issues [search for existing issues](https://github.com/infiniflow/ragflow/issues), including closed ones. - [x] I confirm that I am using English to submit this report ([Language Policy](https://github.com/infiniflow/ragflow/issues/5910)). - [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) ([Language Policy](https://github.com/infiniflow/ragflow/issues/5910)). - [x] Please do not modify this template :) and fill in all the required fields. ## RAGFlow workspace code commit ID `a1b2c3d4e5f67890123456789abcdef12345678` ## RAGFlow image version `0.13.1` ## Other environment information - Hardware parameters: N/A - OS type: Linux 6.17.0-22-generic - Others: API key authentication via `Authorization: Bearer <token>` ## Actual behavior The chatbot API endpoints: - `POST /chatbots/<dialog_id>/completions` - `GET /chatbots/<dialog_id>/info` validate only that the bearer token exists in `APIToken`, but do not verify that `dialog_id` belongs to the same tenant as that token. Current flow (simplified): 1. Route extracts bearer token and checks `APIToken.query(beta=token)`. 2. If token exists, request is accepted. 3. Downstream service resolves dialog globally by ID (`DialogService.get_by_id(dialog_id)` in `conversation_service.py`). 4. No tenant ownership check is enforced for `dialog_id`. Impact: Any user with a valid API key can attempt arbitrary `dialog_id` values and access/invoke chatbots outside their own tenant boundary if IDs are known/guessed/leaked. Security classification: - Vulnerability class: Broken Access Control (IDOR, OWASP Top 10 A01) - Severity recommendation: Critical - Exploit prerequisite: any valid API key + discoverable target `dialog_id` ## Expected behavior Requests to `/chatbots/<dialog_id>/completions` and `/chatbots/<dialog_id>/info` must be authorized only when: 1. bearer token is valid, and 2. `dialog_id` belongs to the same `tenant_id` as the token. Otherwise, reject with authorization failure (e.g., 403 or 404-equivalent policy). ## Steps to reproduce 1. Prepare two tenants: - Tenant A with API key `TOKEN_A` - Tenant B with chatbot `dialog_id = DIALOG_B` 2. Send request from Tenant A to Tenant B chatbot completion endpoint: ```bash curl -X POST "https://<host>/chatbots/DIALOG_B/completions" \ -H "Authorization: Bearer TOKEN_A" \ -H "Content-Type: application/json" \ -d '{"question":"hello","stream":false}' ``` 3. Observe request is processed (or reaches dialog resolution) without tenant ownership rejection. 4. Repeat against info endpoint: ```bash curl -X GET "https://<host>/chatbots/DIALOG_B/info" \ -H "Authorization: Bearer TOKEN_A" ``` 5. Observe the same missing ownership enforcement. ## Additional information Affected code paths: - `api/apps/sdk/session.py` - `chatbot_completions(dialog_id)` - `chatbots_inputs(dialog_id)` - `api/db/services/conversation_service.py` - `async_iframe_completion(...)` uses global dialog lookup Suggested fix: 1. In both chatbot endpoints: - Resolve `tenant_id = objs[0].tenant_id` from validated token. - Fetch dialog with tenant-scoped query (`DialogService.query(id=dialog_id, tenant_id=tenant_id)`). - Reject if dialog is not found/owned by tenant. 2. Defense in depth: - Require and enforce `tenant_id` in service-layer dialog resolution for external flows. - Avoid global `get_by_id(dialog_id)` where user-controlled dialog IDs are reachable. 3. Add regression tests: - Positive: same-tenant token + dialog succeeds. - Negative: cross-tenant token + dialog fails for both endpoints.	2026-05-08 18:00:18 +08:00
Lynn	ada6d47880	Fix: move file check (#14681 ) ### What problem does this PR solve? Restrict file move operations: prevent moving a folder to itself or to one of its own subfolders. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-08 17:58:37 +08:00
Lynn	69197d4a8f	Fix: type of tenant_rerank_id (#14667 ) ### What problem does this PR solve? Update the type of tenant_rerank_id in validation. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-08 15:32:34 +08:00
buua436	d843035c8b	Fix: add compatibility route for document download under /v1 (#14663 ) ### What problem does this PR solve? add compatibility route for document download under /v1 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-08 14:44:02 +08:00
Tim Wang	1bcb6deb6f	Fix: collapsible thinking display and separate deep research retrieval tag (#14613 ) ## Summary - Collapsible thinking: Replace `<section>` with `<details>` for `<think>` content, so model thinking output is collapsed by default (click to expand). Works for all models that output `<think>` tags (Qwen3, DeepSeek, Gemini, Claude, etc.). - Fix double thinking tags: When reasoning/deep research mode is enabled in knowledge base chat, both the retrieval progress and model thinking were wrapped in `<think>` tags, producing two "Thinking..." blocks. Now retrieval progress uses a dedicated `<retrieving>` tag rendered as a separate "Retrieving..." collapsible with a distinct green accent. ### Before - Thinking content displayed as flat gray-bordered `<section>`, occupying significant screen space - Deep research + model thinking both use `<think>` → two identical "Thinking..." blocks ### After - Thinking content collapsed by default in a `<details>` element, click "Thinking..." to expand - Deep research shows "Retrieving..." (green border), model thinking shows "Thinking..." (gray border) ## Changes Backend (`api/db/services/dialog_service.py`) - Deep research callback: replace `start_to_think`/`end_to_think` marker flags with direct `<retrieving>`/`</retrieving>` answer text Frontend - `web/src/utils/chat.ts`: `replaceThinkToSection()` now uses `<details>` instead of `<section>`; add new `replaceRetrievingToSection()` - 4 tsx files: import and pipe `replaceRetrievingToSection`, whitelist `details`, `summary`, `retrieving` in DOMPurify `ADD_TAGS` - 4 less files: `section.think` → `details.think` with `<summary>` styles; add `details.retrieving` with green accent; dark mode and RTL variants ## Test plan - [ ] Open a chat WITHOUT knowledge base, ask a question to a model with thinking (e.g. Qwen3) → thinking content should be collapsed by default, click "Thinking..." to expand - [ ] Open a chat WITH knowledge base and reasoning enabled, ask a question → "Retrieving..." (green) shows retrieval progress, "Thinking..." (gray) shows model thinking, each independently collapsible - [ ] Verify dark mode renders correctly for both collapsible blocks - [ ] Verify RTL layout renders correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: wanghualoong <wanghualoong@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-08 14:40:00 +08:00
web-dev0521	d51fb88573	Fix: enforce tenant authorization on document download endpoint (#14618 ) (#14625 ) ### What problem does this PR solve? Closes #14618. The `GET /v1/document/get/<doc_id>` endpoint in `api/apps/document_app.py` was protected only by `@login_required` and called `DocumentService.get_by_id(doc_id)` without verifying that the document's knowledge base belonged to the requesting user's tenant. Any authenticated user who knew (or guessed) a document ID could download files belonging to any other tenant — a cross-tenant IDOR. This PR adds a `DocumentService.accessible(doc_id, current_user.id)` check before serving the file. The helper already exists and joins `Document` → `Knowledgebase` → `UserTenant` to verify the requesting user belongs to the tenant that owns the document's KB. The same pattern is already used by `api/apps/restful_apis/document_api.py` and mirrors the tenant scoping in the SDK route at `api/apps/sdk/doc.py`. The check returns the existing `"Document not found!"` error for both non-existent and inaccessible documents, so attackers cannot use the response to enumerate valid doc IDs across tenants. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Other (please describe): Security fix (cross-tenant IDOR / authorization bypass)	2026-05-08 14:24:03 +08:00
jony376	6547751936	Fix: missing authorization checks in `/files/link-to-datasets` (#14649 ) ### Related issues Closes #14648 ### What problem does this PR solve? This PR fixes an authorization flaw in `POST /files/link-to-datasets`. Before this change, the endpoint only checked whether the supplied `file_ids` and `kb_ids` existed. It did not verify whether the authenticated user was actually allowed to access those files or target datasets. As a result, an authenticated user who knew valid IDs could relink another user's files to arbitrary datasets. This was especially risky because the relinking flow is state-changing: the background worker removes existing file-document mappings and then recreates documents under the attacker-supplied dataset IDs. This change makes the route enforce the same permission model already used by nearby file and document operations: - each resolved file must pass `check_file_team_permission(...)` - each target dataset must pass `check_kb_team_permission(...)` - authorization is enforced before scheduling background relinking work ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - Added regression coverage in `test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py` - Covered: - unauthorized file access is rejected - unauthorized dataset access is rejected - existing success path still returns immediately after scheduling background work - Attempted to run: - `python -m pytest test\\testcases\\test_web_api\\test_file_app\\test_file2document_routes_unit.py -q` - Local execution in this workspace is currently blocked by missing test dependencies during bootstrap, including `ragflow_sdk` --------- Co-authored-by: jony376 <jony376@gmail.com>	2026-05-08 13:49:23 +08:00
buua436	f703169117	Refa: migrate document preview/download to RESTful API (#14633 ) ### What problem does this PR solve? migrate document preview/download to RESTful API ### Type of change - [x] Refactoring	2026-05-08 13:26:13 +08:00
Lynn	412fae7ac2	Fix: display error (#14654 ) ### What problem does this PR solve? Use right key in error text. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-08 13:11:59 +08:00
sxxtony	59c35100c5	Perf: push metadata filters down to Elasticsearch (#14576 ) ### What problem does this PR solve? Fixes #14412. `common.metadata_utils.meta_filter` evaluates user-defined metadata conditions in Python after `DocMetadataService.get_flatted_meta_by_kbs` loads the entire `meta_fields` table into memory. Past a few thousand documents per knowledge base this becomes a memory bottleneck and a wasted ES round-trip — every filter request currently fetches up to 10000 metadata rows even when the resulting `doc_ids` list is tiny. This PR adds an ES push-down path that translates the same filter language into a `bool` query and returns just the matching document IDs. Changes - `common/metadata_es_filter.py` (new): pure-Python translator from the RAGflow filter list to ES DSL. Covers every operator the in-memory path supports (`=`, `≠`, `>`, `<`, `≥`, `≤`, `in`, `not in`, `contains`, `not contains`, `start with`, `end with`, `empty`, `not empty`) with `case_insensitive: true` on `prefix` and `wildcard` for parity with the existing lower-cased Python comparisons. User wildcard metacharacters are escaped before being injected into `wildcard` patterns. Negative operators (`≠`, `not in`, `not contains`, ranges) are wrapped with an `exists` guard so they do not accidentally match documents missing the key, matching the legacy `if k not in metas` behaviour. - `api/db/services/doc_metadata_service.py`: new `DocMetadataService.filter_doc_ids_by_meta_pushdown(kb_ids, filters, logic)` that returns the doc IDs ES matched, or `None` to signal the caller should fall back to the in-memory path. Returns `None` when the active doc store is Infinity (`meta_fields` is a JSON column, not a dotted-object mapping), when any filter cannot be expressed in DSL (`UnsupportedMetaFilter`), or when the ES request or metadata index lookup errors. - `common/metadata_utils.py`: `apply_meta_data_filter` accepts an optional `kb_ids` argument. When supplied, conditions go through push-down first via a new `_try_meta_pushdown` helper; on `None` the function falls back to the original `meta_filter` call. Default behaviour is unchanged for callers that don't pass `kb_ids`. - Updated all four callers (`agent/tools/retrieval.py`, `api/db/services/dialog_service.py` ×2, `api/apps/services/dataset_api_service.py`, `api/apps/sdk/session.py`) to forward `kb_ids` so the push-down path is exercised in production. - `test/unit_test/common/test_metadata_es_filter.py` (new): 35 unit tests covering every operator's DSL shape, value coercion (`ast.literal_eval`, lowercasing, ISO-date pass-through), wildcard escaping, OR-logic wrapping that protects negative clauses, and the doc-ID extractor. Behaviour preserved - The in-memory `meta_filter` is untouched and still services every fallback case (Infinity backend, unknown operators, ES outages). - The eligibility / credibility / issue-multiplier semantics described in the LLM-driven `auto` and `semi_auto` modes still hand the LLM the full in-memory `metas` dict to choose conditions from. Only the evaluation of those generated conditions is pushed down. - Existing tests in `test/unit_test/common/test_metadata_filter_operators.py` continue to pass (14/14). Test plan - `pytest test/unit_test/common/test_metadata_es_filter.py` — 35 passed. - `pytest test/unit_test/common/test_metadata_filter_operators.py` — 14 passed. - `ruff check` clean on every modified file. - Reviewer please validate the ES query shapes against a live cluster — particularly `case_insensitive` on `wildcard` and `prefix` (requires ES 7.10+) and the `exists` + `must_not` pairing for `≠`. Notes - The first cut caps each push-down request at 10000 results, matching the existing `get_flatted_meta_by_kbs` limit, and logs a warning when the cap is hit. A `search_after` follow-up would let us drop the cap entirely once the push-down path is validated. - Operator parity with the in-memory path is exact for the canonical unicode operators (`≥`, `≤`, `≠`) used internally; the ASCII aliases (`>=`, `<=`, `!=`) are normalised by `convert_conditions` before they reach the translator. ### Type of change - [x] Performance Improvement --------- Co-authored-by: sxxtony <sxxtony@users.noreply.github.com>	2026-05-07 21:23:43 +08:00
Jin Hai	94324afee9	Go: fix auth issue in hybrid mode (#14611 ) ### What problem does this PR solve? Since secret key get and set logic is updated, the go server also need to update. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:14:22 +08:00
buua436	0501134820	Fix: support tool call config (#14616 ) ### What problem does this PR solve? support tool call config ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-07 15:54:57 +08:00

1 2 3 4 5 ...

1605 Commits