ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-29 20:17:35 +08:00

Author	SHA1	Message	Date
Jake Armstrong	93d3deb5e4	Fix admin CLI system variable commands (#14956 ) ## What Fixes #12409. Implements admin CLI support for: - `list vars;` - `show var <name-or-prefix>;` - `set var <name> <value>;` ## Changes - Wire Go CLI variable commands to the admin API. - Support integer and quoted string values in `SET VAR`. - Return variable rows as `data_type`, `name`, `setting_type`, and `value`. - Add exact-name lookup with prefix fallback for `SHOW VAR`. - Validate values by stored data type: `string`, `integer`, `bool`, and `json`. - Keep the legacy Python admin CLI/server behavior aligned. - Update admin CLI docs and add focused tests. ## Verification - `go test -count=1 ./internal/cli` - `python3.12 -m py_compile admin/server/services.py admin/server/routes.py api/db/services/system_settings_service.py admin/client/parser.py admin/client/ragflow_client.py` - Python admin CLI parser smoke test for `SET VAR`, quoted values, `SHOW VAR`, and `LIST VARS`. - Attempted `./run_go_tests.sh`; local environment is missing native tokenizer/linker artifacts: - `internal/cpp/cmake-build-release/librag_tokenizer_c_api.a` - `-lstdc++` Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-18 19:08:45 +08:00
Wang Qi	732e4741c4	Bugfix: fix tag show (#14980 ) ### What problem does this PR solve? Bugfix: fix tag show ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-18 18:55:01 +08:00
Hamza Amin Khokhar	2dbe3b8a62	fix: metadata_condition returning all docs when filter matches nothing (#14967 ) ### What problem does this PR solve? When _parse_doc_id_filter_with_metadata returns [], the empty list is falsy so the WHERE id IN (...) clause was silently skipped, causing the full dataset to be returned instead of an empty result. Change `if doc_ids:` to `if doc_ids is not None:` in both get_list() and get_by_kb_id() to distinguish between no filter (None) and a filter that matched zero documents ([]). Fixes #14962 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-18 18:54:30 +08:00
Wang Qi	13b422037f	Refactor: enhance graphrag - part 2 (#14972 ) ### What problem does this PR solve? 1. expose batch_chunk_token_size for configuration 2. retrieve chunks when build subgraph for the doc, not retreive all docs chunks at the begining 3. get all chunks for a document, used to be hard coded 10000 4. delete not used method run_graphrag ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring Follow on: #14617	2026-05-18 16:10:21 +08:00
dev	b12eaee38b	fix(api): enforce tenant access for connector routes (#14747 ) ### What problem does this PR solve? Fixes #14746. Adds tenant access checks for connector-by-id REST routes before reading connector details, mutating connector config/status, deleting connectors, rebuilding, or listing sync logs. Unauthorized callers now receive `RetCode.AUTHENTICATION_ERROR` with `No authorization.` without reaching the connector/log mutation paths. Validation: - `python3 -m pytest --confcutdir=test/testcases/test_web_api/test_connector_app test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py` - `uvx ruff check api/apps/restful_apis/connector_api.py api/db/services/connector_service.py test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: dev111-actor <dev111-actor@users.noreply.github.com>	2026-05-18 16:09:26 +08:00
Wang Qi	56d73d0c2c	Refactor: speed up ragflow server, save startup memory (#14973 ) ### What problem does this PR solve? Refactor: speed up ragflow server, save startup memory, saved 200MiB, and 5-9 seconds start time. ##### Before 1241292 \| \| \_ python3 api/ragflow_server.py RAGFlow server is ready after 25.61845850944519s initialization. ##### After 1019968 \| \| \_ python3 api/ragflow_server.py RAGFlow server is ready after 16.205134391784668s initialization. ### Type of change - [x] Refactoring	2026-05-18 15:55:59 +08:00
dale053	fe82a96193	Fix: add SSRF guard for agent test_db_connection endpoint (#14860 ) ### What problem does this PR solve? Closes #14858 The `test_db_connection` endpoint in the agent API accepts a user-supplied `host` and connects to it directly via database drivers (MySQL/PostgreSQL) without any validation. This allows an attacker to probe internal network addresses (e.g. `127.0.0.1`, `10.x.x.x`, link-local, etc.) through the server — a classic Server-Side Request Forgery (SSRF) vulnerability. This PR adds an SSRF guard that resolves the host and rejects any address that is not globally routable before the database connection is attempted. Changes: - `common/ssrf_guard.py` — Added `assert_host_is_safe()`, a host-level counterpart of the existing `assert_url_is_safe()`, designed for non-HTTP protocols (database drivers) where there is no URL to parse. - `api/apps/restful_apis/agent_api.py` — Call `assert_host_is_safe(req["host"])` at the top of `test_db_connection` so that non-public hosts are rejected early with a clear error message. Fixes #14858 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-18 14:32:44 +08:00
qinling0210	f1d2383572	Push metadata filters down to Infinity (#14974 ) ### What problem does this PR solve? Push metadata filters down to Infinity ### Type of change - [x] Refactoring	2026-05-18 14:22:04 +08:00
Kevin Hu	7cdc74bbe5	Refactor: Drop the vector fetch for ES (#14970 ) ## Summary - Stop pulling chunk vectors (`q__vec`) back from Elasticsearch in the main retrieval path. ES already knows them; shipping them was pure bandwidth/memory overhead. - Recover the per-chunk cosine similarity via a second KNN-only ES call filtered by the candidate chunk ids. The new `_score` is merged with locally computed term similarity using the user-configured `vector_similarity_weight`. - Lazily fetch the chunk embedding only for the chunks `insert_citations` actually needs. ## Details `rag/nlp/search.py`* - `Dealer.search`: no longer appends `q__vec` to the ES select list. OceanBase still gets it (its rerank path is unchanged). - New `Dealer._knn_scores(sres, idx_names, kb_ids)`: a `MatchDenseExpr` over the cached query vector filtered by `id IN sres.ids`, returning `{chunk_id: cosine_score}` via ES `_score`. - New `Dealer.rerank_with_knn(...)`: term similarity from `qryr.token_similarity` plus the ES-supplied KNN score, combined with `tkweight`/`vtweight` and the existing rank-feature bonus. - New `Dealer.fetch_chunk_vectors(chunk_ids, tenant_ids, kb_ids, dim)`: on-demand vector fetch for citation use. - `Dealer.retrieval` routes Infinity → unchanged, OceanBase → existing local `rerank`, ES → new KNN-score path. `common/doc_store/es_conn_base.py`* - New `get_scores(res)` helper returning `{_id: _score}` directly from hit headers (ES doesn't surface `_score` through `get_fields`). `api/db/services/dialog_service.py` - New top-level `_hydrate_chunk_vectors(...)` helper. On ES it back-fills `ck["vector"]` from `fetch_chunk_vectors` right before `insert_citations`. No-op on Infinity / OB (their chunks already carry vectors). - Both `decorate_answer` closures became `async` and are `await`-ed at all call sites in `async_chat` and `async_ask`. ## Backend behavior \| Backend \| Returns chunk vec in main search \| Sim source \| Vectors for citations \| \|---\|---\|---\|---\| \| ES \| No \| second KNN call (`_score`) merged with term sim \| fetched on demand \| \| Infinity \| No (unchanged) \| normalized `_score` \| already on chunks \| \| OceanBase \| Yes (kept) \| local hybrid rerank \| already on chunks \| ## Test plan	2026-05-18 14:21:56 +08:00
Rene Arredondo	9f2fb4611f	Fix: guard empty/whitespace embedding inputs in LLMBundle (#14428 ) (#14924 ) Closes #14428 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-18 14:11:54 +08:00
Idriss Sbaaoui	e98f3e5c0d	Fix session deletion leaking chat-upload blobs (#14969 ) ### What problem does this PR solve? This fixes a bug where files uploaded in chat were left in storage after the session was deleted. It now removes those chat-uploaded blobs during session deletion. fixes #14965 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-18 11:14:27 +08:00
wdeveloper16	14c0985182	feat: bump Python minimum from 3.12 to 3.13, drop strenum backport (#14767 ) Closes #14753 ## What changed \| File \| Change \| \|---\|---\| \| `pyproject.toml` \| `requires-python` → `>=3.13,<3.15`; remove `strenum==0.4.15` \| \| `Dockerfile` \| `uv python install 3.13`, `uv sync --python 3.13` \| \| `.github/workflows/tests.yml` \| `uv sync --python 3.13` on both matrix legs \| \| `CLAUDE.md` \| dev setup command + requirements note updated \| \| `deepdoc/parser/mineru_parser.py` \| `from strenum import StrEnum` → `from enum import StrEnum` \| \| `agent/tools/code_exec.py` \| same \| `StrEnum` has been in the stdlib since Python 3.11 — the `strenum` backport package is no longer needed once the floor is 3.13. ## Why uv.lock is not regenerated `uv lock --python 3.13` fails because: 1. The infiniflow/graspologic fork pins `numpy>=1.26.4,<2.0.0` 2. `tensorflow-cpu>=2.20.0` (the first release with cp313 wheels) depends on `ml-dtypes>=0.5.1`, which requires `numpy>=2.1.0` 3. These two constraints are irreconcilable on Python 3.13 The lockfile regeneration requires loosening the `numpy` upper bound in the `infiniflow/graspologic` fork. Once that fork commit is updated and the SHA in `pyproject.toml:49` is bumped, `uv lock --python 3.13` will succeed. ## RFC corrections Two claims in the original RFC (#14753) did not hold up under code review: - "graspologic hard-blocks 3.13" — the infiniflow fork at the pinned commit has no `<3.13` Python constraint. The blocker is the transitive `numpy<2.0.0` conflict with tensorflow-cpu's test dependency, not a direct Python version cap. - "free-threading throughput gains for I/O-bound workload" — Python 3.13 free-threading requires a special `--disable-gil` build and provides no benefit for async I/O code (the GIL is already released during I/O). The real motivation is forward compatibility and improved error messages.	2026-05-15 14:40:53 +08:00
plind	c9622d0924	fix(agentbot): aggregate structured output in non-streaming completions (#14848 ) ## What problem does this PR solve? Closes #13384. The `/api/v1/agentbots/<agent_id>/completions` non-streaming path returned the first yielded SSE chunk and exited: ```python async for answer in agent_completion(objs[0].tenant_id, agent_id, **req): return get_result(data=answer) ``` That meant structured output, the full assistant message, and reference data were all dropped when an agent was called with `stream=false`. Streaming worked because each event was forwarded individually; non-streaming was returning a raw SSE-formatted string from a single early event. The v1 endpoint at [`agent_api.py:1006-1050`](https://github.com/infiniflow/ragflow/blob/main/api/apps/restful_apis/agent_api.py#L1006-L1050) already handles this correctly. This PR mirrors that aggregation in the SDK beta endpoint: parse each SSE line, accumulate `content` from `message` events, merge `reference`, collect `outputs.structured` from each `node_finished` event keyed by `component_id`, and attach all of them to the final response. ## Type of change - [x] Bug fix (non-breaking change which fixes an issue) ## Test plan - [ ] Build an agent with a node that emits structured output, call `POST /api/v1/agentbots/<agent_id>/completions` with `stream=false` and a beta API token, verify `data.structured.<component_id>` is present in the response. - [ ] Same agent with `stream=true` — verify behavior is unchanged. - [ ] Agent without structured output — verify `data.structured` is omitted, `content` and `reference` still aggregated correctly.	2026-05-15 12:42:33 +08:00
Sebastion	547b8cf9d8	security: always use RestrictedUnpickler in deserialize_b64 (CWE-502) (#14803 ) ## Summary Harden `api/utils/configs.deserialize_b64` so that it always routes pickle data through the existing `RestrictedUnpickler` (`restricted_loads`) rather than falling back to bare `pickle.loads()`. - CWE-502 — Deserialization of Untrusted Data - File / function: `api/utils/configs.py` → `deserialize_b64` - Caller: `SerializedField.python_value` in `api/db/db_models.py` (invoked by Peewee whenever a pickled DB column is read) ## The issue Before this change, `deserialize_b64` consulted a `use_deserialize_safe_module` config flag that defaults to `False` and is not set anywhere in the repository: ```python use_deserialize_safe_module = get_base_config('use_deserialize_safe_module', False) if use_deserialize_safe_module: return restricted_loads(src) return pickle.loads(src) # <-- default path ``` So the default code path was unrestricted `pickle.loads()` on bytes read from a MySQL `SerializedField(serialized_type=PICKLE)` column. Any attacker who can influence those bytes (SQL injection elsewhere, compromised DB credentials, a backup restored from an untrusted source, or a compromised replication peer) can craft a pickle payload that achieves arbitrary code execution on the ragflow application server when the field is next read. Today no model in-tree instantiates a `SerializedField` with the default PICKLE type — only `JsonSerializedField` is used in practice — so the attack surface is currently latent rather than actively reachable through an HTTP endpoint. But the insecure-by-default behaviour is a sharp edge: any future field that uses the default PICKLE serialization would silently inherit RCE-on-read semantics. ## The fix ```diff - use_deserialize_safe_module = get_base_config( - 'use_deserialize_safe_module', False) - if use_deserialize_safe_module: - return restricted_loads(src) - return pickle.loads(src) + return restricted_loads(src) ``` `restricted_loads` is the existing `RestrictedUnpickler` already defined in the same file, which limits permitted modules to `numpy` and `rag_flow`. The config flag (and the now-dead `get_base_config` import) are removed. Diff is 1 insertion / 6 deletions, scoped to a single function. ## Testing - Built a malicious pickle whose `__reduce__` resolves to `posix.system('id')`. Pre-fix: executes. Post-fix: `restricted_loads` raises `UnpicklingError: global 'posix.system' is forbidden`. - Round-tripped a benign `numpy.ndarray` through `serialize_b64` → `deserialize_b64`. Values preserved bit-for-bit. - Confirmed `use_deserialize_safe_module` is not set in any config file in the tree, so removing the flag does not change any operator-facing knob that was actually in use. ## A note on `restricted_loads` itself The existing `SECURITY.md` notes that `restricted_loads`'s `numpy` allow-list can still be reached via `numpy.f2py.diagnose.run_command`. This PR does not attempt to fix that — it is a separate hardening question about tightening the allow-list to specific symbols rather than whole modules. The change here strictly improves on the status quo (bare `pickle.loads`) and brings the default path in line with what the `restricted_loads` helper was clearly designed for. Happy to follow up with a separate PR narrowing the allow-list if that direction is welcome. ## Adversarial review Before submitting, we tried to argue this finding away. The two strongest objections are (1) "no field uses PICKLE today, so this is unreachable" — true, but the default behaviour of a security-sensitive helper still matters because new fields silently inherit it; and (2) "the attacker already needs DB write access, which is game over" — partially true, but pickle-RCE meaningfully escalates data tampering into code execution on the application host (filesystem, internal network, in-process secrets), which is not equivalent. The fix is one line of real code, has no behavioural cost for legitimate callers, and removes an insecure default. We decided it was worth filing. --- <sub>_Submitted by Sebastion — autonomous open-source security research from [Foundation Machines](https://foundationmachines.ai). Free for public repos via the [Sebastion AI GitHub App](https://github.com/marketplace/sebastion-ai)._</sub>	2026-05-15 10:58:27 +08:00
buua436	58819f5d3e	fix: add document download endpoint and refactor existing download function (#14927 ) ### What problem does this PR solve? add document download endpoint and refactor existing download function ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-15 09:36:58 +08:00
wdeveloper16	a98994ff91	fix: close db connections reliably in test_db_connection (#14777 ) ## Summary - Fixes resource-management bugs in the `POST /agents/test_db_connection` endpoint where database connections could be left open on error (part of #14750) ## Changes - `api/apps/restful_apis/agent_api.py` — `test_db_connection`: - mysql / mariadb / oceanbase / postgres: replaced bare `db.connect()` / `db.close()` fallthrough with `with db.connection_context()` and a probe `SELECT 1` — guaranteed close on both success and exception - mssql: nested `try/finally` blocks so `cursor.close()` and `db.close()` are always called even when `cursor.execute()` raises - trino: wrapped cursor ops in `try/finally` for the same reason - Removed the `if req["db_type"] != "mssql": db.connect(); db.close()` shared fallthrough block — each branch now owns its teardown - Consolidated to a single `return get_json_result(...)` after the if/elif chain	2026-05-14 16:45:44 +08:00
dale053	bd99a22661	fix: atomic chunk/token counter updates for documents and knowledge b… (#14867 ) ### What problem does this PR solve? Fixes #14866. Previously, `DocumentService.increment_chunk_num` and `decrement_chunk_num` updated the `Document` row and its parent `Knowledgebase` row in two separate, non-transactional statements. If the second update failed (DB error, connection drop, etc.) after the first one succeeded, the document and knowledge base chunk/token counters would drift apart and stay inconsistent. There was also a behavioral asymmetry between the two methods: - `increment_chunk_num` only logged a warning when the document row was missing and returned a value that callers usually treated as success. - `decrement_chunk_num` raised `LookupError` in the same situation. This PR makes the counter updates atomic and aligns the missing-document behavior between the two methods: - Wrap the `Document` and `Knowledgebase` updates in `increment_chunk_num` / `decrement_chunk_num` inside a `DB.atomic()` block so both succeed or both roll back together. - Raise `LookupError` from `increment_chunk_num` when the target document no longer exists, matching `decrement_chunk_num`. - Update `reset_document_for_reparse` in `document_api_service.py` to catch the new `LookupError` and return a proper "Document not found!" API error instead of propagating the exception. No schema changes, no API contract changes for the success path; only the failure mode for a missing document during reparse is now a clean error response instead of an uncaught exception. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-14 14:48:52 +08:00
Ethan T.	ba8cb9dd4a	fix: replace mutable default arguments with None in LLM chat models (#13513 ) ## Summary - Replace `gen_conf={}` with `gen_conf=None` + guard in `rag/llm/chat_model.py` (12 instances across Base, BaiChuanChat, LocalLLM, MistralChat, ReplicateChat, BaiduYiyanChat, GoogleChat classes) - Replace `doc_ids=[]` with `doc_ids=None` + guard in `api/db/services/document_service.py` (1 instance) - Mutable default arguments are shared across all calls, causing potential cross-request state contamination - See Python docs: https://docs.python.org/3/faq/programming.html#why-are-default-values-shared-between-objects ## Test plan - [x] Verify LLM calls work with and without explicit gen_conf - [x] No behavior change for existing callers — `None` is replaced with `{}` at function entry 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-14 14:46:47 +08:00
dale053	714f777fa0	Fix: missing authentication on agent file upload and download endpoints (#14854 ) ### What problem does this PR solve? Closes #14853 The `/agents/download` and `/agents/<agent_id>/upload` endpoints in the agent API are missing `@login_required` and `@add_tenant_id_to_kwargs` decorators, allowing unauthenticated access. This is a security issue — any user can upload files to or download files from an agent without being logged in. Additionally, the upload endpoint bypasses canvas access control (`@_require_canvas_access_async`). This PR adds the missing authentication and authorization decorators to both endpoints and replaces the manual `user_id` / `created_by` lookups with the `tenant_id` provided by the auth middleware, making these endpoints consistent with the rest of the agent API. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-14 13:48:41 +08:00
Ricardo-M-L	48b4aa3e93	Fix WebDriver resource leak in HTML-to-PDF conversion (#14310 ) ### What problem does this PR solve? In `api/utils/web_utils.py`, `__get_pdf_from_html()` creates a Chrome WebDriver but only calls `driver.quit()` inside the `TimeoutException` handler. If the page element becomes stale before the timeout (no exception raised), the WebDriver is never quit, leaking the Chrome browser process and returning `None`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Changes - Move the PDF printing logic and `driver.quit()` outside the `except` block so they execute on all code paths - Use `try/finally` to ensure `driver.quit()` is always called, even if the `Page.printToPDF` DevTools call fails Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-14 13:28:58 +08:00
Br1an	d46bbd30f7	Fix: send input and output token usage to Langfuse (#13294 ) ### What problem does this PR solve? Closes #9837 The Langfuse integration currently only sends the output text to `langfuse_generation.update()` without including token usage information. This means Langfuse cannot track input/output token consumption for cost analysis and monitoring. ### Solution Add the `usage` parameter to `langfuse_generation.update()` with: - `input`: approximate input token count from `message_fit_in()` - `output`: approximate output token count from `num_tokens_from_string(answer)` - `total`: sum of input and output ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-14 13:11:37 +08:00
buua436	b89878c593	Fix: dataset document download route (#14910 ) ### What problem does this PR solve? dataset document download route ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-14 10:59:06 +08:00
plind	dd76653dc1	feat: add tag management for Agents with filtering and sorting (#14774 ) (#14799 ) ## Summary Closes #14774. Adds free-form tags on agents (UserCanvas) with full UI + API: - Stored as comma-separated `tags` column on `UserCanvas` with online migration. - New endpoints: `GET /v1/agents/tags` (aggregate counts) and `PUT /v1/agent/<id>/tags` (write). `GET /v1/agents` accepts a `tags=` query. - "Edit tags" item in agent dropdown opens a chip-style editor dialog; tags render as badges on each agent card. - New "Tags" facet in the agents filter bar, with counts. ## Implementation notes - Tag matching is exact-token: the SQL filter wraps stored tags as `,…,` and matches `,ml,` so `ml` doesn't match `ml-ops`. - Server-side normalization in `UserCanvasService.update_tags`: dedup (case-insensitive), per-tag cap of 64 chars, total length capped at 512 chars to fit the column, commas inside tag values are replaced with spaces. - Tenant authorization: `PUT /v1/agent/<id>/tags` gates on `UserCanvasService.accessible(canvas_id, tenant_id)`. - Tag listing scope: `UserCanvasService.list_tags` follows the same own + team-shared rule as `get_by_tenant_ids`. - i18n: keys added to `en.ts` and `zh.ts` only (per project convention; other locales fall back). - `HomeCard` gets a non-breaking `extra?: ReactNode` slot for the chip row; no `src/components/ui/` files modified. ## Test plan - [ ] Backend boot runs `migrate_db` → confirm `user_canvas.tags` column exists (`DESCRIBE user_canvas`). - [ ] Agents page renders cards normally (no console error from missing field). - [ ] `⋯ → Edit tags` opens a dialog that stays open (regression: dialog was unmounting with the dropdown). - [ ] Typing a tag without pressing Enter and clicking Save persists it (regression: last typed tag was being dropped). - [ ] Chip input supports Enter/comma to commit, Backspace on empty to remove, `×` to remove individual chip. - [ ] Tag containing a comma sent via API is stored with the comma replaced by a space. - [ ] 20 long tags sent via API does not error (length cap silently truncates). - [ ] "Tags" filter in the filter bar shows counts and narrows the list. - [ ] Filtering by `ml` does not return agents tagged `ml-ops`. - [ ] UI in Chinese shows 编辑标签 / 添加标签以整理和筛选你的智能体 etc. - [ ] `PUT /v1/agent/<other-tenant-id>/tags` returns `Agent not found or no permission.`	2026-05-13 21:41:32 +08:00
Ethan T.	8c5845f6ca	fix: use context manager for pdfplumber to prevent resource leak (#13512 ) ## Summary - Convert `pdfplumber.open()` to use `with` context manager in `api/utils/file_utils.py` (`thumbnail_img` function) - If any exception occurs between `open()` and `close()`, the PDF file handle leaks - The rest of the codebase (e.g. `read_potential_broken_pdf` in the same file) already uses `with pdfplumber.open(...)` correctly ## Test plan - [x] PDF thumbnail generation works correctly with context manager - [x] Resources properly cleaned up on exceptions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-13 21:09:51 +08:00
Ahmad Intisar	e994051eb9	Feature/generic api connector (#13545 ) # feat: Add Generic REST API Connector ## What problem does this PR solve? RAGFlow supports many specific data source connectors (MySQL, Slack, Google Drive, etc.), but there was no way to connect an arbitrary REST API as a data source. Users with custom or third-party APIs had to write a new connector class for each one. This PR adds a generic, configuration-driven REST API connector that lets users connect any REST API as a data source entirely through the UI — no code changes needed per API. --- ## Features ### Core Connector (`common/data_source/rest_api_connector.py`) - Implements `LoadConnector` and `PollConnector` interfaces for full and incremental sync - Configurable authentication: None, API Key (custom header), Bearer Token, Basic Auth - Pluggable pagination: Page-based, Offset-based, Cursor-based, or None - Smart page-size inference from user's query parameters to avoid duplicate/conflicting params - Configurable request delay between pages to prevent API rate limiting - Auto-detection of the items array in JSON responses (`items`, `results`, `data`, `records`, or first list found) - Advanced field mapping with dot-notation (`country.name`), array wildcards (`newsType[].name`), type hints, and default values - Optional content template rendering (`"Title: {title}\nBody: {body}"`) - HTML stripping for content fields - Stable document IDs via `hash128` from a configurable ID field or auto-generated from item content - Pydantic configuration schema with automatic coercion of UI string inputs to dicts/lists ### Backend Registration (`rag/svr/sync_data_source.py`, `common/constants.py`, `common/data_source/config.py`) - `REST_API` sync class wired into RAGFlow's `func_factory` - Full sync (`load_from_state`) and incremental polling (`poll_source`) support - Credentials and config passed from task to connector following existing patterns (MySQL, SeaFile, etc.) ### Test Connection Endpoint (`api/apps/connector_app.py`) - `POST /v1/connector/<id>/test` validates config schema, authentication, and API connectivity without triggering a sync - Clear error messages for auth failures vs. config issues ### Frontend UI (`web/src/pages/user-setting/data-source/constant/`) - Postman-style configuration:* Base URL, Query Parameters (key=value per line), Auth, Content Fields, Metadata Fields, Pagination Type - Auth-type-aware form: fields for API key header/value, Bearer token, or Basic username/password appear only when relevant - Advanced Settings toggle for: Custom Headers, Max Pages, Request Delay, Poll Timestamp Field, Request Body (POST) - Connector icon (SVG) and i18n strings (English) - "Test Connection" button to validate before syncing --- ## Controls & Safety - Configurable max pages safety cap (default: 1000, adjustable in UI) - Configurable request delay between pages (default: 0.5s, adjustable in UI) - Auth errors (401/403) fail immediately without retries; transient errors retry with exponential backoff - Diagnostic logging: auth setup confirmation, request details on failure, content field extraction status --- ## Type of change - [x] New Feature (non-breaking change which adds functionality) ##Visual Screenshots of Features <img width="482" height="510" alt="Screenshot 2026-03-11 at 5 19 52 PM" src="https://github.com/user-attachments/assets/dcb7ab4a-1622-44f3-bb02-d6f0527314c4" /> (Connector can be configured within the external data sources tab) Configuration Parameters: <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 46 PM" src="https://github.com/user-attachments/assets/5e154e71-4ab5-4872-bfb2-04f02b73c18a" /> <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 54 PM" src="https://github.com/user-attachments/assets/00cb14b7-0bcf-4b94-9d71-34e93369ecb2" /> Connection can be tested before attaching to dataset: <img width="981" height="681" alt="Screenshot 2026-03-11 at 5 21 40 PM" src="https://github.com/user-attachments/assets/aaa6eeeb-89a7-4349-bc34-2423bf8be9ee" /> Ingestion tested with API connector (works perfectly fine): <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 22 30 PM" src="https://github.com/user-attachments/assets/afcd0d58-cadd-4152-badc-d2f14d96fbec" /> Search & Retrieval works as well with metadata flow: <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 23 05 PM" src="https://github.com/user-attachments/assets/d41ee935-dcf7-4456-b317-22a76ca032c0" /> --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-13 20:35:01 +08:00
jony376	7f699d1202	Fix: enforce tenant authorization for `tenant_rerank_id` in retrieval flows (#14782 ) ### Related issues Closes #14781 ### What problem does this PR solve? Some retrieval endpoints accepted caller-supplied `tenant_rerank_id` and resolved it through `get_model_config_by_id(...)`. That helper loaded `TenantLLM` rows by global database id and returned decoded model configuration without checking whether the model belonged to the authenticated tenant or the dataset owner tenant. This meant dataset access was validated, but rerank-model selection was not. A caller who knew or could guess another tenant's `tenant_rerank_id` could attempt retrieval with a foreign rerank model config, creating a cross-tenant authorization gap for model usage. This PR closes that gap by making `tenant_rerank_id` resolution tenant-aware across the retrieval paths that accept it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Solution - Extend `get_model_config_by_id(...)` to accept an optional `allowed_tenant_ids` set and reject `TenantLLM` rows whose `tenant_id` is outside that set. - Pass the allowed tenant scope from retrieval endpoints that accept `tenant_rerank_id`: - `api/apps/sdk/doc.py` - `api/apps/sdk/session.py` - `api/apps/services/dataset_api_service.py` - Use the authenticated tenant plus dataset-owner tenant ids already derived by each retrieval flow as the authorization boundary for rerank model selection. - Add focused unit coverage to assert unauthorized `tenant_rerank_id` values are rejected and that the allowed tenant set is propagated correctly. ### Testing - `python -m py_compile` on: - `api/db/joint_services/tenant_model_service.py` - `api/apps/services/dataset_api_service.py` - `api/apps/sdk/doc.py` - `api/apps/sdk/session.py` - Added unit tests in: - `test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py` - `test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py` ### Notes for reviewers - This change is intentionally narrow: it affects only the `tenant_rerank_id` path, not the normal `rerank_id` name-based resolution path. - Local lint/syntax checks passed. - Full pytest execution could not be completed in this environment because the local test runtime is missing `strenum`, so the route-test files fail during collection before exercising the updated cases. --------- Co-authored-by: jony376 <jony376@gmail.com>	2026-05-13 19:53:08 +08:00
Wang Qi	f3b3596c29	Speed up ragflow server (#14894 ) ### What problem does this PR solve? Speed up ragflow server ### Type of change - [ ] Refactoring	2026-05-13 18:01:33 +08:00
buua436	8cb2bf04fb	Fix: llm add api key overridden (#14885 ) ### What problem does this PR solve? llm add api key overridden ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 17:15:32 +08:00
Wang Qi	ff685d3131	Delete duplicate route (#14883 ) ### What problem does this PR solve? The delete /graph is duplicated of `/datasets/<dataset_id>/<index_type>`, delete it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 15:57:44 +08:00
Wang Qi	45d676bc05	Fix delete graphrag not take effect in UI (#14879 ) ### What problem does this PR solve? Fix delete graphrag not take effect in UI ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 13:49:16 +08:00
Wang Qi	64bd0130d3	Add REST API backward compatibility (#14872 ) ### What problem does this PR solve? Add REST API backward compatibility ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 11:44:40 +08:00
dale053	5a5e766386	fix(api): authorize owner_ids for list chats and search apps (#14775 ) Closes #14768 ### What problem does this PR solve? The `list_chats` and `list_searches` REST API endpoints did not enforce authorization on the `owner_ids` query parameter. Any authenticated user could pass arbitrary tenant IDs to `owner_ids` and retrieve chats or search apps belonging to other tenants they are not a member of. This PR resolves the issue by: 1. Looking up the current user's authorized tenants via `TenantService.get_joined_tenants_by_user_id` and rejecting any `owner_ids` that fall outside that set. 2. When no `owner_ids` are provided, scoping the query to only the user's authorized tenants instead of returning an unfiltered result. 3. Adding unit tests that verify unauthorized `owner_ids` are rejected with `OPERATING_ERROR`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 09:43:44 +08:00
CaptainTimon	2717ee283f	feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679 ) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>	2026-05-12 09:42:31 +08:00
黄圣祺	415169d497	fix(dify): add GET method support to /dify/retrieval for health check (#13837 ) ## Summary - Add GET method handler to `/api/v1/dify/retrieval` endpoint for Dify external knowledge base connectivity verification - GET requests return a simple success response; POST requests retain existing retrieval logic unchanged ## Problem When Dify integrates with RAGFlow as an external knowledge base, it sends periodic GET requests to the retrieval endpoint for health/connectivity checks. The endpoint only accepted POST, causing werkzeug to return `405 Method Not Allowed`. After several successful POST retrievals, the failing GET health checks trigger Dify's circuit breaker, causing all subsequent requests to fail. Traceback from the issue: ``` werkzeug.exceptions.MethodNotAllowed: 405 Method Not Allowed: The method is not allowed for the requested URL. ``` ## Changes - `api/apps/sdk/dify_retrieval.py`: Added a separate GET route handler (`retrieval_health_check`) that returns `get_json_result(data=True)` ## Test plan - [ ] Verify `GET /api/v1/dify/retrieval` returns `{"code": 0, "message": "success", "data": true}` - [ ] Verify `POST /api/v1/dify/retrieval` with valid API key and body still works as before - [ ] Verify Dify external knowledge base integration no longer returns 405 errors Closes #13788 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-12 09:37:07 +08:00
tmimmanuel	663fc1d42c	fix(opensearch): implement doc-meta dispatch surface on OSConnection (#14577 ) ### What problem does this PR solve? Fixes #14570. On OpenSearch backends (`DOC_ENGINE=opensearch`) every document-metadata write failed with `'OSConnection' object has no attribute 'create_doc_meta_idx'`, so both `PATCH /api/v1/datasets/{ds}/documents/{doc}` with `meta_fields` and `POST /api/v1/datasets/{ds}/metadata/update` were unusable while every other document operation (retrieval, parsing, name update, chunk management) worked correctly on the same OpenSearch cluster. The bug runs deeper than the missing method name in the error message suggests. `DocMetadataService` also reached into `settings.docStoreConn.es.*` directly for the index refresh, the scripted partial update, and the count call, which means that even after adding `create_doc_meta_idx` to `OSConnection` the very next call in the same metadata flow would still raise `AttributeError` because `OSConnection` exposes `self.os` rather than `self.es`. Fixing only the reported symptom would have moved the failure one line down without restoring the feature. This PR adds a uniform document-metadata dispatch surface to both connection classes so they present the same abstract API, and routes the service layer through that surface via `getattr` guards instead of poking at backend-specific attributes. The four new methods on `OSConnection` and `ESConnectionBase` are `create_doc_meta_idx`, `refresh_idx`, `count_idx`, and `replace_meta_fields`. `OSConnection.create_doc_meta_idx` reuses the existing `conf/doc_meta_es_mapping.json` schema in the OpenSearch `body=` form because OpenSearch and Elasticsearch share the same index-creation payload, and `replace_meta_fields` emits a full scripted assignment (`ctx._source.meta_fields = params.meta_fields`) on both backends so removed keys actually disappear instead of being preserved by deep-merge semantics. The `getattr`-guarded dispatch in `DocMetadataService` keeps the existing fall-through paths intact for Infinity and OceanBase, which continue to rely on their search-based count fallback and on the delete-then-insert metadata replacement they used before, so this change is strictly additive for those two backends. Verification: `pytest test/unit_test/rag/utils/test_opensearch_doc_meta.py` runs 16 new unit tests that pass locally and pin the `OSConnection` dispatch surface, the `create_doc_meta_idx` short-circuit when the index already exists, the mapping-file payload routing, the `IndicesClient.create` failure path, the `refresh_idx` and `count_idx` success and error sentinels, and the full-assignment script emitted by `replace_meta_fields`. The test module stubs `common.settings` and `rag.nlp` at import time so the suite runs without the heavy backend SDKs that the rest of the repository pulls in transitively. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>	2026-05-11 17:04:28 +08:00
box4wangjing	292b0b8bce	chore: fix some comments to improve readability (#14756 ) ### What problem does this PR solve? fix some comments to improve readability ### Type of change - [x] Documentation Update --------- Signed-off-by: box4wangjing <box4wangjing@outlook.com>	2026-05-11 16:48:48 +08:00
Sank	592dba1489	Refact: Added a private helper _visibility_and_status_filter (#13627 ) ### What problem does this PR solve? Added a private helper _visibility_and_status_filter(joined_tenant_ids, user_id) that returns the Peewee condition: visible to user (team or own) and status is VALID. ### Type of change - [x] Refactoring --------- Co-authored-by: Serobabov Aleksandr <40SerobabovAS@region.cbr.ru> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-11 15:21:41 +08:00
tmimmanuel	6ce014c23b	fix: offload blocking DB/Redis calls to thread pool for high-concurrency support (#13825 ) (#13941 ) ### What problem does this PR solve? Addresses event-loop blocking under high concurrency reported in #13825. When multiple requests hit the API simultaneously, synchronous DB/Redis calls block the async event loop, preventing Quart from handling other requests and causing cascading 502/504 timeouts. This PR wraps all remaining blocking DB/Redis calls in `canvas_app.py`, `chat_api.py`, `session.py`, and `canvas_service.py` with `await thread_pool_exec()` - Offload all synchronous `Service.`, `REDIS_CONN.`, and `APIToken.query` calls to the thread pool - Convert sync endpoint handlers (`list_chats`, `get_chat`, `templates`, `sessions`, etc.) to `async def` - Convert sync helper functions (`_ensure_owned_chat`, `_validate_llm_id`, `_validate_dataset_ids`, etc.) to async - no duplicate sync/async pairs - Wrap `CanvasReplicaService` Redis IO calls (`bootstrap`, `replace_for_set`, `commit_after_run`) - Use `asyncio.gather()` for concurrent file uploads and chat response building Note: This fixes the code-level event-loop blocking, which is a prerequisite for handling concurrent requests. For the full "30 concurrent requests without 502/504" goal described in the issue, users should also tune deployment config: - `WS=4` or higher (HTTP worker processes, default 1) - `MAX_CONCURRENT_CHATS=50` (default 10) - `SANDBOX_EXECUTOR_MANAGER_POOL_SIZE` for workflow-heavy workloads ### Performance verification Reviewer asked for a before-vs-after comparison ([comment](https://github.com/infiniflow/ragflow/pull/13941#issuecomment-4393667231)). I built a self-contained microbenchmark that reproduces the exact failure mode this PR targets: an async handler that performs blocking DB/Redis-style calls (50 ms each, 3 per request, 30 concurrent requests) is run twice — once with the pre-PR pattern (sync call directly inside the async handler) and once with the post-PR pattern (`await thread_pool_exec(...)`). The benchmark imports nothing from RAGFlow except `thread_pool_exec` itself, so it is hermetic and reproducible (`THREAD_POOL_MAX_WORKERS=128`, Python 3.13.12). Throughput — wall-clock for 30 concurrent requests (lower is better) \| flavour \| wall(s) \| p50(s) \| p95(s) \| max(s) \| \|---\|---:\|---:\|---:\|---:\| \| before \| 4.986 \| 0.158 \| 0.207 \| 0.269 \| \| after \| 0.248 \| 0.181 \| 0.230 \| 0.231 \| The pre-PR handler serializes the entire load on the event-loop thread, so 30 × 3 × 50 ms ≈ 4.5 s shows up as the wall time. The post-PR handler parallelizes the blocking work across the thread pool and finishes the same load in 248 ms — a ~20× speedup on this workload. Event-loop responsiveness — latency of an unrelated probe coroutine while the 30 slow requests are running (lower is better) \| flavour \| samples \| probe p50 (ms) \| probe p95 (ms) \| probe max (ms) \| \|---\|---:\|---:\|---:\|---:\| \| before \| 1 \| 5442.26 \| 5442.26 \| 5442.26 \| \| after \| 28 \| 0.88 \| 11.53 \| 98.02 \| This is the metric that maps directly to "the API still answers other requests while one is busy". A 5 ms-interval probe was scheduled while the 30 slow handlers ran. With the pre-PR code the event loop was frozen for the entire duration of the blocking work, so only one probe sample was ever picked up and it waited 5,442 ms. After the PR, 28 probe samples landed with p50 0.88 ms / p95 11.53 ms, meaning unrelated requests are no longer starved by the slow ones. That is the regression mode behind the cascading 502/504s reported in #13825. <details> <summary>Raw benchmark output</summary> ``` config: 30 concurrent requests, 3 blocking calls of 50ms each per request, THREAD_POOL_MAX_WORKERS=128 === Throughput (lower wall is better) === flavour wall(s) p50(s) p95(s) max(s) before 4.986 0.158 0.207 0.269 after 0.248 0.181 0.230 0.231 === Event-loop responsiveness (lower probe latency is better) === flavour samples probe p50(ms) probe p95(ms) probe max(ms) before 1 5442.26 5442.26 5442.26 after 28 0.88 11.53 98.02 ``` </details> The benchmark script is included as a comment on the PR for reproducibility. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Closes [#13825](https://github.com/infiniflow/ragflow/issues/13825) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 15:08:55 +08:00
Paul Y Hui	a0efc453f3	Fix: safe argument guard and remove redundant redis call (#14060 ) ### What problem does this PR solve? - Moved if not all([email, new_pwd, new_pwd2]) guard to the top, before any decryption that could crash on None value - Removed the redundant REDIS_CONN.get() call — one call is sufficient ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-05-11 15:02:24 +08:00
Ricardo-M-L	5ef7f50eef	fix: use context manager for ThreadPoolExecutor in file_service.py (#14144 ) ## Summary - Wrap 2 `ThreadPoolExecutor` instances in `file_service.py` with `with` statement - Ensures threads are properly shut down after all futures complete ## Problem `parse_docs()` (line 532) and the file processing method (line 694) create `ThreadPoolExecutor` instances that are never shut down. In a long-running server process, this leaks thread resources on every invocation — threads remain alive consuming memory even after all submitted work is complete. ## Fix Replace bare `ThreadPoolExecutor()` with `with ThreadPoolExecutor() as exe:` context manager, which calls `executor.shutdown(wait=True)` on exit. ## Test plan - [x] Verified both call sites use `with` statement after fix - [x] No remaining bare `ThreadPoolExecutor` in `file_service.py` - [x] `document_service.py:1066` is a module-level executor (different pattern, not changed in this PR) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 14:02:45 +08:00
buua436	a03b95f8c4	Fix: shared dataset chunk index lookup (#14764 ) ### What problem does this PR solve? shared dataset chunk index lookup ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:50:08 +08:00
buua436	024c8cb0b5	Fix: dataset search rerank id type (#14759 ) ### What problem does this PR solve? issue: https://github.com/infiniflow/ragflow/issues/14748 change: dataset search rerank id type ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:48:05 +08:00
jony376	46897d6fa4	Fix: bind memory message `user_id` to authenticated user for JWT auth (#14745 ) ### Related issues Closes #14744 ### What problem does this PR solve? The Memory REST endpoint `POST /api/v1/messages` previously persisted whatever `user_id` the client sent in the JSON body. Memory rows were therefore attributed to an arbitrary string, even when the caller authenticated as a normal workspace user via JWT (browser/session-style bearer token decoded into an access token). That broke attribution and audit semantics for shared memories (team visibility): any authorized writer could spoof another subject id. The Python SDK already sends an optional `user_id` for integrations using API keys (`APIToken`) to tag an external subject distinct from the tenant owner user. ### Solution - Record `g.auth_via_api_token` in `_load_user` (`api/apps/__init__.py`): set `True` only when authentication resolves via `APIToken`, otherwise `False` after JWT-based login succeeds. - In `POST /messages` (`memory_api.add_message`): if the request was authenticated with an API key, keep accepting optional `user_id` from the body (default empty string). For JWT-authenticated users, always set stored `user_id` to `current_user.id` and ignore the client field. - Guard reads of `g` with `RuntimeError` handling so isolated imports or tests without a Quart application context do not fail when resolving `user_id`. - Document on `RAGFlow.add_message` that `user_id` is only meaningful for API-key authentication. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - `python -m py_compile` on modified modules (`api/apps/__init__.py`, `api/apps/restful_apis/memory_api.py`). - Recommended: run web/SDK memory message tests (`test_add_message`, `test_message_routes_unit`) against a full environment with `quart` and configured services. ### Notes for reviewers - Behavior change only for callers using JWT-style authorization on `POST /messages`; API-key callers keep prior optional `user_id` semantics. Co-authored-by: jony376 <jony376@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-11 13:26:05 +08:00
Achieve3318	16354f4e14	fix(dify): guard retrieval argument error behavior (#14169 ) ## What problem does this PR solve? The Dify-compatible `/dify/retrieval` endpoint recently gained stricter parsing and validation for its request payload, including: - Normalized `retrieval_setting.top_k` and `retrieval_setting.score_threshold` types. - Clear separation between malformed arguments vs missing required fields. Previously, there was no unit test explicitly guarding the exact error code and message contract for these cases. ## What does this PR change? - Add guard-style unit test in `test_dify_retrieval_routes_unit.py`: - `test_retrieval_argument_error_messages`: - Sends a request with malformed numeric options: - `retrieval_setting = {"top_k": "not-int", "score_threshold": "not-float"}` - Asserts `code == RetCode.ARGUMENT_ERROR` and message contains `"invalid or malformed arguments:"`. - Sends a request with required fields missing: - Empty payload (`{}`) - Asserts `code == RetCode.ARGUMENT_ERROR` and message contains `"required arguments are missing:"`. This test encodes the intended behavior of the Dify retrieval API so future refactors cannot silently regress error handling. ## Type of change - [x] Tests (add coverage and guardrails for existing behavior) Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 13:17:42 +08:00
Wang Qi	3838770e7a	GraphRAG feature - Part 1 - add spacy to extract entity and relation (#14670 ) ### What problem does this PR solve? GraphRAG feature - Part 1 - add spacy to extract entity and relation <img width="1621" height="1288" alt="image" src="https://github.com/user-attachments/assets/aadeddad-94da-46c6-adad-9c3784181f61" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 12:59:59 +08:00
web-dev0521	cc207b5b05	Refactor: tidy up ThreadPoolExecutor lifecycle in file_service and task executor (#14668 ) ## Summary - Wrap the `ThreadPoolExecutor` instances in `FileService.parse_docs` and `FileService.get_files` with `with ... as exe:` blocks for deterministic cleanup - Replace the `concurrent.futures.ThreadPoolExecutor` in `do_handle_task` with `asyncio.create_task(asyncio.to_thread(build_TOC, ...))`, preserving the existing parallelism with chunk insertion while leveraging the surrounding async context - Drop the now-unused `import concurrent` and the `executor.shutdown(wait=False)` call in the `finally` block Closes #14622. No behavioral change, no public API change. Net diff: ~19 insertions / 25 deletions across two files. ## Test plan - [ ] `uv run ruff check api/db/services/file_service.py rag/svr/task_executor.py` passes - [ ] Upload a multi-file batch through the chat/file endpoint and confirm `FileService.parse_docs` still returns combined parsed text - [ ] Trigger `FileService.get_files` via the chat reference flow with a mix of image and non-image files; verify both `raw=True` and `raw=False` paths return correctly - [ ] Run a `naive`-parser document task with `toc_extraction: true` and confirm the TOC chunk is generated and inserted exactly as before - [ ] Run a `naive`-parser document task with `toc_extraction: false` and confirm the path with `toc_thread = None` is unaffected - [ ] Cancel a running task to exercise the `finally` block and confirm cleanup still works without the executor shutdown call --------- Co-authored-by: web-dev0521 <jasonpette1783@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-05-11 12:59:00 +08:00
Ahmad Intisar	3c4d1da98f	Feature/table parser column roles (#13710 ) ### What problem does this PR solve? The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes. For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value. The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion. This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns. Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 10:06:04 +08:00
Mehmet Karakose	7ec87f7cb7	fix(auth): fall back to session-based auth in _load_user (#14569 ) ## Summary Closes #13663. OAuth / OIDC callbacks call `login_user(user)` which writes `_user_id` into the session cookie, but `_load_user()` in `api/apps/__init__.py` only ever looked at the `Authorization` header. The SPA's response interceptor wipes the Authorization value from `localStorage` on the first 401 it sees — meaning that during the post-redirect window after an OAuth login, a single transient 401 sends every subsequent request back to the login page even though `login_user()` had already established a perfectly good server-side session. The reporter's analysis traces this all the way through the redirect → `navigate('/')` → first request → empty header → 401 → `removeAll()` → infinite-redirect-to-login chain. ## What changed - New `_load_user_from_session()` helper that reads `session["_user_id"]`, looks up the user in `UserService` (with the same `StatusEnum.VALID` and `access_token` checks already used elsewhere), and assigns `g.user`. - Every `return None` path in `_load_user()` now routes through that helper before giving up: - missing `Authorization` header - malformed `bearer ` prefix - empty / too-short JWT payload - JWT signature failure - JWT-resolved user not found / has no `access_token` - `APIToken.query()` fallback exhausted The JWT and API-token paths still take precedence — the session is only consulted when those can't authenticate the request. So existing local-login and SDK callers see no behaviour change; only OAuth / OIDC users that hit the original race now stay logged in. The Bearer-prefix issue called out in #13663 (lines 103-110) is already handled in the current code, so this PR only addresses the second half of the report. ## Test plan - [ ] Configure OIDC under `oauth` in `service_conf.yaml` - [ ] Click the OIDC login button, complete auth at the IdP - [ ] Confirm that navigating between pages no longer bounces back to `/login` - [ ] Confirm local email/password login still issues + accepts JWTs - [ ] Confirm SDK/API key callers still authenticate via `Authorization: Bearer <api-token>` --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 09:59:52 +08:00
Hunnyboy1217	782084780e	feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628 ) (#14677 ) ### What problem does this PR solve? S3-family connector syncs currently re-download every in-window object just so we can compute `xxhash128(blob)` and compare against `Document.content_hash`. Anything that bumps `LastModified` without changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays full bandwidth and re-parses files that didn't actually change. #14628 covers the broader incremental-ingestion redesign; this PR is the first slice. The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2 / GCS / OCI / S3-compat) now implements a new `FingerprintConnector` interface: `list_keys()` paginates `list_objects_v2` and yields `KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The orchestrator joins those against the connector's existing `{doc_id: content_hash}` map and only calls `get_value(key)` when the fingerprint differs. Unchanged keys are skipped entirely — no `GetObject`, no re-parse. No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing `Document.content_hash` column per @yingfeng's suggestion; the connector decides at listing time whether to populate it. Local uploads and connectors that don't opt in fall through to the existing post-download `xxhash128(blob)` path with no behavior change. This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS (PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint` (PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation (PR-4). Holding those back keeps this PR additive and reviewable on its own. #### Files touched - `common/data_source/models.py` — new `KeyRecord`; optional `fingerprint` on `Document` - `common/data_source/interfaces.py` — `IncrementalCapability` enum, `FingerprintConnector` ABC - `common/data_source/blob_connector.py` — `BlobStorageConnector` implements `FingerprintConnector`; per-object download factored into `_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`, `get_value` all share it - `rag/svr/sync_data_source.py` — `_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop; `_run_task_logic` plumbs `doc.fingerprint` into the upload dict - `api/db/services/document_service.py` — `list_id_content_hash_map_by_kb_and_source_type()` helper - `api/db/services/connector_service.py` + `file_service.py` — fingerprint flows through `duplicate_and_parse → upload_document` and lands in `content_hash` - `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests covering ETag normalization (single-part, multipart, quoted, empty), `list_keys()` not calling `GetObject`, `get_value()` materializing with fingerprint, deterministic/stable fingerprints, and the bypass loop asserting `GetObject` is not called on a match #### Worth flagging for review Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a `LastModified` window when `poll_range_start` was set. New code uses `_fingerprint_filtered_generator` (full bucket listing + fingerprint compare) outside of explicit `reindex=1`. Strictly better for unchanged-bucket cases since it skips `GetObject`, but it does mean every sync now does a full `list_objects_v2` paginate. Should still be cheap for most buckets — flagging in case anyone has a very large bucket where the time-window filter was meaningful. On migration: existing rows have `content_hash = xxhash128(blob)` from the old code. The first sync after this lands sees ETag-derived fingerprints that don't match, re-fetches every object once, and writes the new fingerprint. From the second sync onward the bypass works as expected. "Slow day one, fast every day after." A `fingerprint_backfill: trust` opt-out is sketched in the design doc but not in this PR. #### Test plan - [x] `uv run ruff check` — clean on all 8 touched files - [x] `uv run pytest test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed - [x] Broader unit-test suite — no regressions in anything I touched - [ ] Manual smoke against a real S3 bucket — configure a connector, run sync twice, expect the second sync to log `bypassed=N, fetched=0` and no `GetObject` calls in CloudTrail / bucket access logs - [ ] Manual smoke with `reindex=1` — confirm the full re-download path still works ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-09 20:03:56 +08:00
euvre	f4b8f53b6d	Fix: restore embedding model switching for datasets with existing chunks (#14732 ) ### What problem does this PR solve? ## Problem During the REST API refactoring (#13690), the `/api/v2/kb/check_embedding` endpoint was removed and never migrated to the new RESTful structure. The frontend was pointed to the `/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a completely different function). Additionally, a hard guard was introduced that rejects any `embd_id` change when `chunk_num > 0`, making it impossible to switch embedding models on datasets with existing chunks. ## Root Cause 1. Missing endpoint: The old `check_embedding` logic (sample random chunks, re-embed with the new model, compare cosine similarity) was not carried over to the new REST API service layer. 2. Wrong frontend URL: `checkEmbedding` in `api.ts` pointed to `/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated check endpoint. 3. Overly restrictive guard: `dataset_api_service.py` line 310 blocked all `embd_id` updates when `chunk_num > 0`. This check did not exist in the pre-refactor code — it was incorrectly introduced during the refactor. ## Changes ### Backend - `api/apps/services/dataset_api_service.py` - Remove the `chunk_num > 0` hard guard on `embd_id` updates - Add `check_embedding()` service function: samples random chunks, re-embeds them with the candidate model, computes cosine similarity, returns compatibility result (avg ≥ 0.9 = compatible) - Add `import re` for the `_clean()` helper - `api/apps/restful_apis/dataset_api.py` - Add `POST /datasets/<dataset_id>/embedding/check` endpoint following the new REST API conventions - Clean up unused top-level imports (`random`, `re`, `numpy`) ### Frontend - `web/src/utils/api.ts` - Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` → `/datasets/${datasetId}/embedding/check` ### Tests - `test/testcases/test_http_api/test_dataset_management/test_update_dataset.py` - Update `test_embedding_model_with_existing_chunks` to assert success (`code == 0`) instead of expecting the old `102` error - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py` - Update `test_update_route_branch_matrix_unit` to assert `RetCode.SUCCESS` when updating `embd_id` on a chunked dataset, replacing the old `chunk_num` error assertion ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-09 18:48:57 +08:00

1 2 3 4 5 ...

1627 Commits