Commit Graph

722 Commits

Author SHA1 Message Date
6499bce2a6 fix: Langfuse chat observation (#15026)
### What problem does this PR solve?

Closes #15025

Langfuse-enabled `dialog_service.async_chat()` regressed to
`langfuse_tracer.start_generation(...)` after the earlier Langfuse v4
migration. Langfuse v4 uses `start_observation(as_type="generation")`,
so the remaining `start_generation` call can fail when chat tracing is
enabled.

This restores the migrated `start_observation(as_type="generation")`
call for chat observations while preserving the existing trace context,
model, input payload, and update/end flow. It also adds a regression
test with a fake Langfuse v4-style client that exposes
`start_observation()` but not `start_generation()`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Tests

- `.venv/bin/pytest
test/unit_test/api/db/services/test_dialog_service_final_answer.py -q`
- `.venv/bin/ruff check api/db/services/dialog_service.py
test/unit_test/api/db/services/test_dialog_service_final_answer.py`
2026-05-20 15:01:19 +08:00
f169ab4b39 feat(tts): cache synthesized speech in Redis to avoid redundant calls (#14851)
## What problem does this PR solve?

Closes #12017.

TTS output is deterministic for a given `(model, text)` pair, so
re-running the same text through the same TTS model produces the same
bytes — yet `Canvas.tts` and `dialog_service.tts` re-synthesized on
every request. That's slow and wastes provider quota whenever the same
assistant response is replayed, shared across users, or repeated within
a session.

### Change

New helper `rag/utils/tts_cache.py` with `synthesize_with_cache(tts_mdl,
cleaned_text)`:

- **Key:** `tts:cache:{model_id}:{sha256(text)}` — separate namespace
per model, identical cleaned text reuses a single entry across both call
sites.
- **Value:** the hex-encoded audio blob both call sites already
returned. No format change for downstream consumers.
- **TTL:** 7 days by default, configurable via
`RAGFLOW_TTS_CACHE_TTL_SECONDS`.
- **Failure modes:** a Redis hiccup falls back to direct synthesis; a
failed synthesis still returns `None` (existing contract preserved).


[`Canvas.tts`](https://github.com/infiniflow/ragflow/blob/main/agent/canvas.py#L683-L724)
and
[`dialog_service.tts`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/dialog_service.py#L1367-L1380)
now route through the helper; the per-file bytes-accumulation/hex-encode
loop has been removed in favor of one shared implementation.

## Type of change

- [x] New Feature (non-breaking change which adds functionality)

## Test plan

- [ ] **Cache hit, chat path:** Configure a dialog with TTS enabled, ask
the same question twice with `stream=false`. Verify the second response
returns the same `audio_binary` and that the second invocation doesn't
hit the TTS provider (e.g., observe provider-side logs / usage counters;
check no `LLMBundle.tts can't update token usage` log line on the second
run).
- [ ] **Cache hit, agent path:** Same exercise via a Conversational
Agent that includes a Message component playing back the answer.
- [ ] **Cache isolation per model:** Switch tenant's `tts_id` between
two models, run the same text against each — confirm the second model's
first synthesis still happens (no cross-model hits).
- [ ] **TTL override:** Set `RAGFLOW_TTS_CACHE_TTL_SECONDS=120`, confirm
the entry expires after 2 minutes.
- [ ] **Redis unavailable:** Stop Redis (or break the connection).
Verify the TTS endpoint still works — synthesis falls back to direct
calls, with a `TTS cache lookup failed` / `TTS cache store failed`
warning logged.
- [ ] **Failure path:** Configure a TTS model with an invalid API key,
ensure the response still returns successfully with `audio_binary=None`
(no regression vs. current behavior).
2026-05-19 14:20:40 +08:00
525a87be0f Misc: fix some typos (#14987)
### What problem does this PR solve?

Fix minor code quality issues:

1. Fix typo in assertion error message: "Can't fine" → "Can't find"
2. Remove duplicate line in common/connection_utils.py

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2026-05-19 10:47:06 +08:00
198f3c4b9a Fix: validate memory tenant model IDs on update and enforce tenant scope in memory pipeline (#14923)
### Related issues

Closes #14922

### What problem does this PR solve?

`POST /memories` already resolves `tenant_llm_id` and `tenant_embd_id`
through `ensure_tenant_model_id_for_params`, but `PUT
/memories/<memory_id>` accepted client-supplied `tenant_llm_id` /
`tenant_embd_id` without checking that those `tenant_llm` rows belong to
the memory owner’s tenant. A caller could persist another tenant’s row
IDs and later trigger extraction or embedding that loaded foreign model
credentials via `get_model_config_by_id(tenant_model_id)` with no tenant
allow-list.

This change aligns the update path with create: updates that change
models must go through `llm_id` / `embd_id` and
`ensure_tenant_model_id_for_params` scoped to the **memory’s**
`tenant_id` (not only the current user, so team-access cases stay
correct). Direct `tenant_*` fields in the body without `llm_id` /
`embd_id` are rejected. As defense in depth, `memory_message_service`
passes `allowed_tenant_ids` / `requester_tenant_id` into
`get_model_config_by_id` for LLM and embedding resolution so mismatched
IDs cannot be used even if bad data existed. A regression test rejects
payloads that set only `tenant_llm_id` / `tenant_embd_id`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-19 10:11:46 +08:00
b69a6a5d80 Feat: full optimization on connector dashboard (#14979)
### What problem does this PR solve?

This PR improves the connector dashboard task management experience and
adds better visibility into connector execution logs.

### Overview:

#### Before
<img width="700" alt="image"
src="https://github.com/user-attachments/assets/e4a8ed6f-2e18-4f0f-8528-41a514550052"
/>

#### Now:
<img width="700" alt="Screenshot from 2026-05-18 16-31-30"
src="https://github.com/user-attachments/assets/d4ca193b-847a-49ae-9e4f-5fbca60ea627"
/>

### 1. Add a new logging page to the connector dashboard

A new logging page has been added so users can view connector task
execution logs directly from the connector dashboard.

### 2. Merge the Resume button into Confirm

The separate **Resume** button has been removed. The **Confirm** button
now represents different actions depending on the current task state:

- **Save**: Save form changes and reschedule tasks.
- **Stop**: Cancel currently scheduled or running tasks.
- **Resume**: Create new scheduled tasks after the previous tasks have
been stopped.
- **Start**: Start tasks when no task has been started yet.

### 3. Separate syncing and pruning tasks

Connector tasks are now separated into **syncing** and **pruning**.

Pruning is controlled by the **Sync deleted files** option:

- When **Sync deleted files** is disabled, only syncing tasks are shown.
- When **Sync deleted files** is enabled, both syncing and pruning tasks
are shown.

**Now: Sync deleted files disabled**

<img width="700" alt="Sync deleted files disabled"
src="https://github.com/user-attachments/assets/dbd9232e-614a-407f-a0b1-c109e5fa567d"
/>

**Now: Sync deleted files enabled**

<img width="700" alt="Sync deleted files enabled"
src="https://github.com/user-attachments/assets/1f527f48-ccb3-4ee8-97ca-086891489296"
/>

### 4. Update logs in backend

<img width="700" alt="image"
src="https://github.com/user-attachments/assets/10a95a3f-98c1-4e67-8afa-ddf6cda5b0b2"
/>

### 5. Remove connector resume API

- Removed: `POST /v1/connectors/<connector_id>/resume`
- Replaced by: `PATCH /v1/connectors/<connector_id>`


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-19 10:07:11 +08:00
93d3deb5e4 Fix admin CLI system variable commands (#14956)
## What

Fixes #12409.

Implements admin CLI support for:

- `list vars;`
- `show var <name-or-prefix>;`
- `set var <name> <value>;`

## Changes

- Wire Go CLI variable commands to the admin API.
- Support integer and quoted string values in `SET VAR`.
- Return variable rows as `data_type`, `name`, `setting_type`, and
`value`.
- Add exact-name lookup with prefix fallback for `SHOW VAR`.
- Validate values by stored data type: `string`, `integer`, `bool`, and
`json`.
- Keep the legacy Python admin CLI/server behavior aligned.
- Update admin CLI docs and add focused tests.

## Verification

- `go test -count=1 ./internal/cli`
- `python3.12 -m py_compile admin/server/services.py
admin/server/routes.py api/db/services/system_settings_service.py
admin/client/parser.py admin/client/ragflow_client.py`
- Python admin CLI parser smoke test for `SET VAR`, quoted values, `SHOW
VAR`, and `LIST VARS`.
- Attempted `./run_go_tests.sh`; local environment is missing native
tokenizer/linker artifacts:
  - `internal/cpp/cmake-build-release/librag_tokenizer_c_api.a`
  - `-lstdc++`

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-05-18 19:08:45 +08:00
2dbe3b8a62 fix: metadata_condition returning all docs when filter matches nothing (#14967)
### What problem does this PR solve?

When _parse_doc_id_filter_with_metadata returns [], the empty list is
falsy so the WHERE id IN (...) clause was silently skipped, causing the
full dataset to be returned instead of an empty result.

Change `if doc_ids:` to `if doc_ids is not None:` in both get_list() and
get_by_kb_id() to distinguish between no filter (None) and a filter that
matched zero documents ([]).

Fixes #14962

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-18 18:54:30 +08:00
dev
b12eaee38b fix(api): enforce tenant access for connector routes (#14747)
### What problem does this PR solve?

Fixes #14746.

Adds tenant access checks for connector-by-id REST routes before reading
connector details, mutating connector config/status, deleting
connectors, rebuilding, or listing sync logs. Unauthorized callers now
receive `RetCode.AUTHENTICATION_ERROR` with `No authorization.` without
reaching the connector/log mutation paths.

Validation:
- `python3 -m pytest
--confcutdir=test/testcases/test_web_api/test_connector_app
test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py`
- `uvx ruff check api/apps/restful_apis/connector_api.py
api/db/services/connector_service.py
test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: dev111-actor <dev111-actor@users.noreply.github.com>
2026-05-18 16:09:26 +08:00
56d73d0c2c Refactor: speed up ragflow server, save startup memory (#14973)
### What problem does this PR solve?

Refactor: speed up ragflow server, save startup memory, saved 200MiB,
and 5-9 seconds start time.

##### Before
1241292  |   |           \_ python3 api/ragflow_server.py
RAGFlow server is ready after 25.61845850944519s initialization.

##### After
1019968  |   |           \_ python3 api/ragflow_server.py
RAGFlow server is ready after 16.205134391784668s initialization.

### Type of change

- [x] Refactoring
2026-05-18 15:55:59 +08:00
f1d2383572 Push metadata filters down to Infinity (#14974)
### What problem does this PR solve?

Push metadata filters down to Infinity

### Type of change

- [x] Refactoring
2026-05-18 14:22:04 +08:00
7cdc74bbe5 Refactor: Drop the vector fetch for ES (#14970)
## Summary
- Stop pulling chunk vectors (`q_*_vec`) back from Elasticsearch in the
main retrieval path. ES already knows them; shipping them was pure
bandwidth/memory overhead.
- Recover the per-chunk cosine similarity via a second KNN-only ES call
filtered by the candidate chunk ids. The new `_score` is merged with
locally computed term similarity using the user-configured
`vector_similarity_weight`.
- Lazily fetch the chunk embedding only for the chunks
`insert_citations` actually needs.

## Details
**`rag/nlp/search.py`**
- `Dealer.search`: no longer appends `q_*_vec` to the ES select list.
OceanBase still gets it (its rerank path is unchanged).
- New `Dealer._knn_scores(sres, idx_names, kb_ids)`: a `MatchDenseExpr`
over the cached query vector filtered by `id IN sres.ids`, returning
`{chunk_id: cosine_score}` via ES `_score`.
- New `Dealer.rerank_with_knn(...)`: term similarity from
`qryr.token_similarity` plus the ES-supplied KNN score, combined with
`tkweight`/`vtweight` and the existing rank-feature bonus.
- New `Dealer.fetch_chunk_vectors(chunk_ids, tenant_ids, kb_ids, dim)`:
on-demand vector fetch for citation use.
- `Dealer.retrieval` routes Infinity → unchanged, OceanBase → existing
local `rerank`, ES → new KNN-score path.

**`common/doc_store/es_conn_base.py`**
- New `get_scores(res)` helper returning `{_id: _score}` directly from
hit headers (ES doesn't surface `_score` through `get_fields`).

**`api/db/services/dialog_service.py`**
- New top-level `_hydrate_chunk_vectors(...)` helper. On ES it
back-fills `ck["vector"]` from `fetch_chunk_vectors` right before
`insert_citations`. No-op on Infinity / OB (their chunks already carry
vectors).
- Both `decorate_answer` closures became `async` and are `await`-ed at
all call sites in `async_chat` and `async_ask`.

## Backend behavior
| Backend | Returns chunk vec in main search | Sim source | Vectors for
citations |
|---|---|---|---|
| ES | No | second KNN call (`_score`) merged with term sim | fetched on
demand |
| Infinity | No (unchanged) | normalized `_score` | already on chunks |
| OceanBase | Yes (kept) | local hybrid rerank | already on chunks |

## Test plan
2026-05-18 14:21:56 +08:00
9f2fb4611f Fix: guard empty/whitespace embedding inputs in LLMBundle (#14428) (#14924)
Closes #14428 


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-18 14:11:54 +08:00
14c0985182 feat: bump Python minimum from 3.12 to 3.13, drop strenum backport (#14767)
Closes #14753

## What changed

| File | Change |
|---|---|
| `pyproject.toml` | `requires-python` → `>=3.13,<3.15`; remove
`strenum==0.4.15` |
| `Dockerfile` | `uv python install 3.13`, `uv sync --python 3.13` |
| `.github/workflows/tests.yml` | `uv sync --python 3.13` on both matrix
legs |
| `CLAUDE.md` | dev setup command + requirements note updated |
| `deepdoc/parser/mineru_parser.py` | `from strenum import StrEnum` →
`from enum import StrEnum` |
| `agent/tools/code_exec.py` | same |

`StrEnum` has been in the stdlib since Python 3.11 — the `strenum`
backport package is no longer needed once the floor is 3.13.

## Why uv.lock is not regenerated

`uv lock --python 3.13` fails because:

1. The infiniflow/graspologic fork pins `numpy>=1.26.4,<2.0.0`
2. `tensorflow-cpu>=2.20.0` (the first release with cp313 wheels)
depends on `ml-dtypes>=0.5.1`, which requires `numpy>=2.1.0`
3. These two constraints are irreconcilable on Python 3.13

The lockfile regeneration requires loosening the `numpy` upper bound in
the `infiniflow/graspologic` fork. Once that fork commit is updated and
the SHA in `pyproject.toml:49` is bumped, `uv lock --python 3.13` will
succeed.

## RFC corrections

Two claims in the original RFC (#14753) did not hold up under code
review:

- **"graspologic hard-blocks 3.13"** — the infiniflow fork at the pinned
commit has no `<3.13` Python constraint. The blocker is the transitive
`numpy<2.0.0` conflict with tensorflow-cpu's test dependency, not a
direct Python version cap.
- **"free-threading throughput gains for I/O-bound workload"** — Python
3.13 free-threading requires a special `--disable-gil` build and
provides no benefit for async I/O code (the GIL is already released
during I/O). The real motivation is forward compatibility and improved
error messages.
2026-05-15 14:40:53 +08:00
bd99a22661 fix: atomic chunk/token counter updates for documents and knowledge b… (#14867)
### What problem does this PR solve?

Fixes #14866.

Previously, `DocumentService.increment_chunk_num` and
`decrement_chunk_num` updated the `Document` row and its parent
`Knowledgebase` row in two separate, non-transactional statements. If
the second update failed (DB error, connection drop, etc.) after the
first one succeeded, the document and knowledge base chunk/token
counters would drift apart and stay inconsistent.

There was also a behavioral asymmetry between the two methods:

- `increment_chunk_num` only logged a warning when the document row was
missing and returned a value that callers usually treated as success.
- `decrement_chunk_num` raised `LookupError` in the same situation.

This PR makes the counter updates atomic and aligns the missing-document
behavior between the two methods:

- Wrap the `Document` and `Knowledgebase` updates in
`increment_chunk_num` / `decrement_chunk_num` inside a `DB.atomic()`
block so both succeed or both roll back together.
- Raise `LookupError` from `increment_chunk_num` when the target
document no longer exists, matching `decrement_chunk_num`.
- Update `reset_document_for_reparse` in `document_api_service.py` to
catch the new `LookupError` and return a proper "Document not found!"
API error instead of propagating the exception.

No schema changes, no API contract changes for the success path; only
the failure mode for a missing document during reparse is now a clean
error response instead of an uncaught exception.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-14 14:48:52 +08:00
ba8cb9dd4a fix: replace mutable default arguments with None in LLM chat models (#13513)
## Summary
- Replace `gen_conf={}` with `gen_conf=None` + guard in
`rag/llm/chat_model.py` (12 instances across Base, BaiChuanChat,
LocalLLM, MistralChat, ReplicateChat, BaiduYiyanChat, GoogleChat
classes)
- Replace `doc_ids=[]` with `doc_ids=None` + guard in
`api/db/services/document_service.py` (1 instance)
- Mutable default arguments are shared across all calls, causing
potential cross-request state contamination
- See Python docs:
https://docs.python.org/3/faq/programming.html#why-are-default-values-shared-between-objects

## Test plan
- [x] Verify LLM calls work with and without explicit gen_conf
- [x] No behavior change for existing callers — `None` is replaced with
`{}` at function entry

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-14 14:46:47 +08:00
d46bbd30f7 Fix: send input and output token usage to Langfuse (#13294)
### What problem does this PR solve?

Closes #9837

The Langfuse integration currently only sends the output text to
`langfuse_generation.update()` without including token usage
information. This means Langfuse cannot track input/output token
consumption for cost analysis and monitoring.

### Solution

Add the `usage` parameter to `langfuse_generation.update()` with:
- `input`: approximate input token count from `message_fit_in()`
- `output`: approximate output token count from
`num_tokens_from_string(answer)`
- `total`: sum of input and output

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-14 13:11:37 +08:00
dd76653dc1 feat: add tag management for Agents with filtering and sorting (#14774) (#14799)
## Summary
Closes #14774.

Adds free-form tags on agents (UserCanvas) with full UI + API:
- Stored as comma-separated `tags` column on `UserCanvas` with online
migration.
- New endpoints: `GET /v1/agents/tags` (aggregate counts) and `PUT
/v1/agent/<id>/tags` (write). `GET /v1/agents` accepts a `tags=` query.
- "Edit tags" item in agent dropdown opens a chip-style editor dialog;
tags render as badges on each agent card.
- New "Tags" facet in the agents filter bar, with counts.

## Implementation notes
- **Tag matching is exact-token**: the SQL filter wraps stored tags as
`,…,` and matches `,ml,` so `ml` doesn't match `ml-ops`.
- **Server-side normalization** in `UserCanvasService.update_tags`:
dedup (case-insensitive), per-tag cap of 64 chars, total length capped
at 512 chars to fit the column, commas inside tag values are replaced
with spaces.
- **Tenant authorization**: `PUT /v1/agent/<id>/tags` gates on
`UserCanvasService.accessible(canvas_id, tenant_id)`.
- **Tag listing scope**: `UserCanvasService.list_tags` follows the same
own + team-shared rule as `get_by_tenant_ids`.
- **i18n**: keys added to `en.ts` and `zh.ts` only (per project
convention; other locales fall back).
- **`HomeCard`** gets a non-breaking `extra?: ReactNode` slot for the
chip row; no `src/components/ui/` files modified.

## Test plan
- [ ] Backend boot runs `migrate_db` → confirm `user_canvas.tags` column
exists (`DESCRIBE user_canvas`).
- [ ] Agents page renders cards normally (no console error from missing
field).
- [ ] `⋯ → Edit tags` opens a dialog that stays open (regression: dialog
was unmounting with the dropdown).
- [ ] Typing a tag without pressing Enter and clicking Save persists it
(regression: last typed tag was being dropped).
- [ ] Chip input supports Enter/comma to commit, Backspace on empty to
remove, `×` to remove individual chip.
- [ ] Tag containing a comma sent via API is stored with the comma
replaced by a space.
- [ ] 20 long tags sent via API does not error (length cap silently
truncates).
- [ ] "Tags" filter in the filter bar shows counts and narrows the list.
- [ ] Filtering by `ml` does **not** return agents tagged `ml-ops`.
- [ ] UI in Chinese shows 编辑标签 / 添加标签以整理和筛选你的智能体 etc.
- [ ] `PUT /v1/agent/<other-tenant-id>/tags` returns `Agent not found or
no permission.`
2026-05-13 21:41:32 +08:00
7f699d1202 Fix: enforce tenant authorization for tenant_rerank_id in retrieval flows (#14782)
### Related issues

Closes #14781 

### What problem does this PR solve?

Some retrieval endpoints accepted caller-supplied `tenant_rerank_id` and
resolved it through `get_model_config_by_id(...)`. That helper loaded
`TenantLLM` rows by global database id and returned decoded model
configuration without checking whether the model belonged to the
authenticated tenant or the dataset owner tenant.

This meant dataset access was validated, but rerank-model selection was
not. A caller who knew or could guess another tenant's
`tenant_rerank_id` could attempt retrieval with a foreign rerank model
config, creating a cross-tenant authorization gap for model usage.

This PR closes that gap by making `tenant_rerank_id` resolution
tenant-aware across the retrieval paths that accept it.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Solution

- Extend `get_model_config_by_id(...)` to accept an optional
`allowed_tenant_ids` set and reject `TenantLLM` rows whose `tenant_id`
is outside that set.
- Pass the allowed tenant scope from retrieval endpoints that accept
`tenant_rerank_id`:
  - `api/apps/sdk/doc.py`
  - `api/apps/sdk/session.py`
  - `api/apps/services/dataset_api_service.py`
- Use the authenticated tenant plus dataset-owner tenant ids already
derived by each retrieval flow as the authorization boundary for rerank
model selection.
- Add focused unit coverage to assert unauthorized `tenant_rerank_id`
values are rejected and that the allowed tenant set is propagated
correctly.

### Testing

- `python -m py_compile` on:
  - `api/db/joint_services/tenant_model_service.py`
  - `api/apps/services/dataset_api_service.py`
  - `api/apps/sdk/doc.py`
  - `api/apps/sdk/session.py`
- Added unit tests in:
-
`test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`
-
`test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py`

### Notes for reviewers

- This change is intentionally narrow: it affects only the
`tenant_rerank_id` path, not the normal `rerank_id` name-based
resolution path.
- Local lint/syntax checks passed.
- Full pytest execution could not be completed in this environment
because the local test runtime is missing `strenum`, so the route-test
files fail during collection before exercising the updated cases.

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-13 19:53:08 +08:00
663fc1d42c fix(opensearch): implement doc-meta dispatch surface on OSConnection (#14577)
### What problem does this PR solve?

Fixes #14570. On OpenSearch backends (`DOC_ENGINE=opensearch`) every
document-metadata write failed with `'OSConnection' object has no
attribute 'create_doc_meta_idx'`, so both `PATCH
/api/v1/datasets/{ds}/documents/{doc}` with `meta_fields` and `POST
/api/v1/datasets/{ds}/metadata/update` were unusable while every other
document operation (retrieval, parsing, name update, chunk management)
worked correctly on the same OpenSearch cluster.

The bug runs deeper than the missing method name in the error message
suggests. `DocMetadataService` also reached into
`settings.docStoreConn.es.*` directly for the index refresh, the
scripted partial update, and the count call, which means that even after
adding `create_doc_meta_idx` to `OSConnection` the very next call in the
same metadata flow would still raise `AttributeError` because
`OSConnection` exposes `self.os` rather than `self.es`. Fixing only the
reported symptom would have moved the failure one line down without
restoring the feature.

This PR adds a uniform document-metadata dispatch surface to both
connection classes so they present the same abstract API, and routes the
service layer through that surface via `getattr` guards instead of
poking at backend-specific attributes. The four new methods on
`OSConnection` and `ESConnectionBase` are `create_doc_meta_idx`,
`refresh_idx`, `count_idx`, and `replace_meta_fields`.
`OSConnection.create_doc_meta_idx` reuses the existing
`conf/doc_meta_es_mapping.json` schema in the OpenSearch `body=` form
because OpenSearch and Elasticsearch share the same index-creation
payload, and `replace_meta_fields` emits a full scripted assignment
(`ctx._source.meta_fields = params.meta_fields`) on both backends so
removed keys actually disappear instead of being preserved by deep-merge
semantics.

The `getattr`-guarded dispatch in `DocMetadataService` keeps the
existing fall-through paths intact for Infinity and OceanBase, which
continue to rely on their search-based count fallback and on the
delete-then-insert metadata replacement they used before, so this change
is strictly additive for those two backends.

Verification: `pytest
test/unit_test/rag/utils/test_opensearch_doc_meta.py` runs 16 new unit
tests that pass locally and pin the `OSConnection` dispatch surface, the
`create_doc_meta_idx` short-circuit when the index already exists, the
mapping-file payload routing, the `IndicesClient.create` failure path,
the `refresh_idx` and `count_idx` success and error sentinels, and the
full-assignment script emitted by `replace_meta_fields`. The test module
stubs `common.settings` and `rag.nlp` at import time so the suite runs
without the heavy backend SDKs that the rest of the repository pulls in
transitively.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>
2026-05-11 17:04:28 +08:00
292b0b8bce chore: fix some comments to improve readability (#14756)
### What problem does this PR solve?

fix some comments to improve readability

### Type of change

- [x] Documentation Update

---------

Signed-off-by: box4wangjing <box4wangjing@outlook.com>
2026-05-11 16:48:48 +08:00
592dba1489 Refact: Added a private helper _visibility_and_status_filter (#13627)
### What problem does this PR solve?

Added a private helper _visibility_and_status_filter(joined_tenant_ids,
user_id) that returns the Peewee condition: visible to user (team or
own) and status is VALID.

### Type of change
- [x] Refactoring

---------

Co-authored-by: Serobabov Aleksandr <40SerobabovAS@region.cbr.ru>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-05-11 15:21:41 +08:00
6ce014c23b fix: offload blocking DB/Redis calls to thread pool for high-concurrency support (#13825) (#13941)
### What problem does this PR solve?

Addresses event-loop blocking under high concurrency reported in #13825.
When multiple requests hit the API simultaneously, synchronous DB/Redis
calls block the async event loop, preventing Quart from handling other
requests and causing cascading 502/504 timeouts.

This PR wraps all remaining blocking DB/Redis calls in `canvas_app.py`,
`chat_api.py`, `session.py`, and `canvas_service.py` with `await
thread_pool_exec()`
- Offload all synchronous `Service.*`, `REDIS_CONN.*`, and
`APIToken.query` calls to the thread pool
- Convert sync endpoint handlers (`list_chats`, `get_chat`, `templates`,
`sessions`, etc.) to `async def`
- Convert sync helper functions (`_ensure_owned_chat`,
`_validate_llm_id`, `_validate_dataset_ids`, etc.) to async - no
duplicate sync/async pairs
- Wrap `CanvasReplicaService` Redis IO calls (`bootstrap`,
`replace_for_set`, `commit_after_run`)
- Use `asyncio.gather()` for concurrent file uploads and chat response
building

**Note:** This fixes the code-level event-loop blocking, which is a
prerequisite for handling concurrent requests. For the full "30
concurrent requests without 502/504" goal described in the issue, users
should also tune deployment config:
- `WS=4` or higher (HTTP worker processes, default 1)
- `MAX_CONCURRENT_CHATS=50` (default 10)
- `SANDBOX_EXECUTOR_MANAGER_POOL_SIZE` for workflow-heavy workloads

### Performance verification

Reviewer asked for a before-vs-after comparison
([comment](https://github.com/infiniflow/ragflow/pull/13941#issuecomment-4393667231)).
I built a self-contained microbenchmark that reproduces the exact
failure mode this PR targets: an async handler that performs blocking
DB/Redis-style calls (50 ms each, 3 per request, 30 concurrent requests)
is run twice — once with the pre-PR pattern (sync call directly inside
the async handler) and once with the post-PR pattern (`await
thread_pool_exec(...)`). The benchmark imports nothing from RAGFlow
except `thread_pool_exec` itself, so it is hermetic and reproducible
(`THREAD_POOL_MAX_WORKERS=128`, Python 3.13.12).

**Throughput — wall-clock for 30 concurrent requests (lower is better)**

| flavour | wall(s) | p50(s) | p95(s) | max(s) |
|---|---:|---:|---:|---:|
| before | 4.986 | 0.158 | 0.207 | 0.269 |
| after  | 0.248 | 0.181 | 0.230 | 0.231 |

The pre-PR handler serializes the entire load on the event-loop thread,
so 30 × 3 × 50 ms ≈ 4.5 s shows up as the wall time. The post-PR handler
parallelizes the blocking work across the thread pool and finishes the
same load in 248 ms — a **~20× speedup** on this workload.

**Event-loop responsiveness — latency of an unrelated probe coroutine
while the 30 slow requests are running (lower is better)**

| flavour | samples | probe p50 (ms) | probe p95 (ms) | probe max (ms) |
|---|---:|---:|---:|---:|
| before | 1 | 5442.26 | 5442.26 | 5442.26 |
| after  | 28 | 0.88 | 11.53 | 98.02 |

This is the metric that maps directly to "the API still answers other
requests while one is busy". A 5 ms-interval probe was scheduled while
the 30 slow handlers ran. With the pre-PR code the event loop was frozen
for the entire duration of the blocking work, so only one probe sample
was ever picked up and it waited **5,442 ms**. After the PR, 28 probe
samples landed with **p50 0.88 ms / p95 11.53 ms**, meaning unrelated
requests are no longer starved by the slow ones. That is the regression
mode behind the cascading 502/504s reported in #13825.

<details>
<summary>Raw benchmark output</summary>

```
config: 30 concurrent requests, 3 blocking calls of 50ms each per request, THREAD_POOL_MAX_WORKERS=128

=== Throughput (lower wall is better) ===
flavour       wall(s)     p50(s)     p95(s)     max(s)
before          4.986      0.158      0.207      0.269
after           0.248      0.181      0.230      0.231

=== Event-loop responsiveness (lower probe latency is better) ===
flavour       samples    probe p50(ms)    probe p95(ms)    probe max(ms)
before              1          5442.26          5442.26          5442.26
after              28             0.88            11.53            98.02
```

</details>

The benchmark script is included as a comment on the PR for
reproducibility.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement

Closes [#13825](https://github.com/infiniflow/ragflow/issues/13825)

---------

Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-11 15:08:55 +08:00
5ef7f50eef fix: use context manager for ThreadPoolExecutor in file_service.py (#14144)
## Summary
- Wrap 2 `ThreadPoolExecutor` instances in `file_service.py` with `with`
statement
- Ensures threads are properly shut down after all futures complete

## Problem

`parse_docs()` (line 532) and the file processing method (line 694)
create `ThreadPoolExecutor` instances that are never shut down. In a
long-running server process, this leaks thread resources on every
invocation — threads remain alive consuming memory even after all
submitted work is complete.

## Fix

Replace bare `ThreadPoolExecutor()` with `with ThreadPoolExecutor() as
exe:` context manager, which calls `executor.shutdown(wait=True)` on
exit.

## Test plan
- [x] Verified both call sites use `with` statement after fix
- [x] No remaining bare `ThreadPoolExecutor` in `file_service.py`
- [x] `document_service.py:1066` is a module-level executor (different
pattern, not changed in this PR)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-11 14:02:45 +08:00
cc207b5b05 Refactor: tidy up ThreadPoolExecutor lifecycle in file_service and task executor (#14668)
## Summary
- Wrap the `ThreadPoolExecutor` instances in `FileService.parse_docs`
and `FileService.get_files` with `with ... as exe:` blocks for
deterministic cleanup
- Replace the `concurrent.futures.ThreadPoolExecutor` in
`do_handle_task` with `asyncio.create_task(asyncio.to_thread(build_TOC,
...))`, preserving the existing parallelism with chunk insertion while
leveraging the surrounding async context
- Drop the now-unused `import concurrent` and the
`executor.shutdown(wait=False)` call in the `finally` block

Closes #14622.

No behavioral change, no public API change. Net diff: ~19 insertions /
25 deletions across two files.

## Test plan
- [ ] `uv run ruff check api/db/services/file_service.py
rag/svr/task_executor.py` passes
- [ ] Upload a multi-file batch through the chat/file endpoint and
confirm `FileService.parse_docs` still returns combined parsed text
- [ ] Trigger `FileService.get_files` via the chat reference flow with a
mix of image and non-image files; verify both `raw=True` and `raw=False`
paths return correctly
- [ ] Run a `naive`-parser document task with `toc_extraction: true` and
confirm the TOC chunk is generated and inserted exactly as before
- [ ] Run a `naive`-parser document task with `toc_extraction: false`
and confirm the path with `toc_thread = None` is unaffected
- [ ] Cancel a running task to exercise the `finally` block and confirm
cleanup still works without the executor shutdown call

---------

Co-authored-by: web-dev0521 <jasonpette1783@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-05-11 12:59:00 +08:00
782084780e feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628) (#14677)
### What problem does this PR solve?

S3-family connector syncs currently re-download every in-window object
just so we can compute `xxhash128(blob)` and compare against
`Document.content_hash`. Anything that bumps `LastModified` without
changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays
full bandwidth and re-parses files that didn't actually change. #14628
covers the broader incremental-ingestion redesign; this PR is the first
slice.

The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2
/ GCS / OCI / S3-compat) now implements a new `FingerprintConnector`
interface: `list_keys()` paginates `list_objects_v2` and yields
`KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The
orchestrator joins those against the connector's existing `{doc_id:
content_hash}` map and only calls `get_value(key)` when the fingerprint
differs. Unchanged keys are skipped entirely — no `GetObject`, no
re-parse.

No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing
`Document.content_hash` column per @yingfeng's suggestion; the connector
decides at listing time whether to populate it. Local uploads and
connectors that don't opt in fall through to the existing post-download
`xxhash128(blob)` path with no behavior change.

This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent
PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS
(PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint`
(PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation
(PR-4). Holding those back keeps this PR additive and reviewable on its
own.

#### Files touched

- `common/data_source/models.py` — new `KeyRecord`; optional
`fingerprint` on `Document`
- `common/data_source/interfaces.py` — `IncrementalCapability` enum,
`FingerprintConnector` ABC
- `common/data_source/blob_connector.py` — `BlobStorageConnector`
implements `FingerprintConnector`; per-object download factored into
`_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`,
`get_value` all share it
- `rag/svr/sync_data_source.py` —
`_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop;
`_run_task_logic` plumbs `doc.fingerprint` into the upload dict
- `api/db/services/document_service.py` —
`list_id_content_hash_map_by_kb_and_source_type()` helper
- `api/db/services/connector_service.py` + `file_service.py` —
fingerprint flows through `duplicate_and_parse → upload_document` and
lands in `content_hash`
- `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests
covering ETag normalization (single-part, multipart, quoted, empty),
`list_keys()` not calling `GetObject`, `get_value()` materializing with
fingerprint, deterministic/stable fingerprints, and the bypass loop
asserting `GetObject` is *not* called on a match

#### Worth flagging for review

Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a
`LastModified` window when `poll_range_start` was set. New code uses
`_fingerprint_filtered_generator` (full bucket listing + fingerprint
compare) outside of explicit `reindex=1`. Strictly better for
unchanged-bucket cases since it skips `GetObject`, but it does mean
every sync now does a full `list_objects_v2` paginate. Should still be
cheap for most buckets — flagging in case anyone has a very large bucket
where the time-window filter was meaningful.

On migration: existing rows have `content_hash = xxhash128(blob)` from
the old code. The first sync after this lands sees ETag-derived
fingerprints that don't match, re-fetches every object once, and writes
the new fingerprint. From the second sync onward the bypass works as
expected. "Slow day one, fast every day after." A `fingerprint_backfill:
trust` opt-out is sketched in the design doc but not in this PR.

#### Test plan

- [x] `uv run ruff check` — clean on all 8 touched files
- [x] `uv run pytest
test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed
- [x] Broader unit-test suite — no regressions in anything I touched
- [ ] Manual smoke against a real S3 bucket — configure a connector, run
sync twice, expect the second sync to log `bypassed=N, fetched=0` and no
`GetObject` calls in CloudTrail / bucket access logs
- [ ] Manual smoke with `reindex=1` — confirm the full re-download path
still works

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-05-09 20:03:56 +08:00
3b6eeabb09 Fix: private dataset authorization bypass in shared dataset access checks (#14645)
### Related issues
Closes #14644

### What problem does this PR solve?

This PR fixes an authorization bug where datasets marked with
`permission = me` could still be accessed by other members of the same
tenant through APIs that relied on `KnowledgebaseService.accessible()`
or `DocumentService.accessible()`.

Before this change, those shared access helpers only checked tenant
membership and did not enforce the dataset's permission mode. As a
result, a non-owner who knew a private `dataset_id` could still reach
downstream document and chunk operations even though the dataset was
intended to be owner-only.

This change updates the central access checks so that:

- dataset owners always retain access
- joined tenant members only get access when the dataset permission is
`TEAM`
- private datasets with `permission = me` remain inaccessible to
non-owners
- document-level access follows the same dataset permission rules

The PR also adds regression coverage for private-vs-team dataset access
behavior.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Testing

- Added
`test/unit_test/api/db/services/test_dataset_access_permissions.py`
- Attempted to run: `python -m pytest
test\\unit_test\\api\\db\\services\\test_dataset_access_permissions.py
-q`
- Local execution in this workspace is currently blocked during test
collection because the environment is missing the `strenum` dependency

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: jony376 <jony376@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
Co-authored-by: d 🔹 <liusway405@gmail.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Magicbook1108 <newyorkupperbay@gmail.com>
Co-authored-by: chanx <1243304602@qq.com>
Co-authored-by: sxxtony <166789813+sxxtony@users.noreply.github.com>
Co-authored-by: sxxtony <sxxtony@users.noreply.github.com>
Co-authored-by: Baki Burak Öğün <63836730+bakiburakogun@users.noreply.github.com>
Co-authored-by: bakiburakogun <bakiburakogun@users.noreply.github.com>
Co-authored-by: Panda Dev <56657208+pandadev66@users.noreply.github.com>
Co-authored-by: Haruko386 <tryeverypossible@163.com>
Co-authored-by: D2758695161 <13510221939@163.com>
Co-authored-by: Hunter <hunter@yitong.ai>
Co-authored-by: Lynn <lynn_inf@hotmail.com>
Co-authored-by: buua436 <sz_buua@foxmail.com>
Co-authored-by: web-dev0521 <jasonpette1783@gmail.com>
Co-authored-by: Tim Wang <38489718+wanghualoong@users.noreply.github.com>
Co-authored-by: wanghualoong <wanghualoong@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: qinling0210 <88864212+qinling0210@users.noreply.github.com>
Co-authored-by: dale053 <star05223@outlook.com>
2026-05-09 13:30:14 +08:00
c428187350 Fix: validate kb_ids as UUIDs before SQL interpolation in use_sql (#14087)
### What problem does this PR solve?

The use_sql() function in dialog_service.py constructed SQL WHERE
clauses and Infinity table names by directly interpolating kb_id values
using Python f-strings, with no validation of the input values. A
malformed or maliciously crafted kb_id (introduced via a compromised
admin account or a separate injection vector) could alter the structure
of the generated SQL query, potentially leading to unauthorized data
access or data manipulation.

This PR adds strict UUID format validation for all kb_id values before
they are interpolated into any SQL string, causing requests with invalid
IDs to fail fast with a ValueError rather than executing a tampered
query.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-05-09 10:52:06 +08:00
c44dc85143 Fix: IMAGE2TEXT→CHAT fallback with model_type normalization in tenant_model_service (#14704)
## Summary

- When a model is registered as `chat` in `tenant_llm` but has the
`IMAGE2TEXT` tag in `llm_factories.json`, requesting it as `image2text`
(e.g. PDF parser) fails with `Tenant Model with name <model> and type
image2text not found`.
- After resolution via the new fallback, the returned
`config_dict["model_type"]` was still `"chat"`, causing
`tenant_llm_service.model_instance()` to instantiate `ChatModel` instead
of `CvModel` — breaking `describe_with_prompt` at ingestion time.

## What problem does this PR solve?

RAGFlow already has a `CHAT→IMAGE2TEXT` fallback: when a chat model is
not found, it retries with `image2text`. The symmetric fallback
(`IMAGE2TEXT→CHAT`) was missing.

This matters for multimodal models declared as `model_type: "chat"` with
an `IMAGE2TEXT` tag in `llm_factories.json` (e.g. models added after
tenant creation, or providers where a single model serves both
purposes). The frontend PDF parser selector correctly surfaces these
models via the `IMAGE2TEXT` tag, but the backend fails to resolve them
at runtime.

## Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

## Changes

**`api/db/joint_services/tenant_model_service.py`**

1. Add `IMAGE2TEXT→CHAT` fallback in
`get_model_config_by_type_and_name`: when an `image2text` model is not
found in `tenant_llm`, retry with `chat` — but only if the `llm` table
confirms `IMAGE2TEXT` capability via the `tags` field. This mirrors the
philosophy of the existing `CHAT→IMAGE2TEXT` fallback: substitution is
only allowed when the model has declared the required capability.

2. Normalize `config_dict["model_type"]` to `image2text` after the
fallback, so the caller (`model_instance`) correctly routes to `CvModel`
instead of `ChatModel`.

3. Extend the type validation guard to allow `(requested=image2text,
found=chat)` alongside the existing `(requested=chat, found=image2text)`
exception.

## Test plan

- [ ] Add a model with `model_type=chat` and `tags` containing
`IMAGE2TEXT` to a tenant
- [ ] Select it as PDF parser in a knowledge base
- [ ] Verify ingestion succeeds without `image2text not found` or
`describe_with_prompt` errors
- [ ] Verify the same model still works correctly in chat context

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 10:40:58 +08:00
653b00b94c fix(sync): scope document IDs per connector to prevent cross-KB collisions (#14378)
Fixes #14360

## Problem

When the same blob storage bucket is connected to multiple knowledge
bases (each through a different data source connector), the sync
pipeline hashes only the blob path
(`bucket_type:bucket_name:object_key`) to derive the document ID. Every
connector pointing at the same bucket therefore produces **identical
IDs** for the same object. The collision guard in
`FileService.upload_document` then fires for the second knowledge base:

```
Existing document id collision with another knowledge base; skipping update.
```

This makes it impossible to index the same bucket into more than one KB
simultaneously.

## Solution

Include `connector_id` in the hash input so that each connector produces
a distinct document ID even when the underlying blob path is identical:

```python
# Before
"id": hash128(doc.id),

# After
"id": hash128(f"{task['connector_id']}:{doc.id}"),
```

Because each KB connection uses its own connector (with a unique
`connector_id`), documents are now namespaced per connector and no
collision occurs.

**Note:** This is a breaking change for existing synced data sources.
After upgrading, a re-sync will create new documents with the updated ID
format. Old documents (indexed under the previous format) will remain in
the database but can be manually deleted or cleaned up via a re-sync
with reindex enabled.

## Testing

- Verified that the one-line change produces unique IDs for two
connectors pointing at the same S3 path.
- Existing unit test
`test_upload_document_skips_cross_kb_document_id_collision` continues to
pass — the collision guard in `FileService` is still valid for genuinely
colliding IDs from other sources.

---------

Co-authored-by: octo-patch <octo-patch@github.com>
2026-05-09 10:33:54 +08:00
26d70189b6 fix: enforce tenant-scoped authorization for chatbot SDK endpoints (#14592)
Closes #14590

## Self Checks

- [x] I have searched for existing issues [search for existing
issues](https://github.com/infiniflow/ragflow/issues), including closed
ones.
- [x] I confirm that I am using English to submit this report ([Language
Policy](https://github.com/infiniflow/ragflow/issues/5910)).
- [x] Non-english title submitions will be closed directly (
非英文标题的提交将会被直接关闭 ) ([Language
Policy](https://github.com/infiniflow/ragflow/issues/5910)).
- [x] Please do not modify this template :) and fill in all the required
fields.

## RAGFlow workspace code commit ID

`a1b2c3d4e5f67890123456789abcdef12345678`

## RAGFlow image version

`0.13.1`

## Other environment information

- Hardware parameters: N/A
- OS type: Linux 6.17.0-22-generic
- Others: API key authentication via `Authorization: Bearer <token>`

## Actual behavior

The chatbot API endpoints:

- `POST /chatbots/<dialog_id>/completions`
- `GET /chatbots/<dialog_id>/info`

validate only that the bearer token exists in `APIToken`, but do not
verify that `dialog_id` belongs to the same tenant as that token.

Current flow (simplified):

1. Route extracts bearer token and checks `APIToken.query(beta=token)`.
2. If token exists, request is accepted.
3. Downstream service resolves dialog globally by ID
(`DialogService.get_by_id(dialog_id)` in `conversation_service.py`).
4. No tenant ownership check is enforced for `dialog_id`.

Impact: Any user with a valid API key can attempt arbitrary `dialog_id`
values and access/invoke chatbots outside their own tenant boundary if
IDs are known/guessed/leaked.

Security classification:

- Vulnerability class: Broken Access Control (IDOR, OWASP Top 10 A01)
- Severity recommendation: Critical
- Exploit prerequisite: any valid API key + discoverable target
`dialog_id`

## Expected behavior

Requests to `/chatbots/<dialog_id>/completions` and
`/chatbots/<dialog_id>/info` must be authorized only when:

1. bearer token is valid, and
2. `dialog_id` belongs to the same `tenant_id` as the token.

Otherwise, reject with authorization failure (e.g., 403 or
404-equivalent policy).

## Steps to reproduce

1. Prepare two tenants:
   - Tenant A with API key `TOKEN_A`
   - Tenant B with chatbot `dialog_id = DIALOG_B`
2. Send request from Tenant A to Tenant B chatbot completion endpoint:

```bash
curl -X POST "https://<host>/chatbots/DIALOG_B/completions" \
  -H "Authorization: Bearer TOKEN_A" \
  -H "Content-Type: application/json" \
  -d '{"question":"hello","stream":false}'
```

3. Observe request is processed (or reaches dialog resolution) without
tenant ownership rejection.
4. Repeat against info endpoint:

```bash
curl -X GET "https://<host>/chatbots/DIALOG_B/info" \
  -H "Authorization: Bearer TOKEN_A"
```

5. Observe the same missing ownership enforcement.

## Additional information

Affected code paths:

- `api/apps/sdk/session.py`
  - `chatbot_completions(dialog_id)`
  - `chatbots_inputs(dialog_id)`
- `api/db/services/conversation_service.py`
  - `async_iframe_completion(...)` uses global dialog lookup

Suggested fix:

1. In both chatbot endpoints:
   - Resolve `tenant_id = objs[0].tenant_id` from validated token.
- Fetch dialog with tenant-scoped query
(`DialogService.query(id=dialog_id, tenant_id=tenant_id)`).
   - Reject if dialog is not found/owned by tenant.
2. Defense in depth:
- Require and enforce `tenant_id` in service-layer dialog resolution for
external flows.
- Avoid global `get_by_id(dialog_id)` where user-controlled dialog IDs
are reachable.
3. Add regression tests:
   - Positive: same-tenant token + dialog succeeds.
   - Negative: cross-tenant token + dialog fails for both endpoints.
2026-05-08 18:00:18 +08:00
1bcb6deb6f Fix: collapsible thinking display and separate deep research retrieval tag (#14613)
## Summary

- **Collapsible thinking**: Replace `<section>` with `<details>` for
`<think>` content, so model thinking output is collapsed by default
(click to expand). Works for all models that output `<think>` tags
(Qwen3, DeepSeek, Gemini, Claude, etc.).
- **Fix double thinking tags**: When reasoning/deep research mode is
enabled in knowledge base chat, both the retrieval progress and model
thinking were wrapped in `<think>` tags, producing two "Thinking..."
blocks. Now retrieval progress uses a dedicated `<retrieving>` tag
rendered as a separate "Retrieving..." collapsible with a distinct green
accent.

### Before
- Thinking content displayed as flat gray-bordered `<section>`,
occupying significant screen space
- Deep research + model thinking both use `<think>` → two identical
"Thinking..." blocks

### After
- Thinking content collapsed by default in a `<details>` element, click
"Thinking..." to expand
- Deep research shows "Retrieving..." (green border), model thinking
shows "Thinking..." (gray border)

## Changes

**Backend (`api/db/services/dialog_service.py`)**
- Deep research callback: replace `start_to_think`/`end_to_think` marker
flags with direct `<retrieving>`/`</retrieving>` answer text

**Frontend**
- `web/src/utils/chat.ts`: `replaceThinkToSection()` now uses
`<details>` instead of `<section>`; add new
`replaceRetrievingToSection()`
- 4 tsx files: import and pipe `replaceRetrievingToSection`, whitelist
`details`, `summary`, `retrieving` in DOMPurify `ADD_TAGS`
- 4 less files: `section.think` → `details.think` with `<summary>`
styles; add `details.retrieving` with green accent; dark mode and RTL
variants

## Test plan
- [ ] Open a chat WITHOUT knowledge base, ask a question to a model with
thinking (e.g. Qwen3) → thinking content should be collapsed by default,
click "Thinking..." to expand
- [ ] Open a chat WITH knowledge base and reasoning enabled, ask a
question → "Retrieving..." (green) shows retrieval progress,
"Thinking..." (gray) shows model thinking, each independently
collapsible
- [ ] Verify dark mode renders correctly for both collapsible blocks
- [ ] Verify RTL layout renders correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: wanghualoong <wanghualoong@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-08 14:40:00 +08:00
412fae7ac2 Fix: display error (#14654)
### What problem does this PR solve?

Use right key in error text.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-08 13:11:59 +08:00
59c35100c5 Perf: push metadata filters down to Elasticsearch (#14576)
### What problem does this PR solve?

Fixes #14412.

`common.metadata_utils.meta_filter` evaluates user-defined metadata
conditions in Python after `DocMetadataService.get_flatted_meta_by_kbs`
loads the entire `meta_fields` table into memory. Past a few thousand
documents per knowledge base this becomes a memory bottleneck and a
wasted ES round-trip — every filter request currently fetches up to
10000 metadata rows even when the resulting `doc_ids` list is tiny.

This PR adds an ES push-down path that translates the same filter
language into a `bool` query and returns just the matching document IDs.

**Changes**

- `common/metadata_es_filter.py` *(new)*: pure-Python translator from
the RAGflow filter list to ES DSL. Covers every operator the in-memory
path supports (`=`, `≠`, `>`, `<`, `≥`, `≤`, `in`, `not in`, `contains`,
`not contains`, `start with`, `end with`, `empty`, `not empty`) with
`case_insensitive: true` on `prefix` and `wildcard` for parity with the
existing lower-cased Python comparisons. User wildcard metacharacters
are escaped before being injected into `wildcard` patterns. Negative
operators (`≠`, `not in`, `not contains`, ranges) are wrapped with an
`exists` guard so they do not accidentally match documents missing the
key, matching the legacy `if k not in metas` behaviour.
- `api/db/services/doc_metadata_service.py`: new
`DocMetadataService.filter_doc_ids_by_meta_pushdown(kb_ids, filters,
logic)` that returns the doc IDs ES matched, or `None` to signal the
caller should fall back to the in-memory path. Returns `None` when the
active doc store is Infinity (`meta_fields` is a JSON column, not a
dotted-object mapping), when any filter cannot be expressed in DSL
(`UnsupportedMetaFilter`), or when the ES request or metadata index
lookup errors.
- `common/metadata_utils.py`: `apply_meta_data_filter` accepts an
optional `kb_ids` argument. When supplied, conditions go through
push-down first via a new `_try_meta_pushdown` helper; on `None` the
function falls back to the original `meta_filter` call. Default
behaviour is unchanged for callers that don't pass `kb_ids`.
- Updated all four callers (`agent/tools/retrieval.py`,
`api/db/services/dialog_service.py` ×2,
`api/apps/services/dataset_api_service.py`, `api/apps/sdk/session.py`)
to forward `kb_ids` so the push-down path is exercised in production.
- `test/unit_test/common/test_metadata_es_filter.py` *(new)*: 35 unit
tests covering every operator's DSL shape, value coercion
(`ast.literal_eval`, lowercasing, ISO-date pass-through), wildcard
escaping, OR-logic wrapping that protects negative clauses, and the
doc-ID extractor.

**Behaviour preserved**

- The in-memory `meta_filter` is untouched and still services every
fallback case (Infinity backend, unknown operators, ES outages).
- The eligibility / credibility / issue-multiplier semantics described
in the LLM-driven `auto` and `semi_auto` modes still hand the LLM the
full in-memory `metas` dict to choose conditions from. Only the
*evaluation* of those generated conditions is pushed down.
- Existing tests in
`test/unit_test/common/test_metadata_filter_operators.py` continue to
pass (14/14).

**Test plan**

- `pytest test/unit_test/common/test_metadata_es_filter.py` — 35 passed.
- `pytest test/unit_test/common/test_metadata_filter_operators.py` — 14
passed.
- `ruff check` clean on every modified file.
- Reviewer please validate the ES query shapes against a live cluster —
particularly `case_insensitive` on `wildcard` and `prefix` (requires ES
7.10+) and the `exists` + `must_not` pairing for `≠`.

**Notes**

- The first cut caps each push-down request at 10000 results, matching
the existing `get_flatted_meta_by_kbs` limit, and logs a warning when
the cap is hit. A `search_after` follow-up would let us drop the cap
entirely once the push-down path is validated.
- Operator parity with the in-memory path is exact for the canonical
unicode operators (`≥`, `≤`, `≠`) used internally; the ASCII aliases
(`>=`, `<=`, `!=`) are normalised by `convert_conditions` before they
reach the translator.

### Type of change

- [x] Performance Improvement

---------

Co-authored-by: sxxtony <sxxtony@users.noreply.github.com>
2026-05-07 21:23:43 +08:00
0501134820 Fix: support tool call config (#14616)
### What problem does this PR solve?
support tool call config

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-07 15:54:57 +08:00
c50028b1f3 Fix team member cannot edit agent (#14612)
### What problem does this PR solve?

Follow on PR: https://github.com/infiniflow/ragflow/pull/14602
to fix: team member cannot edit agent.
new behavior: beside delete, everything is allowed for team member.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-07 15:09:13 +08:00
1d114f034b Allow more task logs for #14617 (#14624)
### What problem does this PR solve?

Allow more task logs for #14617

### Type of change

- [x] Refactoring
2026-05-07 15:03:08 +08:00
1d0519d025 Fix secret key inconsistency cross the RAGFlow servers (#14591)
### What problem does this PR solve?

A and B, two API servers and a REDIS server.
If A and REDIS restart, B will hold the obsolete secret key and will
lead to error.

TODO:
app.config['SECRET_KEY'] and app.secret_key still hold obsolete secret
key.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 10:10:02 +08:00
5e01feb755 fix(connector_service): add TIMEZONE setting and correct interval log… (#14446)
### What problem does this PR solve?



### Type of change

- [v] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: wiratama <dafa.wiratama@bankraya.co.id>
2026-05-06 14:40:35 +08:00
94f8779a00 Memory API: enforce tenant permissions on memory and message endpoints (#14535)
### What problem does this PR solve?

This PR fixes missing authorization checks in the Memory API.
Previously, several authenticated endpoints accepted caller-supplied
`tenant_id`, `owner_ids`, or `memory_id` values and used them directly
to list, read, update, delete, or search Memory data.

That could allow an authenticated user to access or mutate another
tenant's Memory records if they knew a tenant ID or memory ID. The fix
centralizes Memory access checks and applies them consistently across
Memory and Memory-message operations.

The change:

- Adds helper logic to parse list filters and compute tenant IDs
accessible to `current_user`.
- Requires direct `memory_id` operations to pass Memory access checks
before reading, updating, deleting, or changing message state.
- Filters list/search/recent-message requests to accessible memories
only.
- Applies Memory visibility filtering before count and pagination in
`MemoryService.get_by_filter`.
- Accepts `owner_ids` in the Memory list route, matching the frontend
owner filter while still intersecting values with the caller's
accessible tenants.
- 

### Related issues
Closes #14534 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-06 14:10:47 +08:00
24af0875e5 Feat/configurable metadata display (#13464)
### What problem does this PR solve?

Currently, RAGFlow's Search and Chat interfaces display only raw
vectorized text chunks during retrieval, without contextual information
about their source documents. Users cannot see document titles, page
numbers, upload dates, or custom metadata fields that would help them
understand and trust the retrieved results.

This PR introduces an **optional metadata display feature** that
enriches retrieved chunks with document-level metadata in both the
Search tab and Chatbot interface.

**Key improvements:**
- **Search results**: Display document metadata as styled badges beneath
chunk snippets
- **Chat citations**: Show metadata in citation popovers and reference
lists for better source context
- **LLM context**: Metadata is injected into the LLM prompt to enable
more accurate, citation-aware responses
- **External API support**: Applications using RAGFlow's SDK retrieval
endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in
via request parameters
- **User control**: Multi-select dropdown UI allows users to choose
which metadata fields to display

**Implementation approach:**
-  Reuses existing `DocMetadataService` infrastructure (no new database
tables or indices)
-  Settings stored in existing JSON configuration fields
(`search_config.reference_metadata`, `prompt_config.reference_metadata`)
-  No database migrations required
-  Disabled by default (fully opt-in and backward-compatible)
-  Dynamic metadata field selection populated from actual document
metadata keys
-  Fixed critical bug where Python's builtin `set()` was shadowed by a
route handler function

**Modified endpoints (all backward-compatible):**
- `POST /v1/retrieval` (Public SDK)
- `POST /v1/searchbots/retrieval_test` (Searchbots)
- `POST /v1/chunk/retrieval_test` (UI/Internal)
- Chat completions endpoints (via `extra_body.reference_metadata` or
`prompt_config`)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)


###Images
-
<img width="879" height="1275" alt="image"
src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb"
/>
<br><br>
<br><br>

<img width="1532" height="362" alt="image"
src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e"
/>
<br><br>
<br><br>

<img width="2586" height="1320" alt="image"
src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776"
/>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
2026-04-30 23:13:27 +08:00
4ee0702aed Feat: add skills space to context engine (#13908)
### What problem does this PR solve?

issue #13714

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 12:36:03 +08:00
6dd38eca6a fix: file logs not displayed in dataset ingestion page (#14479)
### What problem does this PR solve?

## Summary

Fixed a bug where the **File Logs** tab in the dataset ingestion page
always showed "No logs" even after files were parsed successfully.

## Root Cause

Both the **File Logs** and **Dataset Logs** tabs on the frontend called
the same backend endpoint `/datasets/{dataset_id}/ingestions`. However,
the backend only queried `get_dataset_logs_by_kb_id`, which
hard-filtered records by `document_id == GRAPH_RAPTOR_FAKE_DOC_ID`
(dataset-level logs). As a result, real file-level logs were never
returned, causing the table to appear empty.

## Changes

### Backend

- **`api/apps/restful_apis/dataset_api.py`**
  - Added two new query parameters to `list_ingestion_logs`:
    - `log_type` — `"file"` or `"dataset"` (default: `"dataset"`)
    - `keywords` — search keyword for filtering by document / task name

- **`api/apps/services/dataset_api_service.py`**
- Updated `list_ingestion_logs` signature to accept `log_type` and
`keywords`.
  - Added conditional routing:
- When `log_type == "file"`, call
`PipelineOperationLogService.get_file_logs_by_kb_id`
- Otherwise, call
`PipelineOperationLogService.get_dataset_logs_by_kb_id`

- **`api/db/services/pipeline_operation_log_service.py`**
- Extended `get_dataset_logs_by_kb_id` with an optional `keywords`
parameter so dataset logs can also be searched.

### Frontend

- **`web/src/pages/dataset/dataset-overview/hook.ts`**
- Removed the separate API function switching (`listPipelineDatasetLogs`
vs `listDataPipelineLogDocument`).
- Unified both tabs to call `listDataPipelineLogDocument` with the new
`log_type` query parameter (`"file"` or `"dataset"`).
  - Ensured `keywords` and filter values are passed through correctly.

## Behavior After Fix

| Tab | `log_type` | Returned Records | Searchable Field |
|---|---|---|---|
| File Logs | `file` | Real document-level logs | `document_name` (file
name) |
| Dataset Logs | `dataset` | GraphRAG / RAPTOR / MindMap logs |
`document_name` (task type) |
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>
2026-04-29 22:10:24 +08:00
18fbfafca6 Feat: enable sync deleted files for more connectors (#14353)
### What problem does this PR solve?

Feat: enable sync delted files for connectors

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 15:07:14 +08:00
343bda1119 Refactor: deco document upload_and_parse API (#14366)
### What problem does this PR solve?

remove unused "POST /v1/document/upload_and_parse"

### Type of change

- [x] Refactoring
2026-04-27 20:35:00 +08:00
2846a93998 Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?

Fixes #14196

## Problem

When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:

- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports

Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.

## Root Cause

```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # Only the first 300 pages were rendered; everything beyond was silently dropped
```

While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.

## Solution

### 1. Define constants in `common/constants.py`

```python
MAXIMUM_PAGE_NUMBER = 100000                        # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000  # Used by the task/DB layer
```

### 2. Replace all hardcoded sentinel values

| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |

### 3. Fix `parse_into_bboxes()` missing parameters

Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.

## Files Changed (22)

- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 14:57:20 +08:00
4dcc42e0e1 feat(api): add unified index API and dataset management endpoints (#14222)
### What problem does this PR solve?

## Summary

Refactor the dataset API layer into a clean service/REST separation
pattern, add a unified `/index` API for graph/raptor/mindmap operations,
and introduce several new dataset management endpoints with full test
coverage.

## Changes

### Service Layer (`dataset_api_service.py`)

- Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace
function for all index types
- Added `run_index`, `delete_index` service functions
- Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`,
`get_ingestion_log`
- Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`,
`rename_tag`
- Added `get_flattened_metadata`, `get_auto_metadata`,
`update_auto_metadata`

### REST API Layer (`dataset_api.py`)

**New unified routes:**

| Method | Route | Description |
|--------|-------|-------------|
| POST | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Run index
task |
| GET | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Trace index
task |
| DELETE | `/datasets/<id>/<index_type>` | Delete index |
| GET | `/datasets/<id>` | Get dataset details |
| GET | `/datasets/<id>/ingestions/summary` | Ingestion summary |
| GET | `/datasets/<id>/ingestions` | List ingestion logs |
| GET | `/datasets/<id>/ingestions/<log_id>` | Get single ingestion log
|
| POST | `/datasets/<id>/embedding` | Run embedding |
| GET | `/datasets/<id>/tags` | List tags |
| GET | `/datasets/tags/aggregation` | Aggregate tags across datasets |
| DELETE | `/datasets/<id>/tags` | Delete tags |
| PUT | `/datasets/<id>/tags` | Rename tag |
| GET | `/datasets/metadata/flattened` | Get flattened metadata |
| GET/PUT | `/datasets/<id>/metadata/config` | New metadata config path
|

**Removed routes (replaced by unified `/index`):**

- `POST /datasets/<id>/mindmap`
- `GET /datasets/<id>/mindmap`

**Preserved legacy routes (backward compatibility):**

- `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor`
- `/auto_metadata` GET/PUT

### Test Suite

- Updated `common.py` helpers: added `trace_index`, removed
`run_mindmap`/`trace_mindmap`
- Added 7 new test files with 39 test cases total:

| Test File | Cases |
|-----------|-------|
| `test_get_dataset.py` | 4 |
| `test_ingestion_summary.py` | 2 |
| `test_ingestion_logs.py` | 5 |
| `test_index_api.py` | 14 |
| `test_embedding.py` | 2 |
| `test_tags.py` | 8 |
| `test_flattened_metadata.py` | 4 |

- Deleted `test_mindmap_tasks.py` (covered by unified index tests)

## Design Decisions

1. **Unified `/index?type=...`** — single endpoint replaces 3 separate
route pairs for graph/raptor/mindmap
2. **Backward compatibility** — old routes (`/run_graphrag`,
`/run_raptor`, `/auto_metadata`) preserved alongside new paths
3. **`_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}`** — input
validation via constant set
4. **`_INDEX_TYPE_TO_TASK_ID_FIELD`** — maps index type to KB model task
ID field for clean dispatch

## Files Changed

- `api/apps/restful_apis/dataset_api.py`
- `api/apps/services/dataset_api_service.py`
- `sdk/python/ragflow_sdk/modules/dataset.py`
- `test/testcases/test_http_api/common.py`
- `test/testcases/test_http_api/test_dataset_management/` (7 new files)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 09:38:01 +08:00
fb95136f39 Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090)
### What problem does this PR solve?

The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.

This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-25 14:30:15 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
beb2406b86 Fix: allow use image2text as chat model (#14331)
### What problem does this PR solve?

Allow image2text models (multimodal) to be used as chat models.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 17:58:25 +08:00
c41b5e8a5d fix: migrate Langfuse integration from start_generation to start_obse… (#14205)
The Langfuse Python SDK v3+ removed `start_generation()` method.
RagFlow's code called this non-existent method, causing AttributeError
when Langfuse tracing is enabled.

Replace all `start_generation()` calls with
`start_observation(as_type="generation")` which is the correct v4 SDK
API.

Affected files:
- api/db/services/llm_service.py (12 occurrences)
- api/db/services/dialog_service.py (1 occurrence)

Fixes #14204
Related to #9243

### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 10:03:57 +08:00