Commit Graph

120 Commits

Author SHA1 Message Date
b2b63600f1 Adds gpt-5.4-mini and gpt-5.4-nano (#14908)
### What problem does this PR solve?
Includes gpt-5.4-mini and gpt-5.4-nano to the OpenAI model list

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-14 10:16:24 +08:00
dd76653dc1 feat: add tag management for Agents with filtering and sorting (#14774) (#14799)
## Summary
Closes #14774.

Adds free-form tags on agents (UserCanvas) with full UI + API:
- Stored as comma-separated `tags` column on `UserCanvas` with online
migration.
- New endpoints: `GET /v1/agents/tags` (aggregate counts) and `PUT
/v1/agent/<id>/tags` (write). `GET /v1/agents` accepts a `tags=` query.
- "Edit tags" item in agent dropdown opens a chip-style editor dialog;
tags render as badges on each agent card.
- New "Tags" facet in the agents filter bar, with counts.

## Implementation notes
- **Tag matching is exact-token**: the SQL filter wraps stored tags as
`,…,` and matches `,ml,` so `ml` doesn't match `ml-ops`.
- **Server-side normalization** in `UserCanvasService.update_tags`:
dedup (case-insensitive), per-tag cap of 64 chars, total length capped
at 512 chars to fit the column, commas inside tag values are replaced
with spaces.
- **Tenant authorization**: `PUT /v1/agent/<id>/tags` gates on
`UserCanvasService.accessible(canvas_id, tenant_id)`.
- **Tag listing scope**: `UserCanvasService.list_tags` follows the same
own + team-shared rule as `get_by_tenant_ids`.
- **i18n**: keys added to `en.ts` and `zh.ts` only (per project
convention; other locales fall back).
- **`HomeCard`** gets a non-breaking `extra?: ReactNode` slot for the
chip row; no `src/components/ui/` files modified.

## Test plan
- [ ] Backend boot runs `migrate_db` → confirm `user_canvas.tags` column
exists (`DESCRIBE user_canvas`).
- [ ] Agents page renders cards normally (no console error from missing
field).
- [ ] `⋯ → Edit tags` opens a dialog that stays open (regression: dialog
was unmounting with the dropdown).
- [ ] Typing a tag without pressing Enter and clicking Save persists it
(regression: last typed tag was being dropped).
- [ ] Chip input supports Enter/comma to commit, Backspace on empty to
remove, `×` to remove individual chip.
- [ ] Tag containing a comma sent via API is stored with the comma
replaced by a space.
- [ ] 20 long tags sent via API does not error (length cap silently
truncates).
- [ ] "Tags" filter in the filter bar shows counts and narrows the list.
- [ ] Filtering by `ml` does **not** return agents tagged `ml-ops`.
- [ ] UI in Chinese shows 编辑标签 / 添加标签以整理和筛选你的智能体 etc.
- [ ] `PUT /v1/agent/<other-tenant-id>/tags` returns `Agent not found or
no permission.`
2026-05-13 21:41:32 +08:00
5a5e766386 fix(api): authorize owner_ids for list chats and search apps (#14775)
Closes #14768
### What problem does this PR solve?
The `list_chats` and `list_searches` REST API endpoints did not enforce
authorization on the `owner_ids` query parameter. Any authenticated user
could pass arbitrary tenant IDs to `owner_ids` and retrieve chats or
search apps belonging to other tenants they are not a member of.

This PR resolves the issue by:
1. Looking up the current user's authorized tenants via
`TenantService.get_joined_tenants_by_user_id` and rejecting any
`owner_ids` that fall outside that set.
2. When no `owner_ids` are provided, scoping the query to only the
user's authorized tenants instead of returning an unfiltered result.
3. Adding unit tests that verify unauthorized `owner_ids` are rejected
with `OPERATING_ERROR`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-13 09:43:44 +08:00
127aeac4aa fix: expose gpt-5.5 and gpt-5.4 in OpenAI model list (#14828)
### What problem does this PR solve?

OpenAI model catalogs used in provider selection flows were missing the
latest GPT models (`gpt-5.5` and `gpt-5.4`).
Because model availability is driven by seeded catalog data
(`conf/llm_factories.json` → DB seed → API response), these models were
not selectable in the UI or `/llm/list` responses.

This PR updates and synchronizes the OpenAI catalog definitions across
configuration sources and ensures the new models are correctly exposed
through the API layer and validated in tests.

---

### Type of change

* [x] New Feature (non-breaking change which adds functionality)

---

### Changes Made

* Added `gpt-5.5` and `gpt-5.4` to OpenAI catalog definitions in:

  * `conf/llm_factories.json`
  * `conf/models/openai.json` (chat + vision support)
* Ensured consistency between DB-seeded factory config and provider
model configuration
* Updated test coverage in:

  * `test_llm_list_unit.py`

    * seeded OpenAI catalog entries
* added response-level assertion validating `/llm/list` includes both
new model IDs under OpenAI grouping

---

### Root Cause

OpenAI model listings in selection flows are generated from catalog data
seeded via `conf/llm_factories.json`.
The catalog had not been updated to include the latest GPT models,
resulting in missing availability in UI and API responses.

---

### Testing

* Created isolated test environment:

  * `python -m venv .venv-review`
  * installed `pytest`
* Ran targeted and full test suite:

  * `test_list_app_grouping_availability_and_merge`:  passed
  * Full `test_llm_list_unit.py`:  10 passed

---

### Risks / Limitations

* Adding models to the catalog does not guarantee upstream provider
availability or account entitlement.
* Environments with pre-seeded DB catalogs may require reseed or refresh
to reflect updated configuration.

---

### Notes

* Changes are minimal and scoped strictly to catalog configuration and
related test coverage.
* Ensures `/llm/list` API remains aligned with expected latest OpenAI
model availability.
* Closes #14827
2026-05-12 18:03:47 +08:00
02c2587ca4 fix(agent): support iteration item aliases in child nodes (#14146)
## Summary
This PR fixes the iteration variable mismatch reported in #14142.

Changes:
- restore compatibility for `IterationItem@result` by exposing `result`
alongside `item`
- support bare iteration aliases like `{item}`, `{index}`, and
`{result}` inside iteration child-node inputs
- add focused unit/runtime tests covering both alias styles and
multi-item iteration execution

## Validation
```bash
pytest -q --noconftest \
  test/testcases/test_web_api/test_canvas_app/test_iterationitem_unit.py \
  test/testcases/test_web_api/test_canvas_app/test_iteration_runtime_unit.py \
  test/testcases/test_web_api/test_canvas_app/test_invoke_component_unit.py
```

Result: `12 passed`

Closes #14142
2026-05-12 13:05:21 +08:00
daf8a58c4b Fix: add codeexec attachments output (#14787)
### What problem does this PR solve?

add codeexec attachments output

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-11 19:16:33 +08:00
292b0b8bce chore: fix some comments to improve readability (#14756)
### What problem does this PR solve?

fix some comments to improve readability

### Type of change

- [x] Documentation Update

---------

Signed-off-by: box4wangjing <box4wangjing@outlook.com>
2026-05-11 16:48:48 +08:00
a03b95f8c4 Fix: shared dataset chunk index lookup (#14764)
### What problem does this PR solve?

shared dataset chunk index lookup

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-11 13:50:08 +08:00
7ec87f7cb7 fix(auth): fall back to session-based auth in _load_user (#14569)
## Summary

Closes #13663.

OAuth / OIDC callbacks call `login_user(user)` which writes `_user_id`
into the session cookie, but `_load_user()` in `api/apps/__init__.py`
only ever looked at the `Authorization` header. The SPA's response
interceptor wipes the Authorization value from `localStorage` on the
first 401 it sees — meaning that during the post-redirect window after
an OAuth login, a single transient 401 sends every subsequent request
back to the login page even though `login_user()` had already
established a perfectly good server-side session.

The reporter's analysis traces this all the way through the redirect →
`navigate('/')` → first request → empty header → 401 → `removeAll()` →
infinite-redirect-to-login chain.

## What changed

- New `_load_user_from_session()` helper that reads
`session["_user_id"]`, looks up the user in `UserService` (with the same
`StatusEnum.VALID` and `access_token` checks already used elsewhere),
and assigns `g.user`.
- Every `return None` path in `_load_user()` now routes through that
helper before giving up:
  - missing `Authorization` header
  - malformed `bearer ` prefix
  - empty / too-short JWT payload
  - JWT signature failure
  - JWT-resolved user not found / has no `access_token`
  - `APIToken.query()` fallback exhausted

The JWT and API-token paths still take precedence — the session is only
consulted when those can't authenticate the request. So existing
local-login and SDK callers see no behaviour change; only OAuth / OIDC
users that hit the original race now stay logged in.

The Bearer-prefix issue called out in #13663 (lines 103-110) is already
handled in the current code, so this PR only addresses the second half
of the report.

## Test plan

- [ ] Configure OIDC under `oauth` in `service_conf.yaml`
- [ ] Click the OIDC login button, complete auth at the IdP
- [ ] Confirm that navigating between pages no longer bounces back to
`/login`
- [ ] Confirm local email/password login still issues + accepts JWTs
- [ ] Confirm SDK/API key callers still authenticate via `Authorization:
Bearer <api-token>`

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-11 09:59:52 +08:00
f4b8f53b6d Fix: restore embedding model switching for datasets with existing chunks (#14732)
### What problem does this PR solve?

## Problem

During the REST API refactoring (#13690), the
`/api/v2/kb/check_embedding` endpoint was removed and never migrated to
the new RESTful structure. The frontend was pointed to the
`/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a
completely different function). Additionally, a hard guard was
introduced that rejects any `embd_id` change when `chunk_num > 0`,
making it impossible to switch embedding models on datasets with
existing chunks.

## Root Cause

1. **Missing endpoint**: The old `check_embedding` logic (sample random
chunks, re-embed with the new model, compare cosine similarity) was not
carried over to the new REST API service layer.
2. **Wrong frontend URL**: `checkEmbedding` in `api.ts` pointed to
`/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated
check endpoint.
3. **Overly restrictive guard**: `dataset_api_service.py` line 310
blocked all `embd_id` updates when `chunk_num > 0`. This check did not
exist in the pre-refactor code — it was incorrectly introduced during
the refactor.

## Changes

### Backend

- **`api/apps/services/dataset_api_service.py`**
  - Remove the `chunk_num > 0` hard guard on `embd_id` updates
- Add `check_embedding()` service function: samples random chunks,
re-embeds them with the candidate model, computes cosine similarity,
returns compatibility result (avg ≥ 0.9 = compatible)
  - Add `import re` for the `_clean()` helper

- **`api/apps/restful_apis/dataset_api.py`**
- Add `POST /datasets/<dataset_id>/embedding/check` endpoint following
the new REST API conventions
  - Clean up unused top-level imports (`random`, `re`, `numpy`)

### Frontend

- **`web/src/utils/api.ts`**
- Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` →
`/datasets/${datasetId}/embedding/check`

### Tests

-
**`test/testcases/test_http_api/test_dataset_management/test_update_dataset.py`**
- Update `test_embedding_model_with_existing_chunks` to assert success
(`code == 0`) instead of expecting the old `102` error

-
**`test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`**
- Update `test_update_route_branch_matrix_unit` to assert
`RetCode.SUCCESS` when updating `embd_id` on a chunked dataset,
replacing the old `chunk_num` error assertion

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-05-09 18:48:57 +08:00
c11650bb4c Fix IDOR: Add permission checks to file ancestry endpoints (#14725)
Close #14292

## Issue

File ancestry endpoints return folder metadata without validating tenant
permissions, allowing any authenticated user to query arbitrary
`file_id` values across tenant boundaries.

## Affected Endpoints
- `GET /v1/file/parent_folder?file_id={file_id}`
- `GET /v1/file/all_parent_folder?file_id={file_id}`  
- `GET /api/v1/files/{id}/ancestors`

## Root Cause

These endpoints **skip the permission check** that other file operations
(Delete, Download, Move) perform.

## Expected Permission Check

All file operations should follow this 3-step validation:

- Check file.tenant_id
- Check if user_id belongs to this tenant (via user_tenant join table)
- Check KB permission type (team permission)


**Code reference:** This is implemented in `checkFileTeamPermission()`
and used by Delete/Download/Move, but **missing** from
GetParentFolder/GetAllParentFolders.

## Reproduction

```bash
# User B (tenant: BBB) accessing User A's file (tenant: AAA)
curl -H "Authorization: Bearer USER_B_TOKEN" \
  "http://localhost:9384/v1/file/parent_folder?file_id=AAA_FILE_123"

# Result: Returns User A's folder metadata 
# Expected: "No authorization." 
Fix
Pass userID from handler to service and call checkFileTeamPermission() — same as Download/Delete/Move handlers.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 16:03:23 +08:00
d51fb88573 Fix: enforce tenant authorization on document download endpoint (#14618) (#14625)
### What problem does this PR solve?

Closes #14618.

The `GET /v1/document/get/<doc_id>` endpoint in
`api/apps/document_app.py` was protected only by `@login_required` and
called `DocumentService.get_by_id(doc_id)` without verifying that the
document's knowledge base belonged to the requesting user's tenant. Any
authenticated user who knew (or guessed) a document ID could download
files belonging to any other tenant — a cross-tenant IDOR.

This PR adds a `DocumentService.accessible(doc_id, current_user.id)`
check before serving the file. The helper already exists and joins
`Document` → `Knowledgebase` → `UserTenant` to verify the requesting
user belongs to the tenant that owns the document's KB. The same pattern
is already used by `api/apps/restful_apis/document_api.py` and mirrors
the tenant scoping in the SDK route at `api/apps/sdk/doc.py`.

The check returns the existing `"Document not found!"` error for both
non-existent and inaccessible documents, so attackers cannot use the
response to enumerate valid doc IDs across tenants.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Other (please describe): Security fix (cross-tenant IDOR /
authorization bypass)
2026-05-08 14:24:03 +08:00
6547751936 Fix: missing authorization checks in /files/link-to-datasets (#14649)
### Related issues
Closes #14648

### What problem does this PR solve?

This PR fixes an authorization flaw in `POST /files/link-to-datasets`.

Before this change, the endpoint only checked whether the supplied
`file_ids` and `kb_ids` existed. It did not verify whether the
authenticated user was actually allowed to access those files or target
datasets. As a result, an authenticated user who knew valid IDs could
relink another user's files to arbitrary datasets.

This was especially risky because the relinking flow is state-changing:
the background worker removes existing file-document mappings and then
recreates documents under the attacker-supplied dataset IDs.

This change makes the route enforce the same permission model already
used by nearby file and document operations:

- each resolved file must pass `check_file_team_permission(...)`
- each target dataset must pass `check_kb_team_permission(...)`
- authorization is enforced before scheduling background relinking work

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Testing

- Added regression coverage in
`test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py`
- Covered:
  - unauthorized file access is rejected
  - unauthorized dataset access is rejected
- existing success path still returns immediately after scheduling
background work
- Attempted to run:
- `python -m pytest
test\\testcases\\test_web_api\\test_file_app\\test_file2document_routes_unit.py
-q`
- Local execution in this workspace is currently blocked by missing test
dependencies during bootstrap, including `ragflow_sdk`

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-08 13:49:23 +08:00
f703169117 Refa: migrate document preview/download to RESTful API (#14633)
### What problem does this PR solve?

migrate document preview/download to RESTful API

### Type of change
- [x] Refactoring
2026-05-08 13:26:13 +08:00
94324afee9 Go: fix auth issue in hybrid mode (#14611)
### What problem does this PR solve?

Since secret key get and set logic is updated, the go server also need
to update.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 17:14:22 +08:00
c50028b1f3 Fix team member cannot edit agent (#14612)
### What problem does this PR solve?

Follow on PR: https://github.com/infiniflow/ragflow/pull/14602
to fix: team member cannot edit agent.
new behavior: beside delete, everything is allowed for team member.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-07 15:09:13 +08:00
1d0519d025 Fix secret key inconsistency cross the RAGFlow servers (#14591)
### What problem does this PR solve?

A and B, two API servers and a REDIS server.
If A and REDIS restart, B will hold the obsolete secret key and will
lead to error.

TODO:
app.config['SECRET_KEY'] and app.secret_key still hold obsolete secret
key.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 10:10:02 +08:00
e8f19aa338 feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238)
This PR addresses three related GraphRAG reliability issues that
together allow long-running GraphRAG tasks (10+ hours of LLM extraction)
to be resumed after a crash or pause without re-doing completed work. It
builds on #14096 (per-doc subgraph cache) and extends the same idea to
the resolution and community-detection phases.

Fixes #14236.

## 1. Fix concurrent merge crash

Long GraphRAG runs would crash near the end of entity resolution with:
```
RuntimeError: dictionary keys changed during iteration
```
in `Extractor._merge_graph_nodes`. Two changes:

- `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)`
via `list(...)` before iterating, so concurrent `add_edge` /
`remove_node` mutations on the shared `nx.Graph` cannot invalidate the
iterator. Also tracks each redirected neighbour in `node0_neighbors` so
a later merged node sharing the same external neighbour takes the
edge-merge branch instead of overwriting via `add_edge`.
- `rag/graphrag/entity_resolution.py`: serialize the merge step with a
dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and
concurrent merges on overlapping neighbourhoods can produce incorrect
results even with the snapshot fix.

## 2. Don't wipe partial graph on pause

Previously the pause / cancel UI path called
`settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`,
destroying every subgraph, entity, relation, and graph row.
Re-triggering then started GraphRAG from scratch even though #14096 had
already added `load_subgraph_from_store`.

After main was merged in (which deleted `api/apps/kb_app.py` per
#14394), the pause path now lives on the new REST surface `DELETE
/v1/datasets/<id>/<index_type>`:

- `api/apps/services/dataset_api_service.py`: `delete_index` accepts a
`wipe: bool = True` parameter. When `False` the doc-store rows and
GraphRAG phase markers are left intact and only the running task is
cancelled. Default preserves historical behaviour.
- `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false|0|no|off`
from the query string and forwards it.
- `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`:
`unbindPipelineTask` appends `?wipe=false` when explicitly false.
- The GraphRAG pause action in
`web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe:
false` for `KnowledgeGraph`; raptor is unchanged.

**UX impact:** the pause icon next to a running GraphRAG task no longer
wipes graph data. The only path that still wipes is the explicit Delete
action in `GenerateLogButton` (trash icon behind a confirmation modal).

## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`)

A small Redis-backed marker layer at
`graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL).
`run_graphrag_for_kb` consults the markers on entry and skips phases
that already completed in a prior run. Markers are cleared automatically
when:
- new docs are merged into the graph (which invalidates prior resolution
and community results),
- `delete_index` wipes the graph, or
- `delete_knowledge_graph` is called.

Redis failures never block a run -- markers are an optimization, not a
gate.

## 4. Idempotent community detection

`extract_community` previously did `delete-then-insert` on
`community_report` rows; a crash mid-insert left the dataset with no
reports. Now report IDs are derived deterministically from `(kb_id,
community.title)`, the existing report IDs are snapshotted before
insert, new rows are written, then only stale rows are pruned. A failure
at any step leaves either the prior or the new report set intact --
never a partial mix.

## 5. Tunable doc-store insert pipeline

The GraphRAG insert loop in `rag/graphrag/utils.py` and the
`community_report` insert in `rag/graphrag/general/index.py` were both
hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real
KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure
round-trip overhead.

- New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py`
batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout
semantics as the prior loop.
- Defaults: 64 docs per batch, 4 batches in flight (matches the regular
ingest pipeline in `document_service.py`). Tunable per-deployment via
`GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`.
- Both `set_graph` and `extract_community` now use the helper.

This dropped the same 1077-chunk insert from minutes to seconds in local
testing without measurable extra pressure on Infinity (total in-flight
docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default).

## Tests

- `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests):
dense neighbourhood merge, neighbour-snapshot regression, concurrent
serialized merges.
- `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has
round-trip, kb-scoped clear, no-op on empty input, graceful Redis
failure.
-
`test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`:
new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both
GraphRAG and raptor on the new REST route, and confirms the default
still wipes and clears phase markers.

## Compatibility

- Backward compatible: tasks queued before this change behave
identically (default `wipe=true`, no markers expected).
- No schema/migration changes; all new state lives in Redis.
- New optional REST query param `wipe` on `DELETE
/v1/datasets/<id>/<index_type>`.
- New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and
`GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour.

## Example of resume

Screenshot below shows a test resuming knowledge graph generation after
applying the concurrency fix and re-deploying.

<img width="521" height="677" alt="image"
src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a"
/>

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-05-06 15:01:01 +08:00
05ee7f8bb6 Fix: remove delete_documents uuid validation (#14533)
### What problem does this PR solve?

remove delete_documents uuid validation

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 18:56:33 +08:00
06c6da5d94 Fix: add document delete permission check (#14472)
### What problem does this PR solve?

add document delete permission check

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:01:09 +08:00
47129fdd08 Fix: optimize file batch delete (#14473)
### What problem does this PR solve?

optimize file batch delete

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:00:39 +08:00
b684c89950 Add backward compat APIs (#14427)
### What problem does this PR solve?

Add backward compat APIs:

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 15:15:49 +08:00
35f6d81b73 Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402)
### What problem does this PR solve?

## Summary

Migrate two web API endpoints to REST-style HTTP API endpoints,
following the pattern established in #14222:

| Old Endpoint | New Endpoint |
|---|---|
| `POST /v1/chunk/retrieval_test` | `POST
/api/v1/datasets/<dataset_id>/search` |
| `GET /v1/chunk/knowledge_graph` | `GET
/api/v1/datasets/<dataset_id>/graph` |
2026-04-28 20:00:26 +08:00
85575259ac Fix: google authentication - gmail && google-drive (#14422)
### What problem does this PR solve?

Fix: google authentication - gmail && google-drive

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 18:09:02 +08:00
c81081f8ef Refactor: Doc change parser (#14327)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_parser
HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API
PATCH /api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-27 23:42:57 +08:00
c5116b90e5 Refactor: migrate document thumbnails API (#14344)
### What problem does this PR solve?

Before migration: GET /v1/document/thumbnails
After migration:  GET /api/v1/thumbnails

### Type of change

- [x] Refactoring
2026-04-27 21:29:09 +08:00
49912a156e Refactor: migrate document run api (#14351)
### What problem does this PR solve?

Before migration: POST /v1/document/run
After migration: POST /api/v1/documents/ingest/

### Type of change

- [x] Refactoring
2026-04-27 21:25:58 +08:00
343bda1119 Refactor: deco document upload_and_parse API (#14366)
### What problem does this PR solve?

remove unused "POST /v1/document/upload_and_parse"

### Type of change

- [x] Refactoring
2026-04-27 20:35:00 +08:00
a536980e22 Refactor: Doc batch change status (#14337)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_status

After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status 

### Type of change

- [x] Refactoring
2026-04-27 20:00:23 +08:00
82313020c7 Refa: align list operations and strict mode (#14387)
### What problem does this PR solve?

align list operations and strict mode

### Type of change
- [x] Refactoring
2026-04-27 19:13:00 +08:00
c1941fd503 Refactor: deco doc-parse API that is not used any more (#14367)
### What problem does this PR solve?

Delete un-used API "POST /v1/document/parse"

### Type of change

- [x] Refactoring
2026-04-27 18:54:49 +08:00
61a24a2c14 Refactor: migrate doc upload info used in chat (#14359)
### What problem does this PR solve?

Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/

### Type of change

- [x] Refactoring
2026-04-27 16:58:42 +08:00
d88f7ac8d2 Remove evaluation_app.py and kb_app.py (#14394)
### What problem does this PR solve?

Delete not used APIs

### Type of change

- [x] Refactoring
2026-04-27 16:08:54 +08:00
a9e5724b46 Refa: unify document create flows under REST documents API (#14345)
### What problem does this PR solve?

unify document create flows under REST documents API

### Type of change

- [x] Refactoring
2026-04-27 10:18:16 +08:00
4dcc42e0e1 feat(api): add unified index API and dataset management endpoints (#14222)
### What problem does this PR solve?

## Summary

Refactor the dataset API layer into a clean service/REST separation
pattern, add a unified `/index` API for graph/raptor/mindmap operations,
and introduce several new dataset management endpoints with full test
coverage.

## Changes

### Service Layer (`dataset_api_service.py`)

- Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace
function for all index types
- Added `run_index`, `delete_index` service functions
- Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`,
`get_ingestion_log`
- Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`,
`rename_tag`
- Added `get_flattened_metadata`, `get_auto_metadata`,
`update_auto_metadata`

### REST API Layer (`dataset_api.py`)

**New unified routes:**

| Method | Route | Description |
|--------|-------|-------------|
| POST | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Run index
task |
| GET | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Trace index
task |
| DELETE | `/datasets/<id>/<index_type>` | Delete index |
| GET | `/datasets/<id>` | Get dataset details |
| GET | `/datasets/<id>/ingestions/summary` | Ingestion summary |
| GET | `/datasets/<id>/ingestions` | List ingestion logs |
| GET | `/datasets/<id>/ingestions/<log_id>` | Get single ingestion log
|
| POST | `/datasets/<id>/embedding` | Run embedding |
| GET | `/datasets/<id>/tags` | List tags |
| GET | `/datasets/tags/aggregation` | Aggregate tags across datasets |
| DELETE | `/datasets/<id>/tags` | Delete tags |
| PUT | `/datasets/<id>/tags` | Rename tag |
| GET | `/datasets/metadata/flattened` | Get flattened metadata |
| GET/PUT | `/datasets/<id>/metadata/config` | New metadata config path
|

**Removed routes (replaced by unified `/index`):**

- `POST /datasets/<id>/mindmap`
- `GET /datasets/<id>/mindmap`

**Preserved legacy routes (backward compatibility):**

- `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor`
- `/auto_metadata` GET/PUT

### Test Suite

- Updated `common.py` helpers: added `trace_index`, removed
`run_mindmap`/`trace_mindmap`
- Added 7 new test files with 39 test cases total:

| Test File | Cases |
|-----------|-------|
| `test_get_dataset.py` | 4 |
| `test_ingestion_summary.py` | 2 |
| `test_ingestion_logs.py` | 5 |
| `test_index_api.py` | 14 |
| `test_embedding.py` | 2 |
| `test_tags.py` | 8 |
| `test_flattened_metadata.py` | 4 |

- Deleted `test_mindmap_tasks.py` (covered by unified index tests)

## Design Decisions

1. **Unified `/index?type=...`** — single endpoint replaces 3 separate
route pairs for graph/raptor/mindmap
2. **Backward compatibility** — old routes (`/run_graphrag`,
`/run_raptor`, `/auto_metadata`) preserved alongside new paths
3. **`_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}`** — input
validation via constant set
4. **`_INDEX_TYPE_TO_TASK_ID_FIELD`** — maps index type to KB model task
ID field for clean dispatch

## Files Changed

- `api/apps/restful_apis/dataset_api.py`
- `api/apps/services/dataset_api_service.py`
- `sdk/python/ragflow_sdk/modules/dataset.py`
- `test/testcases/test_http_api/common.py`
- `test/testcases/test_http_api/test_dataset_management/` (7 new files)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 09:38:01 +08:00
fb95136f39 Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090)
### What problem does this PR solve?

The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.

This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-25 14:30:15 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
9ad752f497 Refa:migrate agent webhook routes to REST APIs (#14330)
### What problem does this PR solve?

migrate agent webhook routes to REST APIs

### Type of change
- [x] Refactoring
2026-04-24 17:55:53 +08:00
199fbceb72 Refactor user REST API (#14334)
### What problem does this PR solve?
Refactor user REST API

### Type of change
- [x] Refactoring
2026-04-24 10:25:15 +08:00
c74aece63c Feat: Agent api (#14157)
### What problem does this PR solve?

1. **List agents**  
   **Prev API**:  
   - `/v1/canvas/list GET`  
   - `/api/v1/agents GET`  
   **Current API**: `/api/v2/agents GET`

2. **Get canvas template**  
   **Prev API**: `/v1/canvas/templates GET`  
   **Current API**: `/api/v2/agents/templates GET`

3. **Delete an agent**  
   **Prev API**: 
    - `/v1/canvas/rm POST`  
    - `/api/v1/agents/<agent_id> DELETE`
   **Current API**: `/api/v2/agents/<agent_id> DELETE`

4. **Update an agent**  
   **Prev API**: 
    - `/api/v1/agents/<agent_id> PUT`   
    - `/v1/canvas/setting POST `
   **Current API**: `/api/v2/agents/<agent_id> PATCH`


5. **Create an agent**  
   **Prev API**: 
    - `/v1/canvas/set POST`  
    - `/api/v1/agents POST`
   **Current API**: `/api/v2/agents POST`


6. **Get an agent**  
   **Prev API**: 
    - `/v1/canvas/get/<canvas_id> GET `  
   **Current API**: `/api/v2/agents/<agent_id> GET`


7. **Reset an agent**  
   **Prev API**: 
    - `/v1/canvas/reset POST`  
   **Current API**: `/api/v2/agents/<agent_id>/reset POST`


8. **Upload a file to an agent**  
   **Prev API**: 
    - `/v1/canvas/upload/<canvas_id> POST`  
   **Current API**: `/api/v2/agents/<agent_id>/upload POST`


9. **Input form**  
   **Prev API**: 
    - `/v1/canvas/input_form GET`  
**Current API**:
`/api/v2/agents/<agent_id>/components/<component_id>/input-form GET`


10. **Debug an agent**  
   **Prev API**: 
    - `/v1/canvas/debug POST`  
**Current API**:
`/api/v2/agents/<agent_id>/components/<component_id>/debug POST`


11. **Trace an agent**  
   **Prev API**: 
    - `/v1/canvas/trace GET`  
   **Current API**: `/api/v2/agents/<agent_id>/logs/<message_id> GET`


12. **Get an agent version list**  
   **Prev API**: 
    - `/v1/canvas/getlistversion/<canvas_id>`  
   **Current API**: `/api/v2/agents/<agent_id>/versions GET`


13. **Get a version of agent**  
   **Prev API**: 
    - `/v1/canvas/getversion/<version_id>`  
**Current API**: `/api/v2/agents/<agent_id>/versions/<version_id> GET`


14. **Test db connection**  
   **Prev API**: 
    - `/v1/canvas/test_db_connect POST`  
   **Current API**: `/api/v2/agents/test_db_connection`


15. **Rerun the agent**  
   **Prev API**: 
    - `/v1/canvas/rerun POST`  
   **Current API**: `/api/v2/agents/rerun POST`


16. **Get prompts**  
   **Prev API**: 
    - `/v1/canvas/prompts GET`  
   **Current API**: `/api/v2/agents/prompts GET`

### Type of change
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chanx <1243304602@qq.com>
2026-04-24 10:02:22 +08:00
ba47c13eb5 Fix commit override from #14298 of api-key to api_key (#14328)
### What problem does this PR solve?

Fix commit override from
https://github.com/infiniflow/ragflow/pull/14298/ of `api-key` to `api_key`

### Type of change

- [x] Refactoring
2026-04-23 17:16:32 +08:00
4458763a93 API refactor: stats_api and plugin_api (#14324)
### What problem does this PR solve?

API refactor: stats_api and plugin_api

### Type of change

- [x] Refactoring
2026-04-23 17:16:04 +08:00
7817b0d779 Refa: migrate chunk APIs to RESTful routes (#14291)
### What problem does this PR solve?

migrate chunk APIs to RESTful routes

### Type of change
- [x] Refactoring
2026-04-23 14:17:23 +08:00
76b017ca32 Refact: system apis (#14298)
### What problem does this PR solve?
Refact: system apis

### Type of change

- [x] Refactoring
2026-04-23 14:09:42 +08:00
aa4526266f Refa: migrate MCP APIs to RESTful api (#14317)
### What problem does this PR solve?

migrate MCP APIs to RESTful api

### Type of change

- [x] Refactoring
2026-04-23 12:51:27 +08:00
dbf8c6ed90 Refactor: Doc metadata update (#14289)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/metadata/update

After migration, Restful API
PATCH /api/v2/datasets/<dataset_id>/documents/metadatas 

### Type of change

- [x] Refactoring
2026-04-23 12:04:34 +08:00
aae45b959b Refactor: API file2document (#14306)
Refactor: API file2document
2026-04-23 11:40:45 +08:00
e79b896637 Refactor: REST API langfuse api-key (#14315)
REST API langfuse api-key
2026-04-23 11:36:16 +08:00
01753b8f31 Refactor: API connectors (#14228)
### What problem does this PR solve?

Refactor /api/v1/connectors to be more RESTful.

### Type of change
- [x] Refactoring
2026-04-22 20:42:41 +08:00
c08cd8e090 Refactor: Migrate document metadata config update API (#14286)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/update_metadata_setting

After consolidation, Restful API
PUT
/api/v1/datasets/<dataset_id>/documents/<document_id>/metadata/config

### Type of change

- [x] Refactoring
2026-04-22 20:01:31 +08:00