Commit Graph

1973 Commits

Author SHA1 Message Date
aa57b5bd8b Go: move logger to common module (#14545)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-06 10:41:58 +08:00
24af0875e5 Feat/configurable metadata display (#13464)
### What problem does this PR solve?

Currently, RAGFlow's Search and Chat interfaces display only raw
vectorized text chunks during retrieval, without contextual information
about their source documents. Users cannot see document titles, page
numbers, upload dates, or custom metadata fields that would help them
understand and trust the retrieved results.

This PR introduces an **optional metadata display feature** that
enriches retrieved chunks with document-level metadata in both the
Search tab and Chatbot interface.

**Key improvements:**
- **Search results**: Display document metadata as styled badges beneath
chunk snippets
- **Chat citations**: Show metadata in citation popovers and reference
lists for better source context
- **LLM context**: Metadata is injected into the LLM prompt to enable
more accurate, citation-aware responses
- **External API support**: Applications using RAGFlow's SDK retrieval
endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in
via request parameters
- **User control**: Multi-select dropdown UI allows users to choose
which metadata fields to display

**Implementation approach:**
-  Reuses existing `DocMetadataService` infrastructure (no new database
tables or indices)
-  Settings stored in existing JSON configuration fields
(`search_config.reference_metadata`, `prompt_config.reference_metadata`)
-  No database migrations required
-  Disabled by default (fully opt-in and backward-compatible)
-  Dynamic metadata field selection populated from actual document
metadata keys
-  Fixed critical bug where Python's builtin `set()` was shadowed by a
route handler function

**Modified endpoints (all backward-compatible):**
- `POST /v1/retrieval` (Public SDK)
- `POST /v1/searchbots/retrieval_test` (Searchbots)
- `POST /v1/chunk/retrieval_test` (UI/Internal)
- Chat completions endpoints (via `extra_body.reference_metadata` or
`prompt_config`)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)


###Images
-
<img width="879" height="1275" alt="image"
src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb"
/>
<br><br>
<br><br>

<img width="1532" height="362" alt="image"
src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e"
/>
<br><br>
<br><br>

<img width="2586" height="1320" alt="image"
src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776"
/>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
2026-04-30 23:13:27 +08:00
a69e0c73c7 feat(rss): support deleted-file sync (#14493)
### What problem does this PR solve?

Partially addresses #14362.

This PR enables syncing deleted files for RSS data sources.

Previously, RSS incremental sync only returned feed entries whose
timestamps were inside the poll window. If an entry was removed from the
RSS feed, RAGFlow had no full current RSS snapshot to pass into the
shared stale-document cleanup path, so the deleted remote entry could
remain in the knowledge base.

This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector`
- reuses the same `rss:<md5(stable_key)>` document ID derivation used by
normal RSS ingest
- returns `(document_generator, file_list)` for incremental RSS sync
when `sync_deleted_files` is enabled
- captures the poll end timestamp before snapshot/poll so cleanup does
not race against the same sync window
- adds start/end logs around RSS slim snapshot collection
- exposes the deleted-file sync toggle for RSS in the data source UI

Per maintainer request on related datasource PRs, this PR contains no
test-case changes. Local verification was run with an external script.

Validation:
- `uv run ruff check common/data_source/rss_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`
- `uv run python /tmp/verify_rss_deleted_sync.py --repo
/root/74/ragflow`

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 18:56:13 +08:00
bedf9592ef feat(webdav): support deleted-file sync via slim snapshot (#14491)
## What problem does this PR solve?

Incremental WebDAV sync only ingested files whose modification time fell
inside the poll window; documents removed on the WebDAV server were
never removed from the knowledge base. This aligns with
[#14362](https://github.com/infiniflow/ragflow/issues/14362)
(coordinated datasource “sync deleted files” work).

This PR adds a **full-tree slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that enumerates current remote
paths **without downloading file contents**, using the same logical
document IDs as full ingest (`webdav:{base_url}:{file_path}`). When
**`sync_deleted_files`** is enabled on incremental runs, sync returns
**`(document_generator, file_list)`** so **`SyncBase`** runs
**`cleanup_stale_documents_for_task`** and removes KB rows no longer
present remotely.

Design notes:

- **`_list_files_recursive`** gains **`filter_by_mtime`**: snapshot
passes **`filter_by_mtime=False`** (full tree under **`remote_path`**);
**`poll_source`** keeps mtime-window filtering as before.
- Slim snapshot applies the same **extension** and **`size_threshold`**
rules as **`_yield_webdav_documents`** so retain IDs match what would be
indexed.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with
Dropbox-style connectors).

## Type of change

- [x] New Feature (non-breaking change which adds functionality)

## Files changed

| Area | Change |
|------|--------|
| `common/data_source/webdav_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`filter_by_mtime` on `_list_files_recursive` |
| `rag/svr/sync_data_source.py` | WebDAV `_generate`: `file_list` +
tuple return; pass **`batch_size`** from connector config |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` |
2026-04-30 17:26:27 +08:00
00e03a1945 Fix: LaTeX formulas cannot be displayed on the chat page. (#14531)
### What problem does this PR solve?

Fix: LaTeX formulas cannot be displayed on the chat page.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 16:12:13 +08:00
17eda04b8d feat(zendesk): support deleted-file sync (#14487)
### What problem does this PR solve?

Refs #14362.

This PR enables syncing deleted files for Zendesk data sources.

Previously, Zendesk incremental sync never returned a slim remote
snapshot to the shared stale-document cleanup path, so deleted remote
Zendesk records could remain in RAGFlow. The existing Zendesk slim
snapshot also included records that ingestion intentionally skips, such
as draft articles, articles without bodies, skipped-label articles,
empty-body articles, and tickets with `status == "deleted"`.

This PR:
- exposes the deleted-file sync option for Zendesk in the data source UI
- returns Zendesk slim snapshots during incremental sync when
`sync_deleted_files` is enabled
- reuses Zendesk indexability rules so cleanup compares against the same
records ingestion can materialize
- adds start/end logs around Zendesk slim snapshot collection for
operational visibility

Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.

Validation:
- `uv run ruff check common/data_source/zendesk_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-04-30 14:44:05 +08:00
8f75e52bbf feat(asana): support deleted-file sync (#14468)
### What problem does this PR solve?

Partially addresses #14362.

Adds deleted-file sync support for the Asana data source. Asana already
indexes task attachments as documents, but it did not provide the slim
document snapshot required by stale-document reconciliation, and the
sync wrapper never returned a `file_list` for cleanup.

This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector`
- builds slim IDs with the same `asana:{task_id}:{attachment_gid}`
format used by indexed documents
- avoids downloading attachment blobs during the snapshot
- aborts the snapshot if Asana API errors occur, preventing partial
snapshots from deleting valid local docs
- captures the incremental poll end time before snapshotting and makes
`poll_source()` respect that boundary
- exposes the deleted-file sync toggle for Asana in the data source UI

Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.

Validation:
- `uv run ruff check common/data_source/asana_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`

### Type of change

- [x] New Feature
2026-04-30 14:41:36 +08:00
4ee0702aed Feat: add skills space to context engine (#13908)
### What problem does this PR solve?

issue #13714

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 12:36:03 +08:00
bb3b99f0a5 Feat: add button for remove header & footer in pipeline (#14486)
### What problem does this PR solve?

Feat: add button for remove header & footer in pipeline

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 12:30:41 +08:00
2932b65da6 feat(seafile): support deleted-file sync via slim snapshot (#14499)
### What problem does this PR solve?

Incremental Seafile sync only ingests files whose modification time
falls in the poll window; documents removed in Seafile were never
removed from the knowledge base. This contributes to
[#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource
“sync deleted files” coordination).

This PR adds a **slim snapshot** (`retrieve_all_slim_docs_perm_sync`)
that enumerates current remote file IDs **without downloading content**,
using the same logical IDs as full ingest
(`seafile:{repo_id}:{file_id}`). When **`sync_deleted_files`** is
enabled on incremental runs, **`SeaFile._generate`** returns
**`(document_generator, file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove stale KB documents.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
### What changed

- **`common/data_source/seafile_connector.py`**: `SeaFileConnector`
implements **`SlimConnectorWithPermSync`**;
**`_list_files_recursive(..., filter_by_mtime=...)`** supports full-tree
listing for snapshots; **`retrieve_all_slim_docs_perm_sync()`** reuses
the same library/root scan as ingest and applies the same **size**
ceiling; logging for snapshot start/end and counts.
- **`rag/svr/sync_data_source.py`**: **`SeaFile._generate`** validates
**`batch_size`**, captures **`end_ts`** before snapshot +
**`poll_source`**, wraps slim retrieval in **`try`/`except`** (
**`file_list = None`** on failure so ingest continues), returns
**`(generator, file_list)`**.
- **`web/src/pages/user-setting/data-source/constant/index.tsx`**:
**`syncDeletedFiles`** for Seafile in
**`DataSourceFeatureVisibilityMap`**.
2026-04-30 12:05:12 +08:00
47129fdd08 Fix: optimize file batch delete (#14473)
### What problem does this PR solve?

optimize file batch delete

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:00:39 +08:00
7c0584a2b7 Fix: The GraphRAG icon is not displaying. (#14514)
### What problem does this PR solve?

Fix: The GraphRAG icon is not displaying.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 10:44:05 +08:00
6dd38eca6a fix: file logs not displayed in dataset ingestion page (#14479)
### What problem does this PR solve?

## Summary

Fixed a bug where the **File Logs** tab in the dataset ingestion page
always showed "No logs" even after files were parsed successfully.

## Root Cause

Both the **File Logs** and **Dataset Logs** tabs on the frontend called
the same backend endpoint `/datasets/{dataset_id}/ingestions`. However,
the backend only queried `get_dataset_logs_by_kb_id`, which
hard-filtered records by `document_id == GRAPH_RAPTOR_FAKE_DOC_ID`
(dataset-level logs). As a result, real file-level logs were never
returned, causing the table to appear empty.

## Changes

### Backend

- **`api/apps/restful_apis/dataset_api.py`**
  - Added two new query parameters to `list_ingestion_logs`:
    - `log_type` — `"file"` or `"dataset"` (default: `"dataset"`)
    - `keywords` — search keyword for filtering by document / task name

- **`api/apps/services/dataset_api_service.py`**
- Updated `list_ingestion_logs` signature to accept `log_type` and
`keywords`.
  - Added conditional routing:
- When `log_type == "file"`, call
`PipelineOperationLogService.get_file_logs_by_kb_id`
- Otherwise, call
`PipelineOperationLogService.get_dataset_logs_by_kb_id`

- **`api/db/services/pipeline_operation_log_service.py`**
- Extended `get_dataset_logs_by_kb_id` with an optional `keywords`
parameter so dataset logs can also be searched.

### Frontend

- **`web/src/pages/dataset/dataset-overview/hook.ts`**
- Removed the separate API function switching (`listPipelineDatasetLogs`
vs `listDataPipelineLogDocument`).
- Unified both tabs to call `listDataPipelineLogDocument` with the new
`log_type` query parameter (`"file"` or `"dataset"`).
  - Ensured `keywords` and filter values are passed through correctly.

## Behavior After Fix

| Tab | `log_type` | Returned Records | Searchable Field |
|---|---|---|---|
| File Logs | `file` | Real document-level logs | `document_name` (file
name) |
| Dataset Logs | `dataset` | GraphRAG / RAPTOR / MindMap logs |
`document_name` (task type) |
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>
2026-04-29 22:10:24 +08:00
5018459112 Fix metadata config (#14480)
### What problem does this PR solve?

Fix metadata config

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 21:09:54 +08:00
c4d0b0ebcf Fix visit dataset error (#14490)
### What problem does this PR solve?

Fix visit dataset error

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 20:17:00 +08:00
1692f0928f Fix: The pipeline column header in the FileLogsTable is displaying incorrectly. (#14489)
### What problem does this PR solve?
Fix: The pipeline column header in the FileLogsTable is displaying
incorrectly.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 19:52:28 +08:00
9280c64518 Docs: Updated Title chunker references (#14483)
### What problem does this PR solve?

Updated Title chunker references

### Type of change

- [x] Documentation Update
2026-04-29 19:37:24 +08:00
de8c6ad0f3 Feat: enable sync deleted file for Discord (#14451)
### What problem does this PR solve?

Feat: enable sync deleted file for Discord

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:05:40 +08:00
2bc8c6d35e feat(dropbox): support deleted-file sync (#14476)
### What problem does this PR solve?

Partially addresses #14362 by adding deleted-file sync support for the
Dropbox data source.

Dropbox previously did not provide the slim current-file snapshot
required by stale document reconciliation, and its sync runner returned
only document batches. As a result, enabling deleted-file sync could not
remove local documents that had been deleted from Dropbox.

This PR:
- Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`.
- Reuses Dropbox metadata traversal to collect current remote file IDs
without downloading file contents.
- Wires incremental Dropbox sync to return `(document_generator,
file_list)` when `sync_deleted_files` is enabled.
- Enables the deleted-file sync toggle for Dropbox in the data source
settings UI.
- Adds regression coverage for slim snapshots, nested folders, paginated
listings, duplicate filenames, and full reindex behavior.

Tests:
- `uv run pytest test/unit_test/common/test_dropbox_connector.py -q`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `uv run pytest test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py -q`
- `uv run ruff check common/data_source/dropbox_connector.py
rag/svr/sync_data_source.py
test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:05:11 +08:00
db1a73b255 Feat: enable sync deleted files in gitlab (#14481)
### What problem does this PR solve?

Feat: enable sync deleted files in gitlab

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:04:10 +08:00
a0f9ae16d2 Fix: RAPTOR "Generation scope" reset to "Single file" when selecting "Dataset" (#14477)
## Problem
In the Dataset Configuration page, changing the RAPTOR **Generation
scope** from "Single file" to "Dataset" and clicking **Save** did not
persist the change. After refreshing or re-entering the page, the scope
always reverted to "Single file".

## Root Cause
1. **Backend**: The `RaptorConfig` Pydantic model in
`api/utils/validation_utils.py` was configured with `extra="forbid"` but
did not declare a `scope` field. When the frontend sent `"scope":
"dataset"`, Pydantic rejected the request.
2. **Frontend**: The `extractRaptorConfigExt` utility in
`web/src/hooks/parser-config-utils.ts` treated `scope` as an unknown
field and moved it into the nested `ext` object. Consequently, the
backend could not read `raptor_config.get("scope", "file")` correctly,
so the default `"file"` was always used.

## Changes
- Added `scope: Literal["file", "dataset"]` to the backend
`RaptorConfig` model with a default of `"file"`.
- Added `scope` to the known-field whitelist in the frontend
`extractRaptorConfigExt` helper so it is transmitted as a top-level
raptor field instead of being buried in `ext`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-29 18:46:28 +08:00
1b84892e3a Fix delete graph (#14484)
### What problem does this PR solve?

Fix delete graph

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 18:09:10 +08:00
3991bdfaf5 Fix graph task type (#14475)
### What problem does this PR solve?
Fix graph task type

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 17:05:56 +08:00
e0b3070012 Feat: enable sync deleted files for Gmail && fix google drive issues (#14462)
### What problem does this PR solve?

Feat: enable sync deleted files for Gmail && fix google drive issues

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
Co-authored-by: balibabu <assassin_cike@163.com>
2026-04-29 17:03:56 +08:00
a736948493 Fix: Clicking the button in the bottom-right corner of the /chats/widget page fails to display the dialog box. (#14465)
### What problem does this PR solve?

Fix: Clicking the button in the bottom-right corner of the
`/chats/widget` page fails to display the dialog box.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 17:03:33 +08:00
9690923516 Fix delete graphrag raptor (#14469)
### What problem does this PR solve?

Fix delete graphrag raptor

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 16:47:42 +08:00
ce933357c6 Fix: Dataset: When configuring the "general chunk method," options such as chunk size and parent-child slicing are unavailable. (#14459)
### What problem does this PR solve?

Fix: Dataset: When configuring the "general chunk method," options such
as chunk size and parent-child slicing are unavailable.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: balibabu <assassin_cike@163.com>
2026-04-29 14:37:48 +08:00
3b7a6eaa6c Feat: sync deleted files in Bitbucket (#14450)
### What problem does this PR solve?

Feat: sync deleted files in Bitbucket

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 11:29:17 +08:00
74fa54f122 feat(google-drive): optimize memory payload and enable sync deletion (#14372)
**Addresses the Google Drive integration for #14362**

This PR completely overhauls the Google Drive sync logic to accurately
detect remote deletions, while drastically reducing the memory footprint
during the snapshot phase.

### What changed under the hood:

* **Killed the memory bloat:** Swapped out the massive document
dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc
= namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during
`retrieve_all_slim_docs_perm_sync` on massive enterprise drives.
* **Flawless downstream integration:** The `SlimDoc` object relies on
simple duck typing. It perfectly delivers the `.id` attribute required
by `ConnectorService.cleanup_stale_documents_for_task`, meaning your
core `hash128` vector cleanup logic runs natively without modification.
* **Fixed the Shared Drive blindspot:** The standard API query was
missing team folders. Injected the `corpora="allDrives"` and
`includeItemsFromAllDrives=True` override flags so the connector now
accurately maps state across both personal workspaces and organizational
Shared Drives.

### Testing:
Isolated the Google API retrieval logic locally to prove the `SlimDoc`
mapping works and correctly registers state drops when a file is trashed
remotely.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Performance Improvement
2026-04-29 10:04:36 +08:00
0d18b293f5 Fix: enable sync deleted file in airtable (#14438)
### What problem does this PR solve?

Fix: enable sync deleted file in airtable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 20:09:08 +08:00
35f6d81b73 Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402)
### What problem does this PR solve?

## Summary

Migrate two web API endpoints to REST-style HTTP API endpoints,
following the pattern established in #14222:

| Old Endpoint | New Endpoint |
|---|---|
| `POST /v1/chunk/retrieval_test` | `POST
/api/v1/datasets/<dataset_id>/search` |
| `GET /v1/chunk/knowledge_graph` | `GET
/api/v1/datasets/<dataset_id>/graph` |
2026-04-28 20:00:26 +08:00
d532151be0 Feat: more model for paddle (#14436)
### What problem does this PR solve?

Feat: more model for paddle
### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 18:07:00 +08:00
c330005659 Fix: document level auto metadata config missing after save (#14421)
### What problem does this PR solve?

Steps to re-produce (existing bug before API migration):

create a new dataset
upload a file 
click on "General" in "Parse" column and then click on "switch or
configure ingestion pipeline"
click on "Settings" (at right of "Auto metadata")
click "Add" to add new metadata
click on "Save"
re-open "Settings" and the newly added metadata is not there

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 17:09:23 +08:00
18fbfafca6 Feat: enable sync deleted files for more connectors (#14353)
### What problem does this PR solve?

Feat: enable sync delted files for connectors

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 15:07:14 +08:00
444e564329 Fix: align chat recommendation and thumbup APIs (#14413)
### What problem does this PR solve?
align chat recommendation and thumbup APIs
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 12:55:16 +08:00
2d522ccb36 Fix: thumbnails issue in chat (#14415)
[Uploading part_4-13.pdf…]()
### What problem does this PR solve?

In chat, the thumbnails didn't display correctly

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)

Steps to reproduce:
1. create dataset and upload a file (see attached)
2. parse the document
3. once parsing completed, create a chat and associate it with the
dataset
4. ask a question (DAP VS DAPE comparison)
5. check result
2026-04-28 11:39:29 +08:00
c81081f8ef Refactor: Doc change parser (#14327)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_parser
HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API
PATCH /api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-27 23:42:57 +08:00
c5116b90e5 Refactor: migrate document thumbnails API (#14344)
### What problem does this PR solve?

Before migration: GET /v1/document/thumbnails
After migration:  GET /api/v1/thumbnails

### Type of change

- [x] Refactoring
2026-04-27 21:29:09 +08:00
49912a156e Refactor: migrate document run api (#14351)
### What problem does this PR solve?

Before migration: POST /v1/document/run
After migration: POST /api/v1/documents/ingest/

### Type of change

- [x] Refactoring
2026-04-27 21:25:58 +08:00
a536980e22 Refactor: Doc batch change status (#14337)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_status

After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status 

### Type of change

- [x] Refactoring
2026-04-27 20:00:23 +08:00
488c3ef6a3 Add task API (#14393)
### What problem does this PR solve?

Add task API

### Type of change

- [x] Refactor
2026-04-27 19:16:37 +08:00
82313020c7 Refa: align list operations and strict mode (#14387)
### What problem does this PR solve?

align list operations and strict mode

### Type of change
- [x] Refactoring
2026-04-27 19:13:00 +08:00
4f6651968a Fix: prioritize explore session ID and reset default conversation variables (#14399)
### What problem does this PR solve?

 prioritize explore session ID and reset default conversation variables

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 18:52:40 +08:00
61a24a2c14 Refactor: migrate doc upload info used in chat (#14359)
### What problem does this PR solve?

Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/

### Type of change

- [x] Refactoring
2026-04-27 16:58:42 +08:00
290f0294d6 Refactor: migrate artifact API (#14348)
### What problem does this PR solve?

Before migration: GET /v1/document/artifact/<filename>
After migration:  GET /api/v1/documents/artifact/<filename>

### Type of change

- [x] Refactoring
2026-04-27 15:19:41 +08:00
6a23dfeec1 chore(CLAUDE.md): add shared UI component lock convention to CLAUDE.md (#14381)
### What problem does this PR solve?

AI coding agents (Claude, Copilot, etc.) tend to directly edit files in
`src/components/ui/` when asked to tweak styles or add props, treating
them like ordinary feature code. This silently breaks the shared
component library that both shadcn primitives and project-authored
common components live in.

This PR adds a `Shared UI Component Lock` convention to `web/CLAUDE.md`
to instruct AI agents to treat the entire `src/components/ui/` directory
as read-only. Any customization must be done via wrappers or composition
outside the directory; exceptions require explicit user approval.

### Type of change
- [x] Other (please describe): Update `CLAUDE.md`
2026-04-27 12:03:32 +08:00
33bb464ce3 fix: skip canvas SSE fetch in chat shared page to eliminate spurious 103 error (#14190)
## What does this PR do?

Fixes the `hint : 103 Only owner of canvas authorized for this
operation` error that appears when opening a **Chat** shared link
(`/chats/share?shared_id=...&from=chat`).

## Root Cause

The Chat shared page (`web/src/pages/next-chats/share/index.tsx`)
unconditionally calls `useFetchFlowSSE()`, which requests
`/api/canvas/getsse/{sharedId}`. This is an Agent Canvas endpoint that
validates canvas ownership. When sharing a **Chat** dialog (not an
Agent):

1. `sharedId` is a `dialog_id`, not a `canvas_id`
2. The API token's `tenant_id` doesn't match any canvas owner
3. The backend returns `code: 103, message: "Only owner of canvas
authorized for this operation."`
4. The global error interceptor in `request.ts` displays it as a
notification: `hint : 103 Only owner of canvas authorized for this
operation.`

## Changes

- **`web/src/hooks/use-agent-request.ts`**: Added an `enabled` parameter
to `useFetchFlowSSE` so callers can conditionally skip the query.
- **`web/src/pages/next-chats/share/index.tsx`**: Only enable
`useFetchFlowSSE` when `from === SharedFrom.Agent`. For Chat shares, the
hook is disabled, avoiding the unnecessary canvas API call entirely.

## Related Issue

Closes #14115

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 11:27:39 +08:00
a9e5724b46 Refa: unify document create flows under REST documents API (#14345)
### What problem does this PR solve?

unify document create flows under REST documents API

### Type of change

- [x] Refactoring
2026-04-27 10:18:16 +08:00
4dcc42e0e1 feat(api): add unified index API and dataset management endpoints (#14222)
### What problem does this PR solve?

## Summary

Refactor the dataset API layer into a clean service/REST separation
pattern, add a unified `/index` API for graph/raptor/mindmap operations,
and introduce several new dataset management endpoints with full test
coverage.

## Changes

### Service Layer (`dataset_api_service.py`)

- Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace
function for all index types
- Added `run_index`, `delete_index` service functions
- Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`,
`get_ingestion_log`
- Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`,
`rename_tag`
- Added `get_flattened_metadata`, `get_auto_metadata`,
`update_auto_metadata`

### REST API Layer (`dataset_api.py`)

**New unified routes:**

| Method | Route | Description |
|--------|-------|-------------|
| POST | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Run index
task |
| GET | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Trace index
task |
| DELETE | `/datasets/<id>/<index_type>` | Delete index |
| GET | `/datasets/<id>` | Get dataset details |
| GET | `/datasets/<id>/ingestions/summary` | Ingestion summary |
| GET | `/datasets/<id>/ingestions` | List ingestion logs |
| GET | `/datasets/<id>/ingestions/<log_id>` | Get single ingestion log
|
| POST | `/datasets/<id>/embedding` | Run embedding |
| GET | `/datasets/<id>/tags` | List tags |
| GET | `/datasets/tags/aggregation` | Aggregate tags across datasets |
| DELETE | `/datasets/<id>/tags` | Delete tags |
| PUT | `/datasets/<id>/tags` | Rename tag |
| GET | `/datasets/metadata/flattened` | Get flattened metadata |
| GET/PUT | `/datasets/<id>/metadata/config` | New metadata config path
|

**Removed routes (replaced by unified `/index`):**

- `POST /datasets/<id>/mindmap`
- `GET /datasets/<id>/mindmap`

**Preserved legacy routes (backward compatibility):**

- `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor`
- `/auto_metadata` GET/PUT

### Test Suite

- Updated `common.py` helpers: added `trace_index`, removed
`run_mindmap`/`trace_mindmap`
- Added 7 new test files with 39 test cases total:

| Test File | Cases |
|-----------|-------|
| `test_get_dataset.py` | 4 |
| `test_ingestion_summary.py` | 2 |
| `test_ingestion_logs.py` | 5 |
| `test_index_api.py` | 14 |
| `test_embedding.py` | 2 |
| `test_tags.py` | 8 |
| `test_flattened_metadata.py` | 4 |

- Deleted `test_mindmap_tasks.py` (covered by unified index tests)

## Design Decisions

1. **Unified `/index?type=...`** — single endpoint replaces 3 separate
route pairs for graph/raptor/mindmap
2. **Backward compatibility** — old routes (`/run_graphrag`,
`/run_raptor`, `/auto_metadata`) preserved alongside new paths
3. **`_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}`** — input
validation via constant set
4. **`_INDEX_TYPE_TO_TASK_ID_FIELD`** — maps index type to KB model task
ID field for clean dispatch

## Files Changed

- `api/apps/restful_apis/dataset_api.py`
- `api/apps/services/dataset_api_service.py`
- `sdk/python/ragflow_sdk/modules/dataset.py`
- `test/testcases/test_http_api/common.py`
- `test/testcases/test_http_api/test_dataset_management/` (7 new files)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 09:38:01 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00