### What problem does this PR solve?
Currently, RAGFlow's Search and Chat interfaces display only raw
vectorized text chunks during retrieval, without contextual information
about their source documents. Users cannot see document titles, page
numbers, upload dates, or custom metadata fields that would help them
understand and trust the retrieved results.
This PR introduces an **optional metadata display feature** that
enriches retrieved chunks with document-level metadata in both the
Search tab and Chatbot interface.
**Key improvements:**
- **Search results**: Display document metadata as styled badges beneath
chunk snippets
- **Chat citations**: Show metadata in citation popovers and reference
lists for better source context
- **LLM context**: Metadata is injected into the LLM prompt to enable
more accurate, citation-aware responses
- **External API support**: Applications using RAGFlow's SDK retrieval
endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in
via request parameters
- **User control**: Multi-select dropdown UI allows users to choose
which metadata fields to display
**Implementation approach:**
- ✅ Reuses existing `DocMetadataService` infrastructure (no new database
tables or indices)
- ✅ Settings stored in existing JSON configuration fields
(`search_config.reference_metadata`, `prompt_config.reference_metadata`)
- ✅ No database migrations required
- ✅ Disabled by default (fully opt-in and backward-compatible)
- ✅ Dynamic metadata field selection populated from actual document
metadata keys
- ✅ Fixed critical bug where Python's builtin `set()` was shadowed by a
route handler function
**Modified endpoints (all backward-compatible):**
- `POST /v1/retrieval` (Public SDK)
- `POST /v1/searchbots/retrieval_test` (Searchbots)
- `POST /v1/chunk/retrieval_test` (UI/Internal)
- Chat completions endpoints (via `extra_body.reference_metadata` or
`prompt_config`)
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
###Images
-
<img width="879" height="1275" alt="image"
src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb"
/>
<br><br>
<br><br>
<img width="1532" height="362" alt="image"
src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e"
/>
<br><br>
<br><br>
<img width="2586" height="1320" alt="image"
src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776"
/>
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
### What problem does this PR solve?
remove delete_documents uuid validation
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Partially addresses #14362.
This PR enables syncing deleted files for RSS data sources.
Previously, RSS incremental sync only returned feed entries whose
timestamps were inside the poll window. If an entry was removed from the
RSS feed, RAGFlow had no full current RSS snapshot to pass into the
shared stale-document cleanup path, so the deleted remote entry could
remain in the knowledge base.
This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector`
- reuses the same `rss:<md5(stable_key)>` document ID derivation used by
normal RSS ingest
- returns `(document_generator, file_list)` for incremental RSS sync
when `sync_deleted_files` is enabled
- captures the poll end timestamp before snapshot/poll so cleanup does
not race against the same sync window
- adds start/end logs around RSS slim snapshot collection
- exposes the deleted-file sync toggle for RSS in the data source UI
Per maintainer request on related datasource PRs, this PR contains no
test-case changes. Local verification was run with an external script.
Validation:
- `uv run ruff check common/data_source/rss_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`
- `uv run python /tmp/verify_rss_deleted_sync.py --repo
/root/74/ragflow`
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## What problem does this PR solve?
Incremental WebDAV sync only ingested files whose modification time fell
inside the poll window; documents removed on the WebDAV server were
never removed from the knowledge base. This aligns with
[#14362](https://github.com/infiniflow/ragflow/issues/14362)
(coordinated datasource “sync deleted files” work).
This PR adds a **full-tree slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that enumerates current remote
paths **without downloading file contents**, using the same logical
document IDs as full ingest (`webdav:{base_url}:{file_path}`). When
**`sync_deleted_files`** is enabled on incremental runs, sync returns
**`(document_generator, file_list)`** so **`SyncBase`** runs
**`cleanup_stale_documents_for_task`** and removes KB rows no longer
present remotely.
Design notes:
- **`_list_files_recursive`** gains **`filter_by_mtime`**: snapshot
passes **`filter_by_mtime=False`** (full tree under **`remote_path`**);
**`poll_source`** keeps mtime-window filtering as before.
- Slim snapshot applies the same **extension** and **`size_threshold`**
rules as **`_yield_webdav_documents`** so retain IDs match what would be
indexed.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with
Dropbox-style connectors).
## Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Files changed
| Area | Change |
|------|--------|
| `common/data_source/webdav_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`filter_by_mtime` on `_list_files_recursive` |
| `rag/svr/sync_data_source.py` | WebDAV `_generate`: `file_list` +
tuple return; pass **`batch_size`** from connector config |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` |
### What problem does this PR solve?
Implement the vLLM model provider for RAGFlow to fully support local and
self-hosted open-source models (e.g., Qwen, GLM, Llama) via the vLLM
framework, and fix several critical bugs related to model instance
management and API requests.
**Key changes and fixes:**
1. **Added Standard vLLM Provider (`vllm.go`, `vllm.json`):**
- Implemented `VllmModel` driver strictly adhering to the OpenAI API
specification.
- Removed hardcoded and dangerous routing logic (e.g., forcing
`AsyncChat` for Qwen/GLM prefixes), ensuring standard
`/v1/chat/completions` compatibility.
- Refactored `ListModels` to use safe JSON parsing (resolving nil
pointer panics) and standard `GET` requests without bodies.
- Added `APIConfig.Region` fallback logic to prevent empty `base_url`
fetching when checking models.
2. **Fixed `ChatToModelStreamWithSender` Bug (`model_service.go`):**
- Resolved the `model is disabled` error when streaming chat with local
database-saved models.
- Added the missing `if modelInfo.Status == "active"` block to correctly
invoke `NewInstance` and inject the dynamic `base_url` into the provider
driver before starting the SSE stream.
3. **Fixed `ListSupportedModels` Bug (`model_service.go`):**
- Added dynamic `NewInstance` injection for `base_url`. Previously, the
list models function used the static JSON config without injecting
user-configured dynamic URLs from the database, resulting in an
`unsupported protocol scheme ""` error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fix: LaTeX formulas cannot be displayed on the chat page.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
id as "text", not a "keyword", order by it will cause error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix docker image version info in comment
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Refs #14362.
This PR enables syncing deleted files for Zendesk data sources.
Previously, Zendesk incremental sync never returned a slim remote
snapshot to the shared stale-document cleanup path, so deleted remote
Zendesk records could remain in RAGFlow. The existing Zendesk slim
snapshot also included records that ingestion intentionally skips, such
as draft articles, articles without bodies, skipped-label articles,
empty-body articles, and tickets with `status == "deleted"`.
This PR:
- exposes the deleted-file sync option for Zendesk in the data source UI
- returns Zendesk slim snapshots during incremental sync when
`sync_deleted_files` is enabled
- reuses Zendesk indexability rules so cleanup compares against the same
records ingestion can materialize
- adds start/end logs around Zendesk slim snapshot collection for
operational visibility
Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.
Validation:
- `uv run ruff check common/data_source/zendesk_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Partially addresses #14362.
Adds deleted-file sync support for the Asana data source. Asana already
indexes task attachments as documents, but it did not provide the slim
document snapshot required by stale-document reconciliation, and the
sync wrapper never returned a `file_list` for cleanup.
This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector`
- builds slim IDs with the same `asana:{task_id}:{attachment_gid}`
format used by indexed documents
- avoids downloading attachment blobs during the snapshot
- aborts the snapshot if Asana API errors occur, preventing partial
snapshots from deleting valid local docs
- captures the incremental poll end time before snapshotting and makes
`poll_source()` respect that boundary
- exposes the deleted-file sync toggle for Asana in the data source UI
Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.
Validation:
- `uv run ruff check common/data_source/asana_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`
### Type of change
- [x] New Feature
## Summary
Fix critical severity security issue in `rag/utils/ob_conn.py`.
## Vulnerability
| Field | Value |
|-------|-------|
| **ID** | V-003 |
| **Severity** | CRITICAL |
| **Scanner** | multi_agent_ai |
| **Rule** | `V-003` |
| **File** | `rag/utils/ob_conn.py:691` |
**Description**: The OceanBase database connector constructs SQL WHERE
clauses by directly embedding user-controlled filter expressions using
Python f-strings at lines 726, 777, 781, 787, 793, 821, and 827. No
parameterization or allowlist validation is applied before the
expressions are incorporated into live SQL queries. This is the most
critical vulnerability in the codebase because it directly exposes the
RAG knowledge base — the platform's core business asset — to complete
compromise.
## Changes
- `rag/utils/ob_conn.py`
## Verification
- [x] Build passes
- [x] Scanner re-scan confirms fix
- [x] LLM code review passed
---
*Automated security fix by [OrbisAI Security](https://orbisappsec.com)*
### What problem does this PR solve?
Feat: add button for remove header & footer in pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Incremental Seafile sync only ingests files whose modification time
falls in the poll window; documents removed in Seafile were never
removed from the knowledge base. This contributes to
[#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource
“sync deleted files” coordination).
This PR adds a **slim snapshot** (`retrieve_all_slim_docs_perm_sync`)
that enumerates current remote file IDs **without downloading content**,
using the same logical IDs as full ingest
(`seafile:{repo_id}:{file_id}`). When **`sync_deleted_files`** is
enabled on incremental runs, **`SeaFile._generate`** returns
**`(document_generator, file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove stale KB documents.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What changed
- **`common/data_source/seafile_connector.py`**: `SeaFileConnector`
implements **`SlimConnectorWithPermSync`**;
**`_list_files_recursive(..., filter_by_mtime=...)`** supports full-tree
listing for snapshots; **`retrieve_all_slim_docs_perm_sync()`** reuses
the same library/root scan as ingest and applies the same **size**
ceiling; logging for snapshot start/end and counts.
- **`rag/svr/sync_data_source.py`**: **`SeaFile._generate`** validates
**`batch_size`**, captures **`end_ts`** before snapshot +
**`poll_source`**, wraps slim retrieval in **`try`/`except`** (
**`file_list = None`** on failure so ingest continues), returns
**`(generator, file_list)`**.
- **`web/src/pages/user-setting/data-source/constant/index.tsx`**:
**`syncDeletedFiles`** for Seafile in
**`DataSourceFeatureVisibilityMap`**.
### What problem does this PR solve?
This fixes a crash in Manual and Naive parsing when PDF outlines include
page numbers as a third tuple value. It makes outline unpacking accept
extra values so parsing no longer fails. fixes#14411
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and
`BuiltinEmbed.encode`
(`rag/llm/embedding_model.py`) currently accumulate embedding batches
via
`np.concatenate` inside the per-batch loop. `np.concatenate` allocates a
new
array and copies all existing data on every call, so accumulating N
batches
is O(N²) in both time and peak memory.
Replacing the incremental concatenate with a list-of-batches + a single
`np.vstack` at the end gives O(N) total work.
For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)`
is
also replaced by `np.tile`, which does the same job with a single
contiguous
allocation instead of building a Python list of references.
This is purely a CPU/memory optimisation — output shape and dtype are
unchanged. Measured impact grows with document size:
- 1k chunks (batch 512, 2 iters): ~negligible
- 10k chunks (20 iters): ~10× speedup on this stage
- 100k chunks (195 iters): ~100× speedup, and peak RAM
drops from O(N) extra to near-zero
### Type of change
- [x] Performance Improvement
Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>
### What problem does this PR solve?
- Update version tags in README files (including translations) from
v0.25.0 to v0.25.1
- Modify Docker image references and documentation to reflect new
version
- Update version badges and image descriptions
- Maintain consistency across all language variants of README files
### Type of change
- [x] Documentation Update
### What problem does this PR solve?
Fix: The GraphRAG icon is not displaying.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
## Summary
Fixed a bug where the **File Logs** tab in the dataset ingestion page
always showed "No logs" even after files were parsed successfully.
## Root Cause
Both the **File Logs** and **Dataset Logs** tabs on the frontend called
the same backend endpoint `/datasets/{dataset_id}/ingestions`. However,
the backend only queried `get_dataset_logs_by_kb_id`, which
hard-filtered records by `document_id == GRAPH_RAPTOR_FAKE_DOC_ID`
(dataset-level logs). As a result, real file-level logs were never
returned, causing the table to appear empty.
## Changes
### Backend
- **`api/apps/restful_apis/dataset_api.py`**
- Added two new query parameters to `list_ingestion_logs`:
- `log_type` — `"file"` or `"dataset"` (default: `"dataset"`)
- `keywords` — search keyword for filtering by document / task name
- **`api/apps/services/dataset_api_service.py`**
- Updated `list_ingestion_logs` signature to accept `log_type` and
`keywords`.
- Added conditional routing:
- When `log_type == "file"`, call
`PipelineOperationLogService.get_file_logs_by_kb_id`
- Otherwise, call
`PipelineOperationLogService.get_dataset_logs_by_kb_id`
- **`api/db/services/pipeline_operation_log_service.py`**
- Extended `get_dataset_logs_by_kb_id` with an optional `keywords`
parameter so dataset logs can also be searched.
### Frontend
- **`web/src/pages/dataset/dataset-overview/hook.ts`**
- Removed the separate API function switching (`listPipelineDatasetLogs`
vs `listDataPipelineLogDocument`).
- Unified both tabs to call `listDataPipelineLogDocument` with the new
`log_type` query parameter (`"file"` or `"dataset"`).
- Ensured `keywords` and filter values are passed through correctly.
## Behavior After Fix
| Tab | `log_type` | Returned Records | Searchable Field |
|---|---|---|---|
| File Logs | `file` | Real document-level logs | `document_name` (file
name) |
| Dataset Logs | `dataset` | GraphRAG / RAPTOR / MindMap logs |
`document_name` (task type) |
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Signed-off-by: noob <yixiao121314@outlook.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fix: The pipeline column header in the FileLogsTable is displaying
incorrectly.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. drop instance model
2. Fix issue of drop instance but not drop models.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
implement MiniMax provider
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: enable sync deleted file for Discord
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Partially addresses #14362 by adding deleted-file sync support for the
Dropbox data source.
Dropbox previously did not provide the slim current-file snapshot
required by stale document reconciliation, and its sync runner returned
only document batches. As a result, enabling deleted-file sync could not
remove local documents that had been deleted from Dropbox.
This PR:
- Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`.
- Reuses Dropbox metadata traversal to collect current remote file IDs
without downloading file contents.
- Wires incremental Dropbox sync to return `(document_generator,
file_list)` when `sync_deleted_files` is enabled.
- Enables the deleted-file sync toggle for Dropbox in the data source
settings UI.
- Adds regression coverage for slim snapshots, nested folders, paginated
listings, duplicate filenames, and full reindex behavior.
Tests:
- `uv run pytest test/unit_test/common/test_dropbox_connector.py -q`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `uv run pytest test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py -q`
- `uv run ruff check common/data_source/dropbox_connector.py
rag/svr/sync_data_source.py
test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: enable sync deleted files in gitlab
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Problem
In the Dataset Configuration page, changing the RAPTOR **Generation
scope** from "Single file" to "Dataset" and clicking **Save** did not
persist the change. After refreshing or re-entering the page, the scope
always reverted to "Single file".
## Root Cause
1. **Backend**: The `RaptorConfig` Pydantic model in
`api/utils/validation_utils.py` was configured with `extra="forbid"` but
did not declare a `scope` field. When the frontend sent `"scope":
"dataset"`, Pydantic rejected the request.
2. **Frontend**: The `extractRaptorConfigExt` utility in
`web/src/hooks/parser-config-utils.ts` treated `scope` as an unknown
field and moved it into the nested `ext` object. Consequently, the
backend could not read `raptor_config.get("scope", "file")` correctly,
so the default `"file"` was always used.
## Changes
- Added `scope: Literal["file", "dataset"]` to the backend
`RaptorConfig` model with a default of `"file"`.
- Added `scope` to the known-field whitelist in the frontend
`extractRaptorConfigExt` helper so it is transmitted as a top-level
raptor field instead of being buried in `ext`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Signed-off-by: noob <yixiao121314@outlook.com>
### What problem does this PR solve?
1. support command:
```
RAGFlow(user)> create provider 'vllm' instance 'test' key 'test-key' url 'base-url' region 'abc';
SUCCESS
RAGFlow(user)> list instances from 'vllm';
+----------+----------------------------------------+----------------------------------+--------------+----------------------------------+--------+
| apiKey | extra | id | instanceName | providerID | status |
+----------+----------------------------------------+----------------------------------+--------------+----------------------------------+--------+
| test-key | {"base_url":"base-url","region":"abc"} | 40213c89430311f1a7cf38a74640adcc | test | b4d40e6142d311f1a4f938a74640adcc | enable |
+----------+----------------------------------------+----------------------------------+--------------+----------------------------------+--------+
```
2. support add vllm model
```
RAGFlow(user)> add model 'Qwen/Qwen2-0.5B' to provider 'vllm' instance 'test' with tokens 131072 chat;
SUCCESS
```
3. add vllm chat
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Port PR14454 to GO (PruneDeletedChunks)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: enable sync deleted files for Gmail && fix google drive issues
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: bill <yibie_jingnian@163.com>
Co-authored-by: balibabu <assassin_cike@163.com>
### What problem does this PR solve?
Fix: Clicking the button in the bottom-right corner of the
`/chats/widget` page fails to display the dialog box.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)