1539 Commits

Author SHA1 Message Date
24af0875e5 Feat/configurable metadata display (#13464)
### What problem does this PR solve?

Currently, RAGFlow's Search and Chat interfaces display only raw
vectorized text chunks during retrieval, without contextual information
about their source documents. Users cannot see document titles, page
numbers, upload dates, or custom metadata fields that would help them
understand and trust the retrieved results.

This PR introduces an **optional metadata display feature** that
enriches retrieved chunks with document-level metadata in both the
Search tab and Chatbot interface.

**Key improvements:**
- **Search results**: Display document metadata as styled badges beneath
chunk snippets
- **Chat citations**: Show metadata in citation popovers and reference
lists for better source context
- **LLM context**: Metadata is injected into the LLM prompt to enable
more accurate, citation-aware responses
- **External API support**: Applications using RAGFlow's SDK retrieval
endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in
via request parameters
- **User control**: Multi-select dropdown UI allows users to choose
which metadata fields to display

**Implementation approach:**
-  Reuses existing `DocMetadataService` infrastructure (no new database
tables or indices)
-  Settings stored in existing JSON configuration fields
(`search_config.reference_metadata`, `prompt_config.reference_metadata`)
-  No database migrations required
-  Disabled by default (fully opt-in and backward-compatible)
-  Dynamic metadata field selection populated from actual document
metadata keys
-  Fixed critical bug where Python's builtin `set()` was shadowed by a
route handler function

**Modified endpoints (all backward-compatible):**
- `POST /v1/retrieval` (Public SDK)
- `POST /v1/searchbots/retrieval_test` (Searchbots)
- `POST /v1/chunk/retrieval_test` (UI/Internal)
- Chat completions endpoints (via `extra_body.reference_metadata` or
`prompt_config`)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)


###Images
-
<img width="879" height="1275" alt="image"
src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb"
/>
<br><br>
<br><br>

<img width="1532" height="362" alt="image"
src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e"
/>
<br><br>
<br><br>

<img width="2586" height="1320" alt="image"
src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776"
/>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
2026-04-30 23:13:27 +08:00
05ee7f8bb6 Fix: remove delete_documents uuid validation (#14533)
### What problem does this PR solve?

remove delete_documents uuid validation

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 18:56:33 +08:00
4ee0702aed Feat: add skills space to context engine (#13908)
### What problem does this PR solve?

issue #13714

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 12:36:03 +08:00
06c6da5d94 Fix: add document delete permission check (#14472)
### What problem does this PR solve?

add document delete permission check

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:01:09 +08:00
47129fdd08 Fix: optimize file batch delete (#14473)
### What problem does this PR solve?

optimize file batch delete

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:00:39 +08:00
6dd38eca6a fix: file logs not displayed in dataset ingestion page (#14479)
### What problem does this PR solve?

## Summary

Fixed a bug where the **File Logs** tab in the dataset ingestion page
always showed "No logs" even after files were parsed successfully.

## Root Cause

Both the **File Logs** and **Dataset Logs** tabs on the frontend called
the same backend endpoint `/datasets/{dataset_id}/ingestions`. However,
the backend only queried `get_dataset_logs_by_kb_id`, which
hard-filtered records by `document_id == GRAPH_RAPTOR_FAKE_DOC_ID`
(dataset-level logs). As a result, real file-level logs were never
returned, causing the table to appear empty.

## Changes

### Backend

- **`api/apps/restful_apis/dataset_api.py`**
  - Added two new query parameters to `list_ingestion_logs`:
    - `log_type` — `"file"` or `"dataset"` (default: `"dataset"`)
    - `keywords` — search keyword for filtering by document / task name

- **`api/apps/services/dataset_api_service.py`**
- Updated `list_ingestion_logs` signature to accept `log_type` and
`keywords`.
  - Added conditional routing:
- When `log_type == "file"`, call
`PipelineOperationLogService.get_file_logs_by_kb_id`
- Otherwise, call
`PipelineOperationLogService.get_dataset_logs_by_kb_id`

- **`api/db/services/pipeline_operation_log_service.py`**
- Extended `get_dataset_logs_by_kb_id` with an optional `keywords`
parameter so dataset logs can also be searched.

### Frontend

- **`web/src/pages/dataset/dataset-overview/hook.ts`**
- Removed the separate API function switching (`listPipelineDatasetLogs`
vs `listDataPipelineLogDocument`).
- Unified both tabs to call `listDataPipelineLogDocument` with the new
`log_type` query parameter (`"file"` or `"dataset"`).
  - Ensured `keywords` and filter values are passed through correctly.

## Behavior After Fix

| Tab | `log_type` | Returned Records | Searchable Field |
|---|---|---|---|
| File Logs | `file` | Real document-level logs | `document_name` (file
name) |
| Dataset Logs | `dataset` | GraphRAG / RAPTOR / MindMap logs |
`document_name` (task type) |
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>
2026-04-29 22:10:24 +08:00
5018459112 Fix metadata config (#14480)
### What problem does this PR solve?

Fix metadata config

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 21:09:54 +08:00
a0f9ae16d2 Fix: RAPTOR "Generation scope" reset to "Single file" when selecting "Dataset" (#14477)
## Problem
In the Dataset Configuration page, changing the RAPTOR **Generation
scope** from "Single file" to "Dataset" and clicking **Save** did not
persist the change. After refreshing or re-entering the page, the scope
always reverted to "Single file".

## Root Cause
1. **Backend**: The `RaptorConfig` Pydantic model in
`api/utils/validation_utils.py` was configured with `extra="forbid"` but
did not declare a `scope` field. When the frontend sent `"scope":
"dataset"`, Pydantic rejected the request.
2. **Frontend**: The `extractRaptorConfigExt` utility in
`web/src/hooks/parser-config-utils.ts` treated `scope` as an unknown
field and moved it into the nested `ext` object. Consequently, the
backend could not read `raptor_config.get("scope", "file")` correctly,
so the default `"file"` was always used.

## Changes
- Added `scope: Literal["file", "dataset"]` to the backend
`RaptorConfig` model with a default of `"file"`.
- Added `scope` to the known-field whitelist in the frontend
`extractRaptorConfigExt` helper so it is transmitted as a top-level
raptor field instead of being buried in `ext`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-29 18:46:28 +08:00
6afb1957d8 Fix query param type (#14471)
### What problem does this PR solve?

Fix query param type

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 16:53:28 +08:00
b684c89950 Add backward compat APIs (#14427)
### What problem does this PR solve?

Add backward compat APIs:

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 15:15:49 +08:00
ce933357c6 Fix: Dataset: When configuring the "general chunk method," options such as chunk size and parent-child slicing are unavailable. (#14459)
### What problem does this PR solve?

Fix: Dataset: When configuring the "general chunk method," options such
as chunk size and parent-child slicing are unavailable.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: balibabu <assassin_cike@163.com>
2026-04-29 14:37:48 +08:00
35f6d81b73 Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402)
### What problem does this PR solve?

## Summary

Migrate two web API endpoints to REST-style HTTP API endpoints,
following the pattern established in #14222:

| Old Endpoint | New Endpoint |
|---|---|
| `POST /v1/chunk/retrieval_test` | `POST
/api/v1/datasets/<dataset_id>/search` |
| `GET /v1/chunk/knowledge_graph` | `GET
/api/v1/datasets/<dataset_id>/graph` |
2026-04-28 20:00:26 +08:00
85575259ac Fix: google authentication - gmail && google-drive (#14422)
### What problem does this PR solve?

Fix: google authentication - gmail && google-drive

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 18:09:02 +08:00
18fbfafca6 Feat: enable sync deleted files for more connectors (#14353)
### What problem does this PR solve?

Feat: enable sync delted files for connectors

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 15:07:14 +08:00
5885691c68 Always return success if no such task id (#14417)
### What problem does this PR solve?

Always return success if no such task id to follow existing code logic.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 12:55:24 +08:00
c81081f8ef Refactor: Doc change parser (#14327)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_parser
HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API
PATCH /api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-27 23:42:57 +08:00
c5116b90e5 Refactor: migrate document thumbnails API (#14344)
### What problem does this PR solve?

Before migration: GET /v1/document/thumbnails
After migration:  GET /api/v1/thumbnails

### Type of change

- [x] Refactoring
2026-04-27 21:29:09 +08:00
49912a156e Refactor: migrate document run api (#14351)
### What problem does this PR solve?

Before migration: POST /v1/document/run
After migration: POST /api/v1/documents/ingest/

### Type of change

- [x] Refactoring
2026-04-27 21:25:58 +08:00
343bda1119 Refactor: deco document upload_and_parse API (#14366)
### What problem does this PR solve?

remove unused "POST /v1/document/upload_and_parse"

### Type of change

- [x] Refactoring
2026-04-27 20:35:00 +08:00
a536980e22 Refactor: Doc batch change status (#14337)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_status

After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status 

### Type of change

- [x] Refactoring
2026-04-27 20:00:23 +08:00
488c3ef6a3 Add task API (#14393)
### What problem does this PR solve?

Add task API

### Type of change

- [x] Refactor
2026-04-27 19:16:37 +08:00
c1941fd503 Refactor: deco doc-parse API that is not used any more (#14367)
### What problem does this PR solve?

Delete un-used API "POST /v1/document/parse"

### Type of change

- [x] Refactoring
2026-04-27 18:54:49 +08:00
0f2778efe7 Fix: support release in agent update api (#14396)
### What problem does this PR solve?

support release in agent update api

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 17:35:35 +08:00
61a24a2c14 Refactor: migrate doc upload info used in chat (#14359)
### What problem does this PR solve?

Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/

### Type of change

- [x] Refactoring
2026-04-27 16:58:42 +08:00
d88f7ac8d2 Remove evaluation_app.py and kb_app.py (#14394)
### What problem does this PR solve?

Delete not used APIs

### Type of change

- [x] Refactoring
2026-04-27 16:08:54 +08:00
290f0294d6 Refactor: migrate artifact API (#14348)
### What problem does this PR solve?

Before migration: GET /v1/document/artifact/<filename>
After migration:  GET /api/v1/documents/artifact/<filename>

### Type of change

- [x] Refactoring
2026-04-27 15:19:41 +08:00
2846a93998 Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?

Fixes #14196

## Problem

When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:

- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports

Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.

## Root Cause

```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # Only the first 300 pages were rendered; everything beyond was silently dropped
```

While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.

## Solution

### 1. Define constants in `common/constants.py`

```python
MAXIMUM_PAGE_NUMBER = 100000                        # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000  # Used by the task/DB layer
```

### 2. Replace all hardcoded sentinel values

| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |

### 3. Fix `parse_into_bboxes()` missing parameters

Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.

## Files Changed (22)

- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 14:57:20 +08:00
0b46ab07c5 Refa: restore openai-compatible chat completions api (#14380)
### What problem does this PR solve?
restore openai-compatible chat completions api
### Type of change

- [x] Refactoring
2026-04-27 14:02:19 +08:00
f3b7d55a1e fix: handle Infinity table-not-exist error (3022) in update() methods (#14153)
### What problem does this PR solve?

## Summary

Closes #6102

When using Infinity as the document store engine (GPU version), calling
`update()` on a non-existent table throws an unhandled
`InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This
causes users to see a raw "3022" error when clicking on a parsed
document.

## Root Cause

The `update()` methods in both `rag/utils/infinity_conn.py` and
`memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)`
without catching `InfinityException`. In contrast, other CRUD methods
(`insert`, `delete`, `search`) all handle this exception gracefully:

| Method   | Handles table-not-exist? | Behavior |
|----------|--------------------------|----------|
| `insert`  |  Yes | Auto-creates the table |
| `search`  |  Yes | Skips the table |
| `delete`  |  Yes | Returns 0 |
| `update`  |  **No** | Crashes with 3022 |

Additionally, `api/apps/document_app.py` worked around this with a
fragile string match (`"3022" in msg`) to detect the error.

## Changes

- **`rag/utils/infinity_conn.py`**: Catch `InfinityException` in
`update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return
`False` — consistent with `delete()`.
- **`memory/utils/infinity_conn.py`**: Apply the same fix to its
`update()` method.
- **`api/apps/document_app.py`**: Remove the fragile `"3022"`
string-matching workaround. Table-not-exist is now handled by the `if
not ok` path with an improved error message.

### Type of change

- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 11:52:22 +08:00
a9e5724b46 Refa: unify document create flows under REST documents API (#14345)
### What problem does this PR solve?

unify document create flows under REST documents API

### Type of change

- [x] Refactoring
2026-04-27 10:18:16 +08:00
4dcc42e0e1 feat(api): add unified index API and dataset management endpoints (#14222)
### What problem does this PR solve?

## Summary

Refactor the dataset API layer into a clean service/REST separation
pattern, add a unified `/index` API for graph/raptor/mindmap operations,
and introduce several new dataset management endpoints with full test
coverage.

## Changes

### Service Layer (`dataset_api_service.py`)

- Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace
function for all index types
- Added `run_index`, `delete_index` service functions
- Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`,
`get_ingestion_log`
- Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`,
`rename_tag`
- Added `get_flattened_metadata`, `get_auto_metadata`,
`update_auto_metadata`

### REST API Layer (`dataset_api.py`)

**New unified routes:**

| Method | Route | Description |
|--------|-------|-------------|
| POST | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Run index
task |
| GET | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Trace index
task |
| DELETE | `/datasets/<id>/<index_type>` | Delete index |
| GET | `/datasets/<id>` | Get dataset details |
| GET | `/datasets/<id>/ingestions/summary` | Ingestion summary |
| GET | `/datasets/<id>/ingestions` | List ingestion logs |
| GET | `/datasets/<id>/ingestions/<log_id>` | Get single ingestion log
|
| POST | `/datasets/<id>/embedding` | Run embedding |
| GET | `/datasets/<id>/tags` | List tags |
| GET | `/datasets/tags/aggregation` | Aggregate tags across datasets |
| DELETE | `/datasets/<id>/tags` | Delete tags |
| PUT | `/datasets/<id>/tags` | Rename tag |
| GET | `/datasets/metadata/flattened` | Get flattened metadata |
| GET/PUT | `/datasets/<id>/metadata/config` | New metadata config path
|

**Removed routes (replaced by unified `/index`):**

- `POST /datasets/<id>/mindmap`
- `GET /datasets/<id>/mindmap`

**Preserved legacy routes (backward compatibility):**

- `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor`
- `/auto_metadata` GET/PUT

### Test Suite

- Updated `common.py` helpers: added `trace_index`, removed
`run_mindmap`/`trace_mindmap`
- Added 7 new test files with 39 test cases total:

| Test File | Cases |
|-----------|-------|
| `test_get_dataset.py` | 4 |
| `test_ingestion_summary.py` | 2 |
| `test_ingestion_logs.py` | 5 |
| `test_index_api.py` | 14 |
| `test_embedding.py` | 2 |
| `test_tags.py` | 8 |
| `test_flattened_metadata.py` | 4 |

- Deleted `test_mindmap_tasks.py` (covered by unified index tests)

## Design Decisions

1. **Unified `/index?type=...`** — single endpoint replaces 3 separate
route pairs for graph/raptor/mindmap
2. **Backward compatibility** — old routes (`/run_graphrag`,
`/run_raptor`, `/auto_metadata`) preserved alongside new paths
3. **`_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}`** — input
validation via constant set
4. **`_INDEX_TYPE_TO_TASK_ID_FIELD`** — maps index type to KB model task
ID field for clean dispatch

## Files Changed

- `api/apps/restful_apis/dataset_api.py`
- `api/apps/services/dataset_api_service.py`
- `sdk/python/ragflow_sdk/modules/dataset.py`
- `test/testcases/test_http_api/common.py`
- `test/testcases/test_http_api/test_dataset_management/` (7 new files)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 09:38:01 +08:00
fb95136f39 Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090)
### What problem does this PR solve?

The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.

This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-25 14:30:15 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
beb2406b86 Fix: allow use image2text as chat model (#14331)
### What problem does this PR solve?

Allow image2text models (multimodal) to be used as chat models.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 17:58:25 +08:00
9ad752f497 Refa:migrate agent webhook routes to REST APIs (#14330)
### What problem does this PR solve?

migrate agent webhook routes to REST APIs

### Type of change
- [x] Refactoring
2026-04-24 17:55:53 +08:00
1473000135 Implement retrieval_test in GO (#14231)
### What problem does this PR solve?

Implement retrieval_test in GO

### Type of change

- [x] Refactoring
2026-04-24 15:30:14 +08:00
199fbceb72 Refactor user REST API (#14334)
### What problem does this PR solve?
Refactor user REST API

### Type of change
- [x] Refactoring
2026-04-24 10:25:15 +08:00
c41b5e8a5d fix: migrate Langfuse integration from start_generation to start_obse… (#14205)
The Langfuse Python SDK v3+ removed `start_generation()` method.
RagFlow's code called this non-existent method, causing AttributeError
when Langfuse tracing is enabled.

Replace all `start_generation()` calls with
`start_observation(as_type="generation")` which is the correct v4 SDK
API.

Affected files:
- api/db/services/llm_service.py (12 occurrences)
- api/db/services/dialog_service.py (1 occurrence)

Fixes #14204
Related to #9243

### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 10:03:57 +08:00
c74aece63c Feat: Agent api (#14157)
### What problem does this PR solve?

1. **List agents**  
   **Prev API**:  
   - `/v1/canvas/list GET`  
   - `/api/v1/agents GET`  
   **Current API**: `/api/v2/agents GET`

2. **Get canvas template**  
   **Prev API**: `/v1/canvas/templates GET`  
   **Current API**: `/api/v2/agents/templates GET`

3. **Delete an agent**  
   **Prev API**: 
    - `/v1/canvas/rm POST`  
    - `/api/v1/agents/<agent_id> DELETE`
   **Current API**: `/api/v2/agents/<agent_id> DELETE`

4. **Update an agent**  
   **Prev API**: 
    - `/api/v1/agents/<agent_id> PUT`   
    - `/v1/canvas/setting POST `
   **Current API**: `/api/v2/agents/<agent_id> PATCH`


5. **Create an agent**  
   **Prev API**: 
    - `/v1/canvas/set POST`  
    - `/api/v1/agents POST`
   **Current API**: `/api/v2/agents POST`


6. **Get an agent**  
   **Prev API**: 
    - `/v1/canvas/get/<canvas_id> GET `  
   **Current API**: `/api/v2/agents/<agent_id> GET`


7. **Reset an agent**  
   **Prev API**: 
    - `/v1/canvas/reset POST`  
   **Current API**: `/api/v2/agents/<agent_id>/reset POST`


8. **Upload a file to an agent**  
   **Prev API**: 
    - `/v1/canvas/upload/<canvas_id> POST`  
   **Current API**: `/api/v2/agents/<agent_id>/upload POST`


9. **Input form**  
   **Prev API**: 
    - `/v1/canvas/input_form GET`  
**Current API**:
`/api/v2/agents/<agent_id>/components/<component_id>/input-form GET`


10. **Debug an agent**  
   **Prev API**: 
    - `/v1/canvas/debug POST`  
**Current API**:
`/api/v2/agents/<agent_id>/components/<component_id>/debug POST`


11. **Trace an agent**  
   **Prev API**: 
    - `/v1/canvas/trace GET`  
   **Current API**: `/api/v2/agents/<agent_id>/logs/<message_id> GET`


12. **Get an agent version list**  
   **Prev API**: 
    - `/v1/canvas/getlistversion/<canvas_id>`  
   **Current API**: `/api/v2/agents/<agent_id>/versions GET`


13. **Get a version of agent**  
   **Prev API**: 
    - `/v1/canvas/getversion/<version_id>`  
**Current API**: `/api/v2/agents/<agent_id>/versions/<version_id> GET`


14. **Test db connection**  
   **Prev API**: 
    - `/v1/canvas/test_db_connect POST`  
   **Current API**: `/api/v2/agents/test_db_connection`


15. **Rerun the agent**  
   **Prev API**: 
    - `/v1/canvas/rerun POST`  
   **Current API**: `/api/v2/agents/rerun POST`


16. **Get prompts**  
   **Prev API**: 
    - `/v1/canvas/prompts GET`  
   **Current API**: `/api/v2/agents/prompts GET`

### Type of change
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chanx <1243304602@qq.com>
2026-04-24 10:02:22 +08:00
d4fa57311c Refa: remove legacy MCP server web API (#14322)
### What problem does this PR solve?

remove legacy MCP server web API

### Type of change

- [x] Refactoring
2026-04-23 19:01:22 +08:00
4458763a93 API refactor: stats_api and plugin_api (#14324)
### What problem does this PR solve?

API refactor: stats_api and plugin_api

### Type of change

- [x] Refactoring
2026-04-23 17:16:04 +08:00
7817b0d779 Refa: migrate chunk APIs to RESTful routes (#14291)
### What problem does this PR solve?

migrate chunk APIs to RESTful routes

### Type of change
- [x] Refactoring
2026-04-23 14:17:23 +08:00
76b017ca32 Refact: system apis (#14298)
### What problem does this PR solve?
Refact: system apis

### Type of change

- [x] Refactoring
2026-04-23 14:09:42 +08:00
57f527eb02 Add missing timeout to ragflow server health check (#14311)
### What problem does this PR solve?

`check_ragflow_server_alive()` in `api/utils/health_utils.py` calls
`requests.get(url)` without a `timeout` parameter. Unlike
`check_minio_alive()` which correctly specifies `timeout=10`, this
health check can hang indefinitely if the server is unresponsive.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Changes

Added `timeout=10` to the `requests.get()` call, consistent with
`check_minio_alive()`.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 14:08:52 +08:00
aa4526266f Refa: migrate MCP APIs to RESTful api (#14317)
### What problem does this PR solve?

migrate MCP APIs to RESTful api

### Type of change

- [x] Refactoring
2026-04-23 12:51:27 +08:00
dbf8c6ed90 Refactor: Doc metadata update (#14289)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/metadata/update

After migration, Restful API
PATCH /api/v2/datasets/<dataset_id>/documents/metadatas 

### Type of change

- [x] Refactoring
2026-04-23 12:04:34 +08:00
aae45b959b Refactor: API file2document (#14306)
Refactor: API file2document
2026-04-23 11:40:45 +08:00
e79b896637 Refactor: REST API langfuse api-key (#14315)
REST API langfuse api-key
2026-04-23 11:36:16 +08:00
01753b8f31 Refactor: API connectors (#14228)
### What problem does this PR solve?

Refactor /api/v1/connectors to be more RESTful.

### Type of change
- [x] Refactoring
2026-04-22 20:42:41 +08:00
c08cd8e090 Refactor: Migrate document metadata config update API (#14286)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/update_metadata_setting

After consolidation, Restful API
PUT
/api/v1/datasets/<dataset_id>/documents/<document_id>/metadata/config

### Type of change

- [x] Refactoring
2026-04-22 20:01:31 +08:00