Commit Graph

5922 Commits

Author SHA1 Message Date
2d522ccb36 Fix: thumbnails issue in chat (#14415)
[Uploading part_4-13.pdf…]()
### What problem does this PR solve?

In chat, the thumbnails didn't display correctly

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)

Steps to reproduce:
1. create dataset and upload a file (see attached)
2. parse the document
3. once parsing completed, create a chat and associate it with the
dataset
4. ask a question (DAP VS DAPE comparison)
5. check result
2026-04-28 11:39:29 +08:00
0cf105da8d Doc: Added a database schema and migration guide. (#14404)
### What problem does this PR solve?

Added a database schema and migration guide.

### Type of change


- [x] Documentation Update
2026-04-28 09:54:33 +08:00
c81081f8ef Refactor: Doc change parser (#14327)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_parser
HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API
PATCH /api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-27 23:42:57 +08:00
872ff08304 Fix: add executor.shutdown (#14403)
### What problem does this PR solve?

Add executor shutdown in finally clause to free resources.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 22:38:43 +08:00
c5116b90e5 Refactor: migrate document thumbnails API (#14344)
### What problem does this PR solve?

Before migration: GET /v1/document/thumbnails
After migration:  GET /api/v1/thumbnails

### Type of change

- [x] Refactoring
2026-04-27 21:29:09 +08:00
49912a156e Refactor: migrate document run api (#14351)
### What problem does this PR solve?

Before migration: POST /v1/document/run
After migration: POST /api/v1/documents/ingest/

### Type of change

- [x] Refactoring
2026-04-27 21:25:58 +08:00
965717c4fb Go: add new provider: google (#14395)
### What problem does this PR solve?

As title.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-27 20:35:47 +08:00
343bda1119 Refactor: deco document upload_and_parse API (#14366)
### What problem does this PR solve?

remove unused "POST /v1/document/upload_and_parse"

### Type of change

- [x] Refactoring
2026-04-27 20:35:00 +08:00
d78013964a tests: add missing HTTP API tests for dataset management endpoints removed in #14222 (#14390)
### What problem does this PR solve?

### Summary

PR #14222 consolidated KB (web) API endpoints into RESTful Dataset
(HTTP) API endpoints and deleted the web API test suite under
`test_web_api/test_kb_app/` and `test_web_api/test_document_app/`. While
most test coverage was migrated to the HTTP API test suite, some tests
were not ported over. This PR adds back the missing coverage.

### Route migration reference

| Old Web API | New HTTP API | Missing tests |
|---|---|---|
| `POST /v1/kb/update_metadata_setting` | `PUT
/api/v1/datasets/<id>/metadata/config` | auth & error paths |
| `GET /api/v1/datasets/<id>/auto_metadata` | `GET
/api/v1/datasets/<id>/metadata/config` | auth & CRUD |
| `PUT /api/v1/datasets/<id>/auto_metadata` | `PUT
/api/v1/datasets/<id>/metadata/config` | auth & CRUD |
| `GET /v1/kb/<kb_id>/basic_info` | `GET
/api/v1/datasets/<id>/ingestions/summary` | covered |
| `POST /v1/kb/list_pipeline_logs` | `GET
/api/v1/datasets/<id>/ingestions` | edge cases missing |

### Changes

#### `test_file_management_within_dataset/test_metadata_config.py` (new,
10 tests)

Covers `GET/PUT /datasets/<id>/metadata/config` (migrated from
`test_kb_tags_meta.py`'s `test_update_metadata_setting` and
`test_document_metadata.py`'s negative tests):
- Authorization for dataset metadata config GET/PUT
- Authorization for document metadata config PUT
- Success, invalid dataset, missing payload, not found scenarios

#### `test_dataset_management/test_ingestion_logs.py` (extended, +2
tests)

Covers `GET /datasets/<id>/ingestions` edge cases (migrated from
`test_kb_pipeline_tasks.py`):
- Missing dataset ID
- Abnormal date filter

### Type of change

- [x] Other: Test coverage improvement

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 20:01:28 +08:00
a536980e22 Refactor: Doc batch change status (#14337)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_status

After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status 

### Type of change

- [x] Refactoring
2026-04-27 20:00:23 +08:00
c949096db0 Refactor: optimize agent reset conversation variable defaults (#14401)
### What problem does this PR solve?
optimize agent reset conversation variable defaults
### Type of change
- [x] Refactoring
2026-04-27 19:57:56 +08:00
488c3ef6a3 Add task API (#14393)
### What problem does this PR solve?

Add task API

### Type of change

- [x] Refactor
2026-04-27 19:16:37 +08:00
82313020c7 Refa: align list operations and strict mode (#14387)
### What problem does this PR solve?

align list operations and strict mode

### Type of change
- [x] Refactoring
2026-04-27 19:13:00 +08:00
c1941fd503 Refactor: deco doc-parse API that is not used any more (#14367)
### What problem does this PR solve?

Delete un-used API "POST /v1/document/parse"

### Type of change

- [x] Refactoring
2026-04-27 18:54:49 +08:00
4f6651968a Fix: prioritize explore session ID and reset default conversation variables (#14399)
### What problem does this PR solve?

 prioritize explore session ID and reset default conversation variables

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 18:52:40 +08:00
10e28e5c5f Helm template ragflow.yaml: fix nginx-config-volume mountPath according to Dockerfile v0.25.0 (#14361)
### What problem does this PR solve?

Dockerfile v0.25.0 expects nginx conf at path
/etc/nginx/ragflow.conf.python, see
[Dockerfile#L200](ca01c7a745/Dockerfile (L200))
However current helm template mount the conf at path
/etc/nginx/ragflow.conf causing runtime error at startup time.

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Mauro Gattari <mauro.gattari@infn.it>
2026-04-27 18:51:55 +08:00
0f2778efe7 Fix: support release in agent update api (#14396)
### What problem does this PR solve?

support release in agent update api

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 17:35:35 +08:00
61a24a2c14 Refactor: migrate doc upload info used in chat (#14359)
### What problem does this PR solve?

Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/

### Type of change

- [x] Refactoring
2026-04-27 16:58:42 +08:00
c446c403de perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385)
## Summary

- **Lazy img_np loading**: `np.array(img)` is now deferred until the
first OCR text extraction is actually needed, avoiding unnecessary
memory allocation for pages that already have text.
- **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via
`PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's
boxes are normalized with `_to_global_boxes` to produce globally
consistent page numbers and position tags.
- **DLA early init**: Move remote-client initialization before model
loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy
`TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser
containers relying on remote inference.
- **Fix outline regression**: Restore `self.outlines =
extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped
during refactoring and is required by downstream `remove_toc` and
metadata handling in `rag/flow/parser/parser.py`.

## Test plan

- [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines`
is populated
- [ ] Large PDF (>50 pages): verify chunked processing produces globally
consistent page numbers
- [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local
model is not downloaded
- [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 16:52:43 +08:00
4303be223f Fix metadata parsing regression for upgraded v0.24 datasets (#14383)
### What problem does this PR solve?

This PR fixes issue #14371 where file parsing failed after upgrading
from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema
object but was handled like a list and later caused `KeyError:
'properties'`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 16:18:06 +08:00
d88f7ac8d2 Remove evaluation_app.py and kb_app.py (#14394)
### What problem does this PR solve?

Delete not used APIs

### Type of change

- [x] Refactoring
2026-04-27 16:08:54 +08:00
290f0294d6 Refactor: migrate artifact API (#14348)
### What problem does this PR solve?

Before migration: GET /v1/document/artifact/<filename>
After migration:  GET /api/v1/documents/artifact/<filename>

### Type of change

- [x] Refactoring
2026-04-27 15:19:41 +08:00
2846a93998 Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?

Fixes #14196

## Problem

When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:

- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports

Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.

## Root Cause

```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # Only the first 300 pages were rendered; everything beyond was silently dropped
```

While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.

## Solution

### 1. Define constants in `common/constants.py`

```python
MAXIMUM_PAGE_NUMBER = 100000                        # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000  # Used by the task/DB layer
```

### 2. Replace all hardcoded sentinel values

| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |

### 3. Fix `parse_into_bboxes()` missing parameters

Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.

## Files Changed (22)

- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 14:57:20 +08:00
c3eac4103a Go: aliyun model provider (#14379)
### What problem does this PR solve?

As title.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-27 14:53:33 +08:00
0b46ab07c5 Refa: restore openai-compatible chat completions api (#14380)
### What problem does this PR solve?
restore openai-compatible chat completions api
### Type of change

- [x] Refactoring
2026-04-27 14:02:19 +08:00
6a23dfeec1 chore(CLAUDE.md): add shared UI component lock convention to CLAUDE.md (#14381)
### What problem does this PR solve?

AI coding agents (Claude, Copilot, etc.) tend to directly edit files in
`src/components/ui/` when asked to tweak styles or add props, treating
them like ordinary feature code. This silently breaks the shared
component library that both shadcn primitives and project-authored
common components live in.

This PR adds a `Shared UI Component Lock` convention to `web/CLAUDE.md`
to instruct AI agents to treat the entire `src/components/ui/` directory
as read-only. Any customization must be done via wrappers or composition
outside the directory; exceptions require explicit user approval.

### Type of change
- [x] Other (please describe): Update `CLAUDE.md`
2026-04-27 12:03:32 +08:00
0d87cecae2 feat: persist PDF bookmark outline as document metadata (#13287)
## Summary

PDF files often contain a bookmark/outline tree (table of contents built
into the file by the authoring tool). RAGFlow's `pdf_parser.outlines`
already extracts these `(title, depth)` tuples via pypdf, but they are
used ephemerally during chunking (`manual` parser uses them for
hierarchy detection) and then discarded.

This PR persists the outline as `doc.meta_fields["outline"]` — a JSON
array of `{"title": str, "depth": int}` objects — so downstream features
can use the structural information.

### Why this matters

- **Complementary to `toc_extraction`** — the existing `toc_extraction`
feature uses LLM calls to generate a TOC and only works for the `naive`
parser. The raw PDF outline is free (already extracted by pypdf), works
for all parsers, and captures the author's original document structure.
- **Document navigation** — frontends can render a clickable TOC from
the outline
- **Entity extraction** — the outline provides a structural map for
identifying document sections and key topics
- **Search result context** — knowing which section a chunk belongs to
helps users evaluate relevance

### Changes

| File | Change | LOC |
|------|--------|-----|
| `rag/app/naive.py` | Attach `pdf_parser.outlines` as `__outline__` on
first chunk dict | ~7 |
| `rag/app/manual.py` | Same for the manual parser | ~5 |
| `rag/svr/task_executor.py` | Extract `__outline__`, persist via
`DocMetadataService.update_document_metadata()` | ~12 |

### Design decisions

- **Transient key pattern**: The outline is passed from parser →
task_executor via `__outline__` on the first chunk dict, then removed
before indexing. This follows the same pattern as `metadata_obj` for
LLM-generated metadata.
- **No schema changes**: Uses the existing `meta_fields` JSON column on
the document table.
- **Graceful degradation**: If a PDF has no outline (common for scanned
docs), nothing is stored. If persistence fails, it logs a warning and
continues — parsing is not interrupted.

### Backward compatibility

- **Fully backward compatible** — no existing fields, behavior, or
schemas changed
- PDFs without outlines are unaffected
- Existing `meta_fields` data is preserved (merged, not overwritten)

## Test plan

- [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document),
verify `meta_fields["outline"]` is populated
- [ ] Parse a PDF without bookmarks, verify no errors and no outline key
in meta_fields
- [ ] Verify existing `meta_fields` data is preserved (not overwritten)
when outline is added
- [ ] Verify `manual` parser also persists outlines
- [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth":
0}, ...]`

Related: #9921 (Deterministic Document Access Layer)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yuch85 <yuch85.1@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-04-27 11:57:06 +08:00
f3b7d55a1e fix: handle Infinity table-not-exist error (3022) in update() methods (#14153)
### What problem does this PR solve?

## Summary

Closes #6102

When using Infinity as the document store engine (GPU version), calling
`update()` on a non-existent table throws an unhandled
`InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This
causes users to see a raw "3022" error when clicking on a parsed
document.

## Root Cause

The `update()` methods in both `rag/utils/infinity_conn.py` and
`memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)`
without catching `InfinityException`. In contrast, other CRUD methods
(`insert`, `delete`, `search`) all handle this exception gracefully:

| Method   | Handles table-not-exist? | Behavior |
|----------|--------------------------|----------|
| `insert`  |  Yes | Auto-creates the table |
| `search`  |  Yes | Skips the table |
| `delete`  |  Yes | Returns 0 |
| `update`  |  **No** | Crashes with 3022 |

Additionally, `api/apps/document_app.py` worked around this with a
fragile string match (`"3022" in msg`) to detect the error.

## Changes

- **`rag/utils/infinity_conn.py`**: Catch `InfinityException` in
`update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return
`False` — consistent with `delete()`.
- **`memory/utils/infinity_conn.py`**: Apply the same fix to its
`update()` method.
- **`api/apps/document_app.py`**: Remove the fragile `"3022"`
string-matching workaround. Table-not-exist is now handled by the `if
not ok` path with an improved error message.

### Type of change

- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 11:52:22 +08:00
33bb464ce3 fix: skip canvas SSE fetch in chat shared page to eliminate spurious 103 error (#14190)
## What does this PR do?

Fixes the `hint : 103 Only owner of canvas authorized for this
operation` error that appears when opening a **Chat** shared link
(`/chats/share?shared_id=...&from=chat`).

## Root Cause

The Chat shared page (`web/src/pages/next-chats/share/index.tsx`)
unconditionally calls `useFetchFlowSSE()`, which requests
`/api/canvas/getsse/{sharedId}`. This is an Agent Canvas endpoint that
validates canvas ownership. When sharing a **Chat** dialog (not an
Agent):

1. `sharedId` is a `dialog_id`, not a `canvas_id`
2. The API token's `tenant_id` doesn't match any canvas owner
3. The backend returns `code: 103, message: "Only owner of canvas
authorized for this operation."`
4. The global error interceptor in `request.ts` displays it as a
notification: `hint : 103 Only owner of canvas authorized for this
operation.`

## Changes

- **`web/src/hooks/use-agent-request.ts`**: Added an `enabled` parameter
to `useFetchFlowSSE` so callers can conditionally skip the query.
- **`web/src/pages/next-chats/share/index.tsx`**: Only enable
`useFetchFlowSSE` when `from === SharedFrom.Agent`. For Chat shares, the
hook is disabled, avoiding the unnecessary canvas API call entirely.

## Related Issue

Closes #14115

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 11:27:39 +08:00
3ad3241ae0 feat: persist RAPTOR layer metadata on summary chunks (#13286)
## Summary

RAPTOR's recursive clustering builds a `layers` list tracking
`(start_idx, end_idx)` boundaries per level, but currently discards this
information — only the flat `chunks` list is returned. This makes it
impossible to distinguish leaf-level summaries from top-level ones.

This PR:
- Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__`
- Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first
summary level, 2 = summary-of-summaries, etc.)
- Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch
handles it via existing `*_int` dynamic template)

### Why this matters

Downstream features need to know which RAPTOR layer a summary belongs
to:
- **Retrieving the top-level document summary** for entity extraction,
search snippets, or document comparison
- **Filtering by abstraction level** — users may want only high-level
summaries or only leaf-level cluster summaries
- **RAPTOR recall quality** — #10951 reports summaries not being
recalled for definition queries; layer metadata enables targeted
retrieval

### Changes

| File | Change | LOC |
|------|--------|-----|
| `rag/raptor.py` | Return `(chunks, layers)` tuple | ~3 |
| `rag/svr/task_executor.py` | Build `chunk_layer` mapping, set
`raptor_layer_int` | ~12 |
| `conf/infinity_mapping.json` | Add `raptor_layer_int` integer field |
~1 |

### Backward compatibility

- **Additive only** — no existing fields or behavior changed
- Existing RAPTOR chunks continue to work (they'll have
`raptor_layer_int = 0` by default)
- New RAPTOR chunks get layer metadata automatically

## Test plan

- [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is
set on indexed chunks
- [ ] Verify `raptor_layer_int` values increase with abstraction level
(layer 1 < layer 2 < ...)
- [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still
works
- [ ] Verify Infinity backend accepts the new field

Fixes #7488
Related: #4104, #11191, #10951

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yuch85 <yuch85.1@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-04-27 10:20:46 +08:00
a9e5724b46 Refa: unify document create flows under REST documents API (#14345)
### What problem does this PR solve?

unify document create flows under REST documents API

### Type of change

- [x] Refactoring
2026-04-27 10:18:16 +08:00
4dcc42e0e1 feat(api): add unified index API and dataset management endpoints (#14222)
### What problem does this PR solve?

## Summary

Refactor the dataset API layer into a clean service/REST separation
pattern, add a unified `/index` API for graph/raptor/mindmap operations,
and introduce several new dataset management endpoints with full test
coverage.

## Changes

### Service Layer (`dataset_api_service.py`)

- Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace
function for all index types
- Added `run_index`, `delete_index` service functions
- Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`,
`get_ingestion_log`
- Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`,
`rename_tag`
- Added `get_flattened_metadata`, `get_auto_metadata`,
`update_auto_metadata`

### REST API Layer (`dataset_api.py`)

**New unified routes:**

| Method | Route | Description |
|--------|-------|-------------|
| POST | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Run index
task |
| GET | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Trace index
task |
| DELETE | `/datasets/<id>/<index_type>` | Delete index |
| GET | `/datasets/<id>` | Get dataset details |
| GET | `/datasets/<id>/ingestions/summary` | Ingestion summary |
| GET | `/datasets/<id>/ingestions` | List ingestion logs |
| GET | `/datasets/<id>/ingestions/<log_id>` | Get single ingestion log
|
| POST | `/datasets/<id>/embedding` | Run embedding |
| GET | `/datasets/<id>/tags` | List tags |
| GET | `/datasets/tags/aggregation` | Aggregate tags across datasets |
| DELETE | `/datasets/<id>/tags` | Delete tags |
| PUT | `/datasets/<id>/tags` | Rename tag |
| GET | `/datasets/metadata/flattened` | Get flattened metadata |
| GET/PUT | `/datasets/<id>/metadata/config` | New metadata config path
|

**Removed routes (replaced by unified `/index`):**

- `POST /datasets/<id>/mindmap`
- `GET /datasets/<id>/mindmap`

**Preserved legacy routes (backward compatibility):**

- `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor`
- `/auto_metadata` GET/PUT

### Test Suite

- Updated `common.py` helpers: added `trace_index`, removed
`run_mindmap`/`trace_mindmap`
- Added 7 new test files with 39 test cases total:

| Test File | Cases |
|-----------|-------|
| `test_get_dataset.py` | 4 |
| `test_ingestion_summary.py` | 2 |
| `test_ingestion_logs.py` | 5 |
| `test_index_api.py` | 14 |
| `test_embedding.py` | 2 |
| `test_tags.py` | 8 |
| `test_flattened_metadata.py` | 4 |

- Deleted `test_mindmap_tasks.py` (covered by unified index tests)

## Design Decisions

1. **Unified `/index?type=...`** — single endpoint replaces 3 separate
route pairs for graph/raptor/mindmap
2. **Backward compatibility** — old routes (`/run_graphrag`,
`/run_raptor`, `/auto_metadata`) preserved alongside new paths
3. **`_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}`** — input
validation via constant set
4. **`_INDEX_TYPE_TO_TASK_ID_FIELD`** — maps index type to KB model task
ID field for clean dispatch

## Files Changed

- `api/apps/restful_apis/dataset_api.py`
- `api/apps/services/dataset_api_service.py`
- `sdk/python/ragflow_sdk/modules/dataset.py`
- `test/testcases/test_http_api/common.py`
- `test/testcases/test_http_api/test_dataset_management/` (7 new files)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 09:38:01 +08:00
fb95136f39 Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090)
### What problem does this PR solve?

The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.

This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-25 14:30:15 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
e22cf333ed Fix: allow search id or _id (#14356)
### What problem does this PR solve?

Allow search id or _id when using es as doc_engine.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 21:38:19 +08:00
25089600d0 Feat: introduce minimum type check for pipeline (#14354)
### What problem does this PR solve?

Feat: introduce minimum type check for pipeline

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-24 21:12:50 +08:00
1c244df90d Go: add gitee and siliconflow as model provider (#14336)
### What problem does this PR solve?

As title

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-04-24 20:59:30 +08:00
e5cfe7fb8f Doc: Updated a 0.25-specific faq (#14365)
### What problem does this PR solve?

Updated a 0.25 faq.

### Type of change


- [x] Documentation Update
2026-04-24 20:57:32 +08:00
7fb6a12067 Update API document (#14364)
### What problem does this PR solve?

Update API document

### Type of change

- [ ] Documentation Update
2026-04-24 20:36:47 +08:00
3ccd58f28c Fix: The button styles in the PaddleOCR dialog are not applying correctly. (#14350)
### What problem does this PR solve?

Fix: The button styles in the PaddleOCR dialog are not applying
correctly.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Copilot <copilot@github.com>
2026-04-24 20:17:01 +08:00
1870c934c6 Refact: Updated rootAsHeadingTip (#14363)
### What problem does this PR solve?

Updated rootASHeadingTip.

### Type of change

- [x] Documentation Update
2026-04-24 20:08:44 +08:00
ca01c7a745 Fix blob sync: skip unsupported files before download (#14357)
### What problem does this PR solve?

Blob storage sync was downloading unsupported files first and rejecting
them later, which wasted bandwidth and made sync slower. This PR skips
unsupported extensions before download and applies `allow_images` in
blob sync. fixes #14338

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 19:22:32 +08:00
620088be2f fix: check isinstance before len in VariableAssigner _remove_first/_remove_last (#14281)
fix: check isinstance before len in VariableAssigner _remove_first/_remove_last
2026-04-24 19:09:44 +08:00
eeb89d604e feat: route docling parsing through native chunking endpoints (#14218)
Resolves #14211

**Background:** Currently, RAGFlow routes all Docling parsing through
the standard `/convert/source` endpoint. For large documents, this
returns massive, unchunked text that exceeds RAGFlow's internal
embedding model context limits, causing pipeline failures.

**Solution:**
This PR updates the `_parse_pdf_remote` ingestion logic in
`docling_parser.py` to prioritize `docling-serve`'s native chunking
endpoints (`/v1/chunk/source` and `/v1alpha/chunk/source`).
- By receiving pre-sliced chunk objects directly from Docling, RAGFlow
natively bypasses token limit overflows.
- Included a graceful fallback mechanism to the standard
`/convert/source` endpoints to maintain backwards compatibility for
users running older versions of the Docling server that return 404s on
the new routes.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-24 19:03:19 +08:00
beb2406b86 Fix: allow use image2text as chat model (#14331)
### What problem does this PR solve?

Allow image2text models (multimodal) to be used as chat models.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 17:58:25 +08:00
9ad752f497 Refa:migrate agent webhook routes to REST APIs (#14330)
### What problem does this PR solve?

migrate agent webhook routes to REST APIs

### Type of change
- [x] Refactoring
2026-04-24 17:55:53 +08:00
b8d831c1c3 Fix api user patch verb does not work (#14358)
### What problem does this PR solve?

Fix api user patch verb does not work

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 17:27:41 +08:00
8a2f63e77d docs: fix API key guide typo (#14352)
Fixes a small typo in the RAGFlow API key guide: `This documents
provides` -> `This document provides`.
2026-04-24 16:59:25 +08:00
1473000135 Implement retrieval_test in GO (#14231)
### What problem does this PR solve?

Implement retrieval_test in GO

### Type of change

- [x] Refactoring
2026-04-24 15:30:14 +08:00
aadd9a333f Feat: deepseek v4 (#14346)
### What problem does this PR solve?

Feat: deepseek v4
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-24 13:07:59 +08:00