Commit Graph

1515 Commits

Author SHA1 Message Date
d88f7ac8d2 Remove evaluation_app.py and kb_app.py (#14394)
### What problem does this PR solve?

Delete not used APIs

### Type of change

- [x] Refactoring
2026-04-27 16:08:54 +08:00
290f0294d6 Refactor: migrate artifact API (#14348)
### What problem does this PR solve?

Before migration: GET /v1/document/artifact/<filename>
After migration:  GET /api/v1/documents/artifact/<filename>

### Type of change

- [x] Refactoring
2026-04-27 15:19:41 +08:00
2846a93998 Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?

Fixes #14196

## Problem

When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:

- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports

Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.

## Root Cause

```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # Only the first 300 pages were rendered; everything beyond was silently dropped
```

While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.

## Solution

### 1. Define constants in `common/constants.py`

```python
MAXIMUM_PAGE_NUMBER = 100000                        # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000  # Used by the task/DB layer
```

### 2. Replace all hardcoded sentinel values

| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |

### 3. Fix `parse_into_bboxes()` missing parameters

Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.

## Files Changed (22)

- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 14:57:20 +08:00
0b46ab07c5 Refa: restore openai-compatible chat completions api (#14380)
### What problem does this PR solve?
restore openai-compatible chat completions api
### Type of change

- [x] Refactoring
2026-04-27 14:02:19 +08:00
f3b7d55a1e fix: handle Infinity table-not-exist error (3022) in update() methods (#14153)
### What problem does this PR solve?

## Summary

Closes #6102

When using Infinity as the document store engine (GPU version), calling
`update()` on a non-existent table throws an unhandled
`InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This
causes users to see a raw "3022" error when clicking on a parsed
document.

## Root Cause

The `update()` methods in both `rag/utils/infinity_conn.py` and
`memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)`
without catching `InfinityException`. In contrast, other CRUD methods
(`insert`, `delete`, `search`) all handle this exception gracefully:

| Method   | Handles table-not-exist? | Behavior |
|----------|--------------------------|----------|
| `insert`  |  Yes | Auto-creates the table |
| `search`  |  Yes | Skips the table |
| `delete`  |  Yes | Returns 0 |
| `update`  |  **No** | Crashes with 3022 |

Additionally, `api/apps/document_app.py` worked around this with a
fragile string match (`"3022" in msg`) to detect the error.

## Changes

- **`rag/utils/infinity_conn.py`**: Catch `InfinityException` in
`update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return
`False` — consistent with `delete()`.
- **`memory/utils/infinity_conn.py`**: Apply the same fix to its
`update()` method.
- **`api/apps/document_app.py`**: Remove the fragile `"3022"`
string-matching workaround. Table-not-exist is now handled by the `if
not ok` path with an improved error message.

### Type of change

- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 11:52:22 +08:00
a9e5724b46 Refa: unify document create flows under REST documents API (#14345)
### What problem does this PR solve?

unify document create flows under REST documents API

### Type of change

- [x] Refactoring
2026-04-27 10:18:16 +08:00
4dcc42e0e1 feat(api): add unified index API and dataset management endpoints (#14222)
### What problem does this PR solve?

## Summary

Refactor the dataset API layer into a clean service/REST separation
pattern, add a unified `/index` API for graph/raptor/mindmap operations,
and introduce several new dataset management endpoints with full test
coverage.

## Changes

### Service Layer (`dataset_api_service.py`)

- Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace
function for all index types
- Added `run_index`, `delete_index` service functions
- Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`,
`get_ingestion_log`
- Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`,
`rename_tag`
- Added `get_flattened_metadata`, `get_auto_metadata`,
`update_auto_metadata`

### REST API Layer (`dataset_api.py`)

**New unified routes:**

| Method | Route | Description |
|--------|-------|-------------|
| POST | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Run index
task |
| GET | `/datasets/<id>/index?type=graph\|raptor\|mindmap` | Trace index
task |
| DELETE | `/datasets/<id>/<index_type>` | Delete index |
| GET | `/datasets/<id>` | Get dataset details |
| GET | `/datasets/<id>/ingestions/summary` | Ingestion summary |
| GET | `/datasets/<id>/ingestions` | List ingestion logs |
| GET | `/datasets/<id>/ingestions/<log_id>` | Get single ingestion log
|
| POST | `/datasets/<id>/embedding` | Run embedding |
| GET | `/datasets/<id>/tags` | List tags |
| GET | `/datasets/tags/aggregation` | Aggregate tags across datasets |
| DELETE | `/datasets/<id>/tags` | Delete tags |
| PUT | `/datasets/<id>/tags` | Rename tag |
| GET | `/datasets/metadata/flattened` | Get flattened metadata |
| GET/PUT | `/datasets/<id>/metadata/config` | New metadata config path
|

**Removed routes (replaced by unified `/index`):**

- `POST /datasets/<id>/mindmap`
- `GET /datasets/<id>/mindmap`

**Preserved legacy routes (backward compatibility):**

- `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor`
- `/auto_metadata` GET/PUT

### Test Suite

- Updated `common.py` helpers: added `trace_index`, removed
`run_mindmap`/`trace_mindmap`
- Added 7 new test files with 39 test cases total:

| Test File | Cases |
|-----------|-------|
| `test_get_dataset.py` | 4 |
| `test_ingestion_summary.py` | 2 |
| `test_ingestion_logs.py` | 5 |
| `test_index_api.py` | 14 |
| `test_embedding.py` | 2 |
| `test_tags.py` | 8 |
| `test_flattened_metadata.py` | 4 |

- Deleted `test_mindmap_tasks.py` (covered by unified index tests)

## Design Decisions

1. **Unified `/index?type=...`** — single endpoint replaces 3 separate
route pairs for graph/raptor/mindmap
2. **Backward compatibility** — old routes (`/run_graphrag`,
`/run_raptor`, `/auto_metadata`) preserved alongside new paths
3. **`_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}`** — input
validation via constant set
4. **`_INDEX_TYPE_TO_TASK_ID_FIELD`** — maps index type to KB model task
ID field for clean dispatch

## Files Changed

- `api/apps/restful_apis/dataset_api.py`
- `api/apps/services/dataset_api_service.py`
- `sdk/python/ragflow_sdk/modules/dataset.py`
- `test/testcases/test_http_api/common.py`
- `test/testcases/test_http_api/test_dataset_management/` (7 new files)
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 09:38:01 +08:00
fb95136f39 Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090)
### What problem does this PR solve?

The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.

This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-25 14:30:15 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
beb2406b86 Fix: allow use image2text as chat model (#14331)
### What problem does this PR solve?

Allow image2text models (multimodal) to be used as chat models.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 17:58:25 +08:00
9ad752f497 Refa:migrate agent webhook routes to REST APIs (#14330)
### What problem does this PR solve?

migrate agent webhook routes to REST APIs

### Type of change
- [x] Refactoring
2026-04-24 17:55:53 +08:00
1473000135 Implement retrieval_test in GO (#14231)
### What problem does this PR solve?

Implement retrieval_test in GO

### Type of change

- [x] Refactoring
2026-04-24 15:30:14 +08:00
199fbceb72 Refactor user REST API (#14334)
### What problem does this PR solve?
Refactor user REST API

### Type of change
- [x] Refactoring
2026-04-24 10:25:15 +08:00
c41b5e8a5d fix: migrate Langfuse integration from start_generation to start_obse… (#14205)
The Langfuse Python SDK v3+ removed `start_generation()` method.
RagFlow's code called this non-existent method, causing AttributeError
when Langfuse tracing is enabled.

Replace all `start_generation()` calls with
`start_observation(as_type="generation")` which is the correct v4 SDK
API.

Affected files:
- api/db/services/llm_service.py (12 occurrences)
- api/db/services/dialog_service.py (1 occurrence)

Fixes #14204
Related to #9243

### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 10:03:57 +08:00
c74aece63c Feat: Agent api (#14157)
### What problem does this PR solve?

1. **List agents**  
   **Prev API**:  
   - `/v1/canvas/list GET`  
   - `/api/v1/agents GET`  
   **Current API**: `/api/v2/agents GET`

2. **Get canvas template**  
   **Prev API**: `/v1/canvas/templates GET`  
   **Current API**: `/api/v2/agents/templates GET`

3. **Delete an agent**  
   **Prev API**: 
    - `/v1/canvas/rm POST`  
    - `/api/v1/agents/<agent_id> DELETE`
   **Current API**: `/api/v2/agents/<agent_id> DELETE`

4. **Update an agent**  
   **Prev API**: 
    - `/api/v1/agents/<agent_id> PUT`   
    - `/v1/canvas/setting POST `
   **Current API**: `/api/v2/agents/<agent_id> PATCH`


5. **Create an agent**  
   **Prev API**: 
    - `/v1/canvas/set POST`  
    - `/api/v1/agents POST`
   **Current API**: `/api/v2/agents POST`


6. **Get an agent**  
   **Prev API**: 
    - `/v1/canvas/get/<canvas_id> GET `  
   **Current API**: `/api/v2/agents/<agent_id> GET`


7. **Reset an agent**  
   **Prev API**: 
    - `/v1/canvas/reset POST`  
   **Current API**: `/api/v2/agents/<agent_id>/reset POST`


8. **Upload a file to an agent**  
   **Prev API**: 
    - `/v1/canvas/upload/<canvas_id> POST`  
   **Current API**: `/api/v2/agents/<agent_id>/upload POST`


9. **Input form**  
   **Prev API**: 
    - `/v1/canvas/input_form GET`  
**Current API**:
`/api/v2/agents/<agent_id>/components/<component_id>/input-form GET`


10. **Debug an agent**  
   **Prev API**: 
    - `/v1/canvas/debug POST`  
**Current API**:
`/api/v2/agents/<agent_id>/components/<component_id>/debug POST`


11. **Trace an agent**  
   **Prev API**: 
    - `/v1/canvas/trace GET`  
   **Current API**: `/api/v2/agents/<agent_id>/logs/<message_id> GET`


12. **Get an agent version list**  
   **Prev API**: 
    - `/v1/canvas/getlistversion/<canvas_id>`  
   **Current API**: `/api/v2/agents/<agent_id>/versions GET`


13. **Get a version of agent**  
   **Prev API**: 
    - `/v1/canvas/getversion/<version_id>`  
**Current API**: `/api/v2/agents/<agent_id>/versions/<version_id> GET`


14. **Test db connection**  
   **Prev API**: 
    - `/v1/canvas/test_db_connect POST`  
   **Current API**: `/api/v2/agents/test_db_connection`


15. **Rerun the agent**  
   **Prev API**: 
    - `/v1/canvas/rerun POST`  
   **Current API**: `/api/v2/agents/rerun POST`


16. **Get prompts**  
   **Prev API**: 
    - `/v1/canvas/prompts GET`  
   **Current API**: `/api/v2/agents/prompts GET`

### Type of change
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: chanx <1243304602@qq.com>
2026-04-24 10:02:22 +08:00
d4fa57311c Refa: remove legacy MCP server web API (#14322)
### What problem does this PR solve?

remove legacy MCP server web API

### Type of change

- [x] Refactoring
2026-04-23 19:01:22 +08:00
4458763a93 API refactor: stats_api and plugin_api (#14324)
### What problem does this PR solve?

API refactor: stats_api and plugin_api

### Type of change

- [x] Refactoring
2026-04-23 17:16:04 +08:00
7817b0d779 Refa: migrate chunk APIs to RESTful routes (#14291)
### What problem does this PR solve?

migrate chunk APIs to RESTful routes

### Type of change
- [x] Refactoring
2026-04-23 14:17:23 +08:00
76b017ca32 Refact: system apis (#14298)
### What problem does this PR solve?
Refact: system apis

### Type of change

- [x] Refactoring
2026-04-23 14:09:42 +08:00
57f527eb02 Add missing timeout to ragflow server health check (#14311)
### What problem does this PR solve?

`check_ragflow_server_alive()` in `api/utils/health_utils.py` calls
`requests.get(url)` without a `timeout` parameter. Unlike
`check_minio_alive()` which correctly specifies `timeout=10`, this
health check can hang indefinitely if the server is unresponsive.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Changes

Added `timeout=10` to the `requests.get()` call, consistent with
`check_minio_alive()`.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-23 14:08:52 +08:00
aa4526266f Refa: migrate MCP APIs to RESTful api (#14317)
### What problem does this PR solve?

migrate MCP APIs to RESTful api

### Type of change

- [x] Refactoring
2026-04-23 12:51:27 +08:00
dbf8c6ed90 Refactor: Doc metadata update (#14289)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/metadata/update

After migration, Restful API
PATCH /api/v2/datasets/<dataset_id>/documents/metadatas 

### Type of change

- [x] Refactoring
2026-04-23 12:04:34 +08:00
aae45b959b Refactor: API file2document (#14306)
Refactor: API file2document
2026-04-23 11:40:45 +08:00
e79b896637 Refactor: REST API langfuse api-key (#14315)
REST API langfuse api-key
2026-04-23 11:36:16 +08:00
01753b8f31 Refactor: API connectors (#14228)
### What problem does this PR solve?

Refactor /api/v1/connectors to be more RESTful.

### Type of change
- [x] Refactoring
2026-04-22 20:42:41 +08:00
c08cd8e090 Refactor: Migrate document metadata config update API (#14286)
### What problem does this PR solve?

Before migration
Web API: POST /v1/document/update_metadata_setting

After consolidation, Restful API
PUT
/api/v1/datasets/<dataset_id>/documents/<document_id>/metadata/config

### Type of change

- [x] Refactoring
2026-04-22 20:01:31 +08:00
d1c62fc19d Refact: Tenant api (#14288)
### What problem does this PR solve?

Refact: Tenant api

### Type of change

- [x] Refactoring
2026-04-22 20:00:32 +08:00
61d756e1b5 Fix #14213 create folder does not accept FOLDER (#14276)
### What problem does this PR solve?

As description.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-22 11:55:10 +08:00
ff29484d42 fix: normalize think tags in final chat answer (#14271)
### What problem does this PR solve?

normalize think tags in final chat answer

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-22 11:15:08 +08:00
3d8a82c0aa Refactor: Consolidation WEB API & HTTP API for document delete api (#14254)
### What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/rm
Http API - DELETE /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- DELETE
/api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-22 10:49:52 +08:00
6baf74afc1 Refa: align chat and search restful APIs (#14229)
### What problem does this PR solve?

Refactor /api/v1/chats to be more RESTful.

### Type of change

- [x] Refactoring

---------

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-04-22 10:49:11 +08:00
2d05475693 Refactor: Consolidation WEB API & HTTP API for document infos (#14239)
### What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/infos
Http API - GET /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- GET
/api/v1/datasets/<dataset_id>/documents?ids=id1&ids=id2

### Type of change

- [ ] Refactoring
2026-04-21 19:35:11 +08:00
009e538a4e Refactor: Consolidation WEB API & HTTP API for document get_filter (#14248)
### What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/filter
Http API - GET /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- GET
/api/v1/datasets/<dataset_id>/documents?type=filter
### Type of change

- [x] Refactoring
2026-04-21 18:55:30 +08:00
6e33d8722f Revert "Fix: forwarding highlight param" (#14249)
Reverts infiniflow/ragflow#14112
2026-04-21 15:23:18 +08:00
78c3583964 Fix memory resolution regression for multimodal Gemini models (#14209)
### What problem does this PR solve?

Fixes #14206.

This issue is a regression. PR #9520 previously changed Gemini models
from `image2text` to `chat` to fix chat-side resolution, but PR #13073
later restored those Gemini entries to `image2text` during model-list
updates, which reintroduced the bug.

The underlying problem is that Gemini models are multimodal and
advertise both `CHAT` and `IMAGE2TEXT`, while tenant model resolution
still depends on a single stored `model_type`. That makes chat-only
flows such as memory extraction fragile when a compatible model is
stored as `image2text`.

This PR fixes the issue at the model resolution layer instead of
changing `llm_factories.json` again:
- keep the stored tenant model type unchanged
- try exact `model_type` lookup first
- if no exact match is found, fall back only when the model metadata
shows the requested capability is supported
- coerce the runtime config to the requested type for chat callers
- fail fast in memory creation instead of silently persisting
`tenant_llm_id=0`

This preserves existing multimodal and `image2text` behavior while
restoring chat compatibility for memory-related flows.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Testing

- Re-checked the current memory creation and memory message extraction
paths against the updated resolution logic
- Verified locally that a Gemini-style tenant model stored as
`image2text` but tagged with `CHAT` can still be resolved for `chat`
- Verified `get_model_config_by_type_and_name(..., CHAT, ...)` returns a
chat-compatible runtime config
- Verified `get_model_config_by_id(..., CHAT)` also returns a
chat-compatible runtime config
- Verified strict resolution still fails when the model metadata does
not advertise chat capability
2026-04-20 16:37:36 +08:00
939933649a Refactor: Consolidation WEB API & HTTP API for document list_docs (#14176)
### What problem does this PR solve?

Before consolidation
Web API: POST /v1/document/list
Http API - GET /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- GET
/api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-20 14:54:40 +08:00
c3387cd5b8 Fix: parent child config (#14199)
### What problem does this PR solve?

Correctly set and display parent-child config in parser_config, and
allow to pass `tenant_id` in PATCH `/api/v1/chats`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-17 23:02:42 +08:00
22c6648348 Fix: forwarding highlight param (#14112)
Closes #9078

### What problem does this PR solve?

The `retrieval_test` endpoint in `chunk_app.py` never forwarded the
`highlight` request parameter to `retriever.retrieval()`, so the search
engine never produced highlight snippets. Additionally, the frontend
always rendered `content_with_weight` instead of preferring the
`highlight` field, and the CSS rule color `var(--accent-primary)` didn't
work because the variable stores an RGB triplet `(45,212,191)` requiring
the `rgb()` wrapper.

### Before

- Search page: displayed raw content_with_weight as a wall of plain
white text with no term highlighting, including markdown headings
rendered as literal text
- Retrieval testing page: showed `content_with_weight` in a plain `<p>`
tag, no `<em>` tags rendered, no highlight coloring
- Children chunks: when child chunks were consolidated into a parent via
`retrieval_by_children`, any highlight data from children was discarded
- TOC chunks: chunks fetched via `retrieval_by_toc` had no `highlight`
field, appearing as plain text while other chunks had highlights

**Retrieval testing**:
<img width="1449" height="1178"
alt="before-retrieval-no-highlight-cropped"
src="https://github.com/user-attachments/assets/5c6f5a5e-6c11-461a-bdb4-049d7dfb7a33"
/>

**Search**:
<img width="1378" height="711" alt="before-search-no-highlight-cropped"
src="https://github.com/user-attachments/assets/be7b5152-72ef-40da-a8fd-921e997ae7d3"
/>

### After

- Search page: displays the highlight field with search terms rendered
in teal/cyan color (`rgb(var(--accent-primary))`)
- Retrieval testing page: sends highlight: true in the request, uses
`HighLightMarkdown` component to render `<em>` tags with proper coloring
- Children chunks: highlights from child chunks are joined and preserved
on the parent
- TOC chunks: when other chunks have highlights, TOC-fetched chunks use
`content_with_weight` as a highlight fallback

**Retrieval testing**:
<img width="1410" height="1015" alt="05-retrieval-testing-results"
src="https://github.com/user-attachments/assets/f0cff8cf-0962-4320-b559-cd5037f622d2"
/>

**Search**:
<img width="1294" height="455" alt="03-search-highlight-results"
src="https://github.com/user-attachments/assets/a90e0e3e-3837-46be-8ddd-2412ff7cbc19"
/>

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-17 20:59:20 +08:00
28d8b1c883 [Fix] trivial fix log creation (#14181)
### What problem does this PR solve?

Trivial fix log creation, follow on PR:
https://github.com/infiniflow/ragflow/pull/14136

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-17 13:13:41 +08:00
797aa6076a Fix: keyword extraction (#14177)
### What problem does this PR solve?

Fix: keyword extraction

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-17 11:32:48 +08:00
ea8de1bb47 Fix: different llm in chat (#14162)
### What problem does this PR solve?

Fix: different llm in chat

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-16 20:37:01 +08:00
655dd2f8c6 Fix: simplify _load_user (#14154)
### What problem does this PR solve?

Simplify _load_user, remove unused fallback.

### Type of change

- [x] Refactoring
2026-04-16 18:47:43 +08:00
9a785b26bd fix: change file size column from IntegerField to BigIntegerField to support files > 2GB (#14148)
### What problem does this PR solve?

Fixes #6034

Changes the `size` field in both `Document` and `File` models from
`IntegerField` (32-bit, max ~2GB) to `BigIntegerField` (64-bit, max
~9.2EB), and adds corresponding database migrations.

## Problem

When uploading a file larger than 2GB, the `size` value overflows a
32-bit signed integer (max 2,147,483,647). This causes:

- The stored `size` wraps around to an incorrect value (e.g., a 3GB file
shows as 2,097,152 KB in File Management).
- Subsequent file operations (e.g., download) fail because the corrupted
size leads to invalid storage lookups.

## Changes

- `Document.size`: `IntegerField` → `BigIntegerField`
- `File.size`: `IntegerField` → `BigIntegerField`
- Added `alter_db_column_type` migrations in `migrate_db()` for both
`document.size` and `file.size` columns to ensure existing deployments
are upgraded automatically.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-16 15:43:29 +08:00
969ce3a79f [Bug fix #14133] fix graph rag, raptor, mindmap log cannot show correctly in UI (#14136)
### What problem does this PR solve?
Fix #14133, knowledge graph, raptor, mindmap log cannot show correctly
in UI
<img width="1930" height="982" alt="Image"
src="https://github.com/user-attachments/assets/d2f8e6c1-d82d-4b00-a377-949aada545ca"
/>
After Fix:
<img width="2108" height="805" alt="image"
src="https://github.com/user-attachments/assets/b37426c1-83d3-4a32-a83c-9d340d69e0e6"
/>
<img width="2173" height="1067" alt="image"
src="https://github.com/user-attachments/assets/30105222-3310-43a0-9f83-1e320d05e413"
/>

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-16 13:08:36 +08:00
c93ec0a1f3 Fix: reject empty/space-only content in update_chunk API (#14082)
Closes #6541

### What problem does this PR solve?

Add content validation to `update_chunk` (SDK and non-SDK) to reject
empty or whitespace-only content before it reaches the embedding model.

**Before:** Calling `update_chunk` with space-only content (like `" "`,
`""`, `"\n"`) bypassed validation and was sent directly to the embedding
model, which returned an error. This was the same bug previously fixed
for `add_chunk` in #6390, but `update_chunk` was missed.

**After:** Empty/whitespace-only content is caught by validation and
returns an error: `` `content` is required ``

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-15 18:43:53 +08:00
3364d86e6b Auto-inject knowledge parameter in async_chat when prompt_config is missing it (#14121)
### What problem does this PR solve?

Resolve #14115 .

## Problem

On the shared chat link page (`/chats/share?shared_id=...`), querying
the knowledge base returns "no relevant information was found", while
the same query works correctly on the editor chat page.

## Root Cause

Knowledge base retrieval in `async_chat()` is gated by the check `if
"knowledge" in param_keys` (line 598), where `param_keys` is derived
from `prompt_config["parameters"]`. If `parameters` is empty or missing
the `{"key": "knowledge", "optional": false}` entry, retrieval is
entirely skipped.

This can happen because `_apply_prompt_defaults()` — which ensures
`parameters` contains the `knowledge` entry — is only called in the
`create` (POST) and `update_chat` (PUT) handlers, but **not** in
`patch_chat` (PATCH). If a chat's `prompt_config` was updated via PATCH
without including `parameters`, the `knowledge` entry would be absent.
Additionally, `prompt_config["parameters"]` would raise a `KeyError` if
the key was missing entirely.

## Fix

Added a defensive safety net in `async_chat()`
(`api/db/services/dialog_service.py`) that auto-injects the `knowledge`
parameter when:
- `dialog.kb_ids` is set (knowledge bases are configured)
- `"knowledge"` is not already in `param_keys`
- `{knowledge}` placeholder exists in the system prompt

Also changed `prompt_config["parameters"]` to
`prompt_config.get("parameters", [])` to prevent `KeyError` when the key
is absent.

## Files Changed

- `api/db/services/dialog_service.py` — added auto-injection of
`knowledge` parameter and safe `.get()` access for `parameters`


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-15 17:31:31 +08:00
38cefd88e2 Fix tag_feas code injection in retrieval ranking (#13923)
## Summary
- remove eval-based parsing from retrieval rank feature scoring
- validate `tag_feas` at write time in chunk APIs and SDK routes
- add regression tests for safe parsing and malicious payload rejection

## Details
`tag_feas` is intended to be structured rank-feature data, but the
retrieval ranking path was evaluating stored values as Python
expressions. This change treats `tag_feas` strictly as data.

### What changed
- replace `eval()` in `rag/nlp/search.py` with safe parsing via
`json.loads()` and optional `ast.literal_eval()` compatibility for
legacy Python-dict strings
- strictly filter parsed values down to `dict[str, finite number]`
- reject invalid `tag_feas` payloads at write time in web chunk routes
and SDK document chunk routes
- add focused regression tests to prove executable strings are ignored
and invalid payloads are rejected

## Validation
- `python -m pytest test/unit_test/common/test_tag_feature_utils.py
test/unit_test/rag/test_rank_feature_scores.py -q`

---------

Co-authored-by: unknown <zhenglinkai@CCN.Local>
Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>
2026-04-15 16:31:11 +08:00
1f33ca1099 fix(dialog): restore decorated answer in async_ask final SSE event (#13917)
## What's the problem

Both `async_chat()` and `async_ask()` call `decorate_answer()` to build
the final SSE payload — it inserts citation markers (`##N$$`) into the
answer text and prunes `doc_aggs` to only the cited documents.
Immediately after, both functions overwrite `final["answer"]` with `""`:

```python
# async_chat(), line ~774  (issue #13828)
final = decorate_answer(thought + full_answer)
final["final"] = True
final["audio_binary"] = None
final["answer"] = ""   # discards decorated text
yield final

# async_ask(), line ~1444  (same bug, different path)
final = decorate_answer(full_answer)
final["final"] = True
final["answer"] = ""   # discards decorated text
yield final
```

The client receives filtered references (built for a citation-decorated
answer it never sees) while displaying the raw, undecorated streaming
text. Citations can never match.

## Root cause

`final["answer"] = ""` was left over from an earlier design where
clients were meant to reconstruct the full answer purely from delta
events. Once `decorate_answer()` started placing citation markers, this
blank-out broke the contract: the final event is where the decorated
answer should land.

## Fix

Remove the two blank-override lines — one in `async_chat()`, one in
`async_ask()`:

```diff
-    final["answer"] = ""
```

`decorate_answer()` already sets `final["answer"]` to the correct
decorated string; there is nothing to override.

## Relation to #13828

Issue #13828 and PR #13835 identify the bug in `async_chat()`. This PR
absorbs that fix and also corrects the identical pattern in
`async_ask()` (used by the `/retrieval` route in `chat_api.py`), which
PR #13835 does not touch.

## Regression test

Added
`test/unit_test/api/db/services/test_dialog_service_final_answer.py`
with three tests:

| Test | Purpose |
|------|---------|
| `test_buggy_pattern_drops_answer` | Documents the old behaviour:
blank-override empties the final answer |
| `test_fixed_pattern_preserves_decorated_answer` | Core invariant:
final event carries the decorated text from `decorate_answer()` |
| `test_final_event_reference_matches_decorated_result` | Citation
markers in the answer must match the pruned `doc_aggs` in the same event
|

Local run result:

```
test_dialog_service_final_answer.py::test_buggy_pattern_drops_answer         PASSED
test_dialog_service_final_answer.py::test_fixed_pattern_preserves_decorated_answer PASSED
test_dialog_service_final_answer.py::test_final_event_reference_matches_decorated_result PASSED

3 passed in 0.04s
```

`ruff check` passes with no issues on all changed files.

---------

Co-authored-by: edenfunf <edenfunf@gmail.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-04-15 14:10:36 +08:00
bc5f78996b Consolidateion of document upload API (#14106)
### What problem does this PR solve?

Consolidation WEB API & HTTP API for document upload

Before consolidation
Web API: POST /v1/document/upload
Http API - POST /api/v1/datasets/<dataset_id>/documents

After consolidation, Restful API -- POST
/api/v1/datasets/<dataset_id>/documents

### Type of change

- [x] Refactoring
2026-04-15 11:27:43 +08:00
a98b64326c Add warning log when metadata query hits 10000 result limit (#14109)
## What problem does this PR solve?

Add a warning log when `get_flatted_meta_by_kbs` returns 10,000 results,
which indicates the query limit has been reached and metadata may be
silently truncated.


## Type of change
- [x] Improvement (non-breaking change which improves observability)
2026-04-14 20:04:32 +08:00