### What problem does this PR solve?
Always return success if no such task id to follow existing code logic.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Before migration
Web API: POST /v1/document/change_parser
HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API
PATCH /api/v1/datasets/<dataset_id>/documents
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Before migration: GET /v1/document/thumbnails
After migration: GET /api/v1/thumbnails
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Before migration: POST /v1/document/run
After migration: POST /api/v1/documents/ingest/
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Before migration
Web API: POST /v1/document/change_status
After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Before migration: GET /v1/document/artifact/<filename>
After migration: GET /api/v1/documents/artifact/<filename>
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Fixes#14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
### What problem does this PR solve?
## Summary
Closes#6102
When using Infinity as the document store engine (GPU version), calling
`update()` on a non-existent table throws an unhandled
`InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This
causes users to see a raw "3022" error when clicking on a parsed
document.
## Root Cause
The `update()` methods in both `rag/utils/infinity_conn.py` and
`memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)`
without catching `InfinityException`. In contrast, other CRUD methods
(`insert`, `delete`, `search`) all handle this exception gracefully:
| Method | Handles table-not-exist? | Behavior |
|----------|--------------------------|----------|
| `insert` | ✅ Yes | Auto-creates the table |
| `search` | ✅ Yes | Skips the table |
| `delete` | ✅ Yes | Returns 0 |
| `update` | ❌ **No** | Crashes with 3022 |
Additionally, `api/apps/document_app.py` worked around this with a
fragile string match (`"3022" in msg`) to detect the error.
## Changes
- **`rag/utils/infinity_conn.py`**: Catch `InfinityException` in
`update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return
`False` — consistent with `delete()`.
- **`memory/utils/infinity_conn.py`**: Apply the same fix to its
`update()` method.
- **`api/apps/document_app.py`**: Remove the fragile `"3022"`
string-matching workaround. Table-not-exist is now handled by the `if
not ok` path with an improved error message.
### Type of change
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
### What problem does this PR solve?
The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.
This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Allow image2text models (multimodal) to be used as chat models.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
The Langfuse Python SDK v3+ removed `start_generation()` method.
RagFlow's code called this non-existent method, causing AttributeError
when Langfuse tracing is enabled.
Replace all `start_generation()` calls with
`start_observation(as_type="generation")` which is the correct v4 SDK
API.
Affected files:
- api/db/services/llm_service.py (12 occurrences)
- api/db/services/dialog_service.py (1 occurrence)
Fixes#14204
Related to #9243
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What problem does this PR solve?
`check_ragflow_server_alive()` in `api/utils/health_utils.py` calls
`requests.get(url)` without a `timeout` parameter. Unlike
`check_minio_alive()` which correctly specifies `timeout=10`, this
health check can hang indefinitely if the server is unresponsive.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Changes
Added `timeout=10` to the `requests.get()` call, consistent with
`check_minio_alive()`.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
### What problem does this PR solve?
Before migration
Web API: POST /v1/document/metadata/update
After migration, Restful API
PATCH /api/v2/datasets/<dataset_id>/documents/metadatas
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Before migration
Web API: POST /v1/document/update_metadata_setting
After consolidation, Restful API
PUT
/api/v1/datasets/<dataset_id>/documents/<document_id>/metadata/config
### Type of change
- [x] Refactoring
### What problem does this PR solve?
normalize think tags in final chat answer
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/rm
Http API - DELETE /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- DELETE
/api/v1/datasets/<dataset_id>/documents
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Refactor /api/v1/chats to be more RESTful.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/infos
Http API - GET /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- GET
/api/v1/datasets/<dataset_id>/documents?ids=id1&ids=id2
### Type of change
- [ ] Refactoring
### What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/filter
Http API - GET /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- GET
/api/v1/datasets/<dataset_id>/documents?type=filter
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Fixes#14206.
This issue is a regression. PR #9520 previously changed Gemini models
from `image2text` to `chat` to fix chat-side resolution, but PR #13073
later restored those Gemini entries to `image2text` during model-list
updates, which reintroduced the bug.
The underlying problem is that Gemini models are multimodal and
advertise both `CHAT` and `IMAGE2TEXT`, while tenant model resolution
still depends on a single stored `model_type`. That makes chat-only
flows such as memory extraction fragile when a compatible model is
stored as `image2text`.
This PR fixes the issue at the model resolution layer instead of
changing `llm_factories.json` again:
- keep the stored tenant model type unchanged
- try exact `model_type` lookup first
- if no exact match is found, fall back only when the model metadata
shows the requested capability is supported
- coerce the runtime config to the requested type for chat callers
- fail fast in memory creation instead of silently persisting
`tenant_llm_id=0`
This preserves existing multimodal and `image2text` behavior while
restoring chat compatibility for memory-related flows.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Testing
- Re-checked the current memory creation and memory message extraction
paths against the updated resolution logic
- Verified locally that a Gemini-style tenant model stored as
`image2text` but tagged with `CHAT` can still be resolved for `chat`
- Verified `get_model_config_by_type_and_name(..., CHAT, ...)` returns a
chat-compatible runtime config
- Verified `get_model_config_by_id(..., CHAT)` also returns a
chat-compatible runtime config
- Verified strict resolution still fails when the model metadata does
not advertise chat capability
### What problem does this PR solve?
Before consolidation
Web API: POST /v1/document/list
Http API - GET /api/v1/datasets/<dataset_id>/documents
After consolidation, Restful API -- GET
/api/v1/datasets/<dataset_id>/documents
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Correctly set and display parent-child config in parser_config, and
allow to pass `tenant_id` in PATCH `/api/v1/chats`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Closes#9078
### What problem does this PR solve?
The `retrieval_test` endpoint in `chunk_app.py` never forwarded the
`highlight` request parameter to `retriever.retrieval()`, so the search
engine never produced highlight snippets. Additionally, the frontend
always rendered `content_with_weight` instead of preferring the
`highlight` field, and the CSS rule color `var(--accent-primary)` didn't
work because the variable stores an RGB triplet `(45,212,191)` requiring
the `rgb()` wrapper.
### Before
- Search page: displayed raw content_with_weight as a wall of plain
white text with no term highlighting, including markdown headings
rendered as literal text
- Retrieval testing page: showed `content_with_weight` in a plain `<p>`
tag, no `<em>` tags rendered, no highlight coloring
- Children chunks: when child chunks were consolidated into a parent via
`retrieval_by_children`, any highlight data from children was discarded
- TOC chunks: chunks fetched via `retrieval_by_toc` had no `highlight`
field, appearing as plain text while other chunks had highlights
**Retrieval testing**:
<img width="1449" height="1178"
alt="before-retrieval-no-highlight-cropped"
src="https://github.com/user-attachments/assets/5c6f5a5e-6c11-461a-bdb4-049d7dfb7a33"
/>
**Search**:
<img width="1378" height="711" alt="before-search-no-highlight-cropped"
src="https://github.com/user-attachments/assets/be7b5152-72ef-40da-a8fd-921e997ae7d3"
/>
### After
- Search page: displays the highlight field with search terms rendered
in teal/cyan color (`rgb(var(--accent-primary))`)
- Retrieval testing page: sends highlight: true in the request, uses
`HighLightMarkdown` component to render `<em>` tags with proper coloring
- Children chunks: highlights from child chunks are joined and preserved
on the parent
- TOC chunks: when other chunks have highlights, TOC-fetched chunks use
`content_with_weight` as a highlight fallback
**Retrieval testing**:
<img width="1410" height="1015" alt="05-retrieval-testing-results"
src="https://github.com/user-attachments/assets/f0cff8cf-0962-4320-b559-cd5037f622d2"
/>
**Search**:
<img width="1294" height="455" alt="03-search-highlight-results"
src="https://github.com/user-attachments/assets/a90e0e3e-3837-46be-8ddd-2412ff7cbc19"
/>
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Trivial fix log creation, follow on PR:
https://github.com/infiniflow/ragflow/pull/14136
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)