Commit Graph

191 Commits

Author SHA1 Message Date
59bb184e63 feat(moodle): support deleted-file sync (#14548)
Fixes #14551 

### What problem does this PR solve?

The Moodle connector did not let the sync runner clean up indexed
documents that were deleted from the source. Other connectors such as
dropbox, seafile, webdav, and rss already do this through a slim
snapshot pass. This PR adds the same support for Moodle.

When `sync_deleted_files` is on, the runner now asks the Moodle
connector for a lightweight list of every module id that could be
indexed. The runner then compares this list with the index and removes
any indexed document whose id is not in the list.

The slim pass does not download files. It only goes through courses and
modules and yields ids. The id format matches the ids that the loader
produces, so the match is exact.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

### Notes

- `MoodleConnector` now also implements `SlimConnectorWithPermSync`.
- New `retrieve_all_slim_docs_perm_sync` yields slim docs with the same
ids the loader uses (`moodle_resource_<id>`, `moodle_forum_<id>`,
`moodle_page_<id>`, `moodle_book_<id>`, `moodle_assign_<id>`,
`moodle_quiz_<id>`).
- The `Moodle` sync class now returns `(document_generator, file_list)`
so the runner can do the cleanup. If the slim snapshot fails,
`file_list` is set back to `None` and the run continues without cleanup.
- The web data source map exposes `syncDeletedFiles` for Moodle so the
option shows up in the UI.

### How was this tested?

- `ruff check` passes on the changed Python files.
- Manual review of the produced slim ids against the ids the loader
builds in `_process_resource`, `_process_forum`, `_process_page`,
`_process_book`, and `_process_activity`.
- Behavior parity with the merged dropbox (#14476), seafile (#14499),
webdav (#14491), and rss (#14493) PRs.
2026-05-07 17:44:46 +08:00
94324afee9 Go: fix auth issue in hybrid mode (#14611)
### What problem does this PR solve?

Since secret key get and set logic is updated, the go server also need
to update.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 17:14:22 +08:00
911671cef0 Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615)
### What problem does this PR solve?

Feat: enable sync deleted files for RDBMS & fix remove last file issue

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-05-07 13:31:05 +08:00
1d0519d025 Fix secret key inconsistency cross the RAGFlow servers (#14591)
### What problem does this PR solve?

A and B, two API servers and a REDIS server.
If A and REDIS restart, B will hold the obsolete secret key and will
lead to error.

TODO:
app.config['SECRET_KEY'] and app.secret_key still hold obsolete secret
key.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 10:10:02 +08:00
38f6484e98 Fix OpenDataLoader naive parsing by normalizing @OpenDataLoader and filtering unsupported parser kwargs (#14581)
### What problem does this PR solve?
This PR fixes a bug where `layout_recognize="<name>@OpenDataLoader"` was
misrouted and then failed during parsing in the naive parser path. It
now routes correctly to OpenDataLoader and avoids passing unsupported
arguments that caused runtime errors. fixes #14572

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-06 15:00:55 +08:00
5e01feb755 fix(connector_service): add TIMEZONE setting and correct interval log… (#14446)
### What problem does this PR solve?



### Type of change

- [v] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: wiratama <dafa.wiratama@bankraya.co.id>
2026-05-06 14:40:35 +08:00
406b36a452 fix(#14389): normalize list metadata values for in filters (#14410)
## Summary
- normalize string items for list-valued metadata filters in
`meta_filter`
- fix `in` / `not in` case asymmetry when document metadata is
lowercased but filter list values are not
- add regression tests that cover the original issue scenario using
uppercase list values

## Validation
- `PYTHONPATH=external/ragflow pytest
external/ragflow/test/unit_test/common/test_metadata_filter_operators.py
-q`

## Notes
- I commented on #14389 before opening this PR to claim the issue.
- The new tests use `value=["F2", "F11"]` so they fail on the old
implementation and pass with this fix.
- This also benefits other non-comparison operators that flow through
the same normalization path.

Co-authored-by: copizza <copizza@users.noreply.github.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-05-06 14:28:25 +08:00
5672be0652 Feat: add IMAP deleted document sync (#14539)
### What problem does this PR solve?

add IMAP deleted document sync

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-06 14:06:46 +08:00
89961962c0 feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525)
### What problem does this PR solve?

Incremental DingTalk AI Table (Notable) sync did not reconcile rows
removed on the remote side with documents already in the knowledge base.
This follows the coordinated datasource work in #14362 (“sync deleted
files”).

This PR adds a **full slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that lists **current record IDs for
all sheets** without building document blobs, using the same logical
document IDs as full ingest
(`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When
**`sync_deleted_files`** is enabled on incremental runs,
`DingTalkAITable._generate` returns **`(document_generator,
file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove KB rows that no longer
exist remotely.

Design notes:

- **`_document_id`** centralizes the ID string so slim snapshots and
**`_convert_record_to_document`** stay aligned with
**`hash128(doc.id)`** semantics used during ingestion/cleanup.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with other
Dropbox-style connectors).
- **`batch_size`** from connector config is coerced to a positive
**`int`** before constructing the connector.
- Slim snapshot failures are caught in **`_generate`**; **`file_list`**
is set to **`None`** so cleanup is skipped rather than running on
partial/error state.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

### Files changed (summary)

| Area | Change |
|------|--------|
| `common/data_source/dingtalk_ai_table_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`_document_id` shared with document conversion |
| `rag/svr/sync_data_source.py` | `DingTalkAITable._generate`: slim
snapshot + tuple return; `batch_size` validation; shared `end_ts` with
`poll_source` |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for DingTalk AI Table in
`DataSourceFeatureVisibilityMap` |

Closes / relates to: #14362
2026-05-06 14:06:23 +08:00
c0fc8b32f2 Fix: retry RocksDB metadata contention on concurrent CREATE/DROP (#14563)
Concurrent CREATE TABLE / CREATE INDEX / DROP TABLE on the same Infinity
instance can race on the catalog counter (e.g. db|1|next_table_id) and
fail with error 9003 "Resource busy" instead of waiting on a lock. Two
users creating a knowledge base at the same instant, or any deployment
with multiple backend workers behind one Infinity, can hit it.

Wrap the metadata paths in create_idx, create_doc_meta_idx, and
delete_idx with exponential backoff + jitter (5 attempts, 50ms base).
The wrapped operations already use ConflictType.Ignore, so retrying is
idempotent — worst case the second attempt is a no-op against an
already-created table. Tunable via INFINITY_META_RETRY_MAX /
INFINITY_META_RETRY_BASE_DELAY_MS.

Repro: stress 30 concurrent POST /api/v1/datasets against a 4-worker
backend → ~50% of requests fail without the patch (Resource busy from
the second worker that hits the counter), 100% succeed with it. At 100
concurrent requests, all 100 succeed in ~1.2s; the retry budget never
exhausted in our tests.

Scope is limited to metadata paths only — data-path operations (INSERT
chunks, SELECT for retrieval) go through per-table code paths and don't
share the contended counter.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>
2026-05-06 13:32:20 +08:00
a69e0c73c7 feat(rss): support deleted-file sync (#14493)
### What problem does this PR solve?

Partially addresses #14362.

This PR enables syncing deleted files for RSS data sources.

Previously, RSS incremental sync only returned feed entries whose
timestamps were inside the poll window. If an entry was removed from the
RSS feed, RAGFlow had no full current RSS snapshot to pass into the
shared stale-document cleanup path, so the deleted remote entry could
remain in the knowledge base.

This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector`
- reuses the same `rss:<md5(stable_key)>` document ID derivation used by
normal RSS ingest
- returns `(document_generator, file_list)` for incremental RSS sync
when `sync_deleted_files` is enabled
- captures the poll end timestamp before snapshot/poll so cleanup does
not race against the same sync window
- adds start/end logs around RSS slim snapshot collection
- exposes the deleted-file sync toggle for RSS in the data source UI

Per maintainer request on related datasource PRs, this PR contains no
test-case changes. Local verification was run with an external script.

Validation:
- `uv run ruff check common/data_source/rss_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`
- `uv run python /tmp/verify_rss_deleted_sync.py --repo
/root/74/ragflow`

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 18:56:13 +08:00
bedf9592ef feat(webdav): support deleted-file sync via slim snapshot (#14491)
## What problem does this PR solve?

Incremental WebDAV sync only ingested files whose modification time fell
inside the poll window; documents removed on the WebDAV server were
never removed from the knowledge base. This aligns with
[#14362](https://github.com/infiniflow/ragflow/issues/14362)
(coordinated datasource “sync deleted files” work).

This PR adds a **full-tree slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that enumerates current remote
paths **without downloading file contents**, using the same logical
document IDs as full ingest (`webdav:{base_url}:{file_path}`). When
**`sync_deleted_files`** is enabled on incremental runs, sync returns
**`(document_generator, file_list)`** so **`SyncBase`** runs
**`cleanup_stale_documents_for_task`** and removes KB rows no longer
present remotely.

Design notes:

- **`_list_files_recursive`** gains **`filter_by_mtime`**: snapshot
passes **`filter_by_mtime=False`** (full tree under **`remote_path`**);
**`poll_source`** keeps mtime-window filtering as before.
- Slim snapshot applies the same **extension** and **`size_threshold`**
rules as **`_yield_webdav_documents`** so retain IDs match what would be
indexed.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with
Dropbox-style connectors).

## Type of change

- [x] New Feature (non-breaking change which adds functionality)

## Files changed

| Area | Change |
|------|--------|
| `common/data_source/webdav_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`filter_by_mtime` on `_list_files_recursive` |
| `rag/svr/sync_data_source.py` | WebDAV `_generate`: `file_list` +
tuple return; pass **`batch_size`** from connector config |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` |
2026-04-30 17:26:27 +08:00
17eda04b8d feat(zendesk): support deleted-file sync (#14487)
### What problem does this PR solve?

Refs #14362.

This PR enables syncing deleted files for Zendesk data sources.

Previously, Zendesk incremental sync never returned a slim remote
snapshot to the shared stale-document cleanup path, so deleted remote
Zendesk records could remain in RAGFlow. The existing Zendesk slim
snapshot also included records that ingestion intentionally skips, such
as draft articles, articles without bodies, skipped-label articles,
empty-body articles, and tickets with `status == "deleted"`.

This PR:
- exposes the deleted-file sync option for Zendesk in the data source UI
- returns Zendesk slim snapshots during incremental sync when
`sync_deleted_files` is enabled
- reuses Zendesk indexability rules so cleanup compares against the same
records ingestion can materialize
- adds start/end logs around Zendesk slim snapshot collection for
operational visibility

Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.

Validation:
- `uv run ruff check common/data_source/zendesk_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-04-30 14:44:05 +08:00
8f75e52bbf feat(asana): support deleted-file sync (#14468)
### What problem does this PR solve?

Partially addresses #14362.

Adds deleted-file sync support for the Asana data source. Asana already
indexes task attachments as documents, but it did not provide the slim
document snapshot required by stale-document reconciliation, and the
sync wrapper never returned a `file_list` for cleanup.

This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector`
- builds slim IDs with the same `asana:{task_id}:{attachment_gid}`
format used by indexed documents
- avoids downloading attachment blobs during the snapshot
- aborts the snapshot if Asana API errors occur, preventing partial
snapshots from deleting valid local docs
- captures the incremental poll end time before snapshotting and makes
`poll_source()` respect that boundary
- exposes the deleted-file sync toggle for Asana in the data source UI

Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.

Validation:
- `uv run ruff check common/data_source/asana_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`

### Type of change

- [x] New Feature
2026-04-30 14:41:36 +08:00
2932b65da6 feat(seafile): support deleted-file sync via slim snapshot (#14499)
### What problem does this PR solve?

Incremental Seafile sync only ingests files whose modification time
falls in the poll window; documents removed in Seafile were never
removed from the knowledge base. This contributes to
[#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource
“sync deleted files” coordination).

This PR adds a **slim snapshot** (`retrieve_all_slim_docs_perm_sync`)
that enumerates current remote file IDs **without downloading content**,
using the same logical IDs as full ingest
(`seafile:{repo_id}:{file_id}`). When **`sync_deleted_files`** is
enabled on incremental runs, **`SeaFile._generate`** returns
**`(document_generator, file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove stale KB documents.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
### What changed

- **`common/data_source/seafile_connector.py`**: `SeaFileConnector`
implements **`SlimConnectorWithPermSync`**;
**`_list_files_recursive(..., filter_by_mtime=...)`** supports full-tree
listing for snapshots; **`retrieve_all_slim_docs_perm_sync()`** reuses
the same library/root scan as ingest and applies the same **size**
ceiling; logging for snapshot start/end and counts.
- **`rag/svr/sync_data_source.py`**: **`SeaFile._generate`** validates
**`batch_size`**, captures **`end_ts`** before snapshot +
**`poll_source`**, wraps slim retrieval in **`try`/`except`** (
**`file_list = None`** on failure so ingest continues), returns
**`(generator, file_list)`**.
- **`web/src/pages/user-setting/data-source/constant/index.tsx`**:
**`syncDeletedFiles`** for Seafile in
**`DataSourceFeatureVisibilityMap`**.
2026-04-30 12:05:12 +08:00
de8c6ad0f3 Feat: enable sync deleted file for Discord (#14451)
### What problem does this PR solve?

Feat: enable sync deleted file for Discord

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:05:40 +08:00
2bc8c6d35e feat(dropbox): support deleted-file sync (#14476)
### What problem does this PR solve?

Partially addresses #14362 by adding deleted-file sync support for the
Dropbox data source.

Dropbox previously did not provide the slim current-file snapshot
required by stale document reconciliation, and its sync runner returned
only document batches. As a result, enabling deleted-file sync could not
remove local documents that had been deleted from Dropbox.

This PR:
- Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`.
- Reuses Dropbox metadata traversal to collect current remote file IDs
without downloading file contents.
- Wires incremental Dropbox sync to return `(document_generator,
file_list)` when `sync_deleted_files` is enabled.
- Enables the deleted-file sync toggle for Dropbox in the data source
settings UI.
- Adds regression coverage for slim snapshots, nested folders, paginated
listings, duplicate filenames, and full reindex behavior.

Tests:
- `uv run pytest test/unit_test/common/test_dropbox_connector.py -q`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `uv run pytest test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py -q`
- `uv run ruff check common/data_source/dropbox_connector.py
rag/svr/sync_data_source.py
test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:05:11 +08:00
db1a73b255 Feat: enable sync deleted files in gitlab (#14481)
### What problem does this PR solve?

Feat: enable sync deleted files in gitlab

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:04:10 +08:00
e0b3070012 Feat: enable sync deleted files for Gmail && fix google drive issues (#14462)
### What problem does this PR solve?

Feat: enable sync deleted files for Gmail && fix google drive issues

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
Co-authored-by: balibabu <assassin_cike@163.com>
2026-04-29 17:03:56 +08:00
3b7a6eaa6c Feat: sync deleted files in Bitbucket (#14450)
### What problem does this PR solve?

Feat: sync deleted files in Bitbucket

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 11:29:17 +08:00
0d18b293f5 Fix: enable sync deleted file in airtable (#14438)
### What problem does this PR solve?

Fix: enable sync deleted file in airtable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 20:09:08 +08:00
18fbfafca6 Feat: enable sync deleted files for more connectors (#14353)
### What problem does this PR solve?

Feat: enable sync delted files for connectors

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 15:07:14 +08:00
0df65d358a Fix case-insensitive matching for manual meta_data_filter in / not in list values (#14397)
## Summary

Fixes case-asymmetric matching for manual `meta_data_filter` when using
**`in`** / **`not in`** with a **list** `value`. Document metadata
strings were lowercased, but list elements were not, so values like
`"F2"` failed to match `["F2", "F11"]` even though **`=`** behaved
correctly.

Closes #14389

## Changes

- **`common/metadata_utils.py`**: For **`in`** / **`not in`**, normalize
string elements when `value` and/or `input` is a list, consistent with
scalar string lowercasing.
- **`test/unit_test/common/test_metadata_filter_operators.py`**:
Regression tests for list `value` case-insensitivity and **`not in`**.

## Type of change

- [x] Bug fix (non-breaking)
2026-04-28 14:51:48 +08:00
7a70a0fd85 Fix: preserve infinity available_int zero filter (#14416)
### What problem does this PR solve?

preserve infinity available_int zero filter

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 12:54:32 +08:00
2846a93998 Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?

Fixes #14196

## Problem

When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:

- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports

Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.

## Root Cause

```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
    # Only the first 300 pages were rendered; everything beyond was silently dropped
```

While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.

## Solution

### 1. Define constants in `common/constants.py`

```python
MAXIMUM_PAGE_NUMBER = 100000                        # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000  # Used by the task/DB layer
```

### 2. Replace all hardcoded sentinel values

| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |

### 3. Fix `parse_into_bboxes()` missing parameters

Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.

## Files Changed (22)

- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 14:57:20 +08:00
fb95136f39 Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090)
### What problem does this PR solve?

The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.

This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-25 14:30:15 +08:00
78188ce9e9 Feat: add OpenDataLoader PDF parser backend (#14058) (#14097)
### What problem does this PR solve?

Closes #14058.

RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU,
Docling, TCADP, PaddleOCR). This PR adds **OpenDataLoader**
([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf))
as a new optional backend, giving users a deterministic, local-first
alternative with competitive table extraction accuracy.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---

### Changes

#### Backend
- `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser`
class inheriting `RAGFlowPdfParser`. Implements `check_installation()`
(guards Python package + Java 11+ runtime), `parse_pdf()` with
JSON-first extraction (heading/paragraph/table/list/image/formula) and
Markdown fallback, position-tag generation compatible with the shared
`@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup.
- `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in
`PARSERS` dict, added to `chunk_token_num=0` override list.
- `rag/flow/parser/parser.py` — `"opendataloader"` branch in the
pipeline PDF handler + check validation list.

#### Infrastructure
- `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in
via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH.

#### Frontend
- `web/src/components/layout-recognize-form-field.tsx` —
`OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown.
Cascades automatically to the pipeline editor's Parser component.

#### Docs
- `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader
entry and full env-var reference.

---

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `USE_OPENDATALOADER` | `false` | Set `true` to install
`opendataloader-pdf` on container startup |
| `OPENDATALOADER_VERSION` | latest | Pin the PyPI release (e.g.
`==2.2.1`) |
| `OPENDATALOADER_HYBRID` | _(unset)_ | Enable hybrid AI mode (e.g.
`docling-fast`) |
| `OPENDATALOADER_IMAGE_OUTPUT` | _(unset)_ | `off` / `embedded` /
`external` |
| `OPENDATALOADER_OUTPUT_DIR` | _(tmp)_ | Persistent output dir; temp
dir used + cleaned if unset |
| `OPENDATALOADER_DELETE_OUTPUT` | `1` | `0` to retain intermediate
files for debugging |
| `OPENDATALOADER_SANITIZE` | _(unset)_ | `1` to filter prompt-injection
patterns from output |

---

### Dependencies

- **Runtime**: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not
added to `pyproject.toml` core deps. Installed by
`ensure_opendataloader()` at container startup when
`USE_OPENDATALOADER=true`.
- **System**: Java 11+ on PATH (JVM is the underlying engine). The
installer skips with a warning if `java` is not found.

---

### How to test

**Standalone parser:**
```bash
source .venv/bin/activate
uv pip install opendataloader-pdf
python3 -c "
import sys; sys.path.insert(0, '.')
from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser
p = OpenDataLoaderParser()
print('available:', p.check_installation())
s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline')
print(f'sections={len(s)} tables={len(t)}')
"

```
### Benchmark vs Docling
```
file                      parser            secs  sections  tables
----------------------------------------------------------------------
text-heavy.pdf            docling           45.29       148      10
text-heavy.pdf            opendataloader     3.14       559       0
table-heavy.pdf           docling           7.05        76       3
table-heavy.pdf           opendataloader     3.71        90       0
complex.pdf               docling            42.67       114       8
complex.pdf               opendataloader     3.51       180       0
```
2026-04-25 00:33:02 +08:00
ca01c7a745 Fix blob sync: skip unsupported files before download (#14357)
### What problem does this PR solve?

Blob storage sync was downloading unsupported files first and rejecting
them later, which wasted bandwidth and made sync slower. This PR skips
unsupported extensions before download and applies `allow_images` in
blob sync. fixes #14338

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 19:22:32 +08:00
84b6069ec7 fix: escape single quotes in Infinity SQL filter conditions (#14186)
### What problem does this PR solve?

## Summary

Fixes #5939

Entity names containing single quotes (e.g., `投影直线L'`) caused SQL syntax
errors when building filter conditions for Infinity queries, due to
unescaped string interpolation in `equivalent_condition_to_str`.

## Changes

In `common/doc_store/infinity_conn_base.py`, added `.replace("'", "''")`
escaping for string values in two branches of
`equivalent_condition_to_str` where it was missing:

1. **`field_keyword` branch with non-list value** (line 190): The list
branch already escaped single quotes on line 183, but the single-string
branch did not.
2. **Plain string value branch** (line 209): Direct f-string
interpolation `{k}='{v}'` was vulnerable to unescaped quotes.

Both fixes use the same SQL-standard escape pattern (`'` → `''`) already
applied elsewhere in this method.

## How to Test

1. Upload a document containing entity names with single quotes.
2. Enable Knowledge Graph (GraphRAG) in the parsing configuration.
3. Initiate document parsing — it should complete without SQL syntax
errors.

## Note

The original issue also reported a typo (`dge_graph_kwd` instead of
`knowledge_graph_kwd`), which has already been fixed in the current
codebase.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-20 10:04:07 +08:00
96a23d2fd0 [Bug fix] fix bug found in regression when view chunks for document that not parsed in infinity, it would fail in UI (#14168)
### What problem does this PR solve?
See title, the fail image:
<img width="2667" height="915" alt="20260416-205718"
src="https://github.com/user-attachments/assets/0c564237-5ed0-49af-bf4c-d3b5519abc6e"
/>

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-17 09:51:23 +08:00
0cd49e14dd fix: make Infinity connection pool size configurable and add retry logic for GraphRAG write bursts (#14143)
### What problem does this PR solve?

Resolve #14137 .

### Problem

Graph resolution succeeds (nodes/edges merged, pagerank updated), but
the subsequent burst of Infinity write operations in `set_graph`
exhausts the connection pool with `TOO_MANY_CONNECTIONS` errors. Root
causes:

1. **Hardcoded pool size** — `infinity_conn_pool.py` hardcoded
`ConnectionPool(max_size=4)` on initial creation and `max_size=32` on
refresh. Operators cannot tune this without patching code.
2. **No retry on transient failures** — a single `TOO_MANY_CONNECTIONS`
on edge deletes or chunk inserts kills the entire resolution+community
pipeline with no retry.

### Changes

#### `common/doc_store/infinity_conn_pool.py`

- Read `ConnectionPool` `max_size` from the `INFINITY_POOL_MAX_SIZE`
environment variable (default: `4`), applied consistently to both
initial creation and refresh paths.
- Log the actual pool size on startup for easier debugging.

#### `rag/graphrag/utils.py` — `set_graph()`

- **Edge deletes**: add exponential-backoff retry (3 attempts, 1s/2s/4s
delays) so transient `TOO_MANY_CONNECTIONS` errors are retried instead
of failing the entire job. Concurrency continues to be gated by the
existing `chat_limiter`.
- **Batch inserts**: add exponential-backoff retry (3 attempts, 1s/2s/4s
delays) for the same reason.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-16 15:40:54 +08:00
c93ec0a1f3 Fix: reject empty/space-only content in update_chunk API (#14082)
Closes #6541

### What problem does this PR solve?

Add content validation to `update_chunk` (SDK and non-SDK) to reject
empty or whitespace-only content before it reaches the embedding model.

**Before:** Calling `update_chunk` with space-only content (like `" "`,
`""`, `"\n"`) bypassed validation and was sent directly to the embedding
model, which returned an error. This was the same bug previously fixed
for `add_chunk` in #6390, but `update_chunk` was missed.

**After:** Empty/whitespace-only content is caught by validation and
returns an error: `` `content` is required ``

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-15 18:43:53 +08:00
38cefd88e2 Fix tag_feas code injection in retrieval ranking (#13923)
## Summary
- remove eval-based parsing from retrieval rank feature scoring
- validate `tag_feas` at write time in chunk APIs and SDK routes
- add regression tests for safe parsing and malicious payload rejection

## Details
`tag_feas` is intended to be structured rank-feature data, but the
retrieval ranking path was evaluating stored values as Python
expressions. This change treats `tag_feas` strictly as data.

### What changed
- replace `eval()` in `rag/nlp/search.py` with safe parsing via
`json.loads()` and optional `ast.literal_eval()` compatibility for
legacy Python-dict strings
- strictly filter parsed values down to `dict[str, finite number]`
- reject invalid `tag_feas` payloads at write time in web chunk routes
and SDK document chunk routes
- add focused regression tests to prove executable strings are ignored
and invalid payloads are rejected

## Validation
- `python -m pytest test/unit_test/common/test_tag_feature_utils.py
test/unit_test/rag/test_rank_feature_scores.py -q`

---------

Co-authored-by: unknown <zhenglinkai@CCN.Local>
Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>
2026-04-15 16:31:11 +08:00
aa92abe73c fix: close file handles properly in json.load() calls (#13997)
## Summary

Fixes #13996

Replace `json.load(open(...))` with `with open(...) as f: json.load(f)`
in two files to ensure file descriptors are properly closed.

**Affected files:**
- `common/doc_store/infinity_conn_base.py` — schema loading for Infinity
doc store
- `api/db/init_data.py` — agent template loading at startup

## Why this matters

In a long-running server process like RAGFlow, leaked file descriptors
from `json.load(open(...))` can accumulate over time. While CPython's
refcounting usually cleans these up, it's not guaranteed (especially
under memory pressure or with alternative Python runtimes), and can lead
to `OSError: [Errno 24] Too many open files`.

## Test plan

- [ ] Verify Infinity doc store schema loading still works correctly
- [ ] Verify agent templates load correctly on startup

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Improved file handling in internal data processing to ensure proper
resource cleanup.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: easonysliu <easonysliu@tencent.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 12:16:49 +08:00
e7d044413f Fix: Google Drive connector missing new files after initial sync (#13943)
Closes https://github.com/infiniflow/ragflow/issues/13939

## What problem does this PR solve?

The Google Drive connector fails to detect new files after the initial
sync (#13939). The root cause is that `generate_time_range_filter()`
applies a strict `modifiedTime > poll_range_start` cutoff when querying
the Google Drive API. Files uploaded to Google Drive that retain their
original `modifiedTime` (common behavior) get silently excluded if their
timestamp predates the last sync's cutoff.

Unlike the Confluence and Jira connectors which use a configurable time
buffer (`CONFLUENCE_SYNC_TIME_BUFFER_SECONDS`) to offset
`poll_range_start` backward, the Google Drive connector had no such
mechanism — resulting in a razor-sharp timestamp boundary with zero
tolerance for overlap.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)


## Summary

* **New Features**
* Added a configurable time buffer for Google Drive synchronization to
address timing delays and improve sync reliability.
* Improved file detection logic to include recently created files
alongside modified ones, reducing missed synchronizations.
2026-04-10 11:39:19 +08:00
c2ce49e037 fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969)
Fixes #13823

## Problem

When querying with words like `cat`, RAGFlow's query expansion system
looks up synonyms via WordNet, which can return terms containing single
quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document
store, these unescaped single quotes in the query string cause a
`TokenError` because Infinity's lexer treats `'` as a string delimiter.

```
TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531
```

## Solution

Strip single quotes from synonym terms before they are inserted into
query expressions, consistent with how single quotes are already
stripped from the input query text (line 51 of `query.py`):

- **`common/query_base.py`**: In `sub_special_char()`, strip `'` before
escaping other special characters. This fixes the Chinese text
processing path and the `paragraph()` method.
- **`rag/nlp/query.py`**: In the English text path, strip `'` from
tokenized synonym terms.
- **`memory/services/query.py`**: Same fix for the memory query English
text path.

## Testing

The fix can be verified by:
1. Using Infinity as the document store (`DOC_ENGINE=infinity`)
2. Creating a dataset and running a retrieval test with the keyword
`cat`
3. Confirming no `TokenError` is raised and results are returned
normally

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced special character handling in query processing and synonym
expansion by properly sanitizing single quotes before text processing.
* Simplified OCR detection output by removing timing metadata while
preserving core detection accuracy.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ximi <octo-patch@github.com>
2026-04-09 19:10:34 +08:00
1e83c8c051 Fix: align MCP tool call timeout and handle empty content (#13899)
### What problem does this PR solve?
Resolves #12105
This PR fixes two MCP tool call issues in
`common/mcp_tool_call_conn.py`.
First, the timeout passed to `tool_call(..., timeout=...)` was only
applied to the outer `future.result(...)` wait, but was not forwarded to
the internal MCP request. As a result, callers could pass a longer
timeout while the actual MCP request still failed after the default
internal timeout.
Second, the MCP tool call result handling assumed `result.content[0]`
always existed. If an MCP server returned an empty content list, this
could raise an exception unexpectedly.
This PR fixes both issues by:
- forwarding the external `timeout` value to the internal MCP request
timeout
- returning a clear message when the MCP server returns empty content
instead of indexing into an empty list

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update                                             
- [ ] Refactoring                               
- [ ] Performance Improvement
- [ ] Other (please describe)
2026-04-09 18:44:04 +08:00
8d52ef2893 Feat: enable sync deleted files for connector (#14000)
### What problem does this PR solve?

Feat: enable sync deleted files for connector
1. first comes with github

### Type of change

- [x] New Feature (non-breaking change which adds functionality)



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added "sync deleted files" feature for data sources, enabling
automatic removal of files deleted from the source system.
* Added multilingual support for the new sync deleted files setting
across multiple languages.

* **UI Improvements**
  * Improved checkbox form field rendering and layout.
  * Enhanced full-width display for authentication token input fields.
2026-04-09 16:40:14 +08:00
424aee5bec fix: correct typos in code comments, docstrings and docs (#13931)
## Summary
- Fix `a image` → `an image` in README and log message
- Fix `colomn` → `column` in table structure recognizer comment
- Fix `formated` → `formatted` in confluence connector docstring
- Fix `tabel of content` → `table of contents` in TOC prompt

## Test plan
- [ ] Documentation and comment changes, no functional impact

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yuj <yuj@ztjzsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-04-07 13:05:39 +08:00
69264b3a70 Feat: Refact pipeline (#13826)
### What problem does this PR solve?

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 19:26:45 +08:00
6b7989b4b4 Add file type validation (#13802)
### What problem does this PR solve?

This PR fixes WebDAV sync behavior for unsupported file types
([#13795](https://github.com/infiniflow/ragflow/issues/13795)).

Previously, the WebDAV connector selected files primarily by modified
time (and size threshold) and could still pass unsupported extensions
into the download/document-generation path. This caused unnecessary
processing and inconsistent behavior compared with connectors that
validate file type earlier.

This change adds extension validation in two places:

1. **Early filter during recursive listing** to skip unsupported files
before they enter the download flow.
2. **Defensive filter before download/document creation** to prevent
unsupported files from being processed if any listing edge case slips
through.

It also wires `allow_images` into the WebDAV sync path so image
extension handling follows connector policy.

Scope is intentionally limited to WebDAV for a focused bug-fix PR.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### How was this tested?

- Manual verification with mixed file types under the configured WebDAV
path:
  - supported: `.pdf`, `.txt`, `.md`
  - unsupported: `.exe`, `.bin`, `.dat`
- Triggered full sync and polling sync.
- Confirmed unsupported files are skipped before download.
- Confirmed supported files are still indexed normally.
- Confirmed image handling follows `allow_images` setting.

Fixes: #13795
2026-04-02 14:12:27 +08:00
0462c20113 Fix special characters in matching text of search() (#13852)
### What problem does this PR solve?

Fix special characters in matching text of search(). We should escape
some special characters(such as ?, *,:) before passing to matching_text
of search()

Fix https://github.com/infiniflow/ragflow/issues/13729

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-30 18:47:10 +08:00
0d85a8e7aa feat: add dynamic log level adjustment APIs (#13850)
Add REST APIs to dynamically query and modify log levels at runtime for
both Python (Flask) and Go servers.

Changes:
- common/log_utils.py: add set_log_level() and get_log_levels()
functions
- admin/server/routes.py: add GET/PUT /api/v1/admin/log_levels endpoints
- api/apps/system_app.py: add GET/PUT /api/{version}/system/log_levels
endpoints
- internal/logger/logger.go: add GetLevel() and SetLevel() with atomic
level support
- internal/handler/system.go: add GetLogLevel, SetLogLevel, Health
handlers
- internal/router/router.go: route /health to systemHandler
- internal/admin/handler.go: add GetLogLevel, SetLogLevel handlers
- internal/admin/router.go: add /api/v1/admin/log_level routes

### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 18:40:58 +08:00
cb78ce0a7b feat: support rss datasource (#13721)
### What problem does this PR solve?

Supporting public RSS/Atom feed URLs as data sources for RagFlow.

link https://github.com/infiniflow/ragflow/issues/12313

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-03-27 22:58:44 +08:00
840cc8fbe9 fix(asana): use project memberships endpoint for project IDs in connector (#13746)
### What problem does this PR solve?

Fixes a bug in the Asana connector where providing `Project IDs` caused
sync to fail with:

`project_membership: Not a recognized ID: <PROJECT_GID>`

Root cause: the connector called `get_project_membership(project_gid)`,
but that API expects a **project membership gid**, not a **project
gid**.
This PR switches to the correct project-scoped API and adds regression
tests.

Fixes: [#13669](https://github.com/infiniflow/ragflow/issues/13669)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Changes made

- Updated `common/data_source/asana_connector.py`:
- Replaced `get_project_membership(pid, ...)` with
`get_project_memberships_for_project(pid, ...)`
- Trimmed and filtered `asana_project_ids` parsing to avoid
empty/whitespace IDs
  - Normalized `asana_team_id` by trimming whitespace
  - Used safer access for membership email extraction (`m.get("user")`)
- Added `test/unit_test/common/test_asana_connector.py`:
  - Verifies the correct project-membership API method is called
  - Verifies empty `project_ids` path returns workspace emails
  - Verifies project/team input normalization behavior

### Compatibility / risk

- Non-breaking bug fix
- No API contract changes
- Existing behavior for empty `Project IDs` remains unchanged
2026-03-24 20:21:31 +08:00
dd839f30e8 Fix: code supports matplotlib (#13724)
### What problem does this PR solve?

Code as "final" node: 

![img_v3_02vs_aece4caf-8403-4939-9e68-9845a22c2cfg](https://github.com/user-attachments/assets/9d87b8df-da6b-401c-bf6d-8b807fe92c22)

Code as "mid" node:

![img_v3_02vv_f74f331f-d755-44ab-a18c-96fff8cbd34g](https://github.com/user-attachments/assets/c94ef3f9-2a6c-47cb-9d2b-19703d2752e4)


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-03-20 20:32:00 +08:00
c3f79dbcb0 fix(jira): prevent missed incremental updates after issue edits (#13674)
### What problem does this PR solve?

Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira
incremental sync could miss updated issues after initial sync,
especially near time boundaries.

Root cause:
- Jira JQL uses minute-level precision for `updated` filters.
- Incremental windows had no overlap buffer, so boundary updates could
be skipped.
- Sync log cursor tracking used a backward-facing update for
`poll_range_start`.
- Existing-doc updates in `upload_document` lacked a KB ownership guard
for doc-id collisions.

What changed:
- Added Jira incremental overlap buffer (`time_buffer_seconds`,
defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL
lower-bound time.
- Preserved second-level post-filtering to avoid duplicate reprocessing
while still catching boundary updates.
- Improved Jira sync logging to include start/end window and overlap
configuration.
- Updated sync cursor tracking in `increase_docs` to keep
`poll_range_start` moving forward with max update time.
- Added KB ID safety check before updating existing document records in
`upload_document`.

Verification performed:
- Python syntax compile checks passed for modified files.
- Manual verification flow:
  1. Run full Jira sync.
  2. Edit an already-indexed Jira issue.
  3. Run next incremental sync.
  4. Confirm updated content is re-ingested into KB.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-18 23:31:05 +08:00
387b0b27c4 feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527)
### What problem does this PR solve?

This PR adds support for parsing PDFs through an external Docling
server, so RAGFlow can connect to remote `docling serve` deployments
instead of relying only on local in-process Docling.

It addresses the feature request in
[#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns
with the external-server usage pattern already used by MinerU.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### What is changed?

- Add external Docling server support in `DoclingParser`:
  - Use `DOCLING_SERVER_URL` to enable remote parsing mode.
- Try `POST /v1/convert/source` first, and fallback to
`/v1alpha/convert/source`.
- Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not
set.
- Wire Docling env settings into parser invocation paths:
  - `rag/app/naive.py`
  - `rag/flow/parser/parser.py`
- Add Docling env hints in constants and update docs:
  - `docs/guides/dataset/select_pdf_parser.md`
  - `docs/guides/agent/agent_component_reference/parser.md`
  - `docs/faq.mdx`

### Why this approach?

This keeps the change focused on one issue and one capability (external
Docling connectivity), without introducing unrelated provider-model
plumbing.

### Validation

- Static checks:
  - `python -m py_compile` on changed Python files
  - `python -m ruff check` on changed Python files
- Functional checks:
  - Remote v1 endpoint path works
  - v1alpha fallback works
  - Local Docling path remains available when server URL is unset

### Related links

- Feature request: [Support external Docling server (issue
#13426)](https://github.com/infiniflow/ragflow/issues/13426)
- Compare view for this branch:
[main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1)

##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)
2026-03-12 17:09:03 +08:00
e1b632a7bb Feat: add delete all support for delete operations (#13530)
### What problem does this PR solve?

Add delete all support for delete operations.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update

---------

Co-authored-by: writinwaters <cai.keith@gmail.com>
2026-03-12 09:47:42 +08:00
675810e0cf Refact: optimize confluence performance (#13497)
### What problem does this PR solve?

Refact: optimize confluence performance #13494

### Type of change

- [x] Refactoring
2026-03-10 15:02:24 +08:00