mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-05-22 00:50:10 +08:00
## What problem does this PR solve? Incremental WebDAV sync only ingested files whose modification time fell inside the poll window; documents removed on the WebDAV server were never removed from the knowledge base. This aligns with [#14362](https://github.com/infiniflow/ragflow/issues/14362) (coordinated datasource “sync deleted files” work). This PR adds a **full-tree slim snapshot** (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote paths **without downloading file contents**, using the same logical document IDs as full ingest (`webdav:{base_url}:{file_path}`). When **`sync_deleted_files`** is enabled on incremental runs, sync returns **`(document_generator, file_list)`** so **`SyncBase`** runs **`cleanup_stale_documents_for_task`** and removes KB rows no longer present remotely. Design notes: - **`_list_files_recursive`** gains **`filter_by_mtime`**: snapshot passes **`filter_by_mtime=False`** (full tree under **`remote_path`**); **`poll_source`** keeps mtime-window filtering as before. - Slim snapshot applies the same **extension** and **`size_threshold`** rules as **`_yield_webdav_documents`** so retain IDs match what would be indexed. - **`end_ts`** is captured before building **`file_list`**, then **`poll_source`** uses the same upper bound (consistent with Dropbox-style connectors). ## Type of change - [x] New Feature (non-breaking change which adds functionality) ## Files changed | Area | Change | |------|--------| | `common/data_source/webdav_connector.py` | `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `filter_by_mtime` on `_list_files_recursive` | | `rag/svr/sync_data_source.py` | WebDAV `_generate`: `file_list` + tuple return; pass **`batch_size`** from connector config | | `web/src/pages/user-setting/data-source/constant/index.tsx` | `syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` |