### What problem does this PR solve?
fix some comments to improve readability
### Type of change
- [x] Documentation Update
---------
Signed-off-by: box4wangjing <box4wangjing@outlook.com>
## Summary
- Wrap 2 `ThreadPoolExecutor` instances in `file_service.py` with `with`
statement
- Ensures threads are properly shut down after all futures complete
## Problem
`parse_docs()` (line 532) and the file processing method (line 694)
create `ThreadPoolExecutor` instances that are never shut down. In a
long-running server process, this leaks thread resources on every
invocation — threads remain alive consuming memory even after all
submitted work is complete.
## Fix
Replace bare `ThreadPoolExecutor()` with `with ThreadPoolExecutor() as
exe:` context manager, which calls `executor.shutdown(wait=True)` on
exit.
## Test plan
- [x] Verified both call sites use `with` statement after fix
- [x] No remaining bare `ThreadPoolExecutor` in `file_service.py`
- [x] `document_service.py:1066` is a module-level executor (different
pattern, not changed in this PR)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
## Summary
- Wrap the `ThreadPoolExecutor` instances in `FileService.parse_docs`
and `FileService.get_files` with `with ... as exe:` blocks for
deterministic cleanup
- Replace the `concurrent.futures.ThreadPoolExecutor` in
`do_handle_task` with `asyncio.create_task(asyncio.to_thread(build_TOC,
...))`, preserving the existing parallelism with chunk insertion while
leveraging the surrounding async context
- Drop the now-unused `import concurrent` and the
`executor.shutdown(wait=False)` call in the `finally` block
Closes#14622.
No behavioral change, no public API change. Net diff: ~19 insertions /
25 deletions across two files.
## Test plan
- [ ] `uv run ruff check api/db/services/file_service.py
rag/svr/task_executor.py` passes
- [ ] Upload a multi-file batch through the chat/file endpoint and
confirm `FileService.parse_docs` still returns combined parsed text
- [ ] Trigger `FileService.get_files` via the chat reference flow with a
mix of image and non-image files; verify both `raw=True` and `raw=False`
paths return correctly
- [ ] Run a `naive`-parser document task with `toc_extraction: true` and
confirm the TOC chunk is generated and inserted exactly as before
- [ ] Run a `naive`-parser document task with `toc_extraction: false`
and confirm the path with `toc_thread = None` is unaffected
- [ ] Cancel a running task to exercise the `finally` block and confirm
cleanup still works without the executor shutdown call
---------
Co-authored-by: web-dev0521 <jasonpette1783@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
### What problem does this PR solve?
S3-family connector syncs currently re-download every in-window object
just so we can compute `xxhash128(blob)` and compare against
`Document.content_hash`. Anything that bumps `LastModified` without
changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays
full bandwidth and re-parses files that didn't actually change. #14628
covers the broader incremental-ingestion redesign; this PR is the first
slice.
The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2
/ GCS / OCI / S3-compat) now implements a new `FingerprintConnector`
interface: `list_keys()` paginates `list_objects_v2` and yields
`KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The
orchestrator joins those against the connector's existing `{doc_id:
content_hash}` map and only calls `get_value(key)` when the fingerprint
differs. Unchanged keys are skipped entirely — no `GetObject`, no
re-parse.
No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing
`Document.content_hash` column per @yingfeng's suggestion; the connector
decides at listing time whether to populate it. Local uploads and
connectors that don't opt in fall through to the existing post-download
`xxhash128(blob)` path with no behavior change.
This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent
PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS
(PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint`
(PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation
(PR-4). Holding those back keeps this PR additive and reviewable on its
own.
#### Files touched
- `common/data_source/models.py` — new `KeyRecord`; optional
`fingerprint` on `Document`
- `common/data_source/interfaces.py` — `IncrementalCapability` enum,
`FingerprintConnector` ABC
- `common/data_source/blob_connector.py` — `BlobStorageConnector`
implements `FingerprintConnector`; per-object download factored into
`_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`,
`get_value` all share it
- `rag/svr/sync_data_source.py` —
`_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop;
`_run_task_logic` plumbs `doc.fingerprint` into the upload dict
- `api/db/services/document_service.py` —
`list_id_content_hash_map_by_kb_and_source_type()` helper
- `api/db/services/connector_service.py` + `file_service.py` —
fingerprint flows through `duplicate_and_parse → upload_document` and
lands in `content_hash`
- `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests
covering ETag normalization (single-part, multipart, quoted, empty),
`list_keys()` not calling `GetObject`, `get_value()` materializing with
fingerprint, deterministic/stable fingerprints, and the bypass loop
asserting `GetObject` is *not* called on a match
#### Worth flagging for review
Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a
`LastModified` window when `poll_range_start` was set. New code uses
`_fingerprint_filtered_generator` (full bucket listing + fingerprint
compare) outside of explicit `reindex=1`. Strictly better for
unchanged-bucket cases since it skips `GetObject`, but it does mean
every sync now does a full `list_objects_v2` paginate. Should still be
cheap for most buckets — flagging in case anyone has a very large bucket
where the time-window filter was meaningful.
On migration: existing rows have `content_hash = xxhash128(blob)` from
the old code. The first sync after this lands sees ETag-derived
fingerprints that don't match, re-fetches every object once, and writes
the new fingerprint. From the second sync onward the bypass works as
expected. "Slow day one, fast every day after." A `fingerprint_backfill:
trust` opt-out is sketched in the design doc but not in this PR.
#### Test plan
- [x] `uv run ruff check` — clean on all 8 touched files
- [x] `uv run pytest
test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed
- [x] Broader unit-test suite — no regressions in anything I touched
- [ ] Manual smoke against a real S3 bucket — configure a connector, run
sync twice, expect the second sync to log `bypassed=N, fetched=0` and no
`GetObject` calls in CloudTrail / bucket access logs
- [ ] Manual smoke with `reindex=1` — confirm the full re-download path
still works
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fixes#14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
### What problem does this PR solve?
The POST /upload_info?url=<url> endpoint accepted a user-supplied URL
and passed it directly to AsyncWebCrawler without any validation. There
were no restrictions on URL scheme, destination hostname, or resolved IP
address. This allowed any authenticated user to instruct the server to
make outbound HTTP requests to internal infrastructure — including RFC
1918 private networks, loopback addresses, and cloud metadata services
such as http://169.254.169.254 — effectively using the server as a proxy
for internal network reconnaissance or credential theft.
This PR adds an SSRF guard (_validate_url_for_crawl) that runs before
any crawl is initiated. It enforces an allowlist of safe schemes
(http/https), resolves the hostname at validation time, and rejects any
URL whose resolved IP falls within a private or reserved network range.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira
incremental sync could miss updated issues after initial sync,
especially near time boundaries.
Root cause:
- Jira JQL uses minute-level precision for `updated` filters.
- Incremental windows had no overlap buffer, so boundary updates could
be skipped.
- Sync log cursor tracking used a backward-facing update for
`poll_range_start`.
- Existing-doc updates in `upload_document` lacked a KB ownership guard
for doc-id collisions.
What changed:
- Added Jira incremental overlap buffer (`time_buffer_seconds`,
defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL
lower-bound time.
- Preserved second-level post-filtering to avoid duplicate reprocessing
while still catching boundary updates.
- Improved Jira sync logging to include start/end window and overlap
configuration.
- Updated sync cursor tracking in `increase_docs` to keep
`poll_range_start` moving forward with max update time.
- Added KB ID safety check before updating existing document records in
`upload_document`.
Verification performed:
- Python syntax compile checks passed for modified files.
- Manual verification flow:
1. Run full Jira sync.
2. Edit an already-indexed Jira issue.
3. Run next incremental sync.
4. Confirm updated content is re-ingested into KB.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Closes: #12889
### What problem does this PR solve?
When syncing external data sources (e.g., Jira, Confluence, Google
Drive), updated documents were not being re-chunked. The raw content was
correctly updated in blob storage, but the vector database retained
stale chunks, causing search results to return outdated information.
**Root cause:** The task digest used for chunk reuse optimization was
calculated only from parser configuration fields (`parser_id`,
`parser_config`, `kb_id`, etc.), without any content-dependent fields.
When a document's content changed but the parser configuration remained
the same, the system incorrectly reused old chunks instead of
regenerating new ones.
**Example scenario:**
1. User syncs a Jira issue: "Meeting scheduled for Monday"
2. User updates the Jira issue to: "Meeting rescheduled to Friday"
3. User triggers sync again
4. Raw content panel shows updated text ✓
5. Chunk panel still shows old text "Monday" ✗
**Solution:**
1. Include `update_time` and `size` in the chunking config, so the task
digest changes when document content is updated
2. Track updated documents separately in `upload_document()` and return
them for processing
3. Process updated documents through the re-parsing pipeline to
regenerate chunks
[1.webm](https://github.com/user-attachments/assets/d21d4dcd-e189-4d39-8700-053bae0ca5a0)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Issue: #12756
### What problem does this PR solve?
When users upload files through Agent's Begin or Await Response
components, the parsing is hardcoded to "Plain Text", ignoring all other
available parsers (DeepDOC, TCADP, Docling, MinerU, PaddleOCR). This PR
adds a PDF parser dropdown to these components so users can select the
appropriate parser for their file inputs.
### Changes
**Backend**
- `agent/component/fillup.py` - Added `layout_recognize` param to
`UserFillUpParam`, forwarded to `FileService.get_files()`
- `agent/component/begin.py` - Same forwarding in `Begin._invoke()`
- `agent/canvas.py` - Extract Begin's `layout_recognize` for `sys.files`
parsing, added param to `get_files_async()` / `get_files()`
- `api/db/services/file_service.py` - Added `layout_recognize` param to
`parse()` and `get_files()`, replacing hardcoded `"Plain Text"`
- `rag/app/naive.py` - Added `"plain text"` and `"tcadp parser"` aliases
to PARSERS dict to match dropdown values after `.lower()`
**Frontend**
- `web/src/pages/agent/form/begin-form/index.tsx` - Show
`LayoutRecognizeFormField` dropdown when file inputs exist
- `web/src/pages/agent/form/begin-form/schema.ts` - Added
`layout_recognize` to Zod schema
- `web/src/pages/agent/form/user-fill-up-form/index.tsx` - Same dropdown
for Await Response component
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: Hash doc id to avoid duplicate name.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
After a file in the file list is associated with a knowledge base, the
knowledge base document ID is returned
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Try to make this more asynchronous. Verified in chat and agent
scenarios, reducing blocking behavior. #11551, #11579.
However, the impact of these changes still requires further
investigation to ensure everything works as expected.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Add the specified parent_path to the document upload api interface
(#11230)
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: virgilwong <hyhvirgil@gmail.com>
### What problem does this PR solve?
Add get_uuid, download_img and hash_str2int into misc_utils.py
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Improve file management. #10287.
Passed tests:
1. Create folder `A` and `B`.
2. Upload a file inside `A`, called `file`.
3. Create a KB, called `K`.
3. Link `file` to `K`.
4. Parse `file` inside of `K`. (OK)
5. Move `file` from `A` to `B`.
6. Parse `file` inside of `K`. (OK)
7. Move `file` from `B` to `A`.
8. Parse `file` inside of `K`. (OK)
9. Move entire folder `A` into `B`. (B -> A -> file)
10. Parse `file` inside of `K`. (OK)
11. Delete folder `B`.
12. All clear. (There is no document inside of `K`)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
- Admin client support drop user.
Issue: #10241
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
KB folder may not there while creating virtual file. #9423
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
#9082#6365
<u> **WARNING: it's not compatible with the older version of `Agent`
module, which means that `Agent` from older versions can not work
anymore.**</u>
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- Replace hardcoded 255-byte file name length checks with
FILE_NAME_LEN_LIMIT constant
- Update error messages to show the actual limit value
- #8290
### Type of change
- [x] Refactoring
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
Change filename length limit from 128 to 256
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Try the best to repair corrupted PDF files on upload automatically.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- The API documentation lacks detailed error code explanations. Added
error code tables to `python_api_reference.md` and
`http_api_reference.md` to clarify possible error codes and their
meanings.
- Error handling in the codebase is inconsistent. Standardized error
handling logic in `sdk/python/ragflow_sdk/modules/chunk.py`.
- Improved API comments by adding standardized docstrings to enhance
code readability and maintainability.
### Type of change
- [x] Documentation Update
- [x] Refactoring
### What problem does this PR solve?
Use consistent log file names, introduced initLogger
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
1. fix white screen issue when chat response
2. thumbnail bug when document not support
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
---------
Co-authored-by: chongchuanbing <chongchuanbing@gmail.com>
### What problem does this PR solve?
In MySQL, when the thumbnail base64 of a document is relatively large,
the display of the document's thumbnail fails.
Now, I put the document thumbnail into MiniIO storage.
### Type of change
- [✓] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: chongchuanbing <chongchuanbing@gmail.com>
### What problem does this PR solve?
issue: https://github.com/infiniflow/ragflow/issues/2277
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>