## Summary
- scope normal document-list metadata lookups to the current page's
document IDs
- keep the `return_empty_metadata=True` path dataset-wide because it
needs full knowledge of docs that already have metadata
- add unit tests for both paged listing paths and the unchanged
empty-metadata behavior
## Why
`DocumentService.get_list()` and the normal `get_by_kb_id()` path were
calling `DocMetadataService.get_metadata_for_documents(None, kb_id)`,
which loads metadata for the entire dataset on every page request.
That becomes especially problematic on large datasets. The metadata scan
path paginates through the full metadata index without an explicit sort,
while the ES helper only switches to `search_after` beyond `10000`
results when a sort is present. In practice this can lead to unnecessary
full-dataset metadata work, slower document-list loading, and unreliable
`meta_fields` in list responses for large KBs.
This change keeps the existing empty-metadata filter behavior intact,
but scopes normal list responses to metadata for the current page only.
### What problem does this PR solve?
Support getting aggregated parsing status to dataset via the API
Issue: #12810
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>
### What problem does this PR solve?
Avoid getting doc in function delete_document_metadata as the doc might
have been removed.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Closes: #12889
### What problem does this PR solve?
When syncing external data sources (e.g., Jira, Confluence, Google
Drive), updated documents were not being re-chunked. The raw content was
correctly updated in blob storage, but the vector database retained
stale chunks, causing search results to return outdated information.
**Root cause:** The task digest used for chunk reuse optimization was
calculated only from parser configuration fields (`parser_id`,
`parser_config`, `kb_id`, etc.), without any content-dependent fields.
When a document's content changed but the parser configuration remained
the same, the system incorrectly reused old chunks instead of
regenerating new ones.
**Example scenario:**
1. User syncs a Jira issue: "Meeting scheduled for Monday"
2. User updates the Jira issue to: "Meeting rescheduled to Friday"
3. User triggers sync again
4. Raw content panel shows updated text ✓
5. Chunk panel still shows old text "Monday" ✗
**Solution:**
1. Include `update_time` and `size` in the chunking config, so the task
digest changes when document content is updated
2. Track updated documents separately in `upload_document()` and return
them for processing
3. Process updated documents through the re-parsing pipeline to
regenerate chunks
[1.webm](https://github.com/user-attachments/assets/d21d4dcd-e189-4d39-8700-053bae0ca5a0)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add id for table tenant_llm and apply in LLMBundle.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
test_update_document.py failed as metadata is not included in the
response of get_list(), fix the issue.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Put document metadata in ES/Infinity.
Index name of meta data: ragflow_doc_meta_{tenant_id}
### Type of change
- [x] Refactoring
### What problem does this PR solve?
When deleting the knowledge base, the records in the Document and
Knowledgebase tables are immediately deleted
But there are still a large number of pending task messages in the Redis
queue (asynchronous queue) if you did not click on stopping tasks before
deleting knowledge base.
TaskService.get_task() uses a JOIN query to associate three tables (Task
← Document ← Knowledgebase)
Since Document/Knowledgebase have been deleted, the JOIN returns an
empty result, even though the Task records still exist
task-executor considers the task does not exist ("collect task xxx is
unknown"), can only skip and warn
log:2026-01-23 16:43:21,716 WARNING 1190179 collect task
110fbf70f5bd11f0945a23b0930487df is unknown
2026-01-23 16:43:21,818 WARNING 1190179 collect task
11146bc4f5bd11f0945a23b0930487df is unknown
2026-01-23 16:43:21,918 WARNING 1190179 collect task
111c3336f5bd11f0945a23b0930487df is unknown
2026-01-23 16:43:22,021 WARNING 1190179 collect task
112471b8f5bd11f0945a23b0930487df is unknown
2026-01-23 16:43:26,719 WARNING 1190179 collect task
112e855ef5bd11f0945a23b0930487df is unknown
2026-01-23 16:43:26,734 WARNING 1190179 collect task
1134380af5bd11f0945a23b0930487df is unknown
2026-01-23 16:43:26,834 WARNING 1190179 collect task
1138cb2cf5bd11f0945a23b0930487df is unknown
As a consequence, a large number of such tasks occupy the queue
processing capacity, causing new tasks to queue and wait
<img width="1910" height="947"
alt="9a00f2e0-9112-4dbb-b357-7f66b8eb5acf"
src="https://github.com/user-attachments/assets/0e1227c2-a2df-4ef3-ba8f-e04c3f6ef0e1"
/>
Solution
Add logic to stop all ongoing tasks before deleting the knowledge base
and Tasks
### Type of change
- Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR adds missing web API tests (system, search, KB, LLM, plugin,
connector). It also addresses a contract mismatch that was causing test
failures: metadata updates did not persist new keys (update‑only
behavior).
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): Test coverage expansion and test helper
instrumentation
### What problem does this PR solve?
1) Create dataset using table parser for infinity
2) Answer questions in chat using SQL
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fixes web API behavior mismatches that caused test failures by
normalizing error responses, tightening validations, correcting error
messages, and closing upload file handles.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
Fixes#12520 - Deleted chunks should not appear in retrieval/reference
results.
## Changes
### Core Fix
- **api/apps/chunk_app.py**: Include \doc_id\ in delete condition to
properly scope the delete operation
### Improved Error Handling
- **api/db/services/document_service.py**: Better separation of concerns
with individual try-catch blocks and proper logging for each cleanup
operation
### Doc Store Updates
- **rag/utils/es_conn.py**: Updated delete query construction to support
compound conditions
- **rag/utils/opensearch_conn.py**: Same updates for OpenSearch
compatibility
### Tests
- **test/testcases/.../test_retrieval_chunks.py**: Added
\TestDeletedChunksNotRetrievable\ class with regression tests
- **test/unit/test_delete_query_construction.py**: Unit tests for delete
query construction
## Testing
- Added regression tests that verify deleted chunks are not returned by
retrieval API
- Tests cover single chunk deletion and batch deletion scenarios
### What problem does this PR solve?
Modifying a document’s parser config previously left behind obsolete
chunk images. If the dataset isn’t manually deleted, these images
accumulate and waste storage. This PR fixes the issue by automatically
removing associated images when the parser config changes.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Manage message and use in agent.
Issue #4213
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Supports filter documents by empty metadata
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Deduplicate metadata lists during updates.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Document list and filter supports metadata filtering.
**OR within the same field, AND across different fields**
Example 1 (multi-field AND):
```markdown
Doc1 metadata: { "a": "b", "as": ["a", "b", "c"] }
Doc2 metadata: { "a": "x", "as": ["d"] }
Query:
metadata = {
"a": ["b"],
"as": ["d"]
}
Result:
Doc1 matches a=b but not as=d → excluded
Doc2 matches as=d but not a=b → excluded
Final result: empty
```
Example 2 (same field OR):
```markdown
Doc1 metadata: { "as": ["a", "b", "c"] }
Doc2 metadata: { "as": ["d"] }
Query:
metadata = {
"as": ["a", "d"]
}
Result:
Doc1 matches as=a → included
Doc2 matches as=d → included
Final result: Doc1 + Doc2
```
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Correct metadata update behavior. #11912
When update `value` is omitted, the corresponding keys are updated to
`"value"` regardless of their current values.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add metadata condition in document list.
Add metadata bulk update.
Add metadata summary.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
### What problem does this PR solve?
Make RAGFlow more asynchronous 2. #11551, #11579, #11619.
### Type of change
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
Make RAGFlow more asynchronous 2. #11551, #11579, #11619.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
GraphRAG and RAPTOR tasks do not affect document status.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. Rename identifier name
2. Fix some return statement
3. Fix some typos
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Add get_uuid, download_img and hash_str2int into misc_utils.py
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
- Add time utilities and unit tests
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Feat: Support attribute filtering #8703
### Type of change
- [X] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
Co-authored-by: writinwaters <cai.keith@gmail.com>