ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-30 04:27:30 +08:00

Files

BitToby 383986dc5f fix: re-chunk documents when data source content is updated (#12918 )

Closes: #12889 

### What problem does this PR solve?

When syncing external data sources (e.g., Jira, Confluence, Google
Drive), updated documents were not being re-chunked. The raw content was
correctly updated in blob storage, but the vector database retained
stale chunks, causing search results to return outdated information.

**Root cause:** The task digest used for chunk reuse optimization was
calculated only from parser configuration fields (`parser_id`,
`parser_config`, `kb_id`, etc.), without any content-dependent fields.
When a document's content changed but the parser configuration remained
the same, the system incorrectly reused old chunks instead of
regenerating new ones.

**Example scenario:**
1. User syncs a Jira issue: "Meeting scheduled for Monday"
2. User updates the Jira issue to: "Meeting rescheduled to Friday"
3. User triggers sync again
4. Raw content panel shows updated text ✓
5. Chunk panel still shows old text "Monday" ✗

**Solution:**
1. Include `update_time` and `size` in the chunking config, so the task
digest changes when document content is updated
2. Track updated documents separately in `upload_document()` and return
them for processing
3. Process updated documents through the re-parsing pipeline to
regenerate chunks


[1.webm](https://github.com/user-attachments/assets/d21d4dcd-e189-4d39-8700-053bae0ca5a0)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

2026-03-06 12:48:47 +08:00

joint_services

Feat/tenant model (#13072 )

2026-03-05 17:27:17 +08:00

services

fix: re-chunk documents when data source content is updated (#12918 )

2026-03-06 12:48:47 +08:00

__init__.py

Fix: GraphRAG and RAPTOR tasks do not affect document status (#11194 )

2025-11-12 12:03:41 +08:00

db_models.py

fix: re-chunk documents when data source content is updated (#12918 )