### What problem does this PR solve?
Consolidate "set_meta" API into "update_document" .
Before consolidation
Web API: POST /api/v1/document/set_meta
Http API - PUT /v1/datasets/<dataset_id>/document/<document_id>
After consolidation, Restful API -- PUT
/v1/datasets/<dataset_id>/document/<document_id>
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Consolidation WEB API & HTTP API for document metadata summary
Before consolidation
Web API: POST /api/v1/document/metadata/summary
Http API - GET /v1/datasets/<dataset_id>/metadata/summary
After consolidation, Restful API -- GET
/v1/datasets/<dataset_id>/metadata/summary
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Visit
`http://127.0.0.1:9381/?__debugger__=yes&cmd=resource&f=debugger.js`
will expose the flask code:
```
docReady(() => {
if (!EVALEX_TRUSTED) {
initPinBox();
}
// if we are in console mode, show the console.
if (CONSOLE_MODE && EVALEX) {
createInteractiveConsole();
}
const frames = document.querySelectorAll("div.traceback div.frame");
if (EVALEX) {
addConsoleIconToFrames(frames);
}
addEventListenersToElements(document.querySelectorAll("div.detail"), "click", () =>
document.querySelector("div.traceback").scrollIntoView(false)
);
addToggleFrameTraceback(frames);
addToggleTraceTypesOnClick(document.querySelectorAll("h2.traceback"));
addInfoPrompt(document.querySelectorAll("span.nojavascript"));
wrapPlainTraceback();
});
function addToggleFrameTraceback(frames) {
frames.forEach((frame) => {
frame.addEventListener("click", () => {
frame.getElementsByTagName("pre")[0].parentElement.classList.toggle("expanded");
});
})
}
```
### Type of change
- [x] Other (please describe): Fix security risk
## Summary
Fixes#13996
Replace `json.load(open(...))` with `with open(...) as f: json.load(f)`
in two files to ensure file descriptors are properly closed.
**Affected files:**
- `common/doc_store/infinity_conn_base.py` — schema loading for Infinity
doc store
- `api/db/init_data.py` — agent template loading at startup
## Why this matters
In a long-running server process like RAGFlow, leaked file descriptors
from `json.load(open(...))` can accumulate over time. While CPython's
refcounting usually cleans these up, it's not guaranteed (especially
under memory pressure or with alternative Python runtimes), and can lead
to `OSError: [Errno 24] Too many open files`.
## Test plan
- [ ] Verify Infinity doc store schema loading still works correctly
- [ ] Verify agent templates load correctly on startup
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Improved file handling in internal data processing to ensure proper
resource cleanup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: easonysliu <easonysliu@tencent.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What problem does this PR solve?
As title
### Type of change
- [x] Refactoring
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Improved authentication error logging to better distinguish between
JWT and API token failures.
* Enhanced code documentation with clarifying comments for better
maintainability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Feat: enable sync deleted files for connector
1. first comes with github
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added "sync deleted files" feature for data sources, enabling
automatic removal of files deleted from the source system.
* Added multilingual support for the new sync deleted files setting
across multiple languages.
* **UI Improvements**
* Improved checkbox form field rendering and layout.
* Enhanced full-width display for authentication token input fields.
### What problem does this PR solve?
Refactor: merge document.rename into document.update_document
### Type of change
- [x] Refactoring
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added a unified document update API (PUT) supporting name, metadata,
parser/chunk settings, and status changes.
* **Breaking Changes**
* Legacy single-parameter rename endpoint removed; renames now require
dataset + document identifiers.
* `/list` now reads dataset id from a different query parameter.
* **Validation / Bug Fixes**
* Stricter meta_fields and parser-config validation; unauthenticated
requests return 401.
* **Frontend**
* UI now sends dataset id when saving document names.
* **Tests**
* Numerous unit and HTTP tests adjusted or removed to match new API and
validations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: MkDev11 <94194147+MkDev11@users.noreply.github.com>
Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>
Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
Co-authored-by: Qi Wang <wangq8@outlook.com>
Co-authored-by: dataCenter430 <161712630+dataCenter430@users.noreply.github.com>
Co-authored-by: balibabu <cike8899@users.noreply.github.com>
### What problem does this PR solve?
- ping
- token
- log level
### Type of change
- [x] Refactoring
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* System endpoints consolidated under /api/v1/system: ping, health
check, and token management moved to the centralized API surface.
* Token management unified at /api/v1/system/tokens with
list/create/delete behavior.
* **Documentation**
* API reference updated to reflect the new /api/v1/system paths.
* **Tests**
* Client fixtures and test utilities updated to use
/api/v1/system/tokens; one unit test for health/oceanbase status
removed.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
As title.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Standardized the query parameter used when listing documents so
listings behave consistently across the web and client interfaces.
* Clarified the error message shown when a required dataset ID is
missing to give clearer guidance to users.
* **Tests**
* Updated test coverage to reflect the standardized dataset identifier
usage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
…
### What problem does this PR solve?
Closes#13857
Parent-child chunking was introduced in v0.23.0 but is only configurable
through the web UI. Users managing datasets programmatically cannot
enable it via the HTTP API or Python SDK because `ParserConfig` uses
`extra="forbid"`, rejecting the `children_delimiter` field at
validation.
### What does this PR change?
Adds a `parent_child` nested config to `ParserConfig`, following the
same pattern as `raptor` and `graphrag`:
```json
"parser_config": {
"parent_child": {
"use_parent_child": true,
"children_delimiter": "\n"
}
}
```
- api/utils/validation_utils.py — new ParentChildConfig model, added to
ParserConfig
- api/utils/api_utils.py — naive defaults + flatten to
children_delimiter for the execution layer
- api/apps/services/dataset_api_service.py — flatten on the update path
- test/testcases/configs.py — updated DEFAULT_PARSER_CONFIG
-
test/testcases/test_http_api/test_dataset_management/test_create_dataset.py
— 4 valid + 2 invalid test cases
No changes to the execution layer (rag/app/naive.py, rag/nlp/search.py).
Existing UI flow via ext is unaffected.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added parent-child chunking configuration for dataset creation and
updates with new `use_parent_child` toggle and customizable
`children_delimiter` setting to specify how parent chunks are split into
child chunks.
* **Documentation**
* Updated HTTP and Python API references with parent-child chunking
configuration details and examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
### What problem does this PR solve?
Implements automatic adjustment of knowledge base chunk recall weights
based on user feedback (upvotes/downvotes). When users upvote or
downvote a response, the system locates the corresponding knowledge
snippets and adjusts their recall weight to improve future retrieval
quality.
**Closes #12670**
**How it works:**
1. User upvotes/downvotes a response via `POST /thumbup`
2. System extracts chunk IDs from the conversation reference
3. For each referenced chunk:
- Reads current `pagerank_fea` value from document store
- Increments (+1) for upvote or decrements (-1) for downvote
- Clamps weight to [0, 100] range
- Updates chunk in ES/Infinity/OceanBase
4. Future retrievals score these chunks higher/lower based on
accumulated feedback
**Files changed:**
- `api/db/services/chunk_feedback_service.py` - New service for updating
chunk pagerank weights
- `api/apps/conversation_app.py` - Integrated feedback service into
thumbup endpoint
- `test/testcases/test_web_api/test_chunk_feedback/` - Unit tests
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Chat message feedback now updates per-chunk relevance weights
(feature-flag gated), with configurable weighting and atomic updates
across storage backends.
* **Bug Fixes**
* Stricter validation for message feedback inputs and more robust
handling of feedback transitions.
* **Tests**
* Expanded test coverage for chunk-feedback behavior, weighting
strategies, storage backends, and thumb-flip scenarios.
* **Chores**
* CI workflow extended to run the new chunk-feedback web API tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>
Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>
### What problem does this PR solve?
Refactor version API to RESTful style. Python and go server API also
updated.
### Type of change
- [x] Refactoring
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **Refactor**
* Migrated core API endpoints to the `/api/v1/` namespace for improved
consistency and organization.
* Standardized system version, search, and chat list endpoints under the
new API versioning structure.
* **New Features**
* Added MinIO region configuration support, allowing specification of
storage engine regional settings via environment variables or
configuration files.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Add validation logic for parser_config.
Refactor the processing flow. Before change, validation logics and
update logics are mixed up - some validation logis executes followed by
some update logic executes and then another such
"validation-and-then-update" which is not good. After change, all
validation logic executes firstly. Update logic will be executed after
ALL validation logic executed.
Validation logic for parameters (that come from front end) will be
checked using Pydantic. For validation logic that depends on data from
DB, they will be in separate methods.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
### What problem does this PR solve?
Implement UpdateDataset and UpdateMetadata in GO
Add cli:
UPDATE CHUNK <chunk_id> OF DATASET <dataset_name> SET <update_fields>
REMOVE TAGS 'tag1', 'tag2' from DATASET 'dataset_name';
SET METADATA OF DOCUMENT <doc_id> TO <meta>
### Type of change
- [ ] Refactoring
### What problem does this PR solve?
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
---------
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
This PR fixes a race in batch document parsing where overlapping parse
requests for the same document could clear/rewrite chunk state and make
previously parsed content appear lost. It adds an atomic per-document
parse guard so only one parse can run at a time for that document (Fixes
#13864 ).
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Previously, `apikey_required` called
`request.headers.get('Authorization').split()[1]` without checking for
None or insufficient parts, causing an unhandled AttributeError or
IndexError (500) instead of a proper 403 JSON response.
This applies the same guarding pattern already used by `token_required`
in the same file.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
Add REST APIs to dynamically query and modify log levels at runtime for
both Python (Flask) and Go servers.
Changes:
- common/log_utils.py: add set_log_level() and get_log_levels()
functions
- admin/server/routes.py: add GET/PUT /api/v1/admin/log_levels endpoints
- api/apps/system_app.py: add GET/PUT /api/{version}/system/log_levels
endpoints
- internal/logger/logger.go: add GetLevel() and SetLevel() with atomic
level support
- internal/handler/system.go: add GetLogLevel, SetLogLevel, Health
handlers
- internal/router/router.go: route /health to systemHandler
- internal/admin/handler.go: add GetLogLevel, SetLogLevel handlers
- internal/admin/router.go: add /api/v1/admin/log_level routes
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
Enable reading Tag Set tags via API (expose tag_kwd field). The result
of the queried list chunks is as shown below:
<img width="1422" height="818" alt="image"
src="https://github.com/user-attachments/assets/abd1960a-fe34-489e-9d72-525f8e574938"
/>
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>
### What problem does this PR solve?
fixes issue #13799 where team members get model not authorized when
running RAG on an admin-shared knowledge base after the admin changes
the KB embedding model (for example to bge-m3).
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Allow create datasets with parse_type == 1/None and chunk_method, or
parse_type == 2 and pipeline_id.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Allow create dataset with resume chunk_method.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Problem
The /file2document/convert endpoint ran all file lookups, document
deletions, and insertions synchronously inside the
request cycle. Linking a large folder (~1.7GB with many files) caused
504 Gateway Timeout because the blocking DB loop
held the HTTP connection open for too long.
Fix
- Extracted the heavy DB work into a plain sync function _convert_files
- Inputs are validated and folder file IDs expanded upfront (fast path)
- The blocking work is dispatched to a thread pool via
get_running_loop().run_in_executor() and the endpoint returns 200
immediately
- Frontend only checks data.code === 0 so the response change
(file2documents list → True) has no impact
Fixes#13781
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
Searches /search API to RESTFul
### Type of change
- [x] Documentation Update
- [x] Refactoring
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
CI isn't stable, try to fix it.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Fix
migrate_add_unique_email-silently-skips-unique-constraint-when-non-unique-user_email-index-exists.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Files /file API to RESTFul style.
### Type of change
- [x] Documentation Update
- [x] Refactoring
---------
Co-authored-by: writinwaters <cai.keith@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
Saving dataset settings failed with validation error 101 (Extra inputs
are not permitted)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. Split dataset api to gateway and service, and modify web UI to use
restful http api.
2. Old KB releated APIs are commented.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira
incremental sync could miss updated issues after initial sync,
especially near time boundaries.
Root cause:
- Jira JQL uses minute-level precision for `updated` filters.
- Incremental windows had no overlap buffer, so boundary updates could
be skipped.
- Sync log cursor tracking used a backward-facing update for
`poll_range_start`.
- Existing-doc updates in `upload_document` lacked a KB ownership guard
for doc-id collisions.
What changed:
- Added Jira incremental overlap buffer (`time_buffer_seconds`,
defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL
lower-bound time.
- Preserved second-level post-filtering to avoid duplicate reprocessing
while still catching boundary updates.
- Improved Jira sync logging to include start/end window and overlap
configuration.
- Updated sync cursor tracking in `increase_docs` to keep
`poll_range_start` moving forward with max update time.
- Added KB ID safety check before updating existing document records in
`upload_document`.
Verification performed:
- Python syntax compile checks passed for modified files.
- Manual verification flow:
1. Run full Jira sync.
2. Edit an already-indexed Jira issue.
3. Run next incremental sync.
4. Confirm updated content is re-ingested into KB.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Closes#1398
### What problem does this PR solve?
Adds native support for EPUB files. EPUB content is extracted in spine
(reading) order and parsed using the existing HTML parser. No new
dependencies required.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
To check this parser manually:
```python
uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser
with open('$HOME/some_epub_book.epub', 'rb') as f:
data = f.read()
sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
print(f'\n--- Section {i} ---')
print(s[:200])
"
```
### What problem does this PR solve?
using builtin model when parsing gave an error because it expects
fid==builtin. split_model_name_and_factory returns id=None. pr allows
the model to be accepted wheter with or without @Builtin
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: Export Agent Logs.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: balibabu <assassin_cike@163.com>
### What problem does this PR solve?
Follow-up expose agent structured outputs in non-stream completions
#13389.
### Type of change
- [x] Documentation Update
- [x] Refactoring
---------
Co-authored-by: writinwaters <cai.keith@gmail.com>
### What problem does this PR solve?
1. Split dataset api to gateway and service, and modify web UI to use
restful http api.
2. Old KB releated APIs are commented.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Add_chunk supports add image.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
1. Split dataset api to gateway and service, and modify web UI to use
restful http api.
2. Old KB releated APIs are commented.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Fix: model selecton rule in get_model_config_by_type_and_name
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: Modify the style of the release confirmation box.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: balibabu <assassin_cike@163.com>
Co-authored-by: 6ba3i <isbaaoui09@gmail.com>
### What problem does this PR solve?
Get user_id from canvas and record it.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Summary
Fixes#13544: PostgreSQL startup crash because
`update_tenant_llm_to_id_primary_key()` unconditionally uses
MySQL-specific SQL.
- Split `update_tenant_llm_to_id_primary_key()` into
`_update_tenant_llm_to_id_primary_key_mysql()` and
`_update_tenant_llm_to_id_primary_key_postgres()`, dispatching on
`settings.DATABASE_TYPE`
- MySQL path: unchanged (existing `DATABASE()`, `SET @row = 0`,
`AUTO_INCREMENT`, `DROP PRIMARY KEY` logic)
- PostgreSQL path: uses `current_database()`, `ROW_NUMBER() OVER (ORDER
BY ...)` for sequential IDs, `CREATE SEQUENCE` + `nextval()` for
auto-increment, and `information_schema.table_constraints` to find the
PK constraint name
- Also fix `migrate_add_unique_email()`: MySQL-only
`information_schema.statistics` is replaced with `pg_indexes` on
PostgreSQL
## Test plan
- [ ] Start RAGFlow with `DB_TYPE=postgres` — startup should complete
without `function database() does not exist` error
- [ ] Start RAGFlow with `DB_TYPE=mysql` (default) — existing behaviour
unchanged, migration runs as before
- [ ] Fresh PostgreSQL install: verify `tenant_llm.id` column is created
as a serial primary key after migration
- [ ] Idempotency: running migration twice on PostgreSQL should be a
no-op (column already exists check passes)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: gambletan <gambletan@github>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>