Feat: expose parent-child chunking configuration via HTTP API and Python SDK (#13940)

… ### What problem does this PR solve? Closes #13857 Parent-child chunking was introduced in v0.23.0 but is only configurable through the web UI. Users managing datasets programmatically cannot enable it via the HTTP API or Python SDK because `ParserConfig` uses `extra="forbid"`, rejecting the `children_delimiter` field at validation. ### What does this PR change? Adds a `parent_child` nested config to `ParserConfig`, following the same pattern as `raptor` and `graphrag`: ```json "parser_config": { "parent_child": { "use_parent_child": true, "children_delimiter": "\n" } } ``` - api/utils/validation_utils.py — new ParentChildConfig model, added to ParserConfig - api/utils/api_utils.py — naive defaults + flatten to children_delimiter for the execution layer - api/apps/services/dataset_api_service.py — flatten on the update path - test/testcases/configs.py — updated DEFAULT_PARSER_CONFIG - test/testcases/test_http_api/test_dataset_management/test_create_dataset.py — 4 valid + 2 invalid test cases No changes to the execution layer (rag/app/naive.py, rag/nlp/search.py). Existing UI flow via ext is unaffected. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):  ## Summary by CodeRabbit * **New Features** * Added parent-child chunking configuration for dataset creation and updates with new `use_parent_child` toggle and customizable `children_delimiter` setting to specify how parent chunks are split into child chunks. * **Documentation** * Updated HTTP and Python API references with parent-child chunking configuration details and examples.
2026-07-14 23:36:42 +08:00 · 2026-04-07 20:36:57 -07:00
parent 0ced071a0b
commit 62a1333cf2
7 changed files with 58 additions and 11 deletions
--- a/docs/references/python_api_reference.md
+++ b/docs/references/python_api_reference.md
@ -187,7 +187,7 @@ The chunking method of the dataset to create. Available options:
 The parser configuration of the dataset. A `ParserConfig` object's attributes vary based on the selected `chunk_method`:

 - `chunk_method`=`"naive"`:  
-  `{"chunk_token_num":512,"delimiter":"\\n","html4excel":False,"layout_recognize":True,"raptor":{"use_raptor":False}}`.
+  `{"chunk_token_num":512,"delimiter":"\\n","html4excel":False,"layout_recognize":True,"raptor":{"use_raptor":False},"parent_child":{"use_parent_child":False,"children_delimiter":"\\n"}}`.
 - `chunk_method`=`"qa"`:  
  `{"raptor": {"use_raptor": False}}`
 - `chunk_method`=`"manuel"`:  
@ -480,7 +480,7 @@ A dictionary representing the attributes to update, with the following keys:
  - `"email"`: Email
 - `"parser_config"`: `dict[str, Any]` The parsing configuration for the document. Its attributes vary based on the selected `"chunk_method"`:
  - `"chunk_method"`=`"naive"`:  
-    `{"chunk_token_num":128,"delimiter":"\\n","html4excel":False,"layout_recognize":True,"raptor":{"use_raptor":False}}`.
+    `{"chunk_token_num":128,"delimiter":"\\n","html4excel":False,"layout_recognize":True,"raptor":{"use_raptor":False},"parent_child":{"use_parent_child":False,"children_delimiter":"\\n"}}`.
  - `chunk_method`=`"qa"`:  
    `{"raptor": {"use_raptor": False}}`
  - `chunk_method`=`"manuel"`: