fix(infinity): declare extra field + serialize dict on write to unbreak RAPTOR (#14998)

### What problem does this PR solve?

Fixes #14997.

RAPTOR builds on the Infinity backend have been broken since v0.25.2
introduced the `extra` field in code (`rag/svr/task_executor.py:1011`)
without declaring it in `conf/infinity_mapping.json`. Every RAPTOR job
fails with:

```
infinity.common.InfinityException: (3013, 'Fail to bind the expression: extra@src/planner/expression_binder_impl.cpp:99')
```

The auto-migration in
`common/doc_store/infinity_conn_base.py:_migrate_db()` adds any columns
it finds in the mapping JSON to existing tables — so the only thing
standing between users and a working RAPTOR build is that one missing
declaration. OceanBase, ES, and OpenSearch were unaffected because they
store `extra` as a native JSON type; only Infinity (which has a strict
`varchar`/`integer`/`float` schema) needed the addition.

### The fix

Two-part change:

1. **`conf/infinity_mapping.json`**: declare `"extra": {"type":
"varchar", "default": ""}`. On next startup, `_migrate_db()` adds the
column to all existing chunk tables — no manual DDL needed for upgrading
installations.
2. **`rag/utils/infinity_conn.py` `insert()`**: serialize the `extra`
dict to a JSON string at write time, since Infinity's `varchar` can't
store a Python dict directly. Modelled on the existing `chunk_data`
handling a few lines above.

The read path (`rag/utils/raptor_utils.py:_as_extra_dict`) already
normalises both dict and JSON-string inputs, so no read-side change is
needed. Other backends are untouched — `task_executor.py` still writes
the dict, and the OceanBase/ES/OpenSearch insert paths handle dicts
natively.

### Verification

Tested on a v0.25.4 deployment with the Infinity backend by applying the
same two changes via mounted-volume override:

- Confirmed `_migrate_db()` adds the `extra` column to all pre-existing
chunk tables on startup (column visible via Infinity's
`show_columns()`).
- Triggered RAPTOR builds on four datasets (~21k chunks total) via `POST
/api/v1/datasets/<id>/index?type=raptor`.
- All four progressed past the previously-failing
`get_raptor_chunk_methods()` call into actual entity-extraction and
clustering work without the (3013) error.
- GraphRAG builds (which can trigger the same path indirectly via
`task_executor.py:857`) also progressed cleanly.

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
This commit is contained in:
Prateek Jain
2026-05-19 15:10:03 +05:30
committed by GitHub
parent f6537ae4ce
commit eacec86500
2 changed files with 12 additions and 1 deletions

View File

@ -39,5 +39,6 @@
"doc_type_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"toc_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"raptor_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"raptor_layer_int": {"type": "integer", "default": 0}
"raptor_layer_int": {"type": "integer", "default": 0},
"extra": {"type": "varchar", "default": ""}
}

View File

@ -438,6 +438,16 @@ class InfinityConnection(InfinityConnectionBase):
d[k] = json.dumps(v)
else:
d[k] = v
elif k == "extra":
# RAPTOR writes {"raptor_method": ...} as a dict; Infinity's
# `extra` column is varchar so we serialize on the write path.
# The read path (raptor_utils._as_extra_dict) already accepts
# both dict and JSON-string. Other backends (OceanBase JSON
# column, ES/OpenSearch) keep dict shape — this is Infinity-only.
if isinstance(v, dict):
d[k] = json.dumps(v)
else:
d[k] = v if v else ""
elif k == "kb_id":
if isinstance(d[k], list):
d[k] = d[k][0] # since d[k] is a list, but we need a str