ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-08 08:07:21 +08:00

Author	SHA1	Message	Date
Wang Qi	13b422037f	Refactor: enhance graphrag - part 2 (#14972 ) ### What problem does this PR solve? 1. expose batch_chunk_token_size for configuration 2. retrieve chunks when build subgraph for the doc, not retreive all docs chunks at the begining 3. get all chunks for a document, used to be hard coded 10000 4. delete not used method run_graphrag ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring Follow on: #14617	2026-05-18 16:10:21 +08:00
Kevin Hu	7cdc74bbe5	Refactor: Drop the vector fetch for ES (#14970 ) ## Summary - Stop pulling chunk vectors (`q__vec`) back from Elasticsearch in the main retrieval path. ES already knows them; shipping them was pure bandwidth/memory overhead. - Recover the per-chunk cosine similarity via a second KNN-only ES call filtered by the candidate chunk ids. The new `_score` is merged with locally computed term similarity using the user-configured `vector_similarity_weight`. - Lazily fetch the chunk embedding only for the chunks `insert_citations` actually needs. ## Details `rag/nlp/search.py`* - `Dealer.search`: no longer appends `q__vec` to the ES select list. OceanBase still gets it (its rerank path is unchanged). - New `Dealer._knn_scores(sres, idx_names, kb_ids)`: a `MatchDenseExpr` over the cached query vector filtered by `id IN sres.ids`, returning `{chunk_id: cosine_score}` via ES `_score`. - New `Dealer.rerank_with_knn(...)`: term similarity from `qryr.token_similarity` plus the ES-supplied KNN score, combined with `tkweight`/`vtweight` and the existing rank-feature bonus. - New `Dealer.fetch_chunk_vectors(chunk_ids, tenant_ids, kb_ids, dim)`: on-demand vector fetch for citation use. - `Dealer.retrieval` routes Infinity → unchanged, OceanBase → existing local `rerank`, ES → new KNN-score path. `common/doc_store/es_conn_base.py`* - New `get_scores(res)` helper returning `{_id: _score}` directly from hit headers (ES doesn't surface `_score` through `get_fields`). `api/db/services/dialog_service.py` - New top-level `_hydrate_chunk_vectors(...)` helper. On ES it back-fills `ck["vector"]` from `fetch_chunk_vectors` right before `insert_citations`. No-op on Infinity / OB (their chunks already carry vectors). - Both `decorate_answer` closures became `async` and are `await`-ed at all call sites in `async_chat` and `async_ask`. ## Backend behavior \| Backend \| Returns chunk vec in main search \| Sim source \| Vectors for citations \| \|---\|---\|---\|---\| \| ES \| No \| second KNN call (`_score`) merged with term sim \| fetched on demand \| \| Infinity \| No (unchanged) \| normalized `_score` \| already on chunks \| \| OceanBase \| Yes (kept) \| local hybrid rerank \| already on chunks \| ## Test plan	2026-05-18 14:21:56 +08:00
Ramin M.	765cdc2ec2	[Bug]: REDIS error #12870 (#13875 ) Fix for: [Bug]: REDIS error #12870	2026-05-12 09:31:47 +08:00
VincentLambert	b83e2ae5a2	fix: handle missing parent chunk in retrieval_by_children (#14556 ) ### What problem does this PR solve? `retrieval_by_children()` in `rag/nlp/search.py` crashes with a `TypeError: 'NoneType' object is not subscriptable` when a parent ("mom") chunk referenced by child chunks is missing from the index. This happens when the index is in an inconsistent state — for example after a partial re-index, a document deletion that didn't clean up all children, or a race condition during ingestion. `dataStore.get()` returns `None` for the missing parent, and the subsequent access to `chunk["content_with_weight"]` raises a `TypeError`. Stack trace: ``` TypeError: 'NoneType' object is not subscriptable File "rag/nlp/search.py", line 792, in retrieval_by_children "content_with_weight": chunk["content_with_weight"], ``` ### Type of change - [x] Bug Fix ### Fix When `dataStore.get()` returns `None` for a parent chunk, fall back to using the child chunks directly and continue processing the remaining parents. This preserves retrieval results for all other chunks rather than aborting the entire query with an exception. ```python chunk = self.dataStore.get(id, idx_nms[0], [ck["kb_id"] for ck in cks]) if chunk is None: chunks.extend(cks) continue ``` --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 11:55:44 +08:00
qinling0210	4d6e8dffac	Do not bypass threshold for rerank when metadata filter is enabled (#14684 ) ### What problem does this PR solve? Do not bypass threshold for rerank when metadata filter is enabled ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-08 17:48:30 +08:00
buua436	c08ced09a7	Fix: add retrieval fallback comments (#14457 ) ### What problem does this PR solve? add retrieval fallback comments ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 14:44:31 +08:00
buua436	a7ce1b1677	Fix: prune deleted doc chunks from retrieval (#14454 ) ### What problem does this PR solve? prune deleted doc chunks from retrieval ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 13:03:09 +08:00
qinling0210	1473000135	Implement retrieval_test in GO (#14231 ) ### What problem does this PR solve? Implement retrieval_test in GO ### Type of change - [x] Refactoring	2026-04-24 15:30:14 +08:00
Liu An	6e33d8722f	Revert "Fix: forwarding highlight param" (#14249 ) Reverts infiniflow/ragflow#14112	2026-04-21 15:23:18 +08:00
Daniil Sivak	22c6648348	Fix: forwarding highlight param (#14112 ) Closes #9078 ### What problem does this PR solve? The `retrieval_test` endpoint in `chunk_app.py` never forwarded the `highlight` request parameter to `retriever.retrieval()`, so the search engine never produced highlight snippets. Additionally, the frontend always rendered `content_with_weight` instead of preferring the `highlight` field, and the CSS rule color `var(--accent-primary)` didn't work because the variable stores an RGB triplet `(45,212,191)` requiring the `rgb()` wrapper. ### Before - Search page: displayed raw content_with_weight as a wall of plain white text with no term highlighting, including markdown headings rendered as literal text - Retrieval testing page: showed `content_with_weight` in a plain `<p>` tag, no `<em>` tags rendered, no highlight coloring - Children chunks: when child chunks were consolidated into a parent via `retrieval_by_children`, any highlight data from children was discarded - TOC chunks: chunks fetched via `retrieval_by_toc` had no `highlight` field, appearing as plain text while other chunks had highlights Retrieval testing: <img width="1449" height="1178" alt="before-retrieval-no-highlight-cropped" src="https://github.com/user-attachments/assets/5c6f5a5e-6c11-461a-bdb4-049d7dfb7a33" /> Search: <img width="1378" height="711" alt="before-search-no-highlight-cropped" src="https://github.com/user-attachments/assets/be7b5152-72ef-40da-a8fd-921e997ae7d3" /> ### After - Search page: displays the highlight field with search terms rendered in teal/cyan color (`rgb(var(--accent-primary))`) - Retrieval testing page: sends highlight: true in the request, uses `HighLightMarkdown` component to render `<em>` tags with proper coloring - Children chunks: highlights from child chunks are joined and preserved on the parent - TOC chunks: when other chunks have highlights, TOC-fetched chunks use `content_with_weight` as a highlight fallback Retrieval testing: <img width="1410" height="1015" alt="05-retrieval-testing-results" src="https://github.com/user-attachments/assets/f0cff8cf-0962-4320-b559-cd5037f622d2" /> Search: <img width="1294" height="455" alt="03-search-highlight-results" src="https://github.com/user-attachments/assets/a90e0e3e-3837-46be-8ddd-2412ff7cbc19" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-17 20:59:20 +08:00
Ea001	38cefd88e2	Fix tag_feas code injection in retrieval ranking (#13923 ) ## Summary - remove eval-based parsing from retrieval rank feature scoring - validate `tag_feas` at write time in chunk APIs and SDK routes - add regression tests for safe parsing and malicious payload rejection ## Details `tag_feas` is intended to be structured rank-feature data, but the retrieval ranking path was evaluating stored values as Python expressions. This change treats `tag_feas` strictly as data. ### What changed - replace `eval()` in `rag/nlp/search.py` with safe parsing via `json.loads()` and optional `ast.literal_eval()` compatibility for legacy Python-dict strings - strictly filter parsed values down to `dict[str, finite number]` - reject invalid `tag_feas` payloads at write time in web chunk routes and SDK document chunk routes - add focused regression tests to prove executable strings are ignored and invalid payloads are rejected ## Validation - `python -m pytest test/unit_test/common/test_tag_feature_utils.py test/unit_test/rag/test_rank_feature_scores.py -q` --------- Co-authored-by: unknown <zhenglinkai@CCN.Local> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-15 16:31:11 +08:00
Idriss Sbaaoui	de6a8e789a	Fix: rerank overflow by enforcing top_k and 64 cap (#14084 ) ### What problem does this PR solve? This fixes rerank overflow where retrieval could send more documents than allowed (for example 66 when `page_size=6`), causing provider 400 errors and bypassing the user’s `top_k` intent in rerank-enabled paths. this pr fixes #14081 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 10:47:25 +08:00
Octopus	c2ce49e037	fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969 ) Fixes #13823 ## Problem When querying with words like `cat`, RAGFlow's query expansion system looks up synonyms via WordNet, which can return terms containing single quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document store, these unescaped single quotes in the query string cause a `TokenError` because Infinity's lexer treats `'` as a string delimiter. ``` TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531 ``` ## Solution Strip single quotes from synonym terms before they are inserted into query expressions, consistent with how single quotes are already stripped from the input query text (line 51 of `query.py`): - `common/query_base.py`: In `sub_special_char()`, strip `'` before escaping other special characters. This fixes the Chinese text processing path and the `paragraph()` method. - `rag/nlp/query.py`: In the English text path, strip `'` from tokenized synonym terms. - `memory/services/query.py`: Same fix for the memory query English text path. ## Testing The fix can be verified by: 1. Using Infinity as the document store (`DOC_ENGINE=infinity`) 2. Creating a dataset and running a retrieval test with the keyword `cat` 3. Confirming no `TokenError` is raised and results are returned normally <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Enhanced special character handling in query processing and synonym expansion by properly sanitizing single quotes before text processing. * Simplified OCR detection output by removing timing metadata while preserving core detection accuracy. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ximi <octo-patch@github.com>	2026-04-09 19:10:34 +08:00
Jack	c4b0aaa874	Fix: #6098 - Add validation logic for parser_config when update document (#13911 ) ### What problem does this PR solve? Add validation logic for parser_config. Refactor the processing flow. Before change, validation logics and update logics are mixed up - some validation logis executes followed by some update logic executes and then another such "validation-and-then-update" which is not good. After change, all validation logic executes firstly. Update logic will be executed after ALL validation logic executed. Validation logic for parameters (that come from front end) will be checked using Pydantic. For validation logic that depends on data from DB, they will be in separate methods. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-04-07 11:33:05 +08:00
Magicbook1108	69264b3a70	Feat: Refact pipeline (#13826 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 19:26:45 +08:00
qinling0210	f02f5fa435	Get ROW_ID from search() in Infinity (#13901 ) ### What problem does this PR solve? 1. Search() in Infinity can return row_id now 2. To Get ROW_ID from search(), refer to handling of retrieval_test. example ``` $ curl -s -X POST "http://localhost:$PORT/v1/chunk/retrieval_test" -H "Authorization: $TOKEN" -H "Content-Type: application/json" -d '{"kb_id": "4fcd01582ca911f1954184ba59049aa3", "question": "曹操"}' ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-02 18:56:43 +08:00
qinling0210	0462c20113	Fix special characters in matching text of search() (#13852 ) ### What problem does this PR solve? Fix special characters in matching text of search(). We should escape some special characters(such as ?, *,:) before passing to matching_text of search() Fix https://github.com/infiniflow/ragflow/issues/13729 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 18:47:10 +08:00
Heyang Wang	641b319647	feat: support reading tags via API (#12891 ) (#13732 ) ### What problem does this PR solve? Enable reading Tag Set tags via API (expose tag_kwd field). The result of the queried list chunks is as shown below: <img width="1422" height="818" alt="image" src="https://github.com/user-attachments/assets/abd1960a-fe34-489e-9d72-525f8e574938" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>	2026-03-29 20:17:01 +08:00
Stephen Hu	d32967eda8	refactor: let excel use lazy image loader (#13558 ) ### What problem does this PR solve? let excel use lazy image loader ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-23 21:24:40 +08:00
apps-lycusinc	8b984c9d5f	Fixing WordNetCorpusReader object has no attribute _LazyCorpusLoader_… (#13600 ) ### What problem does this PR solve? Forces NLTK to load the corpus synchronously once, preventing concurrent tasks from triggering the lazy-loading race condition that cause Fixing WordNetCorpusReader object has no attribute _LazyCorpusLoader_… #13590 ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: shakeel <shakeel@lollylaw.com>	2026-03-13 19:55:01 +08:00
Ethan T.	1cee8b1a7b	fix: use context managers for file handles to prevent resource leaks (#13514 ) ## Summary - Convert bare `open()` calls to `with` context managers or `Path.read_text()` - File handles leak if not properly closed, especially on exceptions - Fixes in crypt.py, sequence2txt_model.py, term_weight.py, deepdoc/vision/__init__.py ## Test plan - [x] File operations work correctly with context managers - [x] Resources properly cleaned up on exceptions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 16:47:06 +08:00
eviaaaaa	d0ca388bec	Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329 ) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * Centralized Abstraction (`docx_parser.py`): Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * Robust Corrupted Image Fallback (`docx_parser.py`): Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * Legacy Code & Redundancy Purge: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * Output Consistency: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * Memory Footprint Drop: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.	2026-03-11 10:00:07 +08:00
qinling0210	7c92f51133	Fix retrieval function when metadata_condtion is specified in retrieval API (#13473 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13388 The following command returns empty when there is doc with the meta data ``` curl --request POST \ --url http://localhost:9222/api/v1/retrieval \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer ragflow-fO3mPFePfLgUYg8-9gjBVVXbvHqrvMPLGaW0P86PvAk' \ --data '{ "question": "any question", "dataset_ids": ["9bb4f0591b8811f18a4a84ba59049aa3"], "metadata_condition": { "logic": "and", "conditions": [ { "name": "character", "comparison_operator": "is", "value": "刘备" } ] } }' ``` When metadata_condtion is specified in the retrieval API, it is converted to doc_ids and doc_ids is passed to retrieval function. In retrieval funciton, when doc_ids is explicitly provided , we should bypass threshold. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 11:57:32 +08:00
Magicbook1108	daec36e935	Fix: add soft limit for graph rag size (#13252 ) ### What problem does this PR solve? Fix: add soft limit for graph rag size #13258 Q2 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 14:02:36 +08:00
Attili-sys	21bc1ab7ec	Feature rtl support (#13118 ) ### What problem does this PR solve? This PR adds comprehensive Right-to-Left (RTL) language support, primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu, etc.). Previously, RTL content had multiple rendering issues: - Incorrect sentence splitting for Arabic punctuation in citation logic - Misaligned text in chat messages and markdown components - Improper positioning of blockquotes and “think” sections - Incorrect table alignment - Citation placement ambiguity in RTL prompts - UI layout inconsistencies when mixing LTR and RTL text This PR introduces backend and frontend improvements to properly detect, render, and style RTL content while preserving existing LTR behavior. #### Backend - Updated sentence boundary regex in `rag/nlp/search.py` to include Arabic punctuation: - `،` (comma) - `؛` (semicolon) - `؟` (question mark) - `۔` (Arabic full stop) - Ensures citation insertion works correctly in RTL sentences. - Updated citation prompt instructions to clarify citation placement rules for RTL languages. #### Frontend - Introduced a new utility: `text-direction.ts` - Detects text direction based on Unicode ranges. - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts. - Provides `getDirAttribute()` for automatic `dir` assignment. - Applied dynamic `dir` attributes across: - Markdown rendering - Chat messages - Search results - Tables - Hover cards and reference popovers - Added proper RTL styling in LESS: - Text alignment adjustments - Blockquote border flipping - Section indentation correction - Table direction switching - Use of `<bdi>` for figure labels to prevent bidirectional conflicts #### DevOps / Environment - Added Windows backend launch script with retry handling. - Updated dependency metadata. - Adjusted development-only React debugging behavior. --- ### Type of change - [x] Bug Fix (non-breaking change which fixes RTL rendering and citation issues) - [x] New Feature (non-breaking change which adds RTL detection and dynamic direction handling) --------- Co-authored-by: 6ba3i <isbaaoui09@gmail.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:03:44 +08:00
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
qinling0210	4bc622b409	Fix parameter of calling self.dataStore.get() and warning info during parser (#13068 ) ### What problem does this PR solve? Fix parameter of calling self.dataStore.get() and warning info during parser https://github.com/infiniflow/ragflow/issues/13036 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-09 17:56:59 +08:00
6ba3i	fabbfcab90	Fix: failing p3 test for SDK/HTTP APIs (#13062 ) ### What problem does this PR solve? Adjust highlight parsing, add row-count SQL override, tweak retrieval thresholding, and update tests with engine-aware skips/utilities. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-09 14:56:10 +08:00
Kevin Hu	1262533b74	Feat: support verify to set llm key and boost bigrams. (#12980 ) #12863 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-05 19:19:09 +08:00
Magicbook1108	f11ca54e0e	Fix: docx parser output consistent (#12965 ) ### What problem does this PR solve? Fix: docx parser output consistent > File "/home/bxy/ragflow/rag/flow/parser/parser.py", line 506, in _word > sections, tbls = docx_parser(name, binary=blob) > ^^^^^^^^^^^^^^ > ValueError: too many values to unpack (expected 2) > ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-03 15:36:58 +08:00
Philipp Heyken Soares	ad06c042c4	Support operator constraints in semi-automatic metadata filtering (#12956 ) ### What problem does this PR solve? #### Summary This PR enhances the Semi-automatic metadata filtering mode by allowing users to explicitly pre-define operators (e.g., contains, =, >, etc.) for selected metadata keys. While the LLM still dynamically extracts the filter value from the user's query, it is now strictly constrained to use the operator specified in the UI configuration. Using this feature is optional. By default the operator selection is set to "automatic" resulting in the LLM choosing the operator (as presently). #### Rationale & Use Case This enhancement was driven by a concrete challenge I encountered while working with technical documentation. In my specific use case, I was trying to filter for software versions within a technical manual. In this dataset, a single document chunk often applies to multiple software versions. These versions are stored as a combined string within the metadata for each chunk. When using the standard semi-automatic filter, the LLM would inconsistently choose between the contains and equals operators. When it chose equals, it would exclude every chunk that applied to more than one version, even if the version I was searching for was clearly included in that metadata string. This led to incomplete and frustrating retrieval results. By extending the semi-automatic filter to allow pre-defining the operator for a specific key, I was able to force the use of contains for the version field. This change immediately led to significantly improved and more reliable results in my case. I believe this functionality will be equally useful for others dealing with "tagged" or multi-value metadata where the relationship between the query and the field is known, but the specific value needs to remain dynamic. #### Key Changes ##### Backend & Core Logic - `common/metadata_utils.py`: Updated apply_meta_data_filter to support a mixed data structure for semi_auto (handling both legacy string arrays and the new object-based format {"key": "...", "op": "..."}). - `rag/prompts/generator.py`: Extended gen_meta_filter to accept and pass operator constraints to the LLM. - `rag/prompts/meta_filter.md`: Updated the system prompt to instruct the LLM to strictly respect provided operator constraints. ##### Frontend - `web/src/components/metadata-filter/metadata-semi-auto-fields.tsx`: Enhanced the UI to include an operator dropdown for each selected metadata key, utilizing existing operator constants. - `web/src/components/metadata-filter/index.tsx`: Updated the validation schema to accommodate the new state structure. #### Test Plan - Backward Compatibility: Verified that existing semi-auto filters stored as simple strings still function correctly. - Prompt Verification: Confirmed that constraints are correctly rendered in the LLM system prompt when specified. - Added unit tests as `test/unit_test/common/test_apply_semi_auto_meta_data_filter.py` - Manual End-to-End: - Configured a "Semi-automatic" filter for a "Version" key with the "contains" operator. - Asked a version-specific query. - Result <img width="1173" height="704" alt="Screenshot 2026-02-02 145359" src="https://github.com/user-attachments/assets/510a6a61-a231-4dc2-a7fe-cdfc07219132" /> ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>	2026-02-03 11:11:34 +08:00
Magicbook1108	7be3dacdaa	Fix: custom delimeter in docx (#12946 ) ### What problem does this PR solve? Fix: custom delimeter in docx ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-03 09:43:18 +08:00
Yongteng Lei	13076bb87b	Fix: Parent chunking fails on DOCX files (#12822 ) ### What problem does this PR solve? Fixes parent chunking fails on DOCX files. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-26 17:55:09 +08:00
Kevin Hu	927db0b373	Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716 ) ### Type of change - [x] Refactoring	2026-01-20 13:29:37 +08:00
qinling0210	828ae1e82f	Round float value of minimum_should_match (#12688 ) ### What problem does this PR solve? In paragraph() of class FulltextQueryer, "len(keywords) / 10" should be rounded to integer before set to minimum_should_match. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-19 11:39:33 +08:00
Pegasus	d8192f8f17	Fix: validate regex pattern in split_with_pattern to prevent crash (#12633 ) ### What problem does this PR solve? Fix regex pattern validation in split_with_pattern (#12605) - Add try-except block to validate user-provided regex patterns before use - Gracefully fallback to single chunk when invalid regex is provided - Prevent server crash during DOCX parsing with malformed delimiters ## Problem Parsing DOCX files with custom regex delimiters crashes with `re.error: nothing to repeat at position 9` when users provide invalid regex patterns. Closes #12605 ## Solution Validate and compile regex pattern before use. On invalid pattern, log warning and return content as single chunk instead of crashing. ## Changes - `rag/nlp/__init__.py`: Add regex validation in `split_with_pattern()` function ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461	2026-01-15 14:24:51 +08:00
Kevin Hu	9a10558f80	Refa: async retrieval process. (#12629 ) ### Type of change - [x] Refactoring - [x] Performance Improvement	2026-01-15 12:28:49 +08:00
Kevin Hu	44bada64c9	Feat: support tree structured deep-research policy. (#12559 ) ### What problem does this PR solve? #12558 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:41:35 +08:00
lys1313013	b226e06e2d	refactor: remove debug print statements (#12534 ) ### What problem does this PR solve? refactor: remove debug print statements ### Type of change - [x] Refactoring	2026-01-09 19:23:50 +08:00
Kevin Hu	23a9544b73	Fix: toc async issue. (#12485 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-07 15:35:30 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Yongteng Lei	4cd4526492	Feat: PDF vision figure parser supports reading context (#12416 ) ### What problem does this PR solve? PDF vision figure parser supports reading context. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-05 09:55:43 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Yongteng Lei	51bc41b2e8	Refa: improve image table context (#12244 ) ### What problem does this PR solve? Improve image table context. Current strategy in attach_media_context: - Order by position when possible: if any chunk has page/position info, sort by (page, top, left), otherwise keep original order. - Apply only to media chunks: images use image_context_size, tables use table_context_size. - Primary matching: on the same page, choose a text chunk whose vertical span overlaps the media, then pick the one with the closest vertical midpoint. - Fallback matching: if no overlap on that page, choose the nearest text chunk on the same page (page-head uses the next text; page-tail uses the previous text). - Context extraction: inside the chosen text chunk, find a mid-sentence boundary near the text midpoint, then take context_size tokens split before/after (total budget). - No multi-chunk stitching: context comes from a single text chunk to avoid mixing unrelated segments. ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-26 17:55:32 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00
Kevin Hu	8cbfb5aef6	Fix: toc no chunk found issue. (#12197 ) ### What problem does this PR solve? #12170 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 14:06:20 +08:00
Kevin Hu	8197f9a873	Fix: table tag on chunks. (#12126 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:25:38 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
Kevin Hu	ea4a5cd665	Fix: tokenizer issue. (#11902 ) #11786 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 17:38:17 +08:00

1 2 3 4 5 ...

288 Commits