ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-04-23 04:06:21 +08:00

Author	SHA1	Message	Date
Jin Hai	d688b72dff	Go: Add admin server status checking (#13571 ) ### What problem does this PR solve? RAGFlow server isn't available when admin server isn't connected. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-12 20:02:50 +08:00
chanx	1df804a14a	Feature (System Settings): Implemented system settings management functionality (#13556 ) ### What problem does this PR solve? Feature (System Settings): Implemented system settings management functionality - Added a new SystemSettings model, including creation and update time fields. - Implemented SystemSettingsDAO, providing CRUD operations and transaction support. - Implemented management interfaces for variables, configurations, and environment variables in the admin service. ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-12 19:06:20 +08:00
guptas6est	7c79602c77	fix(web): upgrade lodash to 4.17.23 and dompurify to 3.3.2 to fix CVE-2026-0540 and CVE-2025-13465 (#13488 ) ### What problem does this PR solve? This PR fixes two security vulnerabilities in web dependencies identified by Trivy: 1. CVE-2025-13465 (lodash): Prototype pollution vulnerability in _.unset and _.omit functions 2. CVE-2026-0540 (dompurify): Cross-site scripting (XSS) vulnerability Changes: - Upgraded lodash from 4.17.21 to 4.17.23 - Upgraded dompurify from 3.3.1 to 3.3.2 - Added npm override to force monaco-editor's transitive dependency on dompurify to use 3.3.2 (monaco-editor still depends on vulnerable 3.2.7) Both upgrades are backward-compatible patch versions. Build verified successfully with no breaking changes. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 19:04:26 +08:00
Ray Zhang	375f62a6c3	docs(migration): add project name (-p) usage to backup & migration guide (#13565 ) ## Summary - Add documentation for the `-p project_name` flag in the migration script, covering all steps (stop, backup, restore, start) - Add a note explaining how Docker volume name prefixes relate to the Compose project name - Update `docker-compose` to `docker compose` (Compose V2 syntax) for consistency - Fix `sh` to `bash` to match the script's shebang line This is the documentation follow-up to #12187 which added `-p` project name support to `docker/migration.sh`. ## Test plan - [ ] Verify the documentation renders correctly on the docs site - [ ] Confirm all example commands are accurate against the current `migration.sh`	2026-03-12 19:01:25 +08:00
qinling0210	1be07a0a34	Fix "Result window is too large" during meta data search (#13521 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13210#issuecomment-3982878498 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 18:59:56 +08:00
Jin Hai	cebf5892ec	Create go version storage component, but not used (#13561 ) ### What problem does this PR solve? Implement: minio, s3, oss, azure_sas, azure_spn, gcs, opendal ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-12 18:58:25 +08:00
Jinghan Xu	f6b06fab72	Fix: allow document parsing status recovery after transient errors (#13341 ) ### What problem does this PR solve? Fixes #13285 When an LLM returns a transient error (e.g. overloaded) during parsing, the task progress is set to -1. Previously, the progress could never be updated again, leaving the document permanently stuck in FAIL status even after the task successfully recovered and completed. Three coordinated changes address this: 1. task_service.update_progress: relax the progress update guard to accept prog >= 1 even when current progress is -1, so a task that recovers from a transient failure can report completion. 2. document_service.get_unfinished_docs: include documents that are marked FAIL (progress == -1) but still have at least one non-failed task (task.progress >= 0) in the polling set, so their status can be re-synced once a task recovers. Documents where all tasks have permanently failed are excluded to avoid unnecessary polling. 3. document_service.update_progress: explicitly set document status to RUNNING when not all tasks have finished, instead of preserving whatever stale status (potentially FAIL) the document previously had. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 18:02:12 +08:00
Yongteng Lei	13a34d7689	Feat: inject sys.date into canvas (#13567 ) ### What problem does this PR solve? Inject sys.date into canvas. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-12 17:49:13 +08:00
Magicbook1108	eda7835d47	Fix: image pdf in ingestion pipeline (#13563 ) ### What problem does this PR solve? Fix: image pdf in ingestion pipeline #13550 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 17:49:02 +08:00
NeedmeFordev	387b0b27c4	feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527 ) ### What problem does this PR solve? This PR adds support for parsing PDFs through an external Docling server, so RAGFlow can connect to remote `docling serve` deployments instead of relying only on local in-process Docling. It addresses the feature request in [#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns with the external-server usage pattern already used by MinerU. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What is changed? - Add external Docling server support in `DoclingParser`: - Use `DOCLING_SERVER_URL` to enable remote parsing mode. - Try `POST /v1/convert/source` first, and fallback to `/v1alpha/convert/source`. - Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not set. - Wire Docling env settings into parser invocation paths: - `rag/app/naive.py` - `rag/flow/parser/parser.py` - Add Docling env hints in constants and update docs: - `docs/guides/dataset/select_pdf_parser.md` - `docs/guides/agent/agent_component_reference/parser.md` - `docs/faq.mdx` ### Why this approach? This keeps the change focused on one issue and one capability (external Docling connectivity), without introducing unrelated provider-model plumbing. ### Validation - Static checks: - `python -m py_compile` on changed Python files - `python -m ruff check` on changed Python files - Functional checks: - Remote v1 endpoint path works - v1alpha fallback works - Local Docling path remains available when server URL is unset ### Related links - Feature request: [Support external Docling server (issue #13426)](https://github.com/infiniflow/ragflow/issues/13426) - Compare view for this branch: [main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1) ##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)	2026-03-12 17:09:03 +08:00
Josh	a353c7bdd7	Fix: avoid empty doc filter in knowledge retrieval (#13484 ) ## Summary Fix knowledge-base chat retrieval when no individual document IDs are selected. ## Root Cause `async_chat()` initialized `doc_ids` as an empty list when the request did not explicitly select documents. That empty list was then forwarded into retrieval as an active `doc_id` filter, effectively becoming `doc_id IN []` and suppressing all chunk matches. ## Changes - treat missing selected document IDs as `None` instead of `[]` - keep explicit document filtering when IDs are actually provided - add regression coverage for the shared chat retrieval path ## Validation - `python3 -m py_compile api/db/services/dialog_service.py test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py` - `.venv/bin/python -m pytest test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py` - manually verified that chat completions again inject retrieved knowledge into the prompt --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-12 16:03:30 +08:00
cambrianlee	227c852e67	Fix typo: documnet_keyword -> document_keyword in Chunk class (#13531 ) ### What problem does this PR solve? The Chunk class had a typo in the attribute name 'documnet_keyword', which caused the document_name field to remain empty when retrieving chunks via the SDK. This fix corrects the spelling to 'document_keyword'. Changes: - Line 36: Changed self.documnet_keyword to self.document_keyword - Line 52: Updated backward compatibility code to use self.document_keyword ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 15:23:55 +08:00
Jin Hai	e78938c72c	Update go admin server default port to 9383 (#13559 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-12 13:41:08 +08:00
Jimmy Ben Klieve	31a8184f63	refactor(ui): update ui for user settings, etc. (#13532 ) ### What problem does this PR solve? Update UI styles: - User settings - Component styles: - `ui/button.tsx` - `ui/checkbox.tsx` - `avatar-upload.tsx` - `file-uploader.tsx` - `icon-font.tsx` ### Type of change - [x] Refactoring	2026-03-12 13:33:36 +08:00
chanx	0da9c4618d	feat(cli): Enhance CLI functionality and add administrator mode support (#13539 ) ### What problem does this PR solve? feat(cli): Enhance CLI functionality and add administrator mode support - Modify `parseActivateUser` in `parser.go` to support 'on'/'off' states - Add administrator mode switching and host port settings functionality to `cli.go` - Implement user management API calls in `client.go` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-12 13:33:13 +08:00
chanx	4bd5bb141d	Fix: data-source-detail page style (#13507 ) ### What problem does this PR solve? Fix: data-source-detail page style ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 13:32:39 +08:00
Jin Hai	5cbdfc5f17	Fix Gitee embedding model URL error (#13553 ) ### What problem does this PR solve? As title ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-12 13:13:06 +08:00
Yongteng Lei	375a910bcf	Fix: add deadlock retry (#13552 ) ### What problem does this PR solve? Add deadlock retry. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 12:39:01 +08:00
Jin Hai	90afce192c	Add license and fingerprint API hook (#13548 ) ### What problem does this PR solve? For EE ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-12 11:52:39 +08:00
Jin Hai	2fb1360d9d	Add command line parameter and fix error message (#13526 ) ### What problem does this PR solve? `./server_main -p 9380` `./server_main -h` ### Type of change - [x] New Feature (non-breaking change which adds functionality) Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-12 09:50:57 +08:00
Yongteng Lei	e1b632a7bb	Feat: add delete all support for delete operations (#13530 ) ### What problem does this PR solve? Add delete all support for delete operations. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-12 09:47:42 +08:00
qinling0210	d201a81db7	Add command history in ragflow cli (#13538 ) ### What problem does this PR solve? In ragflow cli, use Up/Down arrows to navigate command history, ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-11 19:14:18 +08:00
Liu An	852393c114	Test: Lower priority of chat assistant and chunk list API tests (#13540 ) ### What problem does this PR solve? Mark test cases as lower priority (p3) for: - Creating chat assistants - Deleting chat assistants - Listing chat assistants - Listing chunks within datasets ### Type of change - [x] Update testcases	2026-03-11 19:00:18 +08:00
foyou	f75dc6a452	Docs: Fix normalization of case and some code blocks (#13520 ) ### What problem does this PR solve? Standardize term capitalization in `deploy_local_llm.mdx` and improve code block formatting. ### Type of change - [x] Documentation Update	2026-03-11 17:51:13 +08:00
Ethan T.	1cee8b1a7b	fix: use context managers for file handles to prevent resource leaks (#13514 ) ## Summary - Convert bare `open()` calls to `with` context managers or `Path.read_text()` - File handles leak if not properly closed, especially on exceptions - Fixes in crypt.py, sequence2txt_model.py, term_weight.py, deepdoc/vision/__init__.py ## Test plan - [x] File operations work correctly with context managers - [x] Resources properly cleaned up on exceptions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 16:47:06 +08:00
Attili-sys	6afd13ff29	Feat/arabic language support (#13516 ) ### What problem does this PR solve? This PR implements comprehensive Arabic language support for the RAGFlow application. The changes include: - Complete Arabic translation of all UI text elements in the web interface - RTL (right-to-left) layout support for Arabic content - Localization updates for all supported languages (ar, bg, de, en, es, fr, id, it, ja, pt-br, ru, vi, zh-traditional, zh) - UI component adjustments to properly display Arabic text and support RTL layout The implementation ensures that Arabic-speaking users can fully interact with the application in their native language with proper text rendering and layout direction. ### Type of change - [x] New Feature (non-breaking change which adds functionality) <img width="2866" height="1617" alt="image" src="https://github.com/user-attachments/assets/f2751b34-1b65-4867-b81d-a1068c17b9b7" /> --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-11 15:06:07 +08:00
chanx	9ca2bac984	Feat: Implement user creation, deletion, and permission management functionality. (#13519 ) ### What problem does this PR solve? Feat: Implement user creation, deletion, and permission management functionality. - Added the `ListByEmail` method to `user.go` to query users by email address. - Updated the user activation status handling logic in `handler.go`, adding input validation. - Added RSA password decryption functionality to `password.go`. - Implemented complete user management functionality in `service.go`, including user creation, deletion, password modification, activation status, and permission management. - Added input validation and error handling logic. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-11 14:04:00 +08:00
Jin Hai	2028e895fd	Add license and time record DAO (#13522 ) ### What problem does this PR solve? 1. Change go server default port to 9382 2. Compatible with EE data model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-11 14:02:24 +08:00
qinling0210	1815f5950b	Call get_flatted_meta_by_kbs in dify retrieval (#13509 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13388 Call get_flatted_meta_by_kbs in dify retrieval. Remove get_meta_by_kbs. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-11 13:42:24 +08:00
Josh	2d2d3cdbcf	Fix document metadata loading for paged listings (#13515 ) ## Summary - scope normal document-list metadata lookups to the current page's document IDs - keep the `return_empty_metadata=True` path dataset-wide because it needs full knowledge of docs that already have metadata - add unit tests for both paged listing paths and the unchanged empty-metadata behavior ## Why `DocumentService.get_list()` and the normal `get_by_kb_id()` path were calling `DocMetadataService.get_metadata_for_documents(None, kb_id)`, which loads metadata for the entire dataset on every page request. That becomes especially problematic on large datasets. The metadata scan path paginates through the full metadata index without an explicit sort, while the ES helper only switches to `search_after` beyond `10000` results when a sort is present. In practice this can lead to unnecessary full-dataset metadata work, slower document-list loading, and unreliable `meta_fields` in list responses for large KBs. This change keeps the existing empty-metadata filter behavior intact, but scopes normal list responses to metadata for the current page only.	2026-03-11 13:42:16 +08:00
Jimmy Ben Klieve	507ba4ea20	refactor(ui): update knowledge graph, chunk, metadata, agent log styles (#13518 ) ### What problem does this PR solve? Update UI styles: - Dataset > Knowledge graph tooltip - Dataset > Files > Manage metadata modal - Dataset > Files > Modify Chunking Method > Auto metadata > Manage generation settings modal - Agent > Canvas (Ingestion pipeline) > Dataflow result ### Type of change - [x] Refactoring	2026-03-11 11:27:20 +08:00
Jin Hai	2133fd76a8	Add auth middleware (#13506 ) ### What problem does this PR solve? Use auth middle-ware to check authorization. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-11 11:23:13 +08:00
eviaaaaa	d0ca388bec	Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329 ) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * Centralized Abstraction (`docx_parser.py`): Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * Robust Corrupted Image Fallback (`docx_parser.py`): Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * Legacy Code & Redundancy Purge: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * Output Consistency: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * Memory Footprint Drop: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.	2026-03-11 10:00:07 +08:00
balibabu	d36e3c97d1	Feat: Add a user_id field to the message and retrieval operators. (#13508 ) ### What problem does this PR solve? Feat: Add a user_id field to the message and retrieval operators. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-10 22:18:27 +08:00
Yongteng Lei	3c80a0ae09	Fix: support vLLM's new reasoning field (#13493 ) ### What problem does this PR solve? Support vLLM's new reasoning field ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 21:13:14 +08:00
yzy	07c9cf6cbe	Fix: return structured JSON output for non-streaming agent API (#13389 ) ### What problem does this PR solve? Previously, when an Agent component was configured with structured output, the non-streaming /agents/{agent_id}/completions API never returned the structured field in its response. The root cause: the non-streaming code path only collected message events to build full_content, then returned the workflow_finished payload — which only contains the output of the last component in the execution path (typically a Message component). Any structured output set by upstream components (e.g., Agent or LLM) was silently discarded. This PR fixes the non-streaming handler to iterate node_finished events and collect structured output from intermediate components. If any component produced a non-empty structured value, it is included in the final response under data.structured. The streaming path is unaffected, as it already exposes node_finished events to the caller. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 19:22:04 +08:00
Heyang Wang	08f83ff331	Feat: Support get aggregated parsing status to dataset via the API (#13481 ) ### What problem does this PR solve? Support getting aggregated parsing status to dataset via the API Issue: #12810 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>	2026-03-10 18:05:45 +08:00
Liu An	68a623154a	Fix: bin directory cannot be copied to docker image introduced by #13444 (#13502 ) ### What problem does this PR solve? bin directory cannot be copied to docker image introduced by ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 17:31:20 +08:00
chanx	f14b53c764	feat(admin): Implemented default administrator initialization and login functionality. (#13504 ) ### What problem does this PR solve? feat(admin): Implemented default administrator initialization and login functionality. Added support for default administrator configuration, including super user nickname, email, and password. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 17:30:21 +08:00
balibabu	81461b4505	Fix: The number of deleted session prompts is displayed incorrectly. #13499 (#13500 ) ### What problem does this PR solve? Fix: The number of deleted session prompts is displayed incorrectly. #13499 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 16:01:31 +08:00
Magicbook1108	675810e0cf	Refact: optimize confluence performance (#13497 ) ### What problem does this PR solve? Refact: optimize confluence performance #13494 ### Type of change - [x] Refactoring	2026-03-10 15:02:24 +08:00
Alexander Vostres	9ba43ae4ee	Fix "Coordinate lower is less than upper" error with MinerU (#13483 ) ### What problem does this PR solve? Fixes #6004 #7142 #11959 Unlike #9207 we actually normalize the coordinates here ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 15:02:01 +08:00
balibabu	aaf900cf16	Feat: Display release status in agent version history. (#13479 ) ### What problem does this PR solve? Feat: Display release status in agent version history. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: balibabu <assassin_cike@163.com>	2026-03-10 14:25:27 +08:00
Idriss Sbaaoui	249b78561b	Fix missmatch docnm_kwd in raptor chunks (#13451 ) ### What problem does this PR solve? issue #13393 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 14:24:33 +08:00
qinling0210	185ab0d4ef	Fix delete_document_metadata (#13496 ) ### What problem does this PR solve? Avoid getting doc in function delete_document_metadata as the doc might have been removed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 13:44:24 +08:00
Magicbook1108	7143954b48	Fix: chats_openai in none stream condition (#13495 ) ### What problem does this PR solve? Fix: chats_openai in none stream condition #13453 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 13:44:17 +08:00
qinling0210	7c92f51133	Fix retrieval function when metadata_condtion is specified in retrieval API (#13473 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13388 The following command returns empty when there is doc with the meta data ``` curl --request POST \ --url http://localhost:9222/api/v1/retrieval \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer ragflow-fO3mPFePfLgUYg8-9gjBVVXbvHqrvMPLGaW0P86PvAk' \ --data '{ "question": "any question", "dataset_ids": ["9bb4f0591b8811f18a4a84ba59049aa3"], "metadata_condition": { "logic": "and", "conditions": [ { "name": "character", "comparison_operator": "is", "value": "刘备" } ] } }' ``` When metadata_condtion is specified in the retrieval API, it is converted to doc_ids and doc_ids is passed to retrieval function. In retrieval funciton, when doc_ids is explicitly provided , we should bypass threshold. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 11:57:32 +08:00
tunsuy	292a1a8566	fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 ) ## Problem When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer) cannot map CIDs to correct Unicode characters, outputting PUA characters (U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully trusted pdfplumber text without any garbled detection, causing garbled output in the final parsed result. Relates to #13366 ## Solution ### 1. Garbled text detection functions - `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16), replacement character U+FFFD, control characters, and unassigned/surrogate codepoints - `_is_garbled_text(text, threshold)`: Calculates garbled ratio and detects `(cid:xxx)` patterns ### 2. Box-level fallback (in `__ocr()`) When a text box has ≥50% garbled characters, discard pdfplumber text and fallback to OCR recognition. ### 3. Page-level detection (in `__images__()`) Sample characters from each page; if garbled rate ≥30%, clear all pdfplumber characters for that page, forcing full OCR. ### 4. Layout recognizer CID filtering Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text processing to prevent them from polluting layout analysis. ## Testing - 29 unit tests covering: normal CJK/English text, PUA characters, CID patterns, mixed text, boundary thresholds, edge cases - All 85 existing project unit tests pass without regression	2026-03-10 11:20:31 +08:00
Jin Hai	7f6a9e8ee9	Update ext field type of heartbeat message (#13490 ) ### What problem does this PR solve? As title ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-10 10:49:39 +08:00
chanx	02108772d8	refactor: Moves the LLM factory initialization logic to the `dao` package. (#13476 ) ### What problem does this PR solve? refactor: Moves the LLM factory initialization logic to the `dao` package. Removes the `init_data` package and integrates the LLM factory initialization functionality into the `dao` package. Adds a `utility` package to provide general utility functions. Updates `server_main.go` to use the new initialization path. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-03-10 10:35:55 +08:00

1 2 3 4 5 ...

5505 Commits