ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-30 04:27:30 +08:00

Author	SHA1	Message	Date
Jin Hai	24fcd6bbc7	Update CI (#13774 ) ### What problem does this PR solve? CI isn't stable, try to fix it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-25 18:17:52 +08:00
Stephen Hu	d32967eda8	refactor: let excel use lazy image loader (#13558 ) ### What problem does this PR solve? let excel use lazy image loader ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-23 21:24:40 +08:00
Magicbook1108	f991cd362e	Fix: type check in resume parsing method (#13740 ) ### What problem does this PR solve? Fix: type check in resume parsing method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-23 21:19:09 +08:00
Yongteng Lei	dd839f30e8	Fix: code supports matplotlib (#13724 ) ### What problem does this PR solve? Code as "final" node: ![img_v3_02vs_aece4caf-8403-4939-9e68-9845a22c2cfg](https://github.com/user-attachments/assets/9d87b8df-da6b-401c-bf6d-8b807fe92c22) Code as "mid" node: ![img_v3_02vv_f74f331f-d755-44ab-a18c-96fff8cbd34g](https://github.com/user-attachments/assets/c94ef3f9-2a6c-47cb-9d2b-19703d2752e4) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-20 20:32:00 +08:00
tmimmanuel	13d0df1562	feat: add Perplexity contextualized embeddings API as a new model provider (#13709 ) ### What problem does this PR solve? Adds Perplexity contextualized embeddings API as a new model provider, as requested in #13610. - `PerplexityEmbed` provider in `rag/llm/embedding_model.py` supporting both standard (`/v1/embeddings`) and contextualized (`/v1/contextualizedembeddings`) endpoints - All 4 Perplexity embedding models registered in `conf/llm_factories.json`: `pplx-embed-v1-0.6b`, `pplx-embed-v1-4b`, `pplx-embed-context-v1-0.6b`, `pplx-embed-context-v1-4b` - Frontend entries (enum, icon mapping, API key URL) in `web/src/constants/llm.ts` - Updated `docs/guides/models/supported_models.mdx` - 22 unit tests in `test/unit_test/rag/llm/test_perplexity_embed.py` Perplexity's API returns `base64_int8` encoded embeddings (not OpenAI-compatible), so this uses a custom `requests`-based implementation. Contextualized vs standard model is auto-detected from the model name. Closes #13610 ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2026-03-20 10:47:48 +08:00
yH	757d8d42dd	Fix: use configured OrderByExpr in _community_retrieval_ (#13683 ) The `odr` variable was configured with `desc("weight_flt")` but a new empty `OrderByExpr()` was passed to `dataStore.search()` instead, causing the descending sort to have no effect. ### What problem does this PR solve? In `_community_retrieval_`, the configured `OrderByExpr` with `desc("weight_flt")` was discarded — a new empty `OrderByExpr()` was passed to `dataStore.search()` instead, so community reports were never sorted by weight. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-19 17:55:40 +08:00
Idriss Sbaaoui	7827f0fce5	fix : empty mind map (#13693 ) ### What problem does this PR solve? Fix graphrag extractor chat response parsing and skip truncated cache values ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-19 13:53:06 +08:00
NeedmeFordev	c3f79dbcb0	fix(jira): prevent missed incremental updates after issue edits (#13674 ) ### What problem does this PR solve? Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira incremental sync could miss updated issues after initial sync, especially near time boundaries. Root cause: - Jira JQL uses minute-level precision for `updated` filters. - Incremental windows had no overlap buffer, so boundary updates could be skipped. - Sync log cursor tracking used a backward-facing update for `poll_range_start`. - Existing-doc updates in `upload_document` lacked a KB ownership guard for doc-id collisions. What changed: - Added Jira incremental overlap buffer (`time_buffer_seconds`, defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL lower-bound time. - Preserved second-level post-filtering to avoid duplicate reprocessing while still catching boundary updates. - Improved Jira sync logging to include start/end window and overlap configuration. - Updated sync cursor tracking in `increase_docs` to keep `poll_range_start` moving forward with max update time. - Added KB ID safety check before updating existing document records in `upload_document`. Verification performed: - Python syntax compile checks passed for modified files. - Manual verification flow: 1. Run full Jira sync. 2. Edit an already-indexed Jira issue. 3. Run next incremental sync. 4. Confirm updated content is re-ingested into KB. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 23:31:05 +08:00
Idriss Sbaaoui	9070408b04	Fix : model-specific handling (#13675 ) ### What problem does this PR solve? add a handler for gpt 5 models that do not accept parameters by dropping them, and centralize all models with specific paramter handling function into a single helper. solves issue #13639 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-03-18 17:28:20 +08:00
Daniil Sivak	60ad32a0c2	Feat: support epub parsing (#13650 ) Closes #1398 ### What problem does this PR solve? Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required. ### Type of change - [x] New Feature (non-breaking change which adds functionality) To check this parser manually: ```python uv run --python 3.12 python -c " from deepdoc.parser import EpubParser with open('$HOME/some_epub_book.epub', 'rb') as f: data = f.read() sections = EpubParser()(None, binary=data, chunk_token_num=512) print(f'Got {len(sections)} sections') for i, s in enumerate(sections[:5]): print(f'\n--- Section {i} ---') print(s[:200]) " ```	2026-03-17 20:14:06 +08:00
Stephen Hu	77483b1e58	refactor: remove useless variable in raptor (#13648 ) ### What problem does this PR solve? remove useless variable in raptor ### Type of change - [x] Refactoring	2026-03-17 15:56:51 +08:00
Yingfeng	b686a60713	Switch from demo.ragflow.io to cloud.ragflow.io (#13624 ) ### What problem does this PR solve? Switch from demo.ragflow.io to cloud.ragflow.io ### Type of change - [x] Documentation Update	2026-03-16 14:44:39 +08:00
apps-lycusinc	8b984c9d5f	Fixing WordNetCorpusReader object has no attribute _LazyCorpusLoader_… (#13600 ) ### What problem does this PR solve? Forces NLTK to load the corpus synchronously once, preventing concurrent tasks from triggering the lazy-loading race condition that cause Fixing WordNetCorpusReader object has no attribute _LazyCorpusLoader_… #13590 ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: shakeel <shakeel@lollylaw.com>	2026-03-13 19:55:01 +08:00
Idriss Sbaaoui	810692dfa3	fix: restore cross_languages default chat-model fallback for retrieval (#13471 ) ### What problem does this PR solve? issue #13465 POST /api/v1/retrieval failed with {"code":100,...,"message":"Exception('Model Name is required')"} when cross_languages was provided and no explicit llm_id was passed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-13 10:52:37 +08:00
Ethan Clarke	35cd56f990	feat: add MiniMax-M2.5 and M2.5-highspeed models (#13557 ) ## Summary Add MiniMax's latest M2.5 model family to the model registry and update the default API base URL to the international endpoint for broader accessibility. ## Changes - Add MiniMax-M2.5 models to `conf/llm_factories.json`: - `MiniMax-M2.5` — Peak Performance. Ultimate Value. Master the Complex. - `MiniMax-M2.5-highspeed` — Same performance, faster and more agile. - Both support 204,800 token context window and tool calling (`is_tools: true`). - Update default MiniMax API base URL in `rag/llm/__init__.py`: - From `https://api.minimaxi.com/v1` (domestic) to `https://api.minimax.io/v1` (international). - Chinese users can still override via the Base URL field in the UI settings (as documented in existing i18n strings). ## Supported Models \| Model \| Context Window \| Tool Calling \| Description \| \|-------\|---------------\|-------------\|-------------\| \| `MiniMax-M2.5` \| 204,800 tokens \| Yes \| Peak Performance. Ultimate Value. \| \| `MiniMax-M2.5-highspeed` \| 204,800 tokens \| Yes \| Same performance, faster and more agile. \| ## API Documentation - OpenAI Compatible API: https://platform.minimax.io/docs/api-reference/text-openai-api ## Testing - [x] JSON validation passes - [x] Python syntax validation passes - [x] Ruff lint passes - [x] MiniMax-M2.5 API call verified (returns valid response) - [x] MiniMax-M2.5-highspeed API call verified (returns valid response) Co-authored-by: PR Bot <pr-bot@minimaxi.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-12 20:41:46 +08:00
qinling0210	1be07a0a34	Fix "Result window is too large" during meta data search (#13521 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13210#issuecomment-3982878498 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 18:59:56 +08:00
Magicbook1108	eda7835d47	Fix: image pdf in ingestion pipeline (#13563 ) ### What problem does this PR solve? Fix: image pdf in ingestion pipeline #13550 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 17:49:02 +08:00
NeedmeFordev	387b0b27c4	feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527 ) ### What problem does this PR solve? This PR adds support for parsing PDFs through an external Docling server, so RAGFlow can connect to remote `docling serve` deployments instead of relying only on local in-process Docling. It addresses the feature request in [#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns with the external-server usage pattern already used by MinerU. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What is changed? - Add external Docling server support in `DoclingParser`: - Use `DOCLING_SERVER_URL` to enable remote parsing mode. - Try `POST /v1/convert/source` first, and fallback to `/v1alpha/convert/source`. - Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not set. - Wire Docling env settings into parser invocation paths: - `rag/app/naive.py` - `rag/flow/parser/parser.py` - Add Docling env hints in constants and update docs: - `docs/guides/dataset/select_pdf_parser.md` - `docs/guides/agent/agent_component_reference/parser.md` - `docs/faq.mdx` ### Why this approach? This keeps the change focused on one issue and one capability (external Docling connectivity), without introducing unrelated provider-model plumbing. ### Validation - Static checks: - `python -m py_compile` on changed Python files - `python -m ruff check` on changed Python files - Functional checks: - Remote v1 endpoint path works - v1alpha fallback works - Local Docling path remains available when server URL is unset ### Related links - Feature request: [Support external Docling server (issue #13426)](https://github.com/infiniflow/ragflow/issues/13426) - Compare view for this branch: [main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1) ##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)	2026-03-12 17:09:03 +08:00
Yongteng Lei	e1b632a7bb	Feat: add delete all support for delete operations (#13530 ) ### What problem does this PR solve? Add delete all support for delete operations. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-12 09:47:42 +08:00
Ethan T.	1cee8b1a7b	fix: use context managers for file handles to prevent resource leaks (#13514 ) ## Summary - Convert bare `open()` calls to `with` context managers or `Path.read_text()` - File handles leak if not properly closed, especially on exceptions - Fixes in crypt.py, sequence2txt_model.py, term_weight.py, deepdoc/vision/__init__.py ## Test plan - [x] File operations work correctly with context managers - [x] Resources properly cleaned up on exceptions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 16:47:06 +08:00
eviaaaaa	d0ca388bec	Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329 ) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * Centralized Abstraction (`docx_parser.py`): Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * Robust Corrupted Image Fallback (`docx_parser.py`): Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * Legacy Code & Redundancy Purge: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * Output Consistency: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * Memory Footprint Drop: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.	2026-03-11 10:00:07 +08:00
Yongteng Lei	3c80a0ae09	Fix: support vLLM's new reasoning field (#13493 ) ### What problem does this PR solve? Support vLLM's new reasoning field ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 21:13:14 +08:00
Magicbook1108	675810e0cf	Refact: optimize confluence performance (#13497 ) ### What problem does this PR solve? Refact: optimize confluence performance #13494 ### Type of change - [x] Refactoring	2026-03-10 15:02:24 +08:00
Idriss Sbaaoui	249b78561b	Fix missmatch docnm_kwd in raptor chunks (#13451 ) ### What problem does this PR solve? issue #13393 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 14:24:33 +08:00
Magicbook1108	7143954b48	Fix: chats_openai in none stream condition (#13495 ) ### What problem does this PR solve? Fix: chats_openai in none stream condition #13453 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 13:44:17 +08:00
qinling0210	7c92f51133	Fix retrieval function when metadata_condtion is specified in retrieval API (#13473 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13388 The following command returns empty when there is doc with the meta data ``` curl --request POST \ --url http://localhost:9222/api/v1/retrieval \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer ragflow-fO3mPFePfLgUYg8-9gjBVVXbvHqrvMPLGaW0P86PvAk' \ --data '{ "question": "any question", "dataset_ids": ["9bb4f0591b8811f18a4a84ba59049aa3"], "metadata_condition": { "logic": "and", "conditions": [ { "name": "character", "comparison_operator": "is", "value": "刘备" } ] } }' ``` When metadata_condtion is specified in the retrieval API, it is converted to doc_ids and doc_ids is passed to retrieval function. In retrieval funciton, when doc_ids is explicitly provided , we should bypass threshold. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 11:57:32 +08:00
guptas6est	32d31284cc	Fix: upgrade pypdf to 6.7.5 and migrate from deprecated pypdf2 to fix CVE-2026-28804 and CVE-2023-36464 (#13454 ) ### What problem does this PR solve? This PR addresses security vulnerabilities in PDF processing dependencies identified by Trivy security scan: 1. CVE-2026-28804 (MEDIUM): pypdf 6.7.4 vulnerable to inefficient decoding of ASCIIHexDecode streams 2. CVE-2023-36464 (MEDIUM): pypdf2 3.0.1 susceptible to infinite loop when parsing malformed comments Since pypdf2 is deprecated with no available fixes, this PR migrates all pypdf2 usage to the actively maintained pypdf library (version 6.7.5), which resolves both vulnerabilities. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-09 12:06:00 +08:00
Heyang Wang	c217b8f3d8	Feat: add DingTalk AI Table connector and integration for data synch… (#13413 ) ### What problem does this PR solve? Add DingTalk AI Table connector and integration for data synchronization Issue #13400 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: wangheyang <wangheyang@corp.netease.com>	2026-03-06 21:13:23 +08:00
Jonah Hartmann	6023eb27ac	feat: add Ragcon provider (#13425 ) ### What problem does this PR solve? This PR aims to extend the list of possible providers. Adds new Provider "RAGcon" within the Ollama Modal. It provides all model types except OCR via Openai-compatible endpoints. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Jakob <16180662+hauberj@users.noreply.github.com>	2026-03-06 09:37:27 +08:00
Yongteng Lei	d9785ea2ce	Fix: Alibaba cloud OSS config issue (#13406 ) ### What problem does this PR solve? Alibaba Could OSS config issue #13390. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-05 18:13:45 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Yao Wei	c99b53064d	fix: remove company info from resume_summary to prevent over-retrieval (#13358 ) ### What problem does this PR solve? Problem: When searching for a specific company name like(Daofeng Technology), the search would incorrectly return unrelated resumes containing generic terms like (Technology) in their company names Root Cause: The `corporation_name_tks` field was included in the identity fields that are redundantly written to every chunk. This caused common words like "科技" to match across all chunks, leading to over-retrieval of irrelevant resumes. Solution: Remove `corporation_name_tks` from the `_IDENTITY_FIELDS` list. Company information is still preserved in the "Work Overview" chunk where it belongs, allowing proper company-based searches while preventing false positives from generic terms. --------- Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local> Co-authored-by: Liu An <asiro@qq.com>	2026-03-04 19:24:49 +08:00
Jin Hai	b9ad014f63	Supports login cross multiple RAGFlow servers (#13322 ) ### What problem does this PR solve? 1. Use redis to store the secret key. 2. During startup API server will read the secret from redis. If no such secret key, generate one and store it into redis, atomically. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-04 13:07:45 +08:00
Magicbook1108	93d621a666	Fix: Correct PDF chunking parameter name in naive (#13357 ) ### What problem does this PR solve? Fix: Correct PDF chunking parameter name in naive #13325 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-04 11:51:10 +08:00
Yao Wei	48755a3352	Fix: (resume) Cross-verify project experience and work experience, and remove duplicate text (#13323 ) Cross-verify project experience and work experience, and remove duplicate text --------- Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>	2026-03-03 14:53:46 +08:00
Yongteng Lei	707de2461a	Fix: use async_chat with sync wrapper in resume parser (#13320 ) ### What problem does this PR solve? Fix AttributeError when calling llm.chat() in resume parser. LLMBundle only has async_chat method, not chat method. Use `_run_coroutine_sync` wrapper to call async_chat synchronously. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 19:51:06 +08:00
Yao Wei	f8c91e8854	Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255 ) Core optimizations (refer to arXiv:2510.09722): 1. PDF text fusion: Metadata + OCR dual-path extraction and fusion 2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical sorting + line number indexing 3. Parallel task decomposition: Basic information/work experience/educational background three-way parallel LLM extraction 4. Index pointer mechanism: LLM returns a range of line numbers instead of generating the full text, reducing the illusion of full text. --------- Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local> Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 19:05:50 +08:00
Magicbook1108	5fc3bd38b0	Feat: Support siliconflow.com (#13308 ) ### What problem does this PR solve? Feat: Support siliconflow.com ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 15:37:42 +08:00
liuxiaoyusky	8ba66dd62a	Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234 ) ## Summary When using MinerU, docling, TCADP, or paddleocr as the PDF parser with the General (naive) chunk method, the user-configured `chunk_token_num` is unconditionally overwritten to 0 at [rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859), effectively disabling chunk merging regardless of what the user sets in the UI. ### Problem A user sets `chunk_token_num = 2048` in the dataset configuration UI, expecting small parser blocks to be merged into larger chunks. However, this line: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: parser_config["chunk_token_num"] = 0 ``` silently overrides the user's setting. As a result, every MinerU output block becomes its own chunk. For short documents (e.g. a 3-page PDF fund factsheet parsed by MinerU), this produces 47 tiny chunks — some as small as 11 characters (`"July 2025"`) or 15 characters (`"CIES Eligible"`). This severely degrades retrieval quality: vector embeddings of such short fragments have minimal semantic value, and keyword search produces excessive noise. ### Fix Only apply the `chunk_token_num = 0` override when the user has not explicitly configured a positive value: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: if int(parser_config.get("chunk_token_num", 0)) <= 0: parser_config["chunk_token_num"] = 0 ``` This preserves the original default behavior (no merging) while respecting the user's explicit configuration. ### Before / After (MinerU, 3-page PDF, chunk_token_num=2048) \| \| Before \| After \| \|---\|---\|---\| \| Chunks produced \| 47 \| ~8 (merged by token limit) \| \| Smallest chunk \| 11 chars \| ~500 chars \| \| User setting respected \| No \| Yes \| ## Test plan - [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify chunks are merged up to token limit - [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) → verify original behavior (no merging) - [ ] Parse a PDF with DeepDOC parser → verify no change in behavior (not affected by this code path) - [ ] Repeat with docling/paddleocr if available	2026-03-02 15:31:40 +08:00
Magicbook1108	daec36e935	Fix: add soft limit for graph rag size (#13252 ) ### What problem does this PR solve? Fix: add soft limit for graph rag size #13258 Q2 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 14:02:36 +08:00
huber	8a6b5ced6b	fix: add missing chunk_data column to OceanBase schema migration (#13306 ) ### What problem does this PR solve? When using OceanBase as the document storage engine, parsing and inserting chunks with chunk_data (e.g., table parser row data) fails with the following error: ``` [ERROR][Exception]: Insert chunk error: ['Unconsumed column names: chunk_data'] This happens because the chunk_data column was recently introduced but was omitted from the EXTRA_COLUMNS list in rag/utils/ob_conn.py ``` As a result, the automatic schema migration for existing OceanBase tables does not append the missing chunk_data column, causing the underlying pyobvector or SQLAlchemy to raise an unconsumed column names error during data insertion. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### What is the solution? Added column_chunk_data to the EXTRA_COLUMNS list in ``` rag/utils/ob_conn.py ``` This ensures that the OceanBase connection wrapper can correctly detect the missing column and automatically alter existing chunk tables to include the chunk_data field during initialization.	2026-03-02 13:25:11 +08:00
Magicbook1108	f0dd12289c	Feat: add preprocess parameters for ingestion pipeline (#13300 ) ### What problem does this PR solve? Feat: add preprocess parameters for ingestion pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 13:18:57 +08:00
Attili-sys	21bc1ab7ec	Feature rtl support (#13118 ) ### What problem does this PR solve? This PR adds comprehensive Right-to-Left (RTL) language support, primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu, etc.). Previously, RTL content had multiple rendering issues: - Incorrect sentence splitting for Arabic punctuation in citation logic - Misaligned text in chat messages and markdown components - Improper positioning of blockquotes and “think” sections - Incorrect table alignment - Citation placement ambiguity in RTL prompts - UI layout inconsistencies when mixing LTR and RTL text This PR introduces backend and frontend improvements to properly detect, render, and style RTL content while preserving existing LTR behavior. #### Backend - Updated sentence boundary regex in `rag/nlp/search.py` to include Arabic punctuation: - `،` (comma) - `؛` (semicolon) - `؟` (question mark) - `۔` (Arabic full stop) - Ensures citation insertion works correctly in RTL sentences. - Updated citation prompt instructions to clarify citation placement rules for RTL languages. #### Frontend - Introduced a new utility: `text-direction.ts` - Detects text direction based on Unicode ranges. - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts. - Provides `getDirAttribute()` for automatic `dir` assignment. - Applied dynamic `dir` attributes across: - Markdown rendering - Chat messages - Search results - Tables - Hover cards and reference popovers - Added proper RTL styling in LESS: - Text alignment adjustments - Blockquote border flipping - Section indentation correction - Table direction switching - Use of `<bdi>` for figure labels to prevent bidirectional conflicts #### DevOps / Environment - Added Windows backend launch script with retry handling. - Updated dependency metadata. - Adjusted development-only React debugging behavior. --- ### Type of change - [x] Bug Fix (non-breaking change which fixes RTL rendering and citation issues) - [x] New Feature (non-breaking change which adds RTL detection and dynamic direction handling) --------- Co-authored-by: 6ba3i <isbaaoui09@gmail.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:03:44 +08:00
Yongteng Lei	c91e803a38	Fix: close detached PIL image on JPEG save failure in encode_image (#13278 ) ### What problem does this PR solve? Properly close detached PIL image on JPEG save failure in encode_image. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 14:43:35 +08:00
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
Yesid Cano Castro	d1afcc9e71	feat(seafile): add library and directory sync scope support (#13153 ) ### What problem does this PR solve? The SeaFile connector currently synchronises the entire account — every library visible to the authenticated user. This is impractical for users who only need a subset of their data indexed, especially on large SeaFile instances with many shared libraries. This PR introduces granular sync scope support, allowing users to choose between syncing their entire account, a single library, or a specific directory within a library. It also adds support for SeaFile library-scoped API tokens (`/api/v2.1/via-repo-token/` endpoints), enabling tighter access control without exposing account-level credentials. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Test ``` from seafile_connector import SeaFileConnector import logging import os logging.basicConfig(level=logging.DEBUG) URL = os.environ.get("SEAFILE_URL", "https://seafile.example.com") TOKEN = os.environ.get("SEAFILE_TOKEN", "") REPO_ID = os.environ.get("SEAFILE_REPO_ID", "") SYNC_PATH = os.environ.get("SEAFILE_SYNC_PATH", "/Documents") REPO_TOKEN = os.environ.get("SEAFILE_REPO_TOKEN", "") def _test_scope(scope, repo_id=None, sync_path=None): print(f"\n{'='50}") print(f"Testing scope: {scope}") print(f"{'='50}") creds = {"seafile_token": TOKEN} if TOKEN else {} if REPO_TOKEN and scope in ("library", "directory"): creds["repo_token"] = REPO_TOKEN connector = SeaFileConnector( seafile_url=URL, batch_size=5, sync_scope=scope, include_shared = False, repo_id=repo_id, sync_path=sync_path, ) connector.load_credentials(creds) connector.validate_connector_settings() count = 0 for batch in connector.load_from_state(): for doc in batch: count += 1 print(f" [{count}] {doc.semantic_identifier} " f"({doc.size_bytes} bytes, {doc.extension})") print(f"\n-> {scope} scope: {count} document(s) found.\n") # 1. Account scope if TOKEN: _test_scope("account") else: print("\nSkipping account scope (set SEAFILE_TOKEN)") # 2. Library scope if REPO_ID and (TOKEN or REPO_TOKEN): _test_scope("library", repo_id=REPO_ID) else: print("\nSkipping library scope (set SEAFILE_REPO_ID + token)") # 3. Directory scope if REPO_ID and SYNC_PATH and (TOKEN or REPO_TOKEN): _test_scope("directory", repo_id=REPO_ID, sync_path=SYNC_PATH) else: print("\nSkipping directory scope (set SEAFILE_REPO_ID + SEAFILE_SYNC_PATH + token)") ```	2026-02-28 10:24:28 +08:00
Stephen Hu	aec2ef4232	refactor:improve tts model's codes (#13137 ) ### What problem does this PR solve? improve tts model's codes ### Type of change - [x] Refactoring	2026-02-28 10:18:00 +08:00
Yuxing Deng	51b180d991	fix: adding GPUStack chat model requires v1 suffix (#13237 ) ### What problem does this PR solve? Refer to issue: #13236 The base url for GPUStack chat model requires `/v1` suffix. For the other model type like `Embedding` or `Rerank`, the `/v1` suffix is not required and will be appended in code. So keep the same logic for chat model as other model type. ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 20:13:07 +08:00
avianion	5f53fbe0f1	feat: Add Avian as an LLM provider (#13256 ) ### What problem does this PR solve? This PR adds [Avian](https://avian.io) as a new LLM provider to RAGFlow. Avian provides an OpenAI-compatible API with competitive pricing, offering access to models like DeepSeek V3.2, Kimi K2.5, GLM-5, and MiniMax M2.5. Provider details: - API Base URL: `https://api.avian.io/v1` - Auth: Bearer token via API key - OpenAI-compatible (chat completions, streaming, function calling) - Models: - `deepseek/deepseek-v3.2` — 164K context, $0.26/$0.38 per 1M tokens - `moonshotai/kimi-k2.5` — 131K context, $0.45/$2.20 per 1M tokens - `z-ai/glm-5` — 131K context, $0.30/$2.55 per 1M tokens - `minimax/minimax-m2.5` — 1M context, $0.30/$1.10 per 1M tokens Changes: - `rag/llm/chat_model.py` — Add `AvianChat` class extending `Base` - `rag/llm/__init__.py` — Register in `SupportedLiteLLMProvider`, `FACTORY_DEFAULT_BASE_URL`, `LITELLM_PROVIDER_PREFIX` - `conf/llm_factories.json` — Add Avian factory with model definitions - `web/src/constants/llm.ts` — Add to `LLMFactory` enum, `IconMap`, `APIMapUrl` - `web/src/components/svg-icon.tsx` — Register SVG icon - `web/src/assets/svg/llm/avian.svg` — Provider icon - `docs/references/supported_models.mdx` — Add to supported models table This follows the same pattern as other OpenAI-compatible providers (e.g., n1n #12680, TokenPony). cc @KevinHuSh @JinHai-CN ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2026-02-27 17:36:55 +08:00
Magicbook1108	158503a1aa	Feat: optimize ingestion pipeline with preprocess (#13211 ) ### What problem does this PR solve? Feat: optimize ingestion pipeline with preprocess ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-26 10:24:13 +08:00

1 2 3 4 5 ...

1326 Commits