ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-28 19:53:06 +08:00

Author	SHA1	Message	Date
balibabu	eca60208e3	Fix: The document generation node cannot generate the output content of a large model to a file. #13321 (#13326 ) ### What problem does this PR solve? Fix: The document generation node cannot generate the output content of a large model to a file. #13321 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-03 11:05:24 +08:00
Magicbook1108	4f09b3e2a4	Fix: pipeline canvas category (#13319 ) ### What problem does this PR solve? Fix: pipeline canvas category ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 20:27:36 +08:00
Yongteng Lei	707de2461a	Fix: use async_chat with sync wrapper in resume parser (#13320 ) ### What problem does this PR solve? Fix AttributeError when calling llm.chat() in resume parser. LLMBundle only has async_chat method, not chat method. Use `_run_coroutine_sync` wrapper to call async_chat synchronously. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 19:51:06 +08:00
chanx	ef264b52c7	Fix: Fixed some errors in the console (#13317 ) ### What problem does this PR solve? Fix: Fixed some errors in the console ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 19:19:15 +08:00
Yingfeng	a806f7b707	Potential fix for code scanning alert no. 71: Incomplete URL substring sanitization (#13318 ) Potential fix for [https://github.com/infiniflow/ragflow/security/code-scanning/71](https://github.com/infiniflow/ragflow/security/code-scanning/71) In general, instead of using `String.prototype.includes` on the entire URL string, parse the URL and make decisions based on its `host` (or `hostname`) field. This avoids cases where the trusted domain appears in the path, query, or as part of a different hostname. Here, `payload.source_fid` is set to `'siliconflow_intl'` if `postBody.base_url` “contains” `api.siliconflow.com`. To keep behavior for correct inputs but close the hole, we should: 1. Safely parse `postBody.base_url` using the standard `URL` class. 2. Extract the hostname (`url.hostname`). 3. Compare it appropriately: - If we only want the exact host `api.siliconflow.com`, use strict equality. - If international endpoints may include subdomains like `foo.api.siliconflow.com`, allow those via suffix check on the hostname. 4. Fall back to `LLMFactory.SILICONFLOW` if parsing fails or the host does not match. Concretely, in `web/src/pages/user-setting/setting-model/hooks.tsx`, in the `onApiKeySavingOk` callback where `payload.source_fid` is set, replace the `toLowerCase().includes('api.siliconflow.com')` logic with a small block that: - Initializes a local `let sourceFid = LLMFactory.SILICONFLOW;` - If `postBody.base_url` is present, attempts `new URL(postBody.base_url)` inside a `try/catch`, lowercases `url.hostname`, and checks whether it equals `api.siliconflow.com` or ends with `.api.siliconflow.com`. - Assigns `payload.source_fid = sourceFid`. No new external dependencies are required; `URL` is available in modern browsers and Node, and TypeScript understands it. _Suggested fixes powered by Copilot Autofix. Review carefully before merging._ Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	2026-03-02 19:11:52 +08:00
Idriss Sbaaoui	b0ace2c5d0	feat: enable Arabic in production UI and add complete Arabic documentation (#13315 ) ### What problem does this PR solve? This PR adds end-to-end Arabic support in production. It also adds a full Arabic README ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2026-03-02 19:10:11 +08:00
Yao Wei	f8c91e8854	Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255 ) Core optimizations (refer to arXiv:2510.09722): 1. PDF text fusion: Metadata + OCR dual-path extraction and fusion 2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical sorting + line number indexing 3. Parallel task decomposition: Basic information/work experience/educational background three-way parallel LLM extraction 4. Index pointer mechanism: LLM returns a range of line numbers instead of generating the full text, reducing the illusion of full text. --------- Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local> Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 19:05:50 +08:00
balibabu	7d6f20585f	Feat: Modify the style of the classification operator and fix some console errors. (#13314 ) ### What problem does this PR solve? Feat: Modify the style of the classification operator and fix some console errors. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 16:53:24 +08:00
Magicbook1108	5fc3bd38b0	Feat: Support siliconflow.com (#13308 ) ### What problem does this PR solve? Feat: Support siliconflow.com ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 15:37:42 +08:00
Magicbook1108	1db221f19e	Feat: add more models for siliconflow and tongyi-qwen (#13311 ) ### What problem does this PR solve? Feat: add more models for siliconflow and tongyi-qwen ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 15:37:08 +08:00
liuxiaoyusky	8ba66dd62a	Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234 ) ## Summary When using MinerU, docling, TCADP, or paddleocr as the PDF parser with the General (naive) chunk method, the user-configured `chunk_token_num` is unconditionally overwritten to 0 at [rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859), effectively disabling chunk merging regardless of what the user sets in the UI. ### Problem A user sets `chunk_token_num = 2048` in the dataset configuration UI, expecting small parser blocks to be merged into larger chunks. However, this line: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: parser_config["chunk_token_num"] = 0 ``` silently overrides the user's setting. As a result, every MinerU output block becomes its own chunk. For short documents (e.g. a 3-page PDF fund factsheet parsed by MinerU), this produces 47 tiny chunks — some as small as 11 characters (`"July 2025"`) or 15 characters (`"CIES Eligible"`). This severely degrades retrieval quality: vector embeddings of such short fragments have minimal semantic value, and keyword search produces excessive noise. ### Fix Only apply the `chunk_token_num = 0` override when the user has not explicitly configured a positive value: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: if int(parser_config.get("chunk_token_num", 0)) <= 0: parser_config["chunk_token_num"] = 0 ``` This preserves the original default behavior (no merging) while respecting the user's explicit configuration. ### Before / After (MinerU, 3-page PDF, chunk_token_num=2048) \| \| Before \| After \| \|---\|---\|---\| \| Chunks produced \| 47 \| ~8 (merged by token limit) \| \| Smallest chunk \| 11 chars \| ~500 chars \| \| User setting respected \| No \| Yes \| ## Test plan - [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify chunks are merged up to token limit - [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) → verify original behavior (no merging) - [ ] Parse a PDF with DeepDOC parser → verify no change in behavior (not affected by this code path) - [ ] Repeat with docling/paddleocr if available	2026-03-02 15:31:40 +08:00
少卿	d430446e69	fix:absolute page index mix-up in DeepDoc PDF parser (#12848 ) ### What problem does this PR solve? Summary: This PR addresses critical indexing issues in deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with chunk-based pagination: Normalize rotated table page numbering: Rotated-table re-OCR now writes page_number in chunk-local 1-based form, eliminating double-addition of page_from offset that caused misalignment between table positions and document boxes. Convert absolute positions to chunk-local coordinates: When inserting tables/figures extracted via _extract_table_figure, positions are now converted from absolute (0-based) to chunk-local indices before distance matching and box insertion. This prevents IndexError and out-of-range accesses during paged parsing of long documents. Root Cause: The parser mixed absolute (0-based, document-global) and relative (1-based, chunk-local) page numbering systems. Table/figure positions from layout extraction carried absolute page numbers, but insertion logic expected chunk-local coordinates aligned with self.boxes and page_cum_height. Testing(I do): Manual verification: Parse a 200+ page PDF with from_page > 0 and table rotation enabled. Confirm that: Tables and figures appear on correct pages No IndexError or position mismatches occur Page numbers in output match expected chunk-local offsets Automated testing: 我没做 ## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max and claude 4.5 opus and me) ### Context The current implementation loads entire page ranges into memory (`__images__`, `page_chars`, intermediates), which can cause RAM exhaustion on large documents. While the page numbering fix resolves correctness issues, scalability remains a concern. ### Proposed Architecture Pipeline-Driven Chunking with Explicit Resource Management: 1. Authoritative chunk planning: Accept page-range specifications from upstream pipeline as the single source of truth. The parser should be a stateless worker that processes assigned chunks without making independent pagination decisions. 2. Granular memory lifecycle: ```python for chunk_spec in chunk_plan: # Load only chunk_spec.pages into __images__ page_images = load_page_range(chunk_spec.start, chunk_spec.end) # Process with offset tracking results = process_chunk(page_images, offset=chunk_spec.start) # Explicit cleanup before next iteration del page_images, page_chars, layout_intermediates gc.collect() # Force collection of large objects ``` 3. Persistent lightweight state: Keep model instances (layout detector, OCR engine), document metadata (outlines, PDF structure), and configuration across chunks to avoid reinitialization overhead (~2-5s per chunk for model loading). 4. Adaptive fallback: Provide `max_pages_per_chunk` (default: 50) only when pipeline doesn't supply a plan. Never exceed pipeline-specified ranges to maintain predictable memory bounds. 5. Optional: Dynamic budgeting: Expose a memory budget parameter that adjusts chunk size based on observed image dimensions and format (e.g., reduce chunk size for high-DPI scanned documents). ### Benefits - Predictable memory footprint: RAM usage bounded by `chunk_size × avg_page_size` rather than total document size - Horizontal scalability: Enables parallel chunk processing across workers - Failure isolation: Page extraction errors affect only current chunk, not entire document - Cloud-friendly: Works within container memory limits (e.g., 2-4GB per worker) ### Trade-offs - Increased I/O: Re-opening PDF for each chunk vs. keeping file handle (mitigated by page-range seeks) - Complexity: Requires careful offset tracking and stateful coordination between pipeline and parser - Warmup cost: Model initialization overhead amortized across chunks (acceptable for documents >100 pages) ### Implementation Priority This optimization should be deferred to a separate PR after the current correctness fix is merged, as: 1. It requires broader architectural changes across the pipeline 2. Current fix is critical for correctness and can be backported 3. Memory optimization needs comprehensive benchmarking on representative document corpus ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 14:58:37 +08:00
Ahmad Intisar	184388879d	feat: Add `disable_password_login` configuration to support SSO-only authentication (#13151 ) ### What problem does this PR solve? Enterprise deployments that use an external Identity Provider (e.g., Microsoft Entra ID, Okta, Keycloak) need the ability to enforce SSO-only authentication by hiding the email/password login form. Currently, the login page always shows the password form alongside OAuth buttons, with no way to disable it. This PR adds a `disable_password_login` configuration option under the existing `authentication` section in `service_conf.yaml`. When set to `true`, the login page only displays configured OAuth/SSO buttons and hides the email/password form, "Remember me" checkbox, and "Sign up" link. The flag can be set via: - `service_conf.yaml` (`authentication.disable_password_login: true`) - Environment variable (`DISABLE_PASSWORD_LOGIN=true`) Default behavior is unchanged (`false`). ### Behavior \| `disable_password_login` \| OAuth configured \| Result \| \|---\|---\|---\| \| `false` (default) \| No \| Standard email/password form \| \| `false` \| Yes \| Email/password form + SSO buttons below \| \| `true` \| Yes \| SSO buttons only (no form, no sign up link) \| \| `true` \| No \| Empty card (admin should configure OAuth first) \| ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (5) 1. `docker/service_conf.yaml.template` — added `disable_password_login: false` under authentication 2. `common/settings.py` — added `DISABLE_PASSWORD_LOGIN` global variable and loader in `init_settings()` 3. `common/config_utils.py` — fixed `TypeError` in `show_configs()` when authentication section contains non-dict values (e.g., booleans) 4. `api/apps/system_app.py` — exposed `disablePasswordLogin` flag in `/config` endpoint 5. `web/src/pages/login/index.tsx` — conditionally render password form based on config flag; OAuth buttons always render when channels exist --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-03-02 14:06:03 +08:00
Magicbook1108	daec36e935	Fix: add soft limit for graph rag size (#13252 ) ### What problem does this PR solve? Fix: add soft limit for graph rag size #13258 Q2 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 14:02:36 +08:00
huber	8a6b5ced6b	fix: add missing chunk_data column to OceanBase schema migration (#13306 ) ### What problem does this PR solve? When using OceanBase as the document storage engine, parsing and inserting chunks with chunk_data (e.g., table parser row data) fails with the following error: ``` [ERROR][Exception]: Insert chunk error: ['Unconsumed column names: chunk_data'] This happens because the chunk_data column was recently introduced but was omitted from the EXTRA_COLUMNS list in rag/utils/ob_conn.py ``` As a result, the automatic schema migration for existing OceanBase tables does not append the missing chunk_data column, causing the underlying pyobvector or SQLAlchemy to raise an unconsumed column names error during data insertion. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### What is the solution? Added column_chunk_data to the EXTRA_COLUMNS list in ``` rag/utils/ob_conn.py ``` This ensures that the OceanBase connection wrapper can correctly detect the missing column and automatically alter existing chunk tables to include the chunk_data field during initialization.	2026-03-02 13:25:11 +08:00
Magicbook1108	f0dd12289c	Feat: add preprocess parameters for ingestion pipeline (#13300 ) ### What problem does this PR solve? Feat: add preprocess parameters for ingestion pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 13:18:57 +08:00
Yihang Wang	7fc97da610	security: Adopt Jinja2 SandboxedEnvironment for template rendering. (#13305 )	2026-03-02 13:17:29 +08:00
Idriss Sbaaoui	860c4bd0bb	Feat: UI testing automation with playwright (#12749 ) ### What problem does this PR solve? This PR helps automate the testing of the ui interface using pytest Playwright ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test automation infrastructure --------- Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:04:08 +08:00
Attili-sys	21bc1ab7ec	Feature rtl support (#13118 ) ### What problem does this PR solve? This PR adds comprehensive Right-to-Left (RTL) language support, primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu, etc.). Previously, RTL content had multiple rendering issues: - Incorrect sentence splitting for Arabic punctuation in citation logic - Misaligned text in chat messages and markdown components - Improper positioning of blockquotes and “think” sections - Incorrect table alignment - Citation placement ambiguity in RTL prompts - UI layout inconsistencies when mixing LTR and RTL text This PR introduces backend and frontend improvements to properly detect, render, and style RTL content while preserving existing LTR behavior. #### Backend - Updated sentence boundary regex in `rag/nlp/search.py` to include Arabic punctuation: - `،` (comma) - `؛` (semicolon) - `؟` (question mark) - `۔` (Arabic full stop) - Ensures citation insertion works correctly in RTL sentences. - Updated citation prompt instructions to clarify citation placement rules for RTL languages. #### Frontend - Introduced a new utility: `text-direction.ts` - Detects text direction based on Unicode ranges. - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts. - Provides `getDirAttribute()` for automatic `dir` assignment. - Applied dynamic `dir` attributes across: - Markdown rendering - Chat messages - Search results - Tables - Hover cards and reference popovers - Added proper RTL styling in LESS: - Text alignment adjustments - Blockquote border flipping - Section indentation correction - Table direction switching - Use of `<bdi>` for figure labels to prevent bidirectional conflicts #### DevOps / Environment - Added Windows backend launch script with retry handling. - Updated dependency metadata. - Adjusted development-only React debugging behavior. --- ### Type of change - [x] Bug Fix (non-breaking change which fixes RTL rendering and citation issues) - [x] New Feature (non-breaking change which adds RTL detection and dynamic direction handling) --------- Co-authored-by: 6ba3i <isbaaoui09@gmail.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:03:44 +08:00
balibabu	a897aedea9	Feat: Modify the form styles for retrieval and conditional operators. (#13299 ) ### What problem does this PR solve? Feat: Modify the form styles for retrieval and conditional operators. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 12:05:27 +08:00
chanx	0cdddea59a	feat: pipeline add preprocess (#13302 ) ### What problem does this PR solve? feat: pipeline add preprocess ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 11:50:48 +08:00
balibabu	cf3d3c7c89	Feat: When exporting the agent DSL, the tailkey, password, and history fields need to be cleared. #13281 (#13282 ) ### What problem does this PR solve? Feat: When exporting the agent DSL, the tailkey, password, and history fields need to be cleared. #13281 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 11:41:38 +08:00
dependabot[bot]	b956ad180c	Build(deps): Bump pypdf from 6.7.3 to 6.7.4 (#13298 ) Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.7.3 to 6.7.4. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/py-pdf/pypdf/releases">pypdf's releases</a>.</em></p> <blockquote> <h2>Version 6.7.4, 2026-02-27</h2> <h2>What's new</h2> <h3>Security (SEC)</h3> <ul> <li>Allow limiting output length for RunLengthDecode filter (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3664">#3664</a>) by <a href="https://github.com/stefan6419846"><code>@stefan6419846</code></a></li> </ul> <h3>Robustness (ROB)</h3> <ul> <li>Deal with invalid annotations in extract_links (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3659">#3659</a>) by <a href="https://github.com/stefan6419846"><code>@stefan6419846</code></a></li> </ul> <p><a href="https://github.com/py-pdf/pypdf/compare/6.7.3...6.7.4">Full Changelog</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md">pypdf's changelog</a>.</em></p> <blockquote> <h2>Version 6.7.4, 2026-02-27</h2> <h3>Security (SEC)</h3> <ul> <li>Allow limiting output length for RunLengthDecode filter (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3664">#3664</a>)</li> </ul> <h3>Robustness (ROB)</h3> <ul> <li>Deal with invalid annotations in extract_links (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3659">#3659</a>)</li> </ul> <p><a href="https://github.com/py-pdf/pypdf/compare/6.7.3...6.7.4">Full Changelog</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`1650bc31e8`"><code>1650bc3</code></a> REL: 6.7.4</li> <li><a href="`f309c60037`"><code>f309c60</code></a> SEC: Allow limiting output length for RunLengthDecode filter (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3664">#3664</a>)</li> <li><a href="`993f052748`"><code>993f052</code></a> DEV: Bump actions/upload-artifact from 6 to 7 (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3662">#3662</a>)</li> <li><a href="`a3c996bffc`"><code>a3c996b</code></a> DEV: Bump actions/download-artifact from 7 to 8 (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3663">#3663</a>)</li> <li><a href="`37de32022e`"><code>37de320</code></a> ROB: Deal with invalid annotations in extract_links (<a href="https://redirect.github.com/py-pdf/pypdf/issues/3659">#3659</a>)</li> <li>See full diff in <a href="https://github.com/py-pdf/pypdf/compare/6.7.3...6.7.4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pypdf&package-manager=uv&previous-version=6.7.3&new-version=6.7.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/infiniflow/ragflow/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 11:32:12 +08:00
Idriss Sbaaoui	9d78d3ddb1	Tests: fix failling http in CI (#13301 ) ### What problem does this PR solve? test_doc_sdk_routes_unit had two flaky/incorrect branch assumptions: 1. parse/stop_parsing production logic gates on doc.run, but tests used progress, causing branch mismatch and unintended fallthrough into mutation/DB paths. 2. stop_parsing invalid-state test asserted an outdated message fragment, making the contract brittle. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 10:44:33 +08:00
Jimmy Ben Klieve	7e0dd906f2	refactor: update admin ui (#13280 ) ### What problem does this PR solve? Update for Admin UI: - Update file picker input in Registration whitelist > Import from Excel modal - Modify DOM structure of Sandbox Settings and move several hardcoded texts into translation files ### Type of change - [x] Refactoring	2026-02-28 19:21:51 +08:00
Idriss Sbaaoui	e62552d482	Added some React IDs for playwright e2e tests (#13265 ) ### What problem does this PR solve? Necessary ids for implementing the new testing suite with playwright for UI ### Type of change - [x] Other (please describe): Testing IDs Co-authored-by: Liu An <asiro@qq.com>	2026-02-28 15:13:47 +08:00
Magicbook1108	1027916bfe	Fix: inconsistent state handling for multi-user single-canvas access (#13267 ) ### What problem does this PR solve? <img width="700" alt="image" src="https://github.com/user-attachments/assets/1db7412e-4554-44bc-84ba-16421949aacc" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-02-28 15:09:21 +08:00
Yongteng Lei	c91e803a38	Fix: close detached PIL image on JPEG save failure in encode_image (#13278 ) ### What problem does this PR solve? Properly close detached PIL image on JPEG save failure in encode_image. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 14:43:35 +08:00
天海蒼灆	983150b936	Fix (api): fix the document parsing status check logic (#12504 ) ### What problem does this PR solve? When the original code terminates the parsing task halfway, the progress may not be 0 or 1, which will result in the inability to call the interface to parse again -Change the document parsing progress check to task status check, and use TaskStatus.RUNNING.value to judge -Update the condition judgment for stopping parsing documents, and check whether the task is running instead ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 14:38:55 +08:00
Jin Hai	32ec950ca8	Fix create / drop chat session syntax (#13279 ) ### What problem does this PR solve? This pull request refactors the chat session creation and deletion logic in both the parser and client code to use unique session IDs instead of session names. It also updates the corresponding command syntax and payloads, ensuring more robust and unambiguous session management. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-02-28 14:18:21 +08:00
Jin Hai	d9d4825079	Add chat sessions related command (#13268 ) ### What problem does this PR solve? 1. Create / Drop / List chat sessions 2. Chat with LLM and datasets ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-02-28 12:52:45 +08:00
Jin Hai	54094771a3	Fix streaming chat on web API (#13275 ) ### What problem does this PR solve? This pull request makes a small but important fix to how streaming requests are handled in the `completion` endpoint of `conversation_app.py`. The main change ensures that the `stream` argument is not passed twice, which could cause errors. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-02-28 12:16:38 +08:00
Yongteng Lei	0110151e12	Fix: document remove race condition (#13242 ) ### What problem does this PR solve? Fix document remove race condition. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 11:23:24 +08:00
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
SFL79	4f0c892b32	feat(ui): add individual model delete buttons across all providers (#13271 ) ### What problem does this PR solve? Added the option to delete models individually from providers. For additional context, see [issue-13184](https://github.com/infiniflow/ragflow/issues/13184) ### Type of change - [x] New Feature (non-breaking change which adds functionality) Note: when deleting a selected model, it leaves the full model name as text as seen here: <img width="676" height="90" alt="image" src="https://github.com/user-attachments/assets/c11c7c1b-3f2a-4119-b20c-bb8148a8ad16" /> If attempting to use ragflow with that deleted model, ragflow will throw an unauthorized model error as expected. I left it like that on purpose, so it's easier for the user to understand what he deleted and that he needs to replace it with another model. Co-authored-by: Shahar Flumin <shahar@Shahars-MacBook-Air.local>	2026-02-28 10:51:39 +08:00
Yesid Cano Castro	d1afcc9e71	feat(seafile): add library and directory sync scope support (#13153 ) ### What problem does this PR solve? The SeaFile connector currently synchronises the entire account — every library visible to the authenticated user. This is impractical for users who only need a subset of their data indexed, especially on large SeaFile instances with many shared libraries. This PR introduces granular sync scope support, allowing users to choose between syncing their entire account, a single library, or a specific directory within a library. It also adds support for SeaFile library-scoped API tokens (`/api/v2.1/via-repo-token/` endpoints), enabling tighter access control without exposing account-level credentials. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Test ``` from seafile_connector import SeaFileConnector import logging import os logging.basicConfig(level=logging.DEBUG) URL = os.environ.get("SEAFILE_URL", "https://seafile.example.com") TOKEN = os.environ.get("SEAFILE_TOKEN", "") REPO_ID = os.environ.get("SEAFILE_REPO_ID", "") SYNC_PATH = os.environ.get("SEAFILE_SYNC_PATH", "/Documents") REPO_TOKEN = os.environ.get("SEAFILE_REPO_TOKEN", "") def _test_scope(scope, repo_id=None, sync_path=None): print(f"\n{'='50}") print(f"Testing scope: {scope}") print(f"{'='50}") creds = {"seafile_token": TOKEN} if TOKEN else {} if REPO_TOKEN and scope in ("library", "directory"): creds["repo_token"] = REPO_TOKEN connector = SeaFileConnector( seafile_url=URL, batch_size=5, sync_scope=scope, include_shared = False, repo_id=repo_id, sync_path=sync_path, ) connector.load_credentials(creds) connector.validate_connector_settings() count = 0 for batch in connector.load_from_state(): for doc in batch: count += 1 print(f" [{count}] {doc.semantic_identifier} " f"({doc.size_bytes} bytes, {doc.extension})") print(f"\n-> {scope} scope: {count} document(s) found.\n") # 1. Account scope if TOKEN: _test_scope("account") else: print("\nSkipping account scope (set SEAFILE_TOKEN)") # 2. Library scope if REPO_ID and (TOKEN or REPO_TOKEN): _test_scope("library", repo_id=REPO_ID) else: print("\nSkipping library scope (set SEAFILE_REPO_ID + token)") # 3. Directory scope if REPO_ID and SYNC_PATH and (TOKEN or REPO_TOKEN): _test_scope("directory", repo_id=REPO_ID, sync_path=SYNC_PATH) else: print("\nSkipping directory scope (set SEAFILE_REPO_ID + SEAFILE_SYNC_PATH + token)") ```	2026-02-28 10:24:28 +08:00
Stephen Hu	aec2ef4232	refactor:improve tts model's codes (#13137 ) ### What problem does this PR solve? improve tts model's codes ### Type of change - [x] Refactoring	2026-02-28 10:18:00 +08:00
Stephen Hu	9577753c10	Refactor: improve the logic about docling parser extract box (#13215 ) ### What problem does this PR solve? improve the logic about docling parser extract box ### Type of change - [x] Refactoring	2026-02-28 10:05:24 +08:00
chanx	510ff89661	Fix: remove unused files (#13232 ) ### What problem does this PR solve? Fix: remove unused files ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 23:05:40 +08:00
Jimmy Ben Klieve	c0823e8d6d	refactor: update chat ui (#13269 ) ### What problem does this PR solve? Update Chat UI: - Align to the design. - Update `<AudioButton>` visualizer logic. - Fix keyboard navigation issue. ### Type of change - [x] Refactoring	2026-02-27 22:26:19 +08:00
Enes Delibalta	4e48aba5c4	fix: update DoclingParser return type hint (#13243 ) ### What problem does this PR solve? The _transfer_to_sections method was throwing a type hint violation because it occasionally returns 3-item tuples instead of 2. Adjusted to list[tuple[str, ...]] to prevent runtime crashes. Error: 20:53:21 Page(1~10): [ERROR]Internal server error while chunking: Method[1m[35m deepdoc.parser.docling_parser.DoclingParser._transfer_to_sections()[0m return [1m[31m[(1. JIRA Nasıl Kullanılır?, text, @@1\t70.8\t194.9\t70.9\t85.5##), (1.1. Proje O...##)][0m violates type hint [1m[32mlist[tuple[str, str]][0m, as [1m[33mlist [0mindex [1m[33m15[0m item tuple [1m[33mtuple [0m[1m[31m(Gelen ekran üzerinden alanları isterlerine göre doldurduğunuz taktirde Create düğmesi i...##)[0m length 3 != 2. 20:53:21 [ERROR][Exception]: Method[1m[35m deepdoc.parser.docling_parser.DoclingParser._transfer_to_sections()[0m return [1m[31m[('1. JIRA Nasıl Kullanılır?', 'text', '@@1\t70.8\t194.9\t70.9\t85.5##'), ('1.1. Proje O...##')][0m violates type hint [1m[32mlist[tuple[str, str]][0m, as [1m[33mlist [0mindex [1m[33m15[0m item tuple [1m[33mtuple [0m[1m[31m('Gelen ekran üzerinden alanları isterlerine göre doldurduğunuz taktirde Create düğmesi i...##')[0m length 3 != 2. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Enes Delibalta <enes.delibalta@pentanom.com>	2026-02-27 20:13:50 +08:00
Yuxing Deng	51b180d991	fix: adding GPUStack chat model requires v1 suffix (#13237 ) ### What problem does this PR solve? Refer to issue: #13236 The base url for GPUStack chat model requires `/v1` suffix. For the other model type like `Embedding` or `Rerank`, the `/v1` suffix is not required and will be appended in code. So keep the same logic for chat model as other model type. ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 20:13:07 +08:00
as-ondewo	194e076e26	Fix: init superuser can create duplicate users (#13221 ) ### What problem does this PR solve? This PR fixes 2 bugs related to RAGFlow's init superuser functionality. #### Bug 1 When the RAGFlow server was started with the `--init-superuser` option it would always create a new admin user even if it already exists resulting in duplicate users. To fix this, I added an additional check before create the superuser and added the unique constraint to the email column of the database, to mitigate potential TOCTOU race conditions. Since existing databases could contain duplicate emails I added email de-duplication to the database migration. #### Bug 2 When the RAGFlow server was started with the `--init-superuser` option but without configured default LLM and embedding models it would fail to start because the `init_superuser` function would always make test request to the models even if they were not set. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 19:55:51 +08:00
balibabu	6d0100ca67	Fix: The output content of the multi-model comparison will disappear. #13227 (#13241 ) ### What problem does this PR solve? Fix: The output content of the multi-model comparison will disappear. #13227 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 19:18:40 +08:00
balibabu	861ebfc6e1	Feat: Make the embedded page of chat compatible with mobile devices. (#13262 ) ### What problem does this PR solve? Feat: Make the embedded page of chat compatible with mobile devices. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-27 19:17:41 +08:00
avianion	5f53fbe0f1	feat: Add Avian as an LLM provider (#13256 ) ### What problem does this PR solve? This PR adds [Avian](https://avian.io) as a new LLM provider to RAGFlow. Avian provides an OpenAI-compatible API with competitive pricing, offering access to models like DeepSeek V3.2, Kimi K2.5, GLM-5, and MiniMax M2.5. Provider details: - API Base URL: `https://api.avian.io/v1` - Auth: Bearer token via API key - OpenAI-compatible (chat completions, streaming, function calling) - Models: - `deepseek/deepseek-v3.2` — 164K context, $0.26/$0.38 per 1M tokens - `moonshotai/kimi-k2.5` — 131K context, $0.45/$2.20 per 1M tokens - `z-ai/glm-5` — 131K context, $0.30/$2.55 per 1M tokens - `minimax/minimax-m2.5` — 1M context, $0.30/$1.10 per 1M tokens Changes: - `rag/llm/chat_model.py` — Add `AvianChat` class extending `Base` - `rag/llm/__init__.py` — Register in `SupportedLiteLLMProvider`, `FACTORY_DEFAULT_BASE_URL`, `LITELLM_PROVIDER_PREFIX` - `conf/llm_factories.json` — Add Avian factory with model definitions - `web/src/constants/llm.ts` — Add to `LLMFactory` enum, `IconMap`, `APIMapUrl` - `web/src/components/svg-icon.tsx` — Register SVG icon - `web/src/assets/svg/llm/avian.svg` — Provider icon - `docs/references/supported_models.mdx` — Add to supported models table This follows the same pattern as other OpenAI-compatible providers (e.g., n1n #12680, TokenPony). cc @KevinHuSh @JinHai-CN ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2026-02-27 17:36:55 +08:00
6ba3i	bb59a27e55	Doc : Add french Readme (#13254 ) ### What problem does this PR solve? Add fench Readme ### Type of change - [x] Documentation Update	2026-02-27 11:34:13 +08:00
qinling0210	8b6d363a98	Use pagination in _search_metadata (#13238 ) ### What problem does this PR solve? Fix [#13210](https://github.com/infiniflow/ragflow/issues/13210) Remove limit in _search_metadata, use pagination in _search_metadata. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 11:24:49 +08:00
Jin Hai	a1549c0fdc	Fix UI (#13239 ) ### What problem does this PR solve? This pull request makes a minor update to the English locale strings for the Table of Contents toggle buttons, changing the labels from "Show TOC"/"Hide TOC" to "Show content"/"Hide content" for improved clarity. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-02-26 19:21:08 +08:00
Magicbook1108	c03c537bf8	Feat: optimize gmail/google-drive (#13230 ) ### What problem does this PR solve? Feat: optimize gmail/google-drive Now: <img width="700" alt="image" src="https://github.com/user-attachments/assets/0c4b6044-7209-4c4f-ac0c-32070b79daf7" /> <img width="700" alt="image" src="https://github.com/user-attachments/assets/406f93d8-9b0f-4f5a-b8bb-3936990f558c" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-26 19:19:40 +08:00

1 2 3 4 5 ...

5378 Commits