Commit Graph

5378 Commits

Author SHA1 Message Date
eca60208e3 Fix: The document generation node cannot generate the output content of a large model to a file. #13321 (#13326)
### What problem does this PR solve?

Fix: The document generation node cannot generate the output content of
a large model to a file. #13321
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-03 11:05:24 +08:00
4f09b3e2a4 Fix: pipeline canvas category (#13319)
### What problem does this PR solve?

Fix: pipeline canvas category

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 20:27:36 +08:00
707de2461a Fix: use async_chat with sync wrapper in resume parser (#13320)
### What problem does this PR solve?

Fix AttributeError when calling llm.chat() in resume parser. LLMBundle
only has async_chat method, not chat method. Use `_run_coroutine_sync`
wrapper to call async_chat synchronously.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 19:51:06 +08:00
ef264b52c7 Fix: Fixed some errors in the console (#13317)
### What problem does this PR solve?

Fix: Fixed some errors in the console
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 19:19:15 +08:00
a806f7b707 Potential fix for code scanning alert no. 71: Incomplete URL substring sanitization (#13318)
Potential fix for
[https://github.com/infiniflow/ragflow/security/code-scanning/71](https://github.com/infiniflow/ragflow/security/code-scanning/71)

In general, instead of using `String.prototype.includes` on the entire
URL string, parse the URL and make decisions based on its `host` (or
`hostname`) field. This avoids cases where the trusted domain appears in
the path, query, or as part of a different hostname.

Here, `payload.source_fid` is set to `'siliconflow_intl'` if
`postBody.base_url` “contains” `api.siliconflow.com`. To keep behavior
for correct inputs but close the hole, we should:

1. Safely parse `postBody.base_url` using the standard `URL` class.
2. Extract the hostname (`url.hostname`).
3. Compare it appropriately:
- If we only want the exact host `api.siliconflow.com`, use strict
equality.
- If international endpoints may include subdomains like
`foo.api.siliconflow.com`, allow those via suffix check on the hostname.
4. Fall back to `LLMFactory.SILICONFLOW` if parsing fails or the host
does not match.

Concretely, in `web/src/pages/user-setting/setting-model/hooks.tsx`, in
the `onApiKeySavingOk` callback where `payload.source_fid` is set,
replace the `toLowerCase().includes('api.siliconflow.com')` logic with a
small block that:

- Initializes a local `let sourceFid = LLMFactory.SILICONFLOW;`
- If `postBody.base_url` is present, attempts `new
URL(postBody.base_url)` inside a `try/catch`, lowercases `url.hostname`,
and checks whether it equals `api.siliconflow.com` or ends with
`.api.siliconflow.com`.
- Assigns `payload.source_fid = sourceFid`.

No new external dependencies are required; `URL` is available in modern
browsers and Node, and TypeScript understands it.


_Suggested fixes powered by Copilot Autofix. Review carefully before
merging._

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-03-02 19:11:52 +08:00
b0ace2c5d0 feat: enable Arabic in production UI and add complete Arabic documentation (#13315)
### What problem does this PR solve?

This PR adds end-to-end Arabic support in production. It also adds a
full Arabic README

### Type of change

 - [x] New Feature (non-breaking change which adds functionality)
 - [x] Documentation Update
2026-03-02 19:10:11 +08:00
f8c91e8854 Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255)
Core optimizations (refer to arXiv:2510.09722):

1. PDF text fusion: Metadata + OCR dual-path extraction and fusion

2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical
sorting + line number indexing

3. Parallel task decomposition: Basic information/work
experience/educational background three-way parallel LLM extraction

4. Index pointer mechanism: LLM returns a range of line numbers instead
of generating the full text, reducing the illusion of full text.

---------

Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-02 19:05:50 +08:00
7d6f20585f Feat: Modify the style of the classification operator and fix some console errors. (#13314)
### What problem does this PR solve?

Feat: Modify the style of the classification operator and fix some
console errors.

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-03-02 16:53:24 +08:00
5fc3bd38b0 Feat: Support siliconflow.com (#13308)
### What problem does this PR solve?

Feat: Support siliconflow.com

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-03-02 15:37:42 +08:00
1db221f19e Feat: add more models for siliconflow and tongyi-qwen (#13311)
### What problem does this PR solve?

Feat: add more models for siliconflow and tongyi-qwen

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-03-02 15:37:08 +08:00
8ba66dd62a Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234)
## Summary

When using MinerU, docling, TCADP, or paddleocr as the PDF parser with
the General (naive) chunk method, the user-configured `chunk_token_num`
is **unconditionally overwritten to 0** at
[rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859),
effectively disabling chunk merging regardless of what the user sets in
the UI.

### Problem

A user sets `chunk_token_num = 2048` in the dataset configuration UI,
expecting small parser blocks to be merged into larger chunks. However,
this line:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    parser_config["chunk_token_num"] = 0
```

silently overrides the user's setting. As a result, every MinerU output
block becomes its own chunk. For short documents (e.g. a 3-page PDF fund
factsheet parsed by MinerU), this produces **47 tiny chunks** — some as
small as 11 characters (`"July 2025"`) or 15 characters (`"CIES
Eligible"`).

This severely degrades retrieval quality: vector embeddings of such
short fragments have minimal semantic value, and keyword search produces
excessive noise.

### Fix

Only apply the `chunk_token_num = 0` override when the user has **not**
explicitly configured a positive value:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    if int(parser_config.get("chunk_token_num", 0)) <= 0:
        parser_config["chunk_token_num"] = 0
```

This preserves the original default behavior (no merging) while
respecting the user's explicit configuration.

### Before / After (MinerU, 3-page PDF, chunk_token_num=2048)

| | Before | After |
|---|---|---|
| Chunks produced | 47 | ~8 (merged by token limit) |
| Smallest chunk | 11 chars | ~500 chars |
| User setting respected | No | Yes |

## Test plan

- [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify
chunks are merged up to token limit
- [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) →
verify original behavior (no merging)
- [ ] Parse a PDF with DeepDOC parser → verify no change in behavior
(not affected by this code path)
- [ ] Repeat with docling/paddleocr if available
2026-03-02 15:31:40 +08:00
d430446e69 fix:absolute page index mix-up in DeepDoc PDF parser (#12848)
### What problem does this PR solve?

Summary:
This PR addresses critical indexing issues in
deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with
chunk-based pagination:

Normalize rotated table page numbering: Rotated-table re-OCR now writes
page_number in chunk-local 1-based form, eliminating double-addition of
page_from offset that caused misalignment between table positions and
document boxes.
Convert absolute positions to chunk-local coordinates: When inserting
tables/figures extracted via _extract_table_figure, positions are now
converted from absolute (0-based) to chunk-local indices before distance
matching and box insertion. This prevents IndexError and out-of-range
accesses during paged parsing of long documents.

Root Cause:
The parser mixed absolute (0-based, document-global) and relative
(1-based, chunk-local) page numbering systems. Table/figure positions
from layout extraction carried absolute page numbers, but insertion
logic expected chunk-local coordinates aligned with self.boxes and
page_cum_height.


Testing(I do):

Manual verification: Parse a 200+ page PDF with from_page > 0 and table
rotation enabled. Confirm that:

Tables and figures appear on correct pages
No IndexError or position mismatches occur
Page numbers in output match expected chunk-local offsets


Automated testing: 我没做


## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max
and claude 4.5 opus and me)

### Context

The current implementation loads entire page ranges into memory
(`__images__`, `page_chars`, intermediates), which can cause RAM
exhaustion on large documents. While the page numbering fix resolves
correctness issues, scalability remains a concern.

### Proposed Architecture

**Pipeline-Driven Chunking with Explicit Resource Management:**

1. **Authoritative chunk planning**: Accept page-range specifications
from upstream pipeline as the single source of truth. The parser should
be a stateless worker that processes assigned chunks without making
independent pagination decisions.

2. **Granular memory lifecycle**:
   ```python
   for chunk_spec in chunk_plan:
       # Load only chunk_spec.pages into __images__
       page_images = load_page_range(chunk_spec.start, chunk_spec.end)
       
       # Process with offset tracking
       results = process_chunk(page_images, offset=chunk_spec.start)
       
       # Explicit cleanup before next iteration
       del page_images, page_chars, layout_intermediates
       gc.collect()  # Force collection of large objects
   ```

3. **Persistent lightweight state**: Keep model instances (layout
detector, OCR engine), document metadata (outlines, PDF structure), and
configuration across chunks to avoid reinitialization overhead (~2-5s
per chunk for model loading).

4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50)
only when pipeline doesn't supply a plan. Never exceed
pipeline-specified ranges to maintain predictable memory bounds.

5. **Optional: Dynamic budgeting**: Expose a memory budget parameter
that adjusts chunk size based on observed image dimensions and format
(e.g., reduce chunk size for high-DPI scanned documents).

### Benefits

- **Predictable memory footprint**: RAM usage bounded by `chunk_size ×
avg_page_size` rather than total document size
- **Horizontal scalability**: Enables parallel chunk processing across
workers
- **Failure isolation**: Page extraction errors affect only current
chunk, not entire document
- **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB
per worker)

### Trade-offs

- **Increased I/O**: Re-opening PDF for each chunk vs. keeping file
handle (mitigated by page-range seeks)
- **Complexity**: Requires careful offset tracking and stateful
coordination between pipeline and parser
- **Warmup cost**: Model initialization overhead amortized across chunks
(acceptable for documents >100 pages)

### Implementation Priority

This optimization should be **deferred to a separate PR** after the
current correctness fix is merged, as:
1. It requires broader architectural changes across the pipeline
2. Current fix is critical for correctness and can be backported
3. Memory optimization needs comprehensive benchmarking on
representative document corpus


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 14:58:37 +08:00
184388879d feat: Add disable_password_login configuration to support SSO-only authentication (#13151)
### What problem does this PR solve?

Enterprise deployments that use an external Identity Provider (e.g.,
Microsoft Entra ID, Okta, Keycloak) need the ability to enforce SSO-only
authentication by hiding the email/password login form. Currently, the
login page always shows the password form alongside OAuth buttons, with
no way to disable it.

This PR adds a `disable_password_login` configuration option under the
existing `authentication` section in `service_conf.yaml`. When set to
`true`, the login page only displays configured OAuth/SSO buttons and
hides the email/password form, "Remember me" checkbox, and "Sign up"
link.

The flag can be set via:
- `service_conf.yaml` (`authentication.disable_password_login: true`)
- Environment variable (`DISABLE_PASSWORD_LOGIN=true`)

Default behavior is unchanged (`false`).

### Behavior

| `disable_password_login` | OAuth configured | Result |
|---|---|---|
| `false` (default) | No | Standard email/password form |
| `false` | Yes | Email/password form + SSO buttons below |
| `true` | Yes | **SSO buttons only** (no form, no sign up link) |
| `true` | No | Empty card (admin should configure OAuth first) |

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

### Files changed (5)

1. `docker/service_conf.yaml.template` — added `disable_password_login:
false` under authentication
2. `common/settings.py` — added `DISABLE_PASSWORD_LOGIN` global variable
and loader in `init_settings()`
3. `common/config_utils.py` — fixed `TypeError` in `show_configs()` when
authentication section contains non-dict values (e.g., booleans)
4. `api/apps/system_app.py` — exposed `disablePasswordLogin` flag in
`/config` endpoint
5. `web/src/pages/login/index.tsx` — conditionally render password form
based on config flag; OAuth buttons always render when channels exist

---------

Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
2026-03-02 14:06:03 +08:00
daec36e935 Fix: add soft limit for graph rag size (#13252)
### What problem does this PR solve?

Fix: add soft limit for graph rag size #13258 Q2

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-02 14:02:36 +08:00
8a6b5ced6b fix: add missing chunk_data column to OceanBase schema migration (#13306)
### What problem does this PR solve?

When using OceanBase as the document storage engine, parsing and
inserting chunks with chunk_data (e.g., table parser row data) fails
with the following error:
```
[ERROR][Exception]: Insert chunk error: ['Unconsumed column names: chunk_data']
This happens because the chunk_data column was recently introduced but was omitted from the EXTRA_COLUMNS list in 
rag/utils/ob_conn.py
```
As a result, the automatic schema migration for existing OceanBase
tables does not append the missing chunk_data column, causing the
underlying pyobvector or SQLAlchemy to raise an unconsumed column names
error during data insertion.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### What is the solution?
Added column_chunk_data to the EXTRA_COLUMNS list in 
```
rag/utils/ob_conn.py
```
This ensures that the OceanBase connection wrapper can correctly detect
the missing column and automatically alter existing chunk tables to
include the chunk_data field during initialization.
2026-03-02 13:25:11 +08:00
f0dd12289c Feat: add preprocess parameters for ingestion pipeline (#13300)
### What problem does this PR solve?
Feat: add preprocess parameters for ingestion pipeline

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-03-02 13:18:57 +08:00
7fc97da610 security: Adopt Jinja2 SandboxedEnvironment for template rendering. (#13305) 2026-03-02 13:17:29 +08:00
860c4bd0bb Feat: UI testing automation with playwright (#12749)
### What problem does this PR solve?

This PR helps automate the testing of the ui interface using pytest
Playwright

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test automation infrastructure

---------

Co-authored-by: Liu An <asiro@qq.com>
2026-03-02 13:04:08 +08:00
21bc1ab7ec Feature rtl support (#13118)
### What problem does this PR solve?

This PR adds comprehensive **Right-to-Left (RTL) language support**,
primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu,
etc.).

Previously, RTL content had multiple rendering issues:

- Incorrect sentence splitting for Arabic punctuation in citation logic
- Misaligned text in chat messages and markdown components  
- Improper positioning of blockquotes and “think” sections  
- Incorrect table alignment  
- Citation placement ambiguity in RTL prompts  
- UI layout inconsistencies when mixing LTR and RTL text  

This PR introduces backend and frontend improvements to properly detect,
render, and style RTL content while preserving existing LTR behavior.

#### Backend
- Updated sentence boundary regex in `rag/nlp/search.py` to include
Arabic punctuation:
  - `،` (comma)
  - `؛` (semicolon)
  - `؟` (question mark)
  - `۔` (Arabic full stop)
- Ensures citation insertion works correctly in RTL sentences.
- Updated citation prompt instructions to clarify citation placement
rules for RTL languages.

#### Frontend
- Introduced a new utility: `text-direction.ts`
  - Detects text direction based on Unicode ranges.
  - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts.
  - Provides `getDirAttribute()` for automatic `dir` assignment.

- Applied dynamic `dir` attributes across:
  - Markdown rendering
  - Chat messages
  - Search results
  - Tables
  - Hover cards and reference popovers

- Added proper RTL styling in LESS:
  - Text alignment adjustments
  - Blockquote border flipping
  - Section indentation correction
  - Table direction switching
  - Use of `<bdi>` for figure labels to prevent bidirectional conflicts

#### DevOps / Environment
- Added Windows backend launch script with retry handling.
- Updated dependency metadata.
- Adjusted development-only React debugging behavior.

---

### Type of change

- [x] Bug Fix (non-breaking change which fixes RTL rendering and
citation issues)
- [x] New Feature (non-breaking change which adds RTL detection and
dynamic direction handling)

---------

Co-authored-by: 6ba3i <isbaaoui09@gmail.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com>
Co-authored-by: Liu An <asiro@qq.com>
2026-03-02 13:03:44 +08:00
a897aedea9 Feat: Modify the form styles for retrieval and conditional operators. (#13299)
### What problem does this PR solve?

Feat: Modify the form styles for retrieval and conditional operators.

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-03-02 12:05:27 +08:00
0cdddea59a feat: pipeline add preprocess (#13302)
### What problem does this PR solve?

feat: pipeline add preprocess

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-02 11:50:48 +08:00
cf3d3c7c89 Feat: When exporting the agent DSL, the tailkey, password, and history fields need to be cleared. #13281 (#13282)
### What problem does this PR solve?
Feat: When exporting the agent DSL, the tailkey, password, and history
fields need to be cleared. #13281

### Type of change


- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-02 11:41:38 +08:00
b956ad180c Build(deps): Bump pypdf from 6.7.3 to 6.7.4 (#13298)
Bumps [pypdf](https://github.com/py-pdf/pypdf) from 6.7.3 to 6.7.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/py-pdf/pypdf/releases">pypdf's
releases</a>.</em></p>
<blockquote>
<h2>Version 6.7.4, 2026-02-27</h2>
<h2>What's new</h2>
<h3>Security (SEC)</h3>
<ul>
<li>Allow limiting output length for RunLengthDecode filter (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3664">#3664</a>)
by <a
href="https://github.com/stefan6419846"><code>@​stefan6419846</code></a></li>
</ul>
<h3>Robustness (ROB)</h3>
<ul>
<li>Deal with invalid annotations in extract_links (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3659">#3659</a>)
by <a
href="https://github.com/stefan6419846"><code>@​stefan6419846</code></a></li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.7.3...6.7.4">Full
Changelog</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md">pypdf's
changelog</a>.</em></p>
<blockquote>
<h2>Version 6.7.4, 2026-02-27</h2>
<h3>Security (SEC)</h3>
<ul>
<li>Allow limiting output length for RunLengthDecode filter (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3664">#3664</a>)</li>
</ul>
<h3>Robustness (ROB)</h3>
<ul>
<li>Deal with invalid annotations in extract_links (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3659">#3659</a>)</li>
</ul>
<p><a href="https://github.com/py-pdf/pypdf/compare/6.7.3...6.7.4">Full
Changelog</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="1650bc31e8"><code>1650bc3</code></a>
REL: 6.7.4</li>
<li><a
href="f309c60037"><code>f309c60</code></a>
SEC: Allow limiting output length for RunLengthDecode filter (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3664">#3664</a>)</li>
<li><a
href="993f052748"><code>993f052</code></a>
DEV: Bump actions/upload-artifact from 6 to 7 (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3662">#3662</a>)</li>
<li><a
href="a3c996bffc"><code>a3c996b</code></a>
DEV: Bump actions/download-artifact from 7 to 8 (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3663">#3663</a>)</li>
<li><a
href="37de32022e"><code>37de320</code></a>
ROB: Deal with invalid annotations in extract_links (<a
href="https://redirect.github.com/py-pdf/pypdf/issues/3659">#3659</a>)</li>
<li>See full diff in <a
href="https://github.com/py-pdf/pypdf/compare/6.7.3...6.7.4">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pypdf&package-manager=uv&previous-version=6.7.3&new-version=6.7.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/infiniflow/ragflow/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-02 11:32:12 +08:00
9d78d3ddb1 Tests: fix failling http in CI (#13301)
### What problem does this PR solve?
test_doc_sdk_routes_unit had two flaky/incorrect branch assumptions:

1. parse/stop_parsing production logic gates on doc.run, but tests used
progress, causing branch mismatch and unintended fallthrough into
mutation/DB paths.
2. stop_parsing invalid-state test asserted an outdated message
fragment, making the contract brittle.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 10:44:33 +08:00
7e0dd906f2 refactor: update admin ui (#13280)
### What problem does this PR solve?

Update for Admin UI:
- Update file picker input in **Registration whitelist** > **Import from
Excel** modal
- Modify DOM structure of **Sandbox Settings** and move several
hardcoded texts into translation files

### Type of change

- [x] Refactoring
2026-02-28 19:21:51 +08:00
e62552d482 Added some React IDs for playwright e2e tests (#13265)
### What problem does this PR solve?

Necessary ids for implementing the new testing suite with playwright for
UI

### Type of change

- [x] Other (please describe): Testing IDs

Co-authored-by: Liu An <asiro@qq.com>
2026-02-28 15:13:47 +08:00
1027916bfe Fix: inconsistent state handling for multi-user single-canvas access (#13267)
### What problem does this PR solve?

<img width="700" alt="image"
src="https://github.com/user-attachments/assets/1db7412e-4554-44bc-84ba-16421949aacc"
/>

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-02-28 15:09:21 +08:00
c91e803a38 Fix: close detached PIL image on JPEG save failure in encode_image (#13278)
### What problem does this PR solve?

Properly close detached PIL image on JPEG save failure in encode_image.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-28 14:43:35 +08:00
983150b936 Fix (api): fix the document parsing status check logic (#12504)
### What problem does this PR solve?
When the original code terminates the parsing task halfway, the progress
may not be 0 or 1, which will result in the inability to call the
interface to parse again

-Change the document parsing progress check to task status check, and
use TaskStatus.RUNNING.value to judge
-Update the condition judgment for stopping parsing documents, and check
whether the task is running instead


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-28 14:38:55 +08:00
32ec950ca8 Fix create / drop chat session syntax (#13279)
### What problem does this PR solve?

This pull request refactors the chat session creation and deletion logic
in both the parser and client code to use unique session IDs instead of
session names. It also updates the corresponding command syntax and
payloads, ensuring more robust and unambiguous session management.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-02-28 14:18:21 +08:00
d9d4825079 Add chat sessions related command (#13268)
### What problem does this PR solve?

1. Create / Drop / List chat sessions
2. Chat with LLM and datasets

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-02-28 12:52:45 +08:00
54094771a3 Fix streaming chat on web API (#13275)
### What problem does this PR solve?

This pull request makes a small but important fix to how streaming
requests are handled in the `completion` endpoint of
`conversation_app.py`. The main change ensures that the `stream`
argument is not passed twice, which could cause errors.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-02-28 12:16:38 +08:00
0110151e12 Fix: document remove race condition (#13242)
### What problem does this PR solve?

Fix document remove race condition.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-28 11:23:24 +08:00
fa71f8d0c7 refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.

This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.

**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:

* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.

**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.

**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.

**Validation & Testing**
I've tested this to ensure no regressions:

* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.

**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
4f0c892b32 feat(ui): add individual model delete buttons across all providers (#13271)
### What problem does this PR solve?

Added the option to delete models individually from providers.
For additional context, see
[issue-13184](https://github.com/infiniflow/ragflow/issues/13184)

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Note: when deleting a selected model, it leaves the full model name as
text as seen here:
<img width="676" height="90" alt="image"
src="https://github.com/user-attachments/assets/c11c7c1b-3f2a-4119-b20c-bb8148a8ad16"
/>

If attempting to use ragflow with that deleted model, ragflow will throw
an unauthorized model error as expected.
I left it like that on purpose, so it's easier for the user to
understand what he deleted and that he needs to replace it with another
model.

Co-authored-by: Shahar Flumin <shahar@Shahars-MacBook-Air.local>
2026-02-28 10:51:39 +08:00
d1afcc9e71 feat(seafile): add library and directory sync scope support (#13153)
### What problem does this PR solve?

The SeaFile connector currently synchronises the entire account — every
library
visible to the authenticated user. This is impractical for users who
only need
a subset of their data indexed, especially on large SeaFile instances
with many
shared libraries.

This PR introduces granular sync scope support, allowing users to choose
between
syncing their entire account, a single library, or a specific directory
within a
library. It also adds support for SeaFile library-scoped API tokens
(`/api/v2.1/via-repo-token/` endpoints), enabling tighter access control
without
exposing account-level credentials.


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Test

```
from seafile_connector import SeaFileConnector
import logging
import os

logging.basicConfig(level=logging.DEBUG)

URL = os.environ.get("SEAFILE_URL", "https://seafile.example.com")
TOKEN = os.environ.get("SEAFILE_TOKEN", "")
REPO_ID = os.environ.get("SEAFILE_REPO_ID", "")
SYNC_PATH = os.environ.get("SEAFILE_SYNC_PATH", "/Documents")
REPO_TOKEN = os.environ.get("SEAFILE_REPO_TOKEN", "")

def _test_scope(scope, repo_id=None, sync_path=None):
    print(f"\n{'='*50}")
    print(f"Testing scope: {scope}")
    print(f"{'='*50}")

    creds = {"seafile_token": TOKEN} if TOKEN else {}
    if REPO_TOKEN and scope in ("library", "directory"):
        creds["repo_token"] = REPO_TOKEN

    connector = SeaFileConnector(
        seafile_url=URL,
        batch_size=5,
        sync_scope=scope,
        include_shared = False,
        repo_id=repo_id,
        sync_path=sync_path,
    )
    connector.load_credentials(creds)
    connector.validate_connector_settings()

    count = 0
    for batch in connector.load_from_state():
        for doc in batch:
            count += 1
            print(f"  [{count}] {doc.semantic_identifier} "
                  f"({doc.size_bytes} bytes, {doc.extension})")

    print(f"\n-> {scope} scope: {count} document(s) found.\n")

# 1. Account scope
if TOKEN:
    _test_scope("account")
else:
    print("\nSkipping account scope (set SEAFILE_TOKEN)")

# 2. Library scope
if REPO_ID and (TOKEN or REPO_TOKEN):
    _test_scope("library", repo_id=REPO_ID)
else:
    print("\nSkipping library scope (set SEAFILE_REPO_ID + token)")

# 3. Directory scope
if REPO_ID and SYNC_PATH and (TOKEN or REPO_TOKEN):
    _test_scope("directory", repo_id=REPO_ID, sync_path=SYNC_PATH)
else:
    print("\nSkipping directory scope (set SEAFILE_REPO_ID + SEAFILE_SYNC_PATH + token)")
```
2026-02-28 10:24:28 +08:00
aec2ef4232 refactor:improve tts model's codes (#13137)
### What problem does this PR solve?

improve tts model's codes

### Type of change

- [x] Refactoring
2026-02-28 10:18:00 +08:00
9577753c10 Refactor: improve the logic about docling parser extract box (#13215)
### What problem does this PR solve?
 improve the logic about docling parser extract box

### Type of change
- [x] Refactoring
2026-02-28 10:05:24 +08:00
510ff89661 Fix: remove unused files (#13232)
### What problem does this PR solve?

Fix: remove unused files

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-27 23:05:40 +08:00
c0823e8d6d refactor: update chat ui (#13269)
### What problem does this PR solve?

Update **Chat** UI:
- Align to the design.
- Update `<AudioButton>` visualizer logic.
- Fix keyboard navigation issue.

### Type of change

- [x] Refactoring
2026-02-27 22:26:19 +08:00
4e48aba5c4 fix: update DoclingParser return type hint (#13243)
### What problem does this PR solve?

The _transfer_to_sections method was throwing a type hint violation
because it occasionally returns 3-item tuples instead of 2. Adjusted to
list[tuple[str, ...]] to prevent runtime crashes.

Error: 

20:53:21 Page(1~10): [ERROR]Internal server error while chunking:
Method
deepdoc.parser.docling_parser.DoclingParser._transfer_to_sections()
return [(1. JIRA Nasıl Kullanılır?, text,
@@1\t70.8\t194.9\t70.9\t85.5##), (1.1. Proje O...##)] violates type
hint list[tuple[str, str]], as list index
15 item tuple tuple (Gelen ekran
üzerinden alanları isterlerine göre doldurduğunuz taktirde Create
düğmesi i...##) length 3 != 2.
20:53:21 [ERROR][Exception]: Method
deepdoc.parser.docling_parser.DoclingParser._transfer_to_sections()
return [('1. JIRA Nasıl Kullanılır?', 'text',
'@@1\t70.8\t194.9\t70.9\t85.5##'), ('1.1. Proje O...##')] violates
type hint list[tuple[str, str]], as list index
15 item tuple tuple ('Gelen ekran
üzerinden alanları isterlerine göre doldurduğunuz taktirde Create
düğmesi i...##') length 3 != 2.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Enes Delibalta <enes.delibalta@pentanom.com>
2026-02-27 20:13:50 +08:00
51b180d991 fix: adding GPUStack chat model requires v1 suffix (#13237)
### What problem does this PR solve?

Refer to issue: #13236
The base url for GPUStack chat model requires `/v1` suffix. For the
other model type like `Embedding` or `Rerank`, the `/v1` suffix is not
required and will be appended in code.
So keep the same logic for chat model as other model type.

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
2026-02-27 20:13:07 +08:00
194e076e26 Fix: init superuser can create duplicate users (#13221)
### What problem does this PR solve?

This PR fixes 2 bugs related to RAGFlow's init superuser functionality.

#### Bug 1

When the RAGFlow server was started with the `--init-superuser` option
it would always create a new admin user even if it already exists
resulting in duplicate users.

To fix this, I added an additional check before create the superuser and
added the *unique* constraint to the email column of the database, to
mitigate potential TOCTOU race conditions. Since existing databases
could contain duplicate emails I added email de-duplication to the
database migration.

#### Bug 2

When the RAGFlow server was started with the `--init-superuser` option
but without configured default LLM and embedding models it would fail to
start because the `init_superuser` function would always make test
request to the models even if they were not set.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-27 19:55:51 +08:00
6d0100ca67 Fix: The output content of the multi-model comparison will disappear. #13227 (#13241)
### What problem does this PR solve?

Fix: The output content of the multi-model comparison will disappear.
#13227
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-27 19:18:40 +08:00
861ebfc6e1 Feat: Make the embedded page of chat compatible with mobile devices. (#13262)
### What problem does this PR solve?
Feat: Make the embedded page of chat compatible with mobile devices.

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2026-02-27 19:17:41 +08:00
5f53fbe0f1 feat: Add Avian as an LLM provider (#13256)
### What problem does this PR solve?

This PR adds [Avian](https://avian.io) as a new LLM provider to RAGFlow.
Avian provides an OpenAI-compatible API with competitive pricing,
offering access to models like DeepSeek V3.2, Kimi K2.5, GLM-5, and
MiniMax M2.5.

**Provider details:**
- API Base URL: `https://api.avian.io/v1`
- Auth: Bearer token via API key
- OpenAI-compatible (chat completions, streaming, function calling)
- Models:
  - `deepseek/deepseek-v3.2` — 164K context, $0.26/$0.38 per 1M tokens
  - `moonshotai/kimi-k2.5` — 131K context, $0.45/$2.20 per 1M tokens
  - `z-ai/glm-5` — 131K context, $0.30/$2.55 per 1M tokens
  - `minimax/minimax-m2.5` — 1M context, $0.30/$1.10 per 1M tokens

**Changes:**
- `rag/llm/chat_model.py` — Add `AvianChat` class extending `Base`
- `rag/llm/__init__.py` — Register in `SupportedLiteLLMProvider`,
`FACTORY_DEFAULT_BASE_URL`, `LITELLM_PROVIDER_PREFIX`
- `conf/llm_factories.json` — Add Avian factory with model definitions
- `web/src/constants/llm.ts` — Add to `LLMFactory` enum, `IconMap`,
`APIMapUrl`
- `web/src/components/svg-icon.tsx` — Register SVG icon
- `web/src/assets/svg/llm/avian.svg` — Provider icon
- `docs/references/supported_models.mdx` — Add to supported models table

This follows the same pattern as other OpenAI-compatible providers
(e.g., n1n #12680, TokenPony).

cc @KevinHuSh @JinHai-CN

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
2026-02-27 17:36:55 +08:00
bb59a27e55 Doc : Add french Readme (#13254)
### What problem does this PR solve?

Add fench Readme

### Type of change

- [x] Documentation Update
2026-02-27 11:34:13 +08:00
8b6d363a98 Use pagination in _search_metadata (#13238)
### What problem does this PR solve?

Fix [#13210](https://github.com/infiniflow/ragflow/issues/13210)

Remove limit in _search_metadata, use pagination in _search_metadata.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-27 11:24:49 +08:00
a1549c0fdc Fix UI (#13239)
### What problem does this PR solve?

This pull request makes a minor update to the English locale strings for
the Table of Contents toggle buttons, changing the labels from "Show
TOC"/"Hide TOC" to "Show content"/"Hide content" for improved clarity.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-02-26 19:21:08 +08:00
c03c537bf8 Feat: optimize gmail/google-drive (#13230)
### What problem does this PR solve?

Feat: optimize gmail/google-drive

Now:
<img width="700" alt="image"
src="https://github.com/user-attachments/assets/0c4b6044-7209-4c4f-ac0c-32070b79daf7"
/>
<img width="700" alt="image"
src="https://github.com/user-attachments/assets/406f93d8-9b0f-4f5a-b8bb-3936990f558c"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-02-26 19:19:40 +08:00