### What problem does this PR solve?
Forces NLTK to load the corpus synchronously once, preventing concurrent
tasks from triggering the lazy-loading race condition that cause Fixing
WordNetCorpusReader object has no attribute _LazyCorpusLoader_… #13590
### Type of change
- [X] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: shakeel <shakeel@lollylaw.com>
### What problem does this PR solve?
issue #13465
POST /api/v1/retrieval failed with
{"code":100,...,"message":"Exception('Model Name is required')"} when
cross_languages was provided and no explicit llm_id was passed.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
Add MiniMax's latest M2.5 model family to the model registry and update
the default API base URL to the international endpoint for broader
accessibility.
## Changes
- **Add MiniMax-M2.5 models** to `conf/llm_factories.json`:
- `MiniMax-M2.5` — Peak Performance. Ultimate Value. Master the Complex.
- `MiniMax-M2.5-highspeed` — Same performance, faster and more agile.
- Both support 204,800 token context window and tool calling (`is_tools:
true`).
- **Update default MiniMax API base URL** in `rag/llm/__init__.py`:
- From `https://api.minimaxi.com/v1` (domestic) to
`https://api.minimax.io/v1` (international).
- Chinese users can still override via the Base URL field in the UI
settings (as documented in existing i18n strings).
## Supported Models
| Model | Context Window | Tool Calling | Description |
|-------|---------------|-------------|-------------|
| `MiniMax-M2.5` | 204,800 tokens | Yes | Peak Performance. Ultimate
Value. |
| `MiniMax-M2.5-highspeed` | 204,800 tokens | Yes | Same performance,
faster and more agile. |
## API Documentation
- OpenAI Compatible API:
https://platform.minimax.io/docs/api-reference/text-openai-api
## Testing
- [x] JSON validation passes
- [x] Python syntax validation passes
- [x] Ruff lint passes
- [x] MiniMax-M2.5 API call verified (returns valid response)
- [x] MiniMax-M2.5-highspeed API call verified (returns valid response)
Co-authored-by: PR Bot <pr-bot@minimaxi.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fix: image pdf in ingestion pipeline #13550
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR adds support for parsing PDFs through an external Docling
server, so RAGFlow can connect to remote `docling serve` deployments
instead of relying only on local in-process Docling.
It addresses the feature request in
[#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns
with the external-server usage pattern already used by MinerU.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What is changed?
- Add external Docling server support in `DoclingParser`:
- Use `DOCLING_SERVER_URL` to enable remote parsing mode.
- Try `POST /v1/convert/source` first, and fallback to
`/v1alpha/convert/source`.
- Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not
set.
- Wire Docling env settings into parser invocation paths:
- `rag/app/naive.py`
- `rag/flow/parser/parser.py`
- Add Docling env hints in constants and update docs:
- `docs/guides/dataset/select_pdf_parser.md`
- `docs/guides/agent/agent_component_reference/parser.md`
- `docs/faq.mdx`
### Why this approach?
This keeps the change focused on one issue and one capability (external
Docling connectivity), without introducing unrelated provider-model
plumbing.
### Validation
- Static checks:
- `python -m py_compile` on changed Python files
- `python -m ruff check` on changed Python files
- Functional checks:
- Remote v1 endpoint path works
- v1alpha fallback works
- Local Docling path remains available when server URL is unset
### Related links
- Feature request: [Support external Docling server (issue
#13426)](https://github.com/infiniflow/ragflow/issues/13426)
- Compare view for this branch:
[main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1)
##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)
### What problem does this PR solve?
Add delete all support for delete operations.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
---------
Co-authored-by: writinwaters <cai.keith@gmail.com>
## Summary
- Convert bare `open()` calls to `with` context managers or
`Path.read_text()`
- File handles leak if not properly closed, especially on exceptions
- Fixes in crypt.py, sequence2txt_model.py, term_weight.py,
deepdoc/vision/__init__.py
## Test plan
- [x] File operations work correctly with context managers
- [x] Resources properly cleaned up on exceptions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.
Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.
## What's Changed
* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.
* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.
## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.
## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:
* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.
## Breaking Changes
* None.
### What problem does this PR solve?
Fix: chats_openai in none stream condition #13453
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix https://github.com/infiniflow/ragflow/issues/13388
The following command returns empty when there is doc with the meta data
```
curl --request POST \
--url http://localhost:9222/api/v1/retrieval \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer ragflow-fO3mPFePfLgUYg8-9gjBVVXbvHqrvMPLGaW0P86PvAk' \
--data '{
"question": "any question",
"dataset_ids": ["9bb4f0591b8811f18a4a84ba59049aa3"],
"metadata_condition": {
"logic": "and",
"conditions": [
{
"name": "character",
"comparison_operator": "is",
"value": "刘备"
}
]
}
}'
```
When metadata_condtion is specified in the retrieval API, it is
converted to doc_ids and doc_ids is passed to retrieval function.
In retrieval funciton, when doc_ids is explicitly provided , we should
bypass threshold.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR addresses security vulnerabilities in PDF processing
dependencies identified by Trivy security scan:
1. CVE-2026-28804 (MEDIUM): pypdf 6.7.4 vulnerable to inefficient
decoding of ASCIIHexDecode streams
2. CVE-2023-36464 (MEDIUM): pypdf2 3.0.1 susceptible to infinite loop
when parsing malformed comments
Since pypdf2 is deprecated with no available fixes, this PR migrates all
pypdf2 usage to the actively maintained pypdf library (version 6.7.5),
which resolves
both vulnerabilities.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add DingTalk AI Table connector and integration for data synchronization
Issue #13400
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: wangheyang <wangheyang@corp.netease.com>
### What problem does this PR solve?
This PR aims to extend the list of possible providers. Adds new Provider
"RAGcon" within the Ollama Modal. It provides all model types except OCR
via Openai-compatible endpoints.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Jakob <16180662+hauberj@users.noreply.github.com>
### What problem does this PR solve?
Alibaba Could OSS config issue #13390.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add id for table tenant_llm and apply in LLMBundle.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
Problem: When searching for a specific company name like(Daofeng
Technology), the search would incorrectly return unrelated resumes
containing generic terms like (Technology) in their company names
Root Cause: The `corporation_name_tks` field was included in the
identity fields that are redundantly written to every chunk. This caused
common words like "科技" to match across all chunks, leading to
over-retrieval of irrelevant resumes.
Solution: Remove `corporation_name_tks` from the `_IDENTITY_FIELDS`
list. Company information is still preserved in the "Work Overview"
chunk where it belongs, allowing proper company-based searches while
preventing false positives from generic terms.
---------
Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
1. Use redis to store the secret key.
2. During startup API server will read the secret from redis. If no such
secret key, generate one and store it into redis, atomically.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Fix: Correct PDF chunking parameter name in naive #13325
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Cross-verify project experience and work experience, and remove
duplicate text
---------
Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
### What problem does this PR solve?
Fix AttributeError when calling llm.chat() in resume parser. LLMBundle
only has async_chat method, not chat method. Use `_run_coroutine_sync`
wrapper to call async_chat synchronously.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Core optimizations (refer to arXiv:2510.09722):
1. PDF text fusion: Metadata + OCR dual-path extraction and fusion
2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical
sorting + line number indexing
3. Parallel task decomposition: Basic information/work
experience/educational background three-way parallel LLM extraction
4. Index pointer mechanism: LLM returns a range of line numbers instead
of generating the full text, reducing the illusion of full text.
---------
Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
## Summary
When using MinerU, docling, TCADP, or paddleocr as the PDF parser with
the General (naive) chunk method, the user-configured `chunk_token_num`
is **unconditionally overwritten to 0** at
[rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859),
effectively disabling chunk merging regardless of what the user sets in
the UI.
### Problem
A user sets `chunk_token_num = 2048` in the dataset configuration UI,
expecting small parser blocks to be merged into larger chunks. However,
this line:
```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
parser_config["chunk_token_num"] = 0
```
silently overrides the user's setting. As a result, every MinerU output
block becomes its own chunk. For short documents (e.g. a 3-page PDF fund
factsheet parsed by MinerU), this produces **47 tiny chunks** — some as
small as 11 characters (`"July 2025"`) or 15 characters (`"CIES
Eligible"`).
This severely degrades retrieval quality: vector embeddings of such
short fragments have minimal semantic value, and keyword search produces
excessive noise.
### Fix
Only apply the `chunk_token_num = 0` override when the user has **not**
explicitly configured a positive value:
```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
if int(parser_config.get("chunk_token_num", 0)) <= 0:
parser_config["chunk_token_num"] = 0
```
This preserves the original default behavior (no merging) while
respecting the user's explicit configuration.
### Before / After (MinerU, 3-page PDF, chunk_token_num=2048)
| | Before | After |
|---|---|---|
| Chunks produced | 47 | ~8 (merged by token limit) |
| Smallest chunk | 11 chars | ~500 chars |
| User setting respected | No | Yes |
## Test plan
- [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify
chunks are merged up to token limit
- [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) →
verify original behavior (no merging)
- [ ] Parse a PDF with DeepDOC parser → verify no change in behavior
(not affected by this code path)
- [ ] Repeat with docling/paddleocr if available
### What problem does this PR solve?
Fix: add soft limit for graph rag size #13258 Q2
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
When using OceanBase as the document storage engine, parsing and
inserting chunks with chunk_data (e.g., table parser row data) fails
with the following error:
```
[ERROR][Exception]: Insert chunk error: ['Unconsumed column names: chunk_data']
This happens because the chunk_data column was recently introduced but was omitted from the EXTRA_COLUMNS list in
rag/utils/ob_conn.py
```
As a result, the automatic schema migration for existing OceanBase
tables does not append the missing chunk_data column, causing the
underlying pyobvector or SQLAlchemy to raise an unconsumed column names
error during data insertion.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What is the solution?
Added column_chunk_data to the EXTRA_COLUMNS list in
```
rag/utils/ob_conn.py
```
This ensures that the OceanBase connection wrapper can correctly detect
the missing column and automatically alter existing chunk tables to
include the chunk_data field during initialization.
### What problem does this PR solve?
Feat: add preprocess parameters for ingestion pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
This PR adds comprehensive **Right-to-Left (RTL) language support**,
primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu,
etc.).
Previously, RTL content had multiple rendering issues:
- Incorrect sentence splitting for Arabic punctuation in citation logic
- Misaligned text in chat messages and markdown components
- Improper positioning of blockquotes and “think” sections
- Incorrect table alignment
- Citation placement ambiguity in RTL prompts
- UI layout inconsistencies when mixing LTR and RTL text
This PR introduces backend and frontend improvements to properly detect,
render, and style RTL content while preserving existing LTR behavior.
#### Backend
- Updated sentence boundary regex in `rag/nlp/search.py` to include
Arabic punctuation:
- `،` (comma)
- `؛` (semicolon)
- `؟` (question mark)
- `۔` (Arabic full stop)
- Ensures citation insertion works correctly in RTL sentences.
- Updated citation prompt instructions to clarify citation placement
rules for RTL languages.
#### Frontend
- Introduced a new utility: `text-direction.ts`
- Detects text direction based on Unicode ranges.
- Supports Arabic, Hebrew, Syriac, Thaana, and related scripts.
- Provides `getDirAttribute()` for automatic `dir` assignment.
- Applied dynamic `dir` attributes across:
- Markdown rendering
- Chat messages
- Search results
- Tables
- Hover cards and reference popovers
- Added proper RTL styling in LESS:
- Text alignment adjustments
- Blockquote border flipping
- Section indentation correction
- Table direction switching
- Use of `<bdi>` for figure labels to prevent bidirectional conflicts
#### DevOps / Environment
- Added Windows backend launch script with retry handling.
- Updated dependency metadata.
- Adjusted development-only React debugging behavior.
---
### Type of change
- [x] Bug Fix (non-breaking change which fixes RTL rendering and
citation issues)
- [x] New Feature (non-breaking change which adds RTL detection and
dynamic direction handling)
---------
Co-authored-by: 6ba3i <isbaaoui09@gmail.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
Properly close detached PIL image on JPEG save failure in encode_image.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.
This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.
**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:
* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.
**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.
**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.
**Validation & Testing**
I've tested this to ensure no regressions:
* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.
**Breaking Changes**
None.
### What problem does this PR solve?
The SeaFile connector currently synchronises the entire account — every
library
visible to the authenticated user. This is impractical for users who
only need
a subset of their data indexed, especially on large SeaFile instances
with many
shared libraries.
This PR introduces granular sync scope support, allowing users to choose
between
syncing their entire account, a single library, or a specific directory
within a
library. It also adds support for SeaFile library-scoped API tokens
(`/api/v2.1/via-repo-token/` endpoints), enabling tighter access control
without
exposing account-level credentials.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### Test
```
from seafile_connector import SeaFileConnector
import logging
import os
logging.basicConfig(level=logging.DEBUG)
URL = os.environ.get("SEAFILE_URL", "https://seafile.example.com")
TOKEN = os.environ.get("SEAFILE_TOKEN", "")
REPO_ID = os.environ.get("SEAFILE_REPO_ID", "")
SYNC_PATH = os.environ.get("SEAFILE_SYNC_PATH", "/Documents")
REPO_TOKEN = os.environ.get("SEAFILE_REPO_TOKEN", "")
def _test_scope(scope, repo_id=None, sync_path=None):
print(f"\n{'='*50}")
print(f"Testing scope: {scope}")
print(f"{'='*50}")
creds = {"seafile_token": TOKEN} if TOKEN else {}
if REPO_TOKEN and scope in ("library", "directory"):
creds["repo_token"] = REPO_TOKEN
connector = SeaFileConnector(
seafile_url=URL,
batch_size=5,
sync_scope=scope,
include_shared = False,
repo_id=repo_id,
sync_path=sync_path,
)
connector.load_credentials(creds)
connector.validate_connector_settings()
count = 0
for batch in connector.load_from_state():
for doc in batch:
count += 1
print(f" [{count}] {doc.semantic_identifier} "
f"({doc.size_bytes} bytes, {doc.extension})")
print(f"\n-> {scope} scope: {count} document(s) found.\n")
# 1. Account scope
if TOKEN:
_test_scope("account")
else:
print("\nSkipping account scope (set SEAFILE_TOKEN)")
# 2. Library scope
if REPO_ID and (TOKEN or REPO_TOKEN):
_test_scope("library", repo_id=REPO_ID)
else:
print("\nSkipping library scope (set SEAFILE_REPO_ID + token)")
# 3. Directory scope
if REPO_ID and SYNC_PATH and (TOKEN or REPO_TOKEN):
_test_scope("directory", repo_id=REPO_ID, sync_path=SYNC_PATH)
else:
print("\nSkipping directory scope (set SEAFILE_REPO_ID + SEAFILE_SYNC_PATH + token)")
```
### What problem does this PR solve?
Refer to issue: #13236
The base url for GPUStack chat model requires `/v1` suffix. For the
other model type like `Embedding` or `Rerank`, the `/v1` suffix is not
required and will be appended in code.
So keep the same logic for chat model as other model type.
### Type of change
- [X] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR adds [Avian](https://avian.io) as a new LLM provider to RAGFlow.
Avian provides an OpenAI-compatible API with competitive pricing,
offering access to models like DeepSeek V3.2, Kimi K2.5, GLM-5, and
MiniMax M2.5.
**Provider details:**
- API Base URL: `https://api.avian.io/v1`
- Auth: Bearer token via API key
- OpenAI-compatible (chat completions, streaming, function calling)
- Models:
- `deepseek/deepseek-v3.2` — 164K context, $0.26/$0.38 per 1M tokens
- `moonshotai/kimi-k2.5` — 131K context, $0.45/$2.20 per 1M tokens
- `z-ai/glm-5` — 131K context, $0.30/$2.55 per 1M tokens
- `minimax/minimax-m2.5` — 1M context, $0.30/$1.10 per 1M tokens
**Changes:**
- `rag/llm/chat_model.py` — Add `AvianChat` class extending `Base`
- `rag/llm/__init__.py` — Register in `SupportedLiteLLMProvider`,
`FACTORY_DEFAULT_BASE_URL`, `LITELLM_PROVIDER_PREFIX`
- `conf/llm_factories.json` — Add Avian factory with model definitions
- `web/src/constants/llm.ts` — Add to `LLMFactory` enum, `IconMap`,
`APIMapUrl`
- `web/src/components/svg-icon.tsx` — Register SVG icon
- `web/src/assets/svg/llm/avian.svg` — Provider icon
- `docs/references/supported_models.mdx` — Add to supported models table
This follows the same pattern as other OpenAI-compatible providers
(e.g., n1n #12680, TokenPony).
cc @KevinHuSh @JinHai-CN
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
### What problem does this PR solve?
Feat: optimize ingestion pipeline with preprocess
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Actual behavior
When using OceanBase as storage, the list_chunk sorting is abnormal. The
following is the SQL statement.
SELECT id, content_with_weight, important_kwd, question_kwd, img_id,
available_int, position_int, doc_type_kwd, create_timestamp_flt,
create_time, array_to_string(page_num_int, ',') AS page_num_int_sort,
array_to_string(top_int, ',') AS top_int_sort FROM
rag_store_284250730805059584 WHERE doc_id = '' AND kb_id IN ('') ORDER
BY page_num_int_sort ASC, top_int_sort ASC, create_timestamp_flt DESC
LIMIT 0, 20
<img width="1610" height="740" alt="image"
src="https://github.com/user-attachments/assets/84e14c30-a97f-4e8f-8c8c-6ccac915d97d"
/>
Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
## Summary
Fixes MinIO SSL/TLS support in two places: the MinIO **client**
connection and the **health check** used by the Admin/Service Health
dashboard. Both now respect the `secure` and `verify` settings from the
MinIO configuration.
Closes#13158Closes#13159
---
## Problem
**#13158 – MinIO client:** The client in `rag/utils/minio_conn.py` was
hardcoded with `secure=False`, so RAGFlow could not connect to MinIO
over HTTPS even when `secure: true` was set in config. There was also no
way to disable certificate verification for self-signed certs.
**#13159 – MinIO health check:** In `api/utils/health_utils.py`, the
MinIO liveness check always used `http://` for the health URL. When
MinIO was configured with SSL, the health check failed and the dashboard
showed "timeout" even though MinIO was reachable over HTTPS.
---
## Solution
### MinIO client (`rag/utils/minio_conn.py`)
- Read `MINIO.secure` (default `false`) and pass it into the `Minio()`
constructor so HTTPS is used when configured.
- Add `_build_minio_http_client()` that reads `MINIO.verify` (default
`true`). When `verify` is false, return an `urllib3.PoolManager` with
`cert_reqs=ssl.CERT_NONE` and pass it as `http_client` to `Minio()` so
self-signed certificates are accepted.
- Support string values for `secure` and `verify` (e.g. `"true"`,
`"false"`).
### MinIO health check (`api/utils/health_utils.py`)
- Add `_minio_scheme_and_verify()` to derive URL scheme (http/https) and
the `verify` flag from `MINIO.secure` and `MINIO.verify`.
- Update `check_minio_alive()` to use the correct scheme, pass `verify`
into `requests.get(..., verify=verify)`, and use `timeout=10`.
### Config template (`docker/service_conf.yaml.template`)
- Add commented optional MinIO keys `secure` and `verify` (and env vars
`MINIO_SECURE`, `MINIO_VERIFY`) so deployers know they can enable HTTPS
and optional cert verification.
### Tests
- **`test/unit_test/utils/test_health_utils_minio.py`** – Tests for
`_minio_scheme_and_verify()` and `check_minio_alive()` (scheme, verify,
status codes, timeout, errors).
- **`test/unit_test/utils/test_minio_conn_ssl.py`** – Tests for
`_build_minio_http_client()` (verify true/false/missing, string values,
`CERT_NONE` when verify is false).
---
## Testing
- Unit tests added/updated as above; run with the project's test runner.
- Manually: configure MinIO with HTTPS and `secure: true` (and
optionally `verify: false` for self-signed); confirm client operations
work and the Service Health dashboard shows MinIO as alive instead of
timeout.
### What problem does this PR solve?
Refact: switch from oogle-generativeai to google-genai #13132
Refact: commnet out unused pywencai.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Fix error when extracting the graph.
A string is expected, but a tuple was provided.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
When match_expressions contains coroutine objects (from GraphRAG's
Dealer.get_vector()), the code cannot identify this type because it only
checks for MatchTextExpr, MatchDenseExpr, or FusionExpr.
As a result:
score_func remains initialized as an empty string ""
This empty string is appended to the output list
The output list is passed to Infinity SDK's table_instance.output()
method
Infinity's SQL parser (via sqlglot) fails to parse the empty string,
throwing a ParseError
### What problem does this PR solve?
Fix parameter of calling self.dataStore.get() and warning info during
parser
https://github.com/infiniflow/ragflow/issues/13036
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Adjust highlight parsing, add row-count SQL override, tweak retrieval
thresholding, and update tests with engine-aware skips/utilities.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)