Commit Graph

291 Commits

Author SHA1 Message Date
387b0b27c4 feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527)
### What problem does this PR solve?

This PR adds support for parsing PDFs through an external Docling
server, so RAGFlow can connect to remote `docling serve` deployments
instead of relying only on local in-process Docling.

It addresses the feature request in
[#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns
with the external-server usage pattern already used by MinerU.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### What is changed?

- Add external Docling server support in `DoclingParser`:
  - Use `DOCLING_SERVER_URL` to enable remote parsing mode.
- Try `POST /v1/convert/source` first, and fallback to
`/v1alpha/convert/source`.
- Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not
set.
- Wire Docling env settings into parser invocation paths:
  - `rag/app/naive.py`
  - `rag/flow/parser/parser.py`
- Add Docling env hints in constants and update docs:
  - `docs/guides/dataset/select_pdf_parser.md`
  - `docs/guides/agent/agent_component_reference/parser.md`
  - `docs/faq.mdx`

### Why this approach?

This keeps the change focused on one issue and one capability (external
Docling connectivity), without introducing unrelated provider-model
plumbing.

### Validation

- Static checks:
  - `python -m py_compile` on changed Python files
  - `python -m ruff check` on changed Python files
- Functional checks:
  - Remote v1 endpoint path works
  - v1alpha fallback works
  - Local Docling path remains available when server URL is unset

### Related links

- Feature request: [Support external Docling server (issue
#13426)](https://github.com/infiniflow/ragflow/issues/13426)
- Compare view for this branch:
[main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1)

##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)
2026-03-12 17:09:03 +08:00
d0ca388bec Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.

Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.


## What's Changed

* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.

* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.

## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.

## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:

* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.

## Breaking Changes
* None.
2026-03-11 10:00:07 +08:00
32d31284cc Fix: upgrade pypdf to 6.7.5 and migrate from deprecated pypdf2 to fix CVE-2026-28804 and CVE-2023-36464 (#13454)
### What problem does this PR solve?

This PR addresses security vulnerabilities in PDF processing
dependencies identified by Trivy security scan:

1. CVE-2026-28804 (MEDIUM): pypdf 6.7.4 vulnerable to inefficient
decoding of ASCIIHexDecode streams
2. CVE-2023-36464 (MEDIUM): pypdf2 3.0.1 susceptible to infinite loop
when parsing malformed comments

Since pypdf2 is deprecated with no available fixes, this PR migrates all
pypdf2 usage to the actively maintained pypdf library (version 6.7.5),
which resolves
both vulnerabilities.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-09 12:06:00 +08:00
62cb292635 Feat/tenant model (#13072)
### What problem does this PR solve?

Add id for table tenant_llm and apply in LLMBundle.

### Type of change

- [x] Refactoring

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
2026-03-05 17:27:17 +08:00
c99b53064d fix: remove company info from resume_summary to prevent over-retrieval (#13358)
### What problem does this PR solve?

Problem: When searching for a specific company name like(Daofeng
Technology), the search would incorrectly return unrelated resumes
containing generic terms like (Technology) in their company names

Root Cause: The `corporation_name_tks` field was included in the
identity fields that are redundantly written to every chunk. This caused
common words like "科技" to match across all chunks, leading to
over-retrieval of irrelevant resumes.

Solution: Remove `corporation_name_tks` from the `_IDENTITY_FIELDS`
list. Company information is still preserved in the "Work Overview"
chunk where it belongs, allowing proper company-based searches while
preventing false positives from generic terms.

---------

Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
Co-authored-by: Liu An <asiro@qq.com>
2026-03-04 19:24:49 +08:00
93d621a666 Fix: Correct PDF chunking parameter name in naive (#13357)
### What problem does this PR solve?

Fix: Correct PDF chunking parameter name in naive #13325

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-04 11:51:10 +08:00
48755a3352 Fix: (resume) Cross-verify project experience and work experience, and remove duplicate text (#13323)
Cross-verify project experience and work experience, and remove
duplicate text

---------

Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
2026-03-03 14:53:46 +08:00
707de2461a Fix: use async_chat with sync wrapper in resume parser (#13320)
### What problem does this PR solve?

Fix AttributeError when calling llm.chat() in resume parser. LLMBundle
only has async_chat method, not chat method. Use `_run_coroutine_sync`
wrapper to call async_chat synchronously.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-02 19:51:06 +08:00
f8c91e8854 Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255)
Core optimizations (refer to arXiv:2510.09722):

1. PDF text fusion: Metadata + OCR dual-path extraction and fusion

2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical
sorting + line number indexing

3. Parallel task decomposition: Basic information/work
experience/educational background three-way parallel LLM extraction

4. Index pointer mechanism: LLM returns a range of line numbers instead
of generating the full text, reducing the illusion of full text.

---------

Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>
Co-authored-by: Aron.Yao <yaowei@192.168.1.68>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-03-02 19:05:50 +08:00
8ba66dd62a Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234)
## Summary

When using MinerU, docling, TCADP, or paddleocr as the PDF parser with
the General (naive) chunk method, the user-configured `chunk_token_num`
is **unconditionally overwritten to 0** at
[rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859),
effectively disabling chunk merging regardless of what the user sets in
the UI.

### Problem

A user sets `chunk_token_num = 2048` in the dataset configuration UI,
expecting small parser blocks to be merged into larger chunks. However,
this line:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    parser_config["chunk_token_num"] = 0
```

silently overrides the user's setting. As a result, every MinerU output
block becomes its own chunk. For short documents (e.g. a 3-page PDF fund
factsheet parsed by MinerU), this produces **47 tiny chunks** — some as
small as 11 characters (`"July 2025"`) or 15 characters (`"CIES
Eligible"`).

This severely degrades retrieval quality: vector embeddings of such
short fragments have minimal semantic value, and keyword search produces
excessive noise.

### Fix

Only apply the `chunk_token_num = 0` override when the user has **not**
explicitly configured a positive value:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    if int(parser_config.get("chunk_token_num", 0)) <= 0:
        parser_config["chunk_token_num"] = 0
```

This preserves the original default behavior (no merging) while
respecting the user's explicit configuration.

### Before / After (MinerU, 3-page PDF, chunk_token_num=2048)

| | Before | After |
|---|---|---|
| Chunks produced | 47 | ~8 (merged by token limit) |
| Smallest chunk | 11 chars | ~500 chars |
| User setting respected | No | Yes |

## Test plan

- [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify
chunks are merged up to token limit
- [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) →
verify original behavior (no merging)
- [ ] Parse a PDF with DeepDOC parser → verify no change in behavior
(not affected by this code path)
- [ ] Repeat with docling/paddleocr if available
2026-03-02 15:31:40 +08:00
21bc1ab7ec Feature rtl support (#13118)
### What problem does this PR solve?

This PR adds comprehensive **Right-to-Left (RTL) language support**,
primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu,
etc.).

Previously, RTL content had multiple rendering issues:

- Incorrect sentence splitting for Arabic punctuation in citation logic
- Misaligned text in chat messages and markdown components  
- Improper positioning of blockquotes and “think” sections  
- Incorrect table alignment  
- Citation placement ambiguity in RTL prompts  
- UI layout inconsistencies when mixing LTR and RTL text  

This PR introduces backend and frontend improvements to properly detect,
render, and style RTL content while preserving existing LTR behavior.

#### Backend
- Updated sentence boundary regex in `rag/nlp/search.py` to include
Arabic punctuation:
  - `،` (comma)
  - `؛` (semicolon)
  - `؟` (question mark)
  - `۔` (Arabic full stop)
- Ensures citation insertion works correctly in RTL sentences.
- Updated citation prompt instructions to clarify citation placement
rules for RTL languages.

#### Frontend
- Introduced a new utility: `text-direction.ts`
  - Detects text direction based on Unicode ranges.
  - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts.
  - Provides `getDirAttribute()` for automatic `dir` assignment.

- Applied dynamic `dir` attributes across:
  - Markdown rendering
  - Chat messages
  - Search results
  - Tables
  - Hover cards and reference popovers

- Added proper RTL styling in LESS:
  - Text alignment adjustments
  - Blockquote border flipping
  - Section indentation correction
  - Table direction switching
  - Use of `<bdi>` for figure labels to prevent bidirectional conflicts

#### DevOps / Environment
- Added Windows backend launch script with retry handling.
- Updated dependency metadata.
- Adjusted development-only React debugging behavior.

---

### Type of change

- [x] Bug Fix (non-breaking change which fixes RTL rendering and
citation issues)
- [x] New Feature (non-breaking change which adds RTL detection and
dynamic direction handling)

---------

Co-authored-by: 6ba3i <isbaaoui09@gmail.com>
Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com>
Co-authored-by: Liu An <asiro@qq.com>
2026-03-02 13:03:44 +08:00
fa71f8d0c7 refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.

This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.

**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:

* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.

**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.

**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.

**Validation & Testing**
I've tested this to ensure no regressions:

* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.

**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
158503a1aa Feat: optimize ingestion pipeline with preprocess (#13211)
### What problem does this PR solve?

Feat: optimize ingestion pipeline with preprocess

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-02-26 10:24:13 +08:00
109441628b Fix: upload image files (#13071)
### What problem does this PR solve?

Fix: upload image files

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-11 09:47:33 +08:00
4bc622b409 Fix parameter of calling self.dataStore.get() and warning info during parser (#13068)
### What problem does this PR solve?

Fix parameter of calling self.dataStore.get() and warning info during
parser

https://github.com/infiniflow/ragflow/issues/13036

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-09 17:56:59 +08:00
yH
5333e764fc fix: optimize Excel row counting for files with abnormal max_row (#13018)
### What problem does this PR solve?

Some Excel files have abnormal `max_row` metadata (e.g.,
`max_row=1,048,534` with only 300 actual data rows). This causes:
- `row_number()` returns incorrect count, creating 350+ tasks instead of
1
- `list(ws.rows)` iterates through millions of empty rows, causing
system hang

This PR uses binary search to find the actual last row with data.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-06 14:43:52 +08:00
11703d957d Refactor: Improve Picture.py resource usage (#13011)
### What problem does this PR solve?

Improve Picture.py resource usage

### Type of change


- [x] Refactoring
2026-02-06 09:50:53 +08:00
6c9ca45b30 Refactor: improve close for presentation (#12957)
### What problem does this PR solve?

improve close for presentation

### Type of change

- [x] Refactoring
2026-02-03 10:24:27 +08:00
1a2d69edc4 feat: Implement legacy .ppt parsing via Tika (alternative to Aspose) (#12932)
## What problem does this PR solve?
This PR implements parsing support for legacy PowerPoint files (`.ppt`,
97-2003 format).
Currently, parsing these files fails because `python-pptx` **natively
lacks support** for the legacy OLE2 binary format.

## **Context:**
I originally using `aspose-slides` for this purpose. However, since
`aspose-slides` is **no longer a project dependency**, I implemented a
fallback mechanism using the existing `tika-server` to ensure
compatibility and stability.

## **Key Changes:**
- **Fallback Logic**: Modified `rag/app/presentation.py` to catch
`python-pptx` failures and automatically fall back to Tika parsing.
- **No New Dependencies**: Utilizes the `tika` service that is already
part of the RAGFlow stack.
- **Note**: Since Tika focuses on text extraction, this implementation
extracts text content but does not generate slide thumbnails .
## 🧪 Test / Verification Results

### 1. Before (The Issue)
I have verified the fix using a legacy `.ppt` file (`math(1).ppt`,
~8MB).
<img width="963" height="970" alt="image"
src="https://github.com/user-attachments/assets/468c4ba8-f90b-4d7b-b969-9c5f5e42c474"
/>

### 2. After (The Fix)
With this PR, the system detects the failure in python-pptx and
successfully falls back to Tika. The text is extracted correctly.
<img width="1467" height="1121" alt="image"
src="https://github.com/user-attachments/assets/fa0fed3b-b923-4c86-ba2c-24b3ce6ee7a6"
/>


**Type of change**
- [x] New Feature (non-breaking change which adds functionality)

Signed-off-by: evilhero <2278596667@qq.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-02-02 13:40:51 +08:00
23bdf25a1f feature:Add OceanBase Storage Support for Table Parser (#12923)
### What problem does this PR solve?

close #12770 

This PR adds OceanBase as a storage backend for the Table Parser. It
enables dynamic table schema storage via JSON and implements OceanBase
SQL execution for text-to-SQL retrieval.


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Changes
- Table Parser stores row data into `chunk_data` when doc engine is
OceanBase. (table.py)
- OceanBase table schema adds `chunk_data` JSON column and migrates if
needed.
- Implemented OceanBase `sql()` to execute text-to-SQL results.
(ob_conn.py)
- Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py)

### Test
1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`)
<img width="1290" height="783" alt="doc_engine_ob"
src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1"
/>

2. Upload an Excel file to Knowledge Base.(for test, we use as below)
<img width="786" height="930" alt="excel"
src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a"
/>

3. Choose **Table** as parsing method.
<img width="2550" height="1134" alt="parse_excel"
src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0"
/>

4.Ask a natural language query in chat.
<img width="2550" height="1134" alt="query"
src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91"
/>
2026-01-31 15:11:54 +08:00
f262d416fe Refa: remove aspose dependency. (#12910)
### Type of change

- [x] Refactoring
2026-01-30 14:06:19 +08:00
f1c2fac03e Refa: remove ppt image. (#12909)
### What problem does this PR solve?

remove `aspose`

### Type of change

- [x] Refactoring
2026-01-30 13:35:42 +08:00
32c0161ff1 Refa: Clean the folders. (#12890)
### Type of change

- [x] Refactoring
2026-01-29 14:23:26 +08:00
c8bd413e4c Fixed bug: Prevent 400 errors from Image2Text providers by skipping images smaller than 11px on any side during figure enhancement. (#12868)
### What problem does this PR solve?
During figure enhancement, some cropped figure images are extremely
small. Sending these to the Image2Text/VLM provider fails with a 400
invalid_parameter_error because the image width/height must

be >10px. This aborts the enhancement step. This PR adds a minimal size
guard to skip tiny crops and continue processing.
<img width="1084" height="494" alt="image"
src="https://github.com/user-attachments/assets/ad074270-94e6-4571-91c8-37df85212639"
/>

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-28 14:59:02 +08:00
f096917eeb Fix: overlap cannot be properly applied (#12828)
### What problem does this PR solve?

Overlap cannot be properly applied.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-27 12:43:01 +08:00
b40d639fdb Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541)
### What problem does this PR solve?

1) Create  dataset using table parser for infinity
2) Answer questions in chat using SQL

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-19 19:35:14 +08:00
678a4f959c Fix: skip internal bookmark references in DOCX parsing (#12604) (#12611)
### What problem does this PR solve?

Fixes #12604 - DOCX files containing hyperlinks to internal bookmarks
(e.g., `#_文档目录`) cause a `KeyError` during parsing:

```
KeyError: "There is no item named 'word/#_文档目录' in the archive"
```

This happens because python-docx incorrectly tries to read internal
bookmark references as files from the ZIP archive. Internal bookmarks
are relationship targets starting with `#` and are not actual files.

This PR extends the existing `load_from_xml_v2` workaround (which
already handles `NULL` targets) to also skip relationship targets
starting with `#`.

Related upstream issue:
https://github.com/python-openxml/python-docx/issues/902

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=94194147
2026-01-14 19:08:46 +08:00
b226e06e2d refactor: remove debug print statements (#12534)
### What problem does this PR solve?

refactor: remove debug print statements

### Type of change

- [x] Refactoring
2026-01-09 19:23:50 +08:00
2e09db02f3 feat: add paddleocr parser (#12513)
### What problem does this PR solve?

Add PaddleOCR as a new PDF parser.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-09 17:48:45 +08:00
011bbe9556 Feat: support context window for docx (#12455)
### What problem does this PR solve?

Feat: support context window for docx

#12303

Done:
- [x] naive.py
- [x] one.py

TODO:
- [ ] book.py
- [ ] manual.py

Fix: incorrect image position
Fix: incorrect chunk type tag

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-01-07 15:08:17 +08:00
4cd4526492 Feat: PDF vision figure parser supports reading context (#12416)
### What problem does this PR solve?

PDF vision figure parser supports reading context.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-05 09:55:43 +08:00
52f91c2388 Refine: image/table context. (#12336)
### What problem does this PR solve?

#12303

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-30 20:24:27 +08:00
f0392e7501 Fix IDE warnings (#12315)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 15:04:09 +08:00
df3cbb9b9e Refactor code (#12305)
### What problem does this PR solve?

as title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 11:09:18 +08:00
37e4485415 feat: add MDX file support (#12261)
Feat: add MDX file support  #12057 
### What problem does this PR solve?

<img width="1055" height="270" alt="image"
src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b"
/>

The page states that MDX files are supported, but uploading fails with
the error: "x.mdx: This type of file has not been supported yet!"
<img width="381" height="110" alt="image"
src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-29 12:54:31 +08:00
01f0ced1e6 Fix IDE warnings (#12281)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-29 12:01:18 +08:00
bd76b8ff1a Fix: Tika server upgrades. (#12073)
### What problem does this PR solve?

#12037

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-23 09:35:52 +08:00
b49eb6826b Feat: enhance Excel image extraction with vision-based descriptions (#12054)
### What problem does this PR solve?
issue:
[#11618](https://github.com/infiniflow/ragflow/issues/11618)
change:
enhance Excel image extraction with vision-based descriptions

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-22 10:17:44 +08:00
4dd8cdc38b task executor issues (#12006)
### What problem does this PR solve?

**Fixes #8706** - `InfinityException: TOO_MANY_CONNECTIONS` when running
multiple task executor workers

### Problem Description

When running RAGFlow with 8-16 task executor workers, most workers fail
to start properly. Checking logs revealed that workers were
stuck/hanging during Infinity connection initialization - only 1-2
workers would successfully register in Redis while the rest remained
blocked.

### Root Cause

The Infinity SDK `ConnectionPool` pre-allocates all connections in
`__init__`. With the default `max_size=32` and multiple workers (e.g.,
16), this creates 16×32=512 connections immediately on startup,
exceeding Infinity's default 128 connection limit. Workers hang while
waiting for connections that can never be established.

### Changes

1. **Prevent Infinity connection storm** (`rag/utils/infinity_conn.py`,
`rag/svr/task_executor.py`)
- Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since
operations are synchronous)
- Added staggered startup delay (2s per worker) to spread connection
initialization

2. **Handle None children_delimiter** (`rag/app/naive.py`)
   - Use `or ""` to handle explicitly set None values from parser config

3. **MinerU parser robustness** (`deepdoc/parser/mineru_parser.py`)
   - Use `.get()` for optional output fields that may be missing
- Fix DISCARDED block handling: change `pass` to `continue` to skip
discarded blocks entirely

### Why `max_size=4` is sufficient

| Workers | Pool Size | Total Connections | Infinity Limit |
|---------|-----------|-------------------|----------------|
| 16      | 32        | 512               | 128          |
| 16      | 4         | 64                | 128          |
| 32      | 4         | 128               | 128          |

- All RAGFlow operations are synchronous: `get_conn()` → operation →
`release_conn()`
- No parallel `docStoreConn` operations in the codebase
- Maximum 1-2 concurrent connections needed per worker; 4 provides
safety margin

### MinerU DISCARDED block bug

When MinerU returns blocks with `type: "discarded"` (headers, footers,
watermarks, page numbers, artifacts), the previous code used `pass`
which left the `section` variable undefined, causing:

- **UnboundLocalError** if DISCARDED is the first block
- **Duplicate content** if DISCARDED follows another block (stale value
from previous iteration)

**Root cause confirmed via MinerU source code:**

From
[`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14):
```python
class BlockType:
    DISCARDED = 'discarded'
    # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE
```

Per [MinerU
documentation](https://opendatalab.github.io/MinerU/reference/output_files/),
discarded blocks contain content that should be filtered out for clean
text extraction.

**Fix:** Changed `pass` to `continue` to skip discarded blocks entirely.

### Testing

- Verified all 16 workers now register successfully in Redis
- All workers heartbeating correctly
- Document parsing works as expected
- MinerU parsing with DISCARDED blocks no longer crashes

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: user210 <user210@rt>
2025-12-18 10:03:30 +08:00
672958a192 Fix: model not authorized (#12001)
### What problem does this PR solve?

Fix model not authorized. #11973.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-17 19:48:24 +08:00
8e4d011b15 Fix: parent-children chunking method. (#11997)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-12-17 16:50:36 +08:00
0e8b9588ba Fix error and format issue (#11975)
### What problem does this PR solve?

1. Fix error of book chunking.
2. Fix format issues.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-16 19:29:37 +08:00
49c74d08e8 Feature/mineru improvements (#11938)
我已在下面的评论中用中文重复说明。

### What problem does this PR solve?

## Summary
This PR enhances the MinerU document parser with additional
configuration options, giving users more control over PDF parsing
behavior and improving support for multilingual documents.

## Changes

### Backend (`deepdoc/parser/mineru_parser.py`)
- Added configurable parsing options:
- **Parse Method**: `auto`, `txt`, or `ocr` — allows users to choose the
extraction strategy
- **Formula Recognition**: Toggle for enabling/disabling formula
extraction (useful to disable for Cyrillic documents where it may cause
issues)
- **Table Recognition**: Toggle for enabling/disabling table extraction
- Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate
RAGFlow language settings to MinerU-compatible language codes for better
OCR accuracy
- Improved parser configuration handling to pass these options through
the processing pipeline

### Frontend (`web/`)
- Created new `MinerUOptionsFormField` component that conditionally
renders when MinerU is selected as the layout recognition engine
- Added UI controls for:
  - Parse method selection (dropdown)
  - Formula recognition toggle (switch)
  - Table recognition toggle (switch)
- Added i18n translations for English and Chinese
- Integrated the options into both the dataset creation dialog and
dataset settings page

### Integration
- Updated `rag/app/naive.py` to forward MinerU options to the parser
- Updated task service to handle the new configuration parameters

## Why
MinerU is a powerful document parser, but the default settings don't
work well for all document types. This PR allows users to:
1. Choose the best parsing method for their documents
2. Disable formula recognition for Cyrillic/non-Latin scripts where it
causes issues
3. Control table extraction based on document needs
4. Benefit from automatic language detection for better OCR results

## Testing
- [x] Tested MinerU parsing with different parse methods
- [x] Verified UI renders correctly when MinerU is selected/deselected
- [x] Confirmed settings persist correctly in dataset configuration

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---------

Co-authored-by: user210 <user210@rt>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-16 13:15:25 +08:00
7d23c3aed0 Fix: presentation parsing & Embedding encode exception handling (#11933)
### What problem does this PR solve?

Fix: presentation parsing #11920
Fix: Embeddin encode exception handling
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-13 11:37:42 +08:00
e9710b7aa9 Refa: treat MinerU as an OCR model 2 (#11905)
### What problem does this PR solve?

Treat MinerU as an OCR model 2. #11903

### Type of change

- [x] Refactoring
2025-12-11 17:33:12 +08:00
ab4b62031f Fix:csv parse in Table (#11870)
### What problem does this PR solve?

change:
csv parse in Table

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-10 16:44:06 +08:00
ca2d6f3301 Fix: duplicate output by async_chat_streamly (#11842)
### What problem does this PR solve?

Fix: duplicate output by async_chat_streamly
Refact: revert manual modification

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-09 19:21:52 +08:00
a94b3b9df2 Refa: treat MinerU as an OCR model (#11849)
### What problem does this PR solve?

 Treat MinerU as an OCR model.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-12-09 18:54:14 +08:00
c51e6b2a58 Refa: migrate CV model chat to Async (#11828)
### What problem does this PR solve?

Migrate CV model chat to Async. #11750

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2025-12-09 13:08:37 +08:00
09a3854ed8 Fix: chunk method error. (#11807)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-08 14:28:23 +08:00