Commit Graph

280 Commits

Author SHA1 Message Date
fa71f8d0c7 refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.

This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.

**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:

* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.

**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.

**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.

**Validation & Testing**
I've tested this to ensure no regressions:

* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.

**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
158503a1aa Feat: optimize ingestion pipeline with preprocess (#13211)
### What problem does this PR solve?

Feat: optimize ingestion pipeline with preprocess

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-02-26 10:24:13 +08:00
109441628b Fix: upload image files (#13071)
### What problem does this PR solve?

Fix: upload image files

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-11 09:47:33 +08:00
4bc622b409 Fix parameter of calling self.dataStore.get() and warning info during parser (#13068)
### What problem does this PR solve?

Fix parameter of calling self.dataStore.get() and warning info during
parser

https://github.com/infiniflow/ragflow/issues/13036

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-02-09 17:56:59 +08:00
yH
5333e764fc fix: optimize Excel row counting for files with abnormal max_row (#13018)
### What problem does this PR solve?

Some Excel files have abnormal `max_row` metadata (e.g.,
`max_row=1,048,534` with only 300 actual data rows). This causes:
- `row_number()` returns incorrect count, creating 350+ tasks instead of
1
- `list(ws.rows)` iterates through millions of empty rows, causing
system hang

This PR uses binary search to find the actual last row with data.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-06 14:43:52 +08:00
11703d957d Refactor: Improve Picture.py resource usage (#13011)
### What problem does this PR solve?

Improve Picture.py resource usage

### Type of change


- [x] Refactoring
2026-02-06 09:50:53 +08:00
6c9ca45b30 Refactor: improve close for presentation (#12957)
### What problem does this PR solve?

improve close for presentation

### Type of change

- [x] Refactoring
2026-02-03 10:24:27 +08:00
1a2d69edc4 feat: Implement legacy .ppt parsing via Tika (alternative to Aspose) (#12932)
## What problem does this PR solve?
This PR implements parsing support for legacy PowerPoint files (`.ppt`,
97-2003 format).
Currently, parsing these files fails because `python-pptx` **natively
lacks support** for the legacy OLE2 binary format.

## **Context:**
I originally using `aspose-slides` for this purpose. However, since
`aspose-slides` is **no longer a project dependency**, I implemented a
fallback mechanism using the existing `tika-server` to ensure
compatibility and stability.

## **Key Changes:**
- **Fallback Logic**: Modified `rag/app/presentation.py` to catch
`python-pptx` failures and automatically fall back to Tika parsing.
- **No New Dependencies**: Utilizes the `tika` service that is already
part of the RAGFlow stack.
- **Note**: Since Tika focuses on text extraction, this implementation
extracts text content but does not generate slide thumbnails .
## 🧪 Test / Verification Results

### 1. Before (The Issue)
I have verified the fix using a legacy `.ppt` file (`math(1).ppt`,
~8MB).
<img width="963" height="970" alt="image"
src="https://github.com/user-attachments/assets/468c4ba8-f90b-4d7b-b969-9c5f5e42c474"
/>

### 2. After (The Fix)
With this PR, the system detects the failure in python-pptx and
successfully falls back to Tika. The text is extracted correctly.
<img width="1467" height="1121" alt="image"
src="https://github.com/user-attachments/assets/fa0fed3b-b923-4c86-ba2c-24b3ce6ee7a6"
/>


**Type of change**
- [x] New Feature (non-breaking change which adds functionality)

Signed-off-by: evilhero <2278596667@qq.com>
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-02-02 13:40:51 +08:00
23bdf25a1f feature:Add OceanBase Storage Support for Table Parser (#12923)
### What problem does this PR solve?

close #12770 

This PR adds OceanBase as a storage backend for the Table Parser. It
enables dynamic table schema storage via JSON and implements OceanBase
SQL execution for text-to-SQL retrieval.


### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Changes
- Table Parser stores row data into `chunk_data` when doc engine is
OceanBase. (table.py)
- OceanBase table schema adds `chunk_data` JSON column and migrates if
needed.
- Implemented OceanBase `sql()` to execute text-to-SQL results.
(ob_conn.py)
- Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py)

### Test
1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`)
<img width="1290" height="783" alt="doc_engine_ob"
src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1"
/>

2. Upload an Excel file to Knowledge Base.(for test, we use as below)
<img width="786" height="930" alt="excel"
src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a"
/>

3. Choose **Table** as parsing method.
<img width="2550" height="1134" alt="parse_excel"
src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0"
/>

4.Ask a natural language query in chat.
<img width="2550" height="1134" alt="query"
src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91"
/>
2026-01-31 15:11:54 +08:00
f262d416fe Refa: remove aspose dependency. (#12910)
### Type of change

- [x] Refactoring
2026-01-30 14:06:19 +08:00
f1c2fac03e Refa: remove ppt image. (#12909)
### What problem does this PR solve?

remove `aspose`

### Type of change

- [x] Refactoring
2026-01-30 13:35:42 +08:00
32c0161ff1 Refa: Clean the folders. (#12890)
### Type of change

- [x] Refactoring
2026-01-29 14:23:26 +08:00
c8bd413e4c Fixed bug: Prevent 400 errors from Image2Text providers by skipping images smaller than 11px on any side during figure enhancement. (#12868)
### What problem does this PR solve?
During figure enhancement, some cropped figure images are extremely
small. Sending these to the Image2Text/VLM provider fails with a 400
invalid_parameter_error because the image width/height must

be >10px. This aborts the enhancement step. This PR adds a minimal size
guard to skip tiny crops and continue processing.
<img width="1084" height="494" alt="image"
src="https://github.com/user-attachments/assets/ad074270-94e6-4571-91c8-37df85212639"
/>

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-28 14:59:02 +08:00
f096917eeb Fix: overlap cannot be properly applied (#12828)
### What problem does this PR solve?

Overlap cannot be properly applied.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-27 12:43:01 +08:00
b40d639fdb Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541)
### What problem does this PR solve?

1) Create  dataset using table parser for infinity
2) Answer questions in chat using SQL

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-19 19:35:14 +08:00
678a4f959c Fix: skip internal bookmark references in DOCX parsing (#12604) (#12611)
### What problem does this PR solve?

Fixes #12604 - DOCX files containing hyperlinks to internal bookmarks
(e.g., `#_文档目录`) cause a `KeyError` during parsing:

```
KeyError: "There is no item named 'word/#_文档目录' in the archive"
```

This happens because python-docx incorrectly tries to read internal
bookmark references as files from the ZIP archive. Internal bookmarks
are relationship targets starting with `#` and are not actual files.

This PR extends the existing `load_from_xml_v2` workaround (which
already handles `NULL` targets) to also skip relationship targets
starting with `#`.

Related upstream issue:
https://github.com/python-openxml/python-docx/issues/902

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=94194147
2026-01-14 19:08:46 +08:00
b226e06e2d refactor: remove debug print statements (#12534)
### What problem does this PR solve?

refactor: remove debug print statements

### Type of change

- [x] Refactoring
2026-01-09 19:23:50 +08:00
2e09db02f3 feat: add paddleocr parser (#12513)
### What problem does this PR solve?

Add PaddleOCR as a new PDF parser.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-09 17:48:45 +08:00
011bbe9556 Feat: support context window for docx (#12455)
### What problem does this PR solve?

Feat: support context window for docx

#12303

Done:
- [x] naive.py
- [x] one.py

TODO:
- [ ] book.py
- [ ] manual.py

Fix: incorrect image position
Fix: incorrect chunk type tag

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-01-07 15:08:17 +08:00
4cd4526492 Feat: PDF vision figure parser supports reading context (#12416)
### What problem does this PR solve?

PDF vision figure parser supports reading context.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-05 09:55:43 +08:00
52f91c2388 Refine: image/table context. (#12336)
### What problem does this PR solve?

#12303

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-30 20:24:27 +08:00
f0392e7501 Fix IDE warnings (#12315)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 15:04:09 +08:00
df3cbb9b9e Refactor code (#12305)
### What problem does this PR solve?

as title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 11:09:18 +08:00
37e4485415 feat: add MDX file support (#12261)
Feat: add MDX file support  #12057 
### What problem does this PR solve?

<img width="1055" height="270" alt="image"
src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b"
/>

The page states that MDX files are supported, but uploading fails with
the error: "x.mdx: This type of file has not been supported yet!"
<img width="381" height="110" alt="image"
src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-29 12:54:31 +08:00
01f0ced1e6 Fix IDE warnings (#12281)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-29 12:01:18 +08:00
bd76b8ff1a Fix: Tika server upgrades. (#12073)
### What problem does this PR solve?

#12037

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-23 09:35:52 +08:00
b49eb6826b Feat: enhance Excel image extraction with vision-based descriptions (#12054)
### What problem does this PR solve?
issue:
[#11618](https://github.com/infiniflow/ragflow/issues/11618)
change:
enhance Excel image extraction with vision-based descriptions

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-22 10:17:44 +08:00
4dd8cdc38b task executor issues (#12006)
### What problem does this PR solve?

**Fixes #8706** - `InfinityException: TOO_MANY_CONNECTIONS` when running
multiple task executor workers

### Problem Description

When running RAGFlow with 8-16 task executor workers, most workers fail
to start properly. Checking logs revealed that workers were
stuck/hanging during Infinity connection initialization - only 1-2
workers would successfully register in Redis while the rest remained
blocked.

### Root Cause

The Infinity SDK `ConnectionPool` pre-allocates all connections in
`__init__`. With the default `max_size=32` and multiple workers (e.g.,
16), this creates 16×32=512 connections immediately on startup,
exceeding Infinity's default 128 connection limit. Workers hang while
waiting for connections that can never be established.

### Changes

1. **Prevent Infinity connection storm** (`rag/utils/infinity_conn.py`,
`rag/svr/task_executor.py`)
- Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since
operations are synchronous)
- Added staggered startup delay (2s per worker) to spread connection
initialization

2. **Handle None children_delimiter** (`rag/app/naive.py`)
   - Use `or ""` to handle explicitly set None values from parser config

3. **MinerU parser robustness** (`deepdoc/parser/mineru_parser.py`)
   - Use `.get()` for optional output fields that may be missing
- Fix DISCARDED block handling: change `pass` to `continue` to skip
discarded blocks entirely

### Why `max_size=4` is sufficient

| Workers | Pool Size | Total Connections | Infinity Limit |
|---------|-----------|-------------------|----------------|
| 16      | 32        | 512               | 128          |
| 16      | 4         | 64                | 128          |
| 32      | 4         | 128               | 128          |

- All RAGFlow operations are synchronous: `get_conn()` → operation →
`release_conn()`
- No parallel `docStoreConn` operations in the codebase
- Maximum 1-2 concurrent connections needed per worker; 4 provides
safety margin

### MinerU DISCARDED block bug

When MinerU returns blocks with `type: "discarded"` (headers, footers,
watermarks, page numbers, artifacts), the previous code used `pass`
which left the `section` variable undefined, causing:

- **UnboundLocalError** if DISCARDED is the first block
- **Duplicate content** if DISCARDED follows another block (stale value
from previous iteration)

**Root cause confirmed via MinerU source code:**

From
[`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14):
```python
class BlockType:
    DISCARDED = 'discarded'
    # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE
```

Per [MinerU
documentation](https://opendatalab.github.io/MinerU/reference/output_files/),
discarded blocks contain content that should be filtered out for clean
text extraction.

**Fix:** Changed `pass` to `continue` to skip discarded blocks entirely.

### Testing

- Verified all 16 workers now register successfully in Redis
- All workers heartbeating correctly
- Document parsing works as expected
- MinerU parsing with DISCARDED blocks no longer crashes

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: user210 <user210@rt>
2025-12-18 10:03:30 +08:00
672958a192 Fix: model not authorized (#12001)
### What problem does this PR solve?

Fix model not authorized. #11973.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-17 19:48:24 +08:00
8e4d011b15 Fix: parent-children chunking method. (#11997)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2025-12-17 16:50:36 +08:00
0e8b9588ba Fix error and format issue (#11975)
### What problem does this PR solve?

1. Fix error of book chunking.
2. Fix format issues.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-16 19:29:37 +08:00
49c74d08e8 Feature/mineru improvements (#11938)
我已在下面的评论中用中文重复说明。

### What problem does this PR solve?

## Summary
This PR enhances the MinerU document parser with additional
configuration options, giving users more control over PDF parsing
behavior and improving support for multilingual documents.

## Changes

### Backend (`deepdoc/parser/mineru_parser.py`)
- Added configurable parsing options:
- **Parse Method**: `auto`, `txt`, or `ocr` — allows users to choose the
extraction strategy
- **Formula Recognition**: Toggle for enabling/disabling formula
extraction (useful to disable for Cyrillic documents where it may cause
issues)
- **Table Recognition**: Toggle for enabling/disabling table extraction
- Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate
RAGFlow language settings to MinerU-compatible language codes for better
OCR accuracy
- Improved parser configuration handling to pass these options through
the processing pipeline

### Frontend (`web/`)
- Created new `MinerUOptionsFormField` component that conditionally
renders when MinerU is selected as the layout recognition engine
- Added UI controls for:
  - Parse method selection (dropdown)
  - Formula recognition toggle (switch)
  - Table recognition toggle (switch)
- Added i18n translations for English and Chinese
- Integrated the options into both the dataset creation dialog and
dataset settings page

### Integration
- Updated `rag/app/naive.py` to forward MinerU options to the parser
- Updated task service to handle the new configuration parameters

## Why
MinerU is a powerful document parser, but the default settings don't
work well for all document types. This PR allows users to:
1. Choose the best parsing method for their documents
2. Disable formula recognition for Cyrillic/non-Latin scripts where it
causes issues
3. Control table extraction based on document needs
4. Benefit from automatic language detection for better OCR results

## Testing
- [x] Tested MinerU parsing with different parse methods
- [x] Verified UI renders correctly when MinerU is selected/deselected
- [x] Confirmed settings persist correctly in dataset configuration

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---------

Co-authored-by: user210 <user210@rt>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-16 13:15:25 +08:00
7d23c3aed0 Fix: presentation parsing & Embedding encode exception handling (#11933)
### What problem does this PR solve?

Fix: presentation parsing #11920
Fix: Embeddin encode exception handling
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-13 11:37:42 +08:00
e9710b7aa9 Refa: treat MinerU as an OCR model 2 (#11905)
### What problem does this PR solve?

Treat MinerU as an OCR model 2. #11903

### Type of change

- [x] Refactoring
2025-12-11 17:33:12 +08:00
ab4b62031f Fix:csv parse in Table (#11870)
### What problem does this PR solve?

change:
csv parse in Table

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-10 16:44:06 +08:00
ca2d6f3301 Fix: duplicate output by async_chat_streamly (#11842)
### What problem does this PR solve?

Fix: duplicate output by async_chat_streamly
Refact: revert manual modification

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-09 19:21:52 +08:00
a94b3b9df2 Refa: treat MinerU as an OCR model (#11849)
### What problem does this PR solve?

 Treat MinerU as an OCR model.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2025-12-09 18:54:14 +08:00
c51e6b2a58 Refa: migrate CV model chat to Async (#11828)
### What problem does this PR solve?

Migrate CV model chat to Async. #11750

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2025-12-09 13:08:37 +08:00
09a3854ed8 Fix: chunk method error. (#11807)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-08 14:28:23 +08:00
43f51baa96 Fix errors (#11804)
### What problem does this PR solve?

1. typos
2. grammar errors.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-08 12:21:18 +08:00
b66881a371 Refactor:book parser use with to handle bytesIO (#11800)
### What problem does this PR solve?

book parser use with to handle bytesIO

### Type of change

- [x] Refactoring
2025-12-08 10:18:46 +08:00
7719fd6350 Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702)
### What problem does this PR solve?

This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:

1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
   This PR adds:

   * root-directory detection
   * fallback lookup using sanitized names
   * a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.

2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
   `ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:

   * parse position tags when present
   * default to empty positions when missing
     preserving backward compatibility and preventing crashes.

### Type of change

* [x] Bug Fix (non-breaking change which fixes an issue)


[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)

we need your help [yongtenglei](https://github.com/yongtenglei)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-05 19:25:45 +08:00
257af75ece Fix: relative page_number in boxes (#11712)
page_number in boxes is relative page number,must + from_page

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-04 11:23:34 +08:00
4ba17361e9 feat: improve presentation PdfParser (#11639)
The old presentation PdfParser lost table format after parse

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-02 17:35:14 +08:00
14616cf845 Feat: add child parent chunking method in backend. (#11598)
### What problem does this PR solve?

#7996

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-28 19:25:32 +08:00
9d8b96c1d0 Feat: add context for figure and table (#11547)
### What problem does this PR solve?

Add context for figure table.



![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e)


`==================()` for demonstrating purpose. 
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-27 10:21:44 +08:00
74e0b58d89 Fix: excel default optimization. (#11519)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:20 +08:00
7c20c964b4 Fix: incorrect image merging for naive markdown parser (#11520)
### What problem does this PR solve?

Fix incorrect image merging for naive markdown parser. #9349 


[ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc)

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-25 19:54:06 +08:00
41665b0865 Refactor: Email parser use with to handle buffer (#11496)
### What problem does this PR solve?
 Email parser use with to handle buffer

### Type of change

- [x] Refactoring
2025-11-25 10:03:37 +08:00
971c1bcba7 Fix: missing parameters in by_plaintext method for PDF naive mode (#11408)
### What problem does this PR solve?

FIx: missing parameters in by_plaintext method for PDF naive mode

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: lih <dev_lih@139.com>
2025-11-21 09:33:36 +08:00