Files
ragflow/rag
Rene Arredondo f58e0b3eca Feat: VLM image descriptions in MinerU parser (#14869) (#14946)
## Summary

Closes #14869.

Adds VLM-based semantic descriptions to **image chunks produced by the
MinerU parser**, closing a long-standing parity gap with the deepdoc
parser's `VisionFigureParser`. A maintainer flagged this in #13342
("We may add the VLM enhancement to MinerU parser as well") and an
earlier proposal exists in #13824; this PR lands the change end-to-end
inside the existing parser plumbing.

## Why

Today the MinerU parser returns image chunks containing only the
native `image_caption` and `image_footnote` strings from MinerU's
JSON. When neither is present (or when both are sparse), the chunk
carries effectively no searchable content for the figure and
retrieval misses it entirely. Users who configured a local VLM
(reporter's case: Gemma-4-31B) had to post-process MinerU's
`tmp/*.json` themselves.

The deepdoc parser already solves this via
[`VisionFigureParser`](deepdoc/parser/figure_parser.py): when the
tenant has an `IMAGE2TEXT` model configured, each figure gets a
semantic description merged into its chunk. This PR brings the same
behavior to MinerU.

## What changed

### `deepdoc/parser/mineru_parser.py`

- **New method `_enhance_images_with_vlm(outputs, vision_model,
callback=None)`** —
  collects every `IMAGE` block with a readable `img_path`, runs
  `rag.app.picture.vision_llm_chunk` in a 10-worker
  `ThreadPoolExecutor` using the existing
  `vision_llm_figure_describe_prompt`, and writes the result back as
  `vlm_description`. Per-image failures are logged and skipped — they
  never abort the run.
- **`_transfer_to_sections` (IMAGE branch)** — folds
  `vlm_description` into the section text alongside caption +
  footnote, so the description becomes part of the chunk and is
  searchable / retrievable.
- **`parse_pdf`** — after `_read_output`, calls
  `_enhance_images_with_vlm(outputs, vision_model, callback=callback)`
  when a `vision_model` kwarg is supplied. Wrapped in `try / except`
  so a VLM outage cannot break parsing.

### `rag/app/naive.py` (`by_mineru`)

After successfully resolving the MinerU OCR parser, also resolves the
tenant's default `LLMType.IMAGE2TEXT` model via
`get_tenant_default_model_by_type`, wraps it in an `LLMBundle`, and
injects it as `kwargs["vision_model"]` before delegating to
`parse_pdf`.

## Behavior

| Tenant config | Behavior |
|---|---|
| `IMAGE2TEXT` model configured | MinerU image chunks contain `caption +
footnote + VLM description`. Retrieval against figures now actually
works. |
| No `IMAGE2TEXT` model configured | Exact same output as today (caption
+ footnote only). Lookup fails silently with an info log; no error, no
regression. |
| VLM call fails for a single image | That image silently falls back to
caption + footnote; other images proceed. |
| Caller already passes `vision_model` in kwargs | We don't override it
— `if "vision_model" not in kwargs` guards the lookup. |

## Files

- `deepdoc/parser/mineru_parser.py` (+56)
- `rag/app/naive.py` (+13)
2026-05-19 16:08:10 +08:00
..
2025-12-31 17:18:30 +08:00
2026-03-05 17:27:17 +08:00