mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-05-20 16:26:42 +08:00
## Summary Closes #14869. Adds VLM-based semantic descriptions to **image chunks produced by the MinerU parser**, closing a long-standing parity gap with the deepdoc parser's `VisionFigureParser`. A maintainer flagged this in #13342 ("We may add the VLM enhancement to MinerU parser as well") and an earlier proposal exists in #13824; this PR lands the change end-to-end inside the existing parser plumbing. ## Why Today the MinerU parser returns image chunks containing only the native `image_caption` and `image_footnote` strings from MinerU's JSON. When neither is present (or when both are sparse), the chunk carries effectively no searchable content for the figure and retrieval misses it entirely. Users who configured a local VLM (reporter's case: Gemma-4-31B) had to post-process MinerU's `tmp/*.json` themselves. The deepdoc parser already solves this via [`VisionFigureParser`](deepdoc/parser/figure_parser.py): when the tenant has an `IMAGE2TEXT` model configured, each figure gets a semantic description merged into its chunk. This PR brings the same behavior to MinerU. ## What changed ### `deepdoc/parser/mineru_parser.py` - **New method `_enhance_images_with_vlm(outputs, vision_model, callback=None)`** — collects every `IMAGE` block with a readable `img_path`, runs `rag.app.picture.vision_llm_chunk` in a 10-worker `ThreadPoolExecutor` using the existing `vision_llm_figure_describe_prompt`, and writes the result back as `vlm_description`. Per-image failures are logged and skipped — they never abort the run. - **`_transfer_to_sections` (IMAGE branch)** — folds `vlm_description` into the section text alongside caption + footnote, so the description becomes part of the chunk and is searchable / retrievable. - **`parse_pdf`** — after `_read_output`, calls `_enhance_images_with_vlm(outputs, vision_model, callback=callback)` when a `vision_model` kwarg is supplied. Wrapped in `try / except` so a VLM outage cannot break parsing. ### `rag/app/naive.py` (`by_mineru`) After successfully resolving the MinerU OCR parser, also resolves the tenant's default `LLMType.IMAGE2TEXT` model via `get_tenant_default_model_by_type`, wraps it in an `LLMBundle`, and injects it as `kwargs["vision_model"]` before delegating to `parse_pdf`. ## Behavior | Tenant config | Behavior | |---|---| | `IMAGE2TEXT` model configured | MinerU image chunks contain `caption + footnote + VLM description`. Retrieval against figures now actually works. | | No `IMAGE2TEXT` model configured | Exact same output as today (caption + footnote only). Lookup fails silently with an info log; no error, no regression. | | VLM call fails for a single image | That image silently falls back to caption + footnote; other images proceed. | | Caller already passes `vision_model` in kwargs | We don't override it — `if "vision_model" not in kwargs` guards the lookup. | ## Files - `deepdoc/parser/mineru_parser.py` (+56) - `rag/app/naive.py` (+13)