Commit Graph

97 Commits

Author SHA1 Message Date
292a1a8566 fix: detect and fallback garbled PDF text to OCR (#13366) (#13404)
## Problem

When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer)
cannot map CIDs to correct Unicode characters, outputting PUA characters
(U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully
trusted pdfplumber text without any garbled detection, causing garbled
output in the final parsed result.

Relates to #13366

## Solution

### 1. Garbled text detection functions
- `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16),
replacement character U+FFFD, control characters, and
unassigned/surrogate codepoints
- `_is_garbled_text(text, threshold)`: Calculates garbled ratio and
detects `(cid:xxx)` patterns

### 2. Box-level fallback (in `__ocr()`)
When a text box has ≥50% garbled characters, discard pdfplumber text and
fallback to OCR recognition.

### 3. Page-level detection (in `__images__()`)
Sample characters from each page; if garbled rate ≥30%, clear all
pdfplumber characters for that page, forcing full OCR.

### 4. Layout recognizer CID filtering
Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text
processing to prevent them from polluting layout analysis.

## Testing
- 29 unit tests covering: normal CJK/English text, PUA characters, CID
patterns, mixed text, boundary thresholds, edge cases
- All 85 existing project unit tests pass without regression
2026-03-10 11:20:31 +08:00
fa71f8d0c7 refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.

This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.

**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:

* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.

**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.

**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.

**Validation & Testing**
I've tested this to ensure no regressions:

* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.

**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
678392c040 feat(deepdoc): add configurable ONNX thread counts and GPU memory shrinkage (#12777)
### What problem does this PR solve?

This PR addresses critical memory and CPU resource management issues in
high-concurrency environments (multi-worker setups):

GPU Memory Exhaustion (OOM): Currently, onnxruntime-gpu uses an
aggressive memory arena that does not effectively release VRAM back to
the system after a task completes. In multi-process worker setups ($WS >
4), this leads to BFCArena allocation failures and OOM errors as workers
"hoard" VRAM even when idle. This PR introduces an optional GPU Memory
Arena Shrinkage toggle to mitigate this issue.

CPU Oversubscription: ONNX intra_op and inter_op thread counts are
currently hardcoded to 2. When running many workers, this causes
significant CPU context-switching overhead and degrades performance.
This PR makes these values configurable to match the host's actual CPU
core density.

Multi-GPU Support: The memory management logic has been improved to
dynamically target the correct device_id, ensuring stability on systems
with multiple GPUs.

Transparency: Added detailed initialization logs to help administrators
verify and troubleshoot their ONNX session configurations.

 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: shakeel <shakeel@lollylaw.com>
2026-01-23 11:36:28 +08:00
927db0b373 Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716)
### Type of change

- [x] Refactoring
2026-01-20 13:29:37 +08:00
Rin
651d9fff9f security: replace unsafe eval with ast.literal_eval in vision operators (#12236)
Addresses a potential RCE vulnerability in NormalizeImage by using
ast.literal_eval for safer string parsing.

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-29 13:28:09 +08:00
65a5a56d95 Refa:replace trio with asyncio (#11831)
### What problem does this PR solve?

change:
replace trio with asyncio

### Type of change
- [x] Refactoring
2025-12-09 19:23:14 +08:00
43f51baa96 Fix errors (#11804)
### What problem does this PR solve?

1. typos
2. grammar errors.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-08 12:21:18 +08:00
6546f86b4e Fix errors (#11795)
### What problem does this PR solve?

- typos
- IDE warnings

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-08 09:42:10 +08:00
a674338c21 Fix: remove garbage filtering rules (#11567)
### What problem does this PR solve?
change:

remove garbage filtering rules

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 17:54:49 +08:00
ba71160b14 Refa: rm useless code. (#11238)
### Type of change

- [x] Refactoring
2025-11-13 09:59:55 +08:00
f98b24c9bf Move api.settings to common.settings (#11036)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-06 09:36:38 +08:00
1284647694 Refactor file utils (#10970)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 18:54:55 +08:00
78631a3fd3 Move some functions out of 'api/utils/common.py' (#10948)
### What problem does this PR solve?

as title.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-03 12:34:47 +08:00
44f2d6f5da Move 'get_project_base_directory' to common directory (#10940)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-02 21:05:28 +08:00
60a6cf7c7a Fix:remove unexpected keyword argument in table_structure_recognizer logging (#10831)
### What problem does this PR solve?
issue:
#10825
change:
remove unexpected keyword argument in table_structure_recognizer logging

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-28 11:02:43 +08:00
73144e278b Don't release full image (#10654)
### What problem does this PR solve?

Introduced gpu profile in .env
Added Dockerfile_tei
fix datrie
Removed LIGHTEN flag

### Type of change

- [x] Documentation Update
- [x] Refactoring
2025-10-23 23:02:27 +08:00
f631073ac2 Fix OCR GPU provider mem limit handling (#10407)
### What problem does this PR solve?

- Running DeepDoc OCR on large PDFs inside the GPU docker-compose setup
would intermittently fail with
[ONNXRuntimeError] ... p2o.Clip.6 ... Available memory of 0 is smaller
than requested bytes ...
- Root cause: load_model() in deepdoc/vision/ocr.py treated
device_id=None as-is.
torch.cuda.device_count() > device_id then raised a TypeError, the
helper returned False, and ONNXRuntime quietly fell back to
CPUExecutionProvider with
the hard-coded 512 MB limit, which then triggered the allocator failure.
- Environment where this reproduces: Windows 11, AMD 5900x, 64 GB RAM,
RTX 3090 (24 GB), docker-compose-gpu.yml from upstream, default DeepDoc
+ GraphRAG
parser settings, ingesting heavy PDF such as 《内科学》(第10版).pdf (~180 MB).

  Fixes:

- Normalize device_id to 0 when it is None before calling any CUDA APIs,
so the GPU path is considered available.
- Allow configuring the CUDA provider’s memory cap via
OCR_GPU_MEM_LIMIT_MB (default 2048 MB) and expose
OCR_ARENA_EXTEND_STRATEGY; the calculated byte
  limit is logged to confirm the effective settings.

  After the change, ragflow_server.log shows for example
load_model ... uses GPU (device 0, gpu_mem_limit=21474836480,
arena_strategy=kNextPowerOfTwo) and the same document finishes OCR
without allocator errors.

  ### Type of change

  - [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-10 11:03:12 +08:00
b0b866c8fd Refactor: move some functions out of api/utils/__init__.py (#10216)
### What problem does this PR solve?

Refactor import modules.

### Type of change

- [x] Refactoring

---------

Signed-off-by: jinhai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-09-25 18:04:49 +08:00
86f6da2f74 Feat: add support for the Ascend table structure recognizer (#10110)
### What problem does this PR solve?

Add support for the Ascend table structure recognizer.

Use the environment variable `TABLE_STRUCTURE_RECOGNIZER_TYPE=ascend` to
enable the Ascend table structure recognizer.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 13:57:06 +08:00
bc0281040b Feat: add support for the Ascend layout recognizer (#10105)
### What problem does this PR solve?

Supports Ascend layout recognizer.

Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable
the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n`
(for example, n=0) to specify the Ascend device ID.

Ensure that you have installed the [ais
tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)
properly.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 09:51:15 +08:00
341a7b1473 Fix: judge not empty before delete (#10099)
### What problem does this PR solve?

judge not empty before delete session.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-15 17:49:52 +08:00
2a88ce6be1 Fix: terminate onnx inference session manually (#10076)
### What problem does this PR solve?

terminate onnx inference session and release memory manually.

Issue #5050 
Issue #9992 
Issue #8805

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-09-12 17:18:26 +08:00
5abd0bbac1 Fix typo (#9766)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-08-27 18:56:40 +08:00
2ae8f2cf00 Fix: exception layout_type in is_caption (#9028)
### What problem does this PR solve?

Exception layout_type in is_caption.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-24 17:06:56 +08:00
e470645efd Refactor code (#8341)
### What problem does this PR solve?

1. rename var
2. update if statement

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-06-18 16:40:30 +08:00
4a2ff633e0 Fix typo in code (#8327)
### What problem does this PR solve?

Fix typo in code

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-06-18 09:41:09 +08:00
e6d36f3a3a Improve image rotation logic for text recognition (#8167)
### What problem does this PR solve?

Enhanced the image rotation handling by evaluating the original
orientation, clockwise 90°, and counter-clockwise 90° rotations. The
image with the highest text recognition score is now selected, improving
accuracy for text detection in images with aspect ratios >= 1.5.

#8166

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: wenrui.cao <wenrui.cao@univers.com>
2025-06-11 09:20:30 +08:00
6ba5a4348a set PARALLEL_DEVICES default value= 0 (#7935)
### What problem does this PR solve?


it would be fail if PARALLEL_DEVICES = None in OCR class , because it
pass 0 to TextDetector and TextRecognizer init method.

and It would be simpler to set 0 as the default value for
PARALLEL_DEVICES.

### Type of change

- [x] Refactoring
2025-05-29 13:32:16 +08:00
ed5f81b02e Fix: abnormal cell mergeing. (#6991)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-14 11:00:11 +08:00
3bb1e012e6 Fix: assistant deleteion issue. (#6906)
### What problem does this PR solve?

#6875

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-09 20:29:40 +08:00
2caf15b24c Refa: trival. (#6802)
### What problem does this PR solve?


### Type of change


- [x] Refactoring
2025-04-03 19:01:24 +08:00
b0b4b7ba33 Feat: Improve Recognizer.py performance (#6185)
### What problem does this PR solve?

For the create_inputs method based on np operation to replace for loop

### Type of change

- [x] Performance Improvement
2025-03-18 09:39:49 +08:00
3a99c2b5f4 Refa: PARALLEL_DEVICES is a static parameter. (#6168)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-03-17 16:49:54 +08:00
3e19044dee Feat: add OCR's muti-gpus and parallel processing support (#5972)
### What problem does this PR solve?

Add OCR's muti-gpus and parallel processing support

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

@yuzhichang I've tried to resolve the comments in #5697. OCR jobs can
now be done on both CPU and GPU. ( By the way, I've encountered a
“Generate embedding error” issue #5954 that might be due to my outdated
GPUs? idk. ) Please review it and give me suggestions.

GPU:

![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e)

![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d)

CPU:

![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)
2025-03-17 11:58:40 +08:00
4ff609b6a8 Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027)
### What problem does this PR solve?

Optimize OCR garbage identification to reduce unnecessary filtering.
#5713

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-13 18:48:32 +08:00
4326873af6 refactor: no need to inherit in python3 clean the code (#5659)
### What problem does this PR solve?

As title

### Type of change


- [x] Refactoring

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-05 18:03:53 +08:00
ca04ae9540 Minor: improve doc and rm unused file (#5634)
### What problem does this PR solve?

The `ocr.res` file is already included in the model directory
`rag/res/deepdoc`, but it doesn't seem to be utilized here.

### Type of change

- [x] Documentation Update
2025-03-05 12:59:54 +08:00
c813c1ff4c Made task_executor async to speedup parsing (#5530)
### What problem does this PR solve?

Made task_executor async to speedup parsing

### Type of change

- [x] Performance Improvement
2025-03-03 18:59:49 +08:00
8a2542157f Fix: possible memory leaks close #5277 (#5500)
### What problem does this PR solve?

close #5277 by make sure the file close

### Type of change

- [x] Performance Improvement

---------

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-03 10:26:45 +08:00
37aacb3960 Refa: drop useless fasttext (#5470)
### What problem does this PR solve?

This patch drop useless fastext which is seems useless in the code base 
and its very kind of hard install
should close #4498


### Type of change

- [x] Refactoring

Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-02-28 14:30:56 +08:00
db42d0e0ae Optimize ocr (#5297)
### What problem does this PR solve?

Introduced OCR.recognize_batch

### Type of change

- [x] Performance Improvement
2025-02-24 16:21:55 +08:00
0151d42156 Reuse loaded modules if possible (#5231)
### What problem does this PR solve?

Reuse loaded modules if possible

### Type of change

- [x] Refactoring
2025-02-21 17:21:01 +08:00
c326f14fed Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly (#5182)
### What problem does this PR solve?

Optimized Recognizer.sort_X_firstly and Recognizer.sort_Y_firstly

### Type of change

- [x] Performance Improvement
2025-02-20 15:41:12 +08:00
b08bb56f6c Display thinking for deepseek r1 (#4904)
### What problem does this PR solve?
#4903
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-02-12 15:43:13 +08:00
6b389e01b5 Remove use of eval() from operators.py (#4888)
Use `np.float32()` instead.

### What problem does this PR solve?

Using `eval()` can lead to code injections.

I think `eval()` is only used to parse a floating point number here.
This change preserves the correct behavior if the string `"None"` is
supplied. But if that behavior isn't intended then this part could be
just deleted instead, since `np.float32()` is parsing strings anyway:

```Python
        if isinstance(scale, str):
            scale = eval(scale)
```

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-12 12:53:42 +08:00
3411d0a2ce Added cuda_is_available (#4725)
### What problem does this PR solve?

Added cuda_is_available

### Type of change

- [x] Refactoring
2025-02-05 18:01:23 +08:00
e1526846da Fixed GPU detection on CPU only environment (#4711)
### What problem does this PR solve?

Fixed GPU detection on CPU only environment. Close #4692

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-05 12:02:43 +08:00
1bff6b7333 Fix t_ocr.py for PNG image. (#4625)
### What problem does this PR solve?
#4586

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-01-24 11:47:27 +08:00
4230402fbb deepdoc use GPU if possible (#4618)
### What problem does this PR solve?

deepdoc use GPU if possible

### Type of change

- [x] Refactoring
2025-01-24 09:48:02 +08:00
1a367664f1 Remove usage of eval() from postprocess.py (#4571)
Remove usage of `eval()` from postprocess.py

### What problem does this PR solve?

The use of `eval()` is a potential security risk. While the use of
`eval()` is guarded and thus not a security risk normally, `assert`s
aren't run if `-O` or `-OO` is passed to the interpreter, and as such
then the guard would not apply. In any case there is no reason to use
`eval()` here at all.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Other (please describe):

Potential security fix if somehow the passed `modul_name` could be user
controlled.
2025-01-22 19:37:24 +08:00