Commit Graph

29 Commits

Author SHA1 Message Date
c446c403de perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385)
## Summary

- **Lazy img_np loading**: `np.array(img)` is now deferred until the
first OCR text extraction is actually needed, avoiding unnecessary
memory allocation for pages that already have text.
- **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via
`PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's
boxes are normalized with `_to_global_boxes` to produce globally
consistent page numbers and position tags.
- **DLA early init**: Move remote-client initialization before model
loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy
`TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser
containers relying on remote inference.
- **Fix outline regression**: Restore `self.outlines =
extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped
during refactoring and is required by downstream `remove_toc` and
metadata handling in `rag/flow/parser/parser.py`.

## Test plan

- [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines`
is populated
- [ ] Large PDF (>50 pages): verify chunked processing produces globally
consistent page numbers
- [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local
model is not downloaded
- [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 16:52:43 +08:00
292a1a8566 fix: detect and fallback garbled PDF text to OCR (#13366) (#13404)
## Problem

When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer)
cannot map CIDs to correct Unicode characters, outputting PUA characters
(U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully
trusted pdfplumber text without any garbled detection, causing garbled
output in the final parsed result.

Relates to #13366

## Solution

### 1. Garbled text detection functions
- `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16),
replacement character U+FFFD, control characters, and
unassigned/surrogate codepoints
- `_is_garbled_text(text, threshold)`: Calculates garbled ratio and
detects `(cid:xxx)` patterns

### 2. Box-level fallback (in `__ocr()`)
When a text box has ≥50% garbled characters, discard pdfplumber text and
fallback to OCR recognition.

### 3. Page-level detection (in `__images__()`)
Sample characters from each page; if garbled rate ≥30%, clear all
pdfplumber characters for that page, forcing full OCR.

### 4. Layout recognizer CID filtering
Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text
processing to prevent them from polluting layout analysis.

## Testing
- 29 unit tests covering: normal CJK/English text, PUA characters, CID
patterns, mixed text, boundary thresholds, edge cases
- All 85 existing project unit tests pass without regression
2026-03-10 11:20:31 +08:00
a674338c21 Fix: remove garbage filtering rules (#11567)
### What problem does this PR solve?
change:

remove garbage filtering rules

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-11-27 17:54:49 +08:00
44f2d6f5da Move 'get_project_base_directory' to common directory (#10940)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-02 21:05:28 +08:00
bc0281040b Feat: add support for the Ascend layout recognizer (#10105)
### What problem does this PR solve?

Supports Ascend layout recognizer.

Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable
the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n`
(for example, n=0) to specify the Ascend device ID.

Ensure that you have installed the [ais
tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)
properly.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-16 09:51:15 +08:00
4a2ff633e0 Fix typo in code (#8327)
### What problem does this PR solve?

Fix typo in code

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-06-18 09:41:09 +08:00
3bb1e012e6 Fix: assistant deleteion issue. (#6906)
### What problem does this PR solve?

#6875

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-04-09 20:29:40 +08:00
4ff609b6a8 Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027)
### What problem does this PR solve?

Optimize OCR garbage identification to reduce unnecessary filtering.
#5713

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-13 18:48:32 +08:00
3894de895b Update comments (#4569)
### What problem does this PR solve?

Add license statement.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-01-21 20:52:28 +08:00
c852a6dfbf Accelerate titles' embeddings. (#4492)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2025-01-15 15:20:29 +08:00
b7ce4e7e62 fix:t_recognizer TypeError: 'super' object is not callable (#4404)
### What problem does this PR solve?

[Bug]: layout recognizer failed for wrong boxes class type #4230
(https://github.com/infiniflow/ragflow/issues/4230)

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: youzhiqiang <zhiqiang.you@aminer.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-01-08 10:59:35 +08:00
ce1e855328 Upgrades Document Layout Analysis model. (#4054)
### What problem does this PR solve?

#4052

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-12-17 11:27:19 +08:00
0d68a6cd1b Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
6b3a40be5c Format file format from Windows/dos to Unix (#1949)
### What problem does this PR solve?

Related source file is in Windows/DOS format, they are format to Unix
format.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-08-15 09:17:36 +08:00
H
c943517932 Fix pdfparser error (#1707)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-07-25 18:54:36 +08:00
453c29170f make sure the models will not be load twice (#422)
### What problem does this PR solve?

#381 
### Type of change

- [x] Refactoring
2024-04-18 09:37:23 +08:00
a5384446e3 let's load model from local (#163) 2024-03-28 16:10:47 +08:00
fd7fcb5baf apply pep8 formalize (#155) 2024-03-27 11:33:46 +08:00
979b3a5b4b support snapshot download from local (#153)
* support snapshot download from local

* let snapshot download from local
2024-03-27 09:53:42 +08:00
71fe314955 refine page ranges (#147) 2024-03-25 13:11:57 +08:00
6c6b144de2 refine manual parser (#140) 2024-03-21 18:17:32 +08:00
bcb58b7e71 layout refine (#115) 2024-03-08 18:59:53 +08:00
8f86ab9f7f refine pdf parser, add time zone to userinfo (#112) 2024-03-08 11:24:24 +08:00
b89ac3c4be chage tas execution logic (#103) 2024-03-06 19:16:31 +08:00
8a726fb04b solve task execution issues (#90) 2024-03-01 19:48:01 +08:00
0429107e80 fix user login issue (#85) 2024-02-29 14:03:07 +08:00
4568a4b2cb refine admin initialization (#75) 2024-02-27 14:57:34 +08:00
d1c600d5d3 add ocr and recognizer demo, update README (#74) 2024-02-26 19:51:35 +08:00
d32322c081 rename vision, add layour and tsr recognizer (#70)
* rename vision, add layour and tsr recognizer

* trivial fixing
2024-02-22 19:11:37 +08:00