ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-27 19:25:58 +08:00

Author	SHA1	Message	Date
Zhichang Yu	c446c403de	perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385 ) ## Summary - Lazy img_np loading: `np.array(img)` is now deferred until the first OCR text extraction is actually needed, avoiding unnecessary memory allocation for pages that already have text. - Chunked parse_into_bboxes: Large PDFs (>50 pages, configurable via `PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's boxes are normalized with `_to_global_boxes` to produce globally consistent page numbers and position tags. - DLA early init: Move remote-client initialization before model loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy `TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser containers relying on remote inference. - Fix outline regression: Restore `self.outlines = extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped during refactoring and is required by downstream `remove_toc` and metadata handling in `rag/flow/parser/parser.py`. ## Test plan - [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines` is populated - [ ] Large PDF (>50 pages): verify chunked processing produces globally consistent page numbers - [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local model is not downloaded - [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-27 16:52:43 +08:00
tunsuy	292a1a8566	fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 ) ## Problem When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer) cannot map CIDs to correct Unicode characters, outputting PUA characters (U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully trusted pdfplumber text without any garbled detection, causing garbled output in the final parsed result. Relates to #13366 ## Solution ### 1. Garbled text detection functions - `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16), replacement character U+FFFD, control characters, and unassigned/surrogate codepoints - `_is_garbled_text(text, threshold)`: Calculates garbled ratio and detects `(cid:xxx)` patterns ### 2. Box-level fallback (in `__ocr()`) When a text box has ≥50% garbled characters, discard pdfplumber text and fallback to OCR recognition. ### 3. Page-level detection (in `__images__()`) Sample characters from each page; if garbled rate ≥30%, clear all pdfplumber characters for that page, forcing full OCR. ### 4. Layout recognizer CID filtering Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text processing to prevent them from polluting layout analysis. ## Testing - 29 unit tests covering: normal CJK/English text, PUA characters, CID patterns, mixed text, boundary thresholds, edge cases - All 85 existing project unit tests pass without regression	2026-03-10 11:20:31 +08:00
buua436	a674338c21	Fix: remove garbage filtering rules (#11567 ) ### What problem does this PR solve? change: remove garbage filtering rules ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-27 17:54:49 +08:00
Jin Hai	44f2d6f5da	Move 'get_project_base_directory' to common directory (#10940 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 21:05:28 +08:00
Yongteng Lei	bc0281040b	Feat: add support for the Ascend layout recognizer (#10105 ) ### What problem does this PR solve? Supports Ascend layout recognizer. Use the environment variable `LAYOUT_RECOGNIZER_TYPE=ascend` to enable the Ascend layout recognizer, and `ASCEND_LAYOUT_RECOGNIZER_DEVICE_ID=n` (for example, n=0) to specify the Ascend device ID. Ensure that you have installed the [ais tools](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench) properly. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-16 09:51:15 +08:00
Jin Hai	4a2ff633e0	Fix typo in code (#8327 ) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-06-18 09:41:09 +08:00
Kevin Hu	3bb1e012e6	Fix: assistant deleteion issue. (#6906 ) ### What problem does this PR solve? #6875 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-04-09 20:29:40 +08:00
Yongteng Lei	4ff609b6a8	Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027 ) ### What problem does this PR solve? Optimize OCR garbage identification to reduce unnecessary filtering. #5713 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-13 18:48:32 +08:00
Jin Hai	3894de895b	Update comments (#4569 ) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-01-21 20:52:28 +08:00
Kevin Hu	c852a6dfbf	Accelerate titles' embeddings. (#4492 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-01-15 15:20:29 +08:00
Zhi-Qiang You	b7ce4e7e62	fix:t_recognizer TypeError: 'super' object is not callable (#4404 ) ### What problem does this PR solve? [Bug]: layout recognizer failed for wrong boxes class type #4230 (https://github.com/infiniflow/ragflow/issues/4230) ### Type of change - [✅ ] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: youzhiqiang <zhiqiang.you@aminer.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-01-08 10:59:35 +08:00
Kevin Hu	ce1e855328	Upgrades Document Layout Analysis model. (#4054 ) ### What problem does this PR solve? #4052 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-12-17 11:27:19 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949 ) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-08-15 09:17:36 +08:00
H	c943517932	Fix pdfparser error (#1707 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-07-25 18:54:36 +08:00
KevinHuSh	453c29170f	make sure the models will not be load twice (#422 ) ### What problem does this PR solve? #381 ### Type of change - [x] Refactoring	2024-04-18 09:37:23 +08:00
KevinHuSh	a5384446e3	let's load model from local (#163 )	2024-03-28 16:10:47 +08:00
KevinHuSh	fd7fcb5baf	apply pep8 formalize (#155 )	2024-03-27 11:33:46 +08:00
KevinHuSh	979b3a5b4b	support snapshot download from local (#153 ) * support snapshot download from local * let snapshot download from local	2024-03-27 09:53:42 +08:00
KevinHuSh	71fe314955	refine page ranges (#147 )	2024-03-25 13:11:57 +08:00
KevinHuSh	6c6b144de2	refine manual parser (#140 )	2024-03-21 18:17:32 +08:00
KevinHuSh	bcb58b7e71	layout refine (#115 )	2024-03-08 18:59:53 +08:00
KevinHuSh	8f86ab9f7f	refine pdf parser, add time zone to userinfo (#112 )	2024-03-08 11:24:24 +08:00
KevinHuSh	b89ac3c4be	chage tas execution logic (#103 )	2024-03-06 19:16:31 +08:00
KevinHuSh	8a726fb04b	solve task execution issues (#90 )	2024-03-01 19:48:01 +08:00
KevinHuSh	0429107e80	fix user login issue (#85 )	2024-02-29 14:03:07 +08:00
KevinHuSh	4568a4b2cb	refine admin initialization (#75 )	2024-02-27 14:57:34 +08:00
KevinHuSh	d1c600d5d3	add ocr and recognizer demo, update README (#74 )	2024-02-26 19:51:35 +08:00
KevinHuSh	d32322c081	rename vision, add layour and tsr recognizer (#70 ) * rename vision, add layour and tsr recognizer * trivial fixing	2024-02-22 19:11:37 +08:00

29 Commits