ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-04-30 23:37:49 +08:00

Author	SHA1	Message	Date
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
Magicbook1108	158503a1aa	Feat: optimize ingestion pipeline with preprocess (#13211 ) ### What problem does this PR solve? Feat: optimize ingestion pipeline with preprocess ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-26 10:24:13 +08:00
Magicbook1108	109441628b	Fix: upload image files (#13071 ) ### What problem does this PR solve? Fix: upload image files ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-11 09:47:33 +08:00
qinling0210	4bc622b409	Fix parameter of calling self.dataStore.get() and warning info during parser (#13068 ) ### What problem does this PR solve? Fix parameter of calling self.dataStore.get() and warning info during parser https://github.com/infiniflow/ragflow/issues/13036 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-09 17:56:59 +08:00
yH	5333e764fc	fix: optimize Excel row counting for files with abnormal max_row (#13018 ) ### What problem does this PR solve? Some Excel files have abnormal `max_row` metadata (e.g., `max_row=1,048,534` with only 300 actual data rows). This causes: - `row_number()` returns incorrect count, creating 350+ tasks instead of 1 - `list(ws.rows)` iterates through millions of empty rows, causing system hang This PR uses binary search to find the actual last row with data. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-06 14:43:52 +08:00
Stephen Hu	11703d957d	Refactor: Improve Picture.py resource usage (#13011 ) ### What problem does this PR solve? Improve Picture.py resource usage ### Type of change - [x] Refactoring	2026-02-06 09:50:53 +08:00
Stephen Hu	6c9ca45b30	Refactor: improve close for presentation (#12957 ) ### What problem does this PR solve? improve close for presentation ### Type of change - [x] Refactoring	2026-02-03 10:24:27 +08:00
eviaaaaa	1a2d69edc4	feat: Implement legacy .ppt parsing via Tika (alternative to Aspose) (#12932 ) ## What problem does this PR solve? This PR implements parsing support for legacy PowerPoint files (`.ppt`, 97-2003 format). Currently, parsing these files fails because `python-pptx` natively lacks support for the legacy OLE2 binary format. ## Context: I originally using `aspose-slides` for this purpose. However, since `aspose-slides` is no longer a project dependency, I implemented a fallback mechanism using the existing `tika-server` to ensure compatibility and stability. ## Key Changes: - Fallback Logic: Modified `rag/app/presentation.py` to catch `python-pptx` failures and automatically fall back to Tika parsing. - No New Dependencies: Utilizes the `tika` service that is already part of the RAGFlow stack. - Note: Since Tika focuses on text extraction, this implementation extracts text content but does not generate slide thumbnails . ## 🧪 Test / Verification Results ### 1. Before (The Issue) I have verified the fix using a legacy `.ppt` file (`math(1).ppt`, ~8MB). <img width="963" height="970" alt="image" src="https://github.com/user-attachments/assets/468c4ba8-f90b-4d7b-b969-9c5f5e42c474" /> ### 2. After (The Fix) With this PR, the system detects the failure in python-pptx and successfully falls back to Tika. The text is extracted correctly. <img width="1467" height="1121" alt="image" src="https://github.com/user-attachments/assets/fa0fed3b-b923-4c86-ba2c-24b3ce6ee7a6" /> Type of change - [x] New Feature (non-breaking change which adds functionality) Signed-off-by: evilhero <2278596667@qq.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-02-02 13:40:51 +08:00
Carve_	23bdf25a1f	feature:Add OceanBase Storage Support for Table Parser (#12923 ) ### What problem does this PR solve? close #12770 This PR adds OceanBase as a storage backend for the Table Parser. It enables dynamic table schema storage via JSON and implements OceanBase SQL execution for text-to-SQL retrieval. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes - Table Parser stores row data into `chunk_data` when doc engine is OceanBase. (table.py) - OceanBase table schema adds `chunk_data` JSON column and migrates if needed. - Implemented OceanBase `sql()` to execute text-to-SQL results. (ob_conn.py) - Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py) ### Test 1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`) <img width="1290" height="783" alt="doc_engine_ob" src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1" /> 2. Upload an Excel file to Knowledge Base.(for test, we use as below) <img width="786" height="930" alt="excel" src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a" /> 3. Choose Table as parsing method. <img width="2550" height="1134" alt="parse_excel" src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0" /> 4.Ask a natural language query in chat. <img width="2550" height="1134" alt="query" src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91" />	2026-01-31 15:11:54 +08:00
Kevin Hu	f262d416fe	Refa: remove aspose dependency. (#12910 ) ### Type of change - [x] Refactoring	2026-01-30 14:06:19 +08:00
Kevin Hu	f1c2fac03e	Refa: remove ppt image. (#12909 ) ### What problem does this PR solve? remove `aspose` ### Type of change - [x] Refactoring	2026-01-30 13:35:42 +08:00
Kevin Hu	32c0161ff1	Refa: Clean the folders. (#12890 ) ### Type of change - [x] Refactoring	2026-01-29 14:23:26 +08:00
LIRUI YU	c8bd413e4c	Fixed bug: Prevent 400 errors from Image2Text providers by skipping images smaller than 11px on any side during figure enhancement. (#12868 ) ### What problem does this PR solve? During figure enhancement, some cropped figure images are extremely small. Sending these to the Image2Text/VLM provider fails with a 400 invalid_parameter_error because the image width/height must be >10px. This aborts the enhancement step. This PR adds a minimal size guard to skip tiny crops and continue processing. <img width="1084" height="494" alt="image" src="https://github.com/user-attachments/assets/ad074270-94e6-4571-91c8-37df85212639" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-28 14:59:02 +08:00
Yongteng Lei	f096917eeb	Fix: overlap cannot be properly applied (#12828 ) ### What problem does this PR solve? Overlap cannot be properly applied. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-27 12:43:01 +08:00
qinling0210	b40d639fdb	Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541 ) ### What problem does this PR solve? 1) Create dataset using table parser for infinity 2) Answer questions in chat using SQL ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-19 19:35:14 +08:00
MkDev11	678a4f959c	Fix: skip internal bookmark references in DOCX parsing (#12604 ) (#12611 ) ### What problem does this PR solve? Fixes #12604 - DOCX files containing hyperlinks to internal bookmarks (e.g., `#_文档目录`) cause a `KeyError` during parsing: ``` KeyError: "There is no item named 'word/#_文档目录' in the archive" ``` This happens because python-docx incorrectly tries to read internal bookmark references as files from the ZIP archive. Internal bookmarks are relationship targets starting with `#` and are not actual files. This PR extends the existing `load_from_xml_v2` workaround (which already handles `NULL` targets) to also skip relationship targets starting with `#`. Related upstream issue: https://github.com/python-openxml/python-docx/issues/902 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --- Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=94194147	2026-01-14 19:08:46 +08:00
lys1313013	b226e06e2d	refactor: remove debug print statements (#12534 ) ### What problem does this PR solve? refactor: remove debug print statements ### Type of change - [x] Refactoring	2026-01-09 19:23:50 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Yongteng Lei	4cd4526492	Feat: PDF vision figure parser supports reading context (#12416 ) ### What problem does this PR solve? PDF vision figure parser supports reading context. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-05 09:55:43 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
Jin Hai	f0392e7501	Fix IDE warnings (#12315 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 15:04:09 +08:00
Jin Hai	df3cbb9b9e	Refactor code (#12305 ) ### What problem does this PR solve? as title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:09:18 +08:00
lys1313013	37e4485415	feat: add MDX file support (#12261 ) Feat: add MDX file support #12057 ### What problem does this PR solve? <img width="1055" height="270" alt="image" src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b" /> The page states that MDX files are supported, but uploading fails with the error: "x.mdx: This type of file has not been supported yet!" <img width="381" height="110" alt="image" src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-29 12:54:31 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Kevin Hu	bd76b8ff1a	Fix: Tika server upgrades. (#12073 ) ### What problem does this PR solve? #12037 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-23 09:35:52 +08:00
buua436	b49eb6826b	Feat: enhance Excel image extraction with vision-based descriptions (#12054 ) ### What problem does this PR solve? issue: [#11618](https://github.com/infiniflow/ragflow/issues/11618) change: enhance Excel image extraction with vision-based descriptions ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-22 10:17:44 +08:00
concertdictate	4dd8cdc38b	task executor issues (#12006 ) ### What problem does this PR solve? Fixes #8706 - `InfinityException: TOO_MANY_CONNECTIONS` when running multiple task executor workers ### Problem Description When running RAGFlow with 8-16 task executor workers, most workers fail to start properly. Checking logs revealed that workers were stuck/hanging during Infinity connection initialization - only 1-2 workers would successfully register in Redis while the rest remained blocked. ### Root Cause The Infinity SDK `ConnectionPool` pre-allocates all connections in `__init__`. With the default `max_size=32` and multiple workers (e.g., 16), this creates 16×32=512 connections immediately on startup, exceeding Infinity's default 128 connection limit. Workers hang while waiting for connections that can never be established. ### Changes 1. Prevent Infinity connection storm (`rag/utils/infinity_conn.py`, `rag/svr/task_executor.py`) - Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since operations are synchronous) - Added staggered startup delay (2s per worker) to spread connection initialization 2. Handle None children_delimiter (`rag/app/naive.py`) - Use `or ""` to handle explicitly set None values from parser config 3. MinerU parser robustness (`deepdoc/parser/mineru_parser.py`) - Use `.get()` for optional output fields that may be missing - Fix DISCARDED block handling: change `pass` to `continue` to skip discarded blocks entirely ### Why `max_size=4` is sufficient \| Workers \| Pool Size \| Total Connections \| Infinity Limit \| \|---------\|-----------\|-------------------\|----------------\| \| 16 \| 32 \| 512 \| 128 ❌ \| \| 16 \| 4 \| 64 \| 128 ✅ \| \| 32 \| 4 \| 128 \| 128 ✅ \| - All RAGFlow operations are synchronous: `get_conn()` → operation → `release_conn()` - No parallel `docStoreConn` operations in the codebase - Maximum 1-2 concurrent connections needed per worker; 4 provides safety margin ### MinerU DISCARDED block bug When MinerU returns blocks with `type: "discarded"` (headers, footers, watermarks, page numbers, artifacts), the previous code used `pass` which left the `section` variable undefined, causing: - UnboundLocalError if DISCARDED is the first block - Duplicate content if DISCARDED follows another block (stale value from previous iteration) Root cause confirmed via MinerU source code: From [`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14): ```python class BlockType: DISCARDED = 'discarded' # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE ``` Per [MinerU documentation](https://opendatalab.github.io/MinerU/reference/output_files/), discarded blocks contain content that should be filtered out for clean text extraction. Fix: Changed `pass` to `continue` to skip discarded blocks entirely. ### Testing - Verified all 16 workers now register successfully in Redis - All workers heartbeating correctly - Document parsing works as expected - MinerU parsing with DISCARDED blocks no longer crashes ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: user210 <user210@rt>	2025-12-18 10:03:30 +08:00
Yongteng Lei	672958a192	Fix: model not authorized (#12001 ) ### What problem does this PR solve? Fix model not authorized. #11973. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-17 19:48:24 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
Jin Hai	0e8b9588ba	Fix error and format issue (#11975 ) ### What problem does this PR solve? 1. Fix error of book chunking. 2. Fix format issues. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-16 19:29:37 +08:00
concertdictate	49c74d08e8	Feature/mineru improvements (#11938 ) 我已在下面的评论中用中文重复说明。 ### What problem does this PR solve? ## Summary This PR enhances the MinerU document parser with additional configuration options, giving users more control over PDF parsing behavior and improving support for multilingual documents. ## Changes ### Backend (`deepdoc/parser/mineru_parser.py`) - Added configurable parsing options: - Parse Method: `auto`, `txt`, or `ocr` — allows users to choose the extraction strategy - Formula Recognition: Toggle for enabling/disabling formula extraction (useful to disable for Cyrillic documents where it may cause issues) - Table Recognition: Toggle for enabling/disabling table extraction - Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate RAGFlow language settings to MinerU-compatible language codes for better OCR accuracy - Improved parser configuration handling to pass these options through the processing pipeline ### Frontend (`web/`) - Created new `MinerUOptionsFormField` component that conditionally renders when MinerU is selected as the layout recognition engine - Added UI controls for: - Parse method selection (dropdown) - Formula recognition toggle (switch) - Table recognition toggle (switch) - Added i18n translations for English and Chinese - Integrated the options into both the dataset creation dialog and dataset settings page ### Integration - Updated `rag/app/naive.py` to forward MinerU options to the parser - Updated task service to handle the new configuration parameters ## Why MinerU is a powerful document parser, but the default settings don't work well for all document types. This PR allows users to: 1. Choose the best parsing method for their documents 2. Disable formula recognition for Cyrillic/non-Latin scripts where it causes issues 3. Control table extraction based on document needs 4. Benefit from automatic language detection for better OCR results ## Testing - [x] Tested MinerU parsing with different parse methods - [x] Verified UI renders correctly when MinerU is selected/deselected - [x] Confirmed settings persist correctly in dataset configuration ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: user210 <user210@rt> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-16 13:15:25 +08:00
Magicbook1108	7d23c3aed0	Fix: presentation parsing & Embedding encode exception handling (#11933 ) ### What problem does this PR solve? Fix: presentation parsing #11920 Fix: Embeddin encode exception handling ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-13 11:37:42 +08:00
Yongteng Lei	e9710b7aa9	Refa: treat MinerU as an OCR model 2 (#11905 ) ### What problem does this PR solve? Treat MinerU as an OCR model 2. #11903 ### Type of change - [x] Refactoring	2025-12-11 17:33:12 +08:00
buua436	ab4b62031f	Fix:csv parse in Table (#11870 ) ### What problem does this PR solve? change: csv parse in Table ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-10 16:44:06 +08:00
Magicbook1108	ca2d6f3301	Fix: duplicate output by async_chat_streamly (#11842 ) ### What problem does this PR solve? Fix: duplicate output by async_chat_streamly Refact: revert manual modification ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-09 19:21:52 +08:00
Yongteng Lei	a94b3b9df2	Refa: treat MinerU as an OCR model (#11849 ) ### What problem does this PR solve? Treat MinerU as an OCR model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-09 18:54:14 +08:00
Yongteng Lei	c51e6b2a58	Refa: migrate CV model chat to Async (#11828 ) ### What problem does this PR solve? Migrate CV model chat to Async. #11750 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2025-12-09 13:08:37 +08:00
Kevin Hu	09a3854ed8	Fix: chunk method error. (#11807 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 14:28:23 +08:00
Jin Hai	43f51baa96	Fix errors (#11804 ) ### What problem does this PR solve? 1. typos 2. grammar errors. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-08 12:21:18 +08:00
Stephen Hu	b66881a371	Refactor:book parser use with to handle bytesIO (#11800 ) ### What problem does this PR solve? book parser use with to handle bytesIO ### Type of change - [x] Refactoring	2025-12-08 10:18:46 +08:00
少卿	7719fd6350	Fix MinerU API sanitized-output lookup and manual chunk tuple handling (#11702 ) ### What problem does this PR solve? This PR addresses two independent issues encountered when using the MinerU engine in Ragflow: 1. MinerU API output path mismatch for non-ASCII filenames MinerU sanitizes the root directory name inside the returned ZIP when the original filename contains non-ASCII characters (e.g., Chinese). Ragflow's client-side unzip logic assumed the original filename stem and therefore failed to locate `_content_list.json`. This PR adds: * root-directory detection * fallback lookup using sanitized names * a broadened `_read_output` search with a glob fallback ensuring output files are consistently located regardless of filename encoding. 2. Chunker crash due to tuple-structure mismatch in manual mode Some parsers (e.g., MinerU / Docling) return 2-tuple sections, but Ragflow’s chunker expects 3-tuple sections, leading to: `ValueError: not enough values to unpack (expected 3, got 2)` This PR normalizes all sections to a uniform structure `(text, layout, positions)`: * parse position tags when present * default to empty positions when missing preserving backward compatibility and preventing crashes. ### Type of change * [x] Bug Fix (non-breaking change which fixes an issue) [#11136](https://github.com/infiniflow/ragflow/issues/11136) [#11700](https://github.com/infiniflow/ragflow/issues/11700) [#11620](https://github.com/infiniflow/ragflow/issues/11620) [#11701](https://github.com/infiniflow/ragflow/pull/11701) we need your help [yongtenglei](https://github.com/yongtenglei) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-05 19:25:45 +08:00
rommy2017	257af75ece	Fix: relative page_number in boxes (#11712 ) page_number in boxes is relative page number，must + from_page ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-04 11:23:34 +08:00
rommy2017	4ba17361e9	feat: improve presentation PdfParser (#11639 ) The old presentation PdfParser lost table format after parse ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-02 17:35:14 +08:00
Kevin Hu	14616cf845	Feat: add child parent chunking method in backend. (#11598 ) ### What problem does this PR solve? #7996 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-28 19:25:32 +08:00
Yongteng Lei	9d8b96c1d0	Feat: add context for figure and table (#11547 ) ### What problem does this PR solve? Add context for figure table. ![demo_figure_table_context](https://github.com/user-attachments/assets/61b37fac-e22e-40a4-9665-9396c7b4103e) `==================()` for demonstrating purpose. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-27 10:21:44 +08:00
Kevin Hu	74e0b58d89	Fix: excel default optimization. (#11519 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:20 +08:00
Yongteng Lei	7c20c964b4	Fix: incorrect image merging for naive markdown parser (#11520 ) ### What problem does this PR solve? Fix incorrect image merging for naive markdown parser. #9349 [ragflow_readme.webm](https://github.com/user-attachments/assets/ca3f1e18-72b6-4a4c-80db-d03da9adf8dc) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-11-25 19:54:06 +08:00
Stephen Hu	41665b0865	Refactor: Email parser use with to handle buffer (#11496 ) ### What problem does this PR solve? Email parser use with to handle buffer ### Type of change - [x] Refactoring	2025-11-25 10:03:37 +08:00
coding	971c1bcba7	Fix: missing parameters in by_plaintext method for PDF naive mode (#11408 ) ### What problem does this PR solve? FIx: missing parameters in by_plaintext method for PDF naive mode ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: lih <dev_lih@139.com>	2025-11-21 09:33:36 +08:00

1 2 3 4 5 ...

280 Commits