ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-01-19 03:35:11 +08:00

Author	SHA1	Message	Date
He Wang	bd9163904a	fix(ob_conn): ignore duplicate errors when executing 'create_idx' (#12661 ) ### What problem does this PR solve? Skip duplicate errors to avoid 'create_idx' failures caused by slow metadata refresh or external modifications. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-16 20:46:37 +08:00
6ba3i	4f036a881d	Fix: Infinity keyword round-trip, highlight fallback, and KB update guards (#12660 ) ### What problem does this PR solve? Fixes Infinity-specific API regressions: preserves ```important_kwd``` round‑trip for ```[""]```, restores required highlight key in retrieval responses, and enforces Infinity guards for unsupported ```parser_id=tag``` and pagerank in ```/v1/kb/update```. Also removes a slow/buggy pandas row-wise apply that was throwing ```ValueError``` and causing flakiness. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-16 20:03:52 +08:00
Kevin Hu	cec06bfb5d	Fix: empty chunk issue. (#12638 ) #12570 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-15 17:46:21 +08:00
liuxiaoyusky	2ea8dddef6	fix(infinity): Use comma separator for important_kwd to preserve mult… (#12618 ) ## Problem The \`important_kwd\` field in Infinity connector was using mismatched separators: - Storage: \`list2str(v)\` uses space as default separator - Reading: \`v.split()\` splits by all whitespace This causes multi-word keywords like \`\"Senior Fund Manager\"\` to be incorrectly split into \`[\"Senior\", \"Fund\", \"Manager\"]\`. ## Solution Use comma \`,\` as separator for both storing and reading, consistent with: 1. The LLM output format in \`keyword_prompt.md\` (\"delimited by ENGLISH COMMA\") 2. The \`cached.split(\",\")\` in \`task_executor.py\` ## Changes - \`insert()\`: \`list2str(v)\` → \`list2str(v, \",\")\` - \`update()\`: \`list2str(v)\` → \`list2str(v, \",\")\` - \`get_fields()\`: \`v.split()\` → \`v.split(\",\") if v else []\` ## Impact This bug affects: - Python-level reranking weight calculation (\`important_kwd * 5\`) - API response keyword display - Search precision due to fragmented keywords	2026-01-15 15:32:40 +08:00
Vedant Madane	ac936005e6	fix: ensure deleted chunks are not returned in retrieval (#12520 ) (#12546 ) ## Summary Fixes #12520 - Deleted chunks should not appear in retrieval/reference results. ## Changes ### Core Fix - api/apps/chunk_app.py: Include \doc_id\ in delete condition to properly scope the delete operation ### Improved Error Handling - api/db/services/document_service.py: Better separation of concerns with individual try-catch blocks and proper logging for each cleanup operation ### Doc Store Updates - rag/utils/es_conn.py: Updated delete query construction to support compound conditions - rag/utils/opensearch_conn.py: Same updates for OpenSearch compatibility ### Tests - test/testcases/.../test_retrieval_chunks.py: Added \TestDeletedChunksNotRetrievable\ class with regression tests - test/unit/test_delete_query_construction.py: Unit tests for delete query construction ## Testing - Added regression tests that verify deleted chunks are not returned by retrieval API - Tests cover single chunk deletion and batch deletion scenarios	2026-01-15 14:45:55 +08:00
Pegasus	d8192f8f17	Fix: validate regex pattern in split_with_pattern to prevent crash (#12633 ) ### What problem does this PR solve? Fix regex pattern validation in split_with_pattern (#12605) - Add try-except block to validate user-provided regex patterns before use - Gracefully fallback to single chunk when invalid regex is provided - Prevent server crash during DOCX parsing with malformed delimiters ## Problem Parsing DOCX files with custom regex delimiters crashes with `re.error: nothing to repeat at position 9` when users provide invalid regex patterns. Closes #12605 ## Solution Validate and compile regex pattern before use. On invalid pattern, log warning and return content as single chunk instead of crashing. ## Changes - `rag/nlp/__init__.py`: Add regex validation in `split_with_pattern()` function ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461	2026-01-15 14:24:51 +08:00
Magicbook1108	b40a7b2e7d	Feat: Hash doc id to avoid duplicate name. (#12573 ) ### What problem does this PR solve? Feat: Hash doc id to avoid duplicate name. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-15 14:02:15 +08:00
Kevin Hu	9a10558f80	Refa: async retrieval process. (#12629 ) ### Type of change - [x] Refactoring - [x] Performance Improvement	2026-01-15 12:28:49 +08:00
MkDev11	678a4f959c	Fix: skip internal bookmark references in DOCX parsing (#12604 ) (#12611 ) ### What problem does this PR solve? Fixes #12604 - DOCX files containing hyperlinks to internal bookmarks (e.g., `#_文档目录`) cause a `KeyError` during parsing: ``` KeyError: "There is no item named 'word/#_文档目录' in the archive" ``` This happens because python-docx incorrectly tries to read internal bookmark references as files from the ZIP archive. Internal bookmarks are relationship targets starting with `#` and are not actual files. This PR extends the existing `load_from_xml_v2` workaround (which already handles `NULL` targets) to also skip relationship targets starting with `#`. Related upstream issue: https://github.com/python-openxml/python-docx/issues/902 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --- Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=94194147	2026-01-14 19:08:46 +08:00
Pegasus	b091ff2730	Fix enable_thinking parameter for Qwen3 models (#12603 ) ### Issue When using Qwen3 models (`qwen3-32b`, `qwen3-max`) through the Tongyi-Qianwen provider for non-streaming calls (e.g., knowledge graph generation), the API fails with: Closes #12424 ``` parameter.enable_thinking must be set to false for non-streaming calls ``` ### Root Cause In `LiteLLMBase.async_chat()`, the `extra_body={"enable_thinking": False}` was set in `kwargs` but never forwarded to `_construct_completion_args()`. ### What problem does this PR solve? Pass merged kwargs to `_construct_completion_args()` using `{gen_conf, **kwargs}` to safely handle potential duplicate parameters. ### Changes - `rag/llm/chat_model.py`: Forward kwargs containing `extra_body` to `_construct_completion_args()` in `async_chat()` _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461	2026-01-14 16:35:46 +08:00
lys1313013	f72a35188d	refactor: remove debug print statements (#12598 ) ### What problem does this PR solve? This PR eliminates unnecessary debug print statements that were left in hot paths of the codebase. ### Type of change - [x] Refactoring	2026-01-14 10:05:34 +08:00
He Wang	360114ed42	fix(ob_conn): avoid reusing SQLAlchemy Column objects in DDL (#12588 ) ### What problem does this PR solve? When there are multiple users, parsing a document for a new user can trigger the reuse of column objects, leading to the error `sqlalchemy.exc.ArgumentError: Column object 'id' already assigned to Table xxx`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 17:39:20 +08:00
Yongteng Lei	68e5c86e9c	Fix: image not displaying thumbnails when using pipeline (#12574 ) ### What problem does this PR solve? Fix image not displaying thumbnails when using pipeline. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 12:54:13 +08:00
Lin Manhui	4fe3c24198	feat: PaddleOCR PDF parser supports thumnails and positions (#12565 ) ### What problem does this PR solve? 1. PaddleOCR PDF parser supports thumnails and positions. 2. Add FAQ documentation for PaddleOCR PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:51:08 +08:00
Kevin Hu	44bada64c9	Feat: support tree structured deep-research policy. (#12559 ) ### What problem does this PR solve? #12558 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:41:35 +08:00
Jin Hai	a7dd3b7e9e	Add time cost when start servers (#12552 ) ### What problem does this PR solve? - API server - Ingestion server - Data sync server - Admin server ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-01-12 12:48:23 +08:00
Stephen Hu	638c510468	refactor: introduce common normalize method in rerank base class (#12550 ) ### What problem does this PR solve? introduce common normalize method in rerank base class ### Type of change - [x] Refactoring	2026-01-12 11:07:11 +08:00
lys1313013	b226e06e2d	refactor: remove debug print statements (#12534 ) ### What problem does this PR solve? refactor: remove debug print statements ### Type of change - [x] Refactoring	2026-01-09 19:23:50 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
He Wang	fbe55cef05	fix: keep password in opendal config to fix connection initialization (#12529 ) ### What problem does this PR solve? If we delete the password in kwargs, func 'init_db_config' will fail, so we need to keep this field. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-09 14:19:32 +08:00
Stephen Hu	f522391d1e	Fix: "AttributeError(\"'list' object has no attribute 'get'\")" (#12518 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/12515 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-09 10:19:51 +08:00
Stephen Hu	f1dc2df23c	Fix:Bedrock assume_role auth mode fails with LiteLLM "Extra inputs are not permitted" error (#12495 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/12489 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-08 12:53:41 +08:00
Kevin Hu	23a9544b73	Fix: toc async issue. (#12485 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-07 15:35:30 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
OliverW	8d406bd2e6	fix: prevent MinIO health check failure in multi-bucket mode (#12446 ) ### What problem does this PR solve? - Fixes the health check failure in multi-bucket MinIO environments. Previously, health checks would fail because the default "ragflow-bucket" did not exist. This caused false negatives for system health. - Also removes the _health_check write in single-bucket mode to avoid side effects (minor optimization). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-07 10:07:18 +08:00
He Wang	55c9fc0017	fix: add 'mom_id' column to OBConnection chunk table (#12444 ) ### What problem does this PR solve? Fix #12428 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-05 19:31:44 +08:00
Liu An	606f4e6c9e	Refa: improve TOC building with better error handling (#12427 ) ### What problem does this PR solve? Refactor TOC building logic to use enumerate instead of while loop, add comprehensive error handling for missing/invalid chunk_id values, and improve logging with more specific error messages. The changes make the code more robust against malformed TOC data while maintaining the same functionality for valid inputs. ### Type of change - [x] Refactoring	2026-01-05 10:02:42 +08:00
Yongteng Lei	4cd4526492	Feat: PDF vision figure parser supports reading context (#12416 ) ### What problem does this PR solve? PDF vision figure parser supports reading context. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-05 09:55:43 +08:00
OliverW	d6e006f086	Improve task executor heartbeat handling and cleanup (#12390 ) Improve task executor heartbeat handling and cleanup. ### What problem does this PR solve? - Reduce lock contention during executor cleanup: The cleanup lock is acquired only when removing expired executors, not during regular heartbeat reporting, reducing potential lock contention. - Optimize own heartbeat cleanup: Each executor removes its own expired heartbeat using `zremrangebyscore` instead of `zcount` + `zpopmin`, reducing Redis operations and improving efficiency. - Improve cleanup of other executors' heartbeats: Expired executors are detected by checking their latest heartbeat, and stale entries are removed safely. - Other improvements: IP address and PID are captured once at startup, and unnecessary global declarations are removed. ### Type of change - [x] Performance Improvement Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-01-04 11:24:05 +08:00
Magicbook1108	f56bceb2a9	Fix: remvoe async wrappers (#12405 ) ### What problem does this PR solve? Fix: remvoe async wrappers #12396 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-04 11:19:48 +08:00
Rin	bbaf918d74	security: harden OpenDAL SQL initialization against injection (#12393 ) Eliminates SQL injection vectors in the OpenDAL MySQL initialization logic by implementing strict input validation and explicit type casting. Modifications: 1. `init_db_config`: Enforced integer casting for `max_allowed_packet` before formatting it into the SQL string. 2. `init_opendal_mysql_table`: Implemented regex-based validation for `table_name` to ensure only alphanumeric characters and underscores are permitted, preventing arbitrary SQL command injection through configuration parameters. These changes ensure that even if configuration values are sourced from untrusted environments, the database initialization remains secure.	2026-01-04 11:19:26 +08:00
Stephen Hu	6f2fc2f1cb	refactor:re order logics in clean_gen_conf (#12391 ) ### What problem does this PR solve? re order logics in clean_gen_conf #12388 ### Type of change - [x] Refactoring	2026-01-04 10:31:56 +08:00
Magicbook1108	96810b7d97	Fix: webdav connector (#12380 ) ### What problem does this PR solve? fix webdav #11422 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-31 19:00:00 +08:00
Magicbook1108	7d4d687dde	Feat: Bitbucket connector (#12332 ) ### What problem does this PR solve? Feat: Bitbucket connector NOT READY TO MERGE ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-31 17:18:30 +08:00
buua436	c2ee2bf7fe	Feat: add Zendesk data source integration with configuration and sync capabilities (#12344 ) ### What problem does this PR solve? issue: #12313 change: add Zendesk data source integration with configuration and sync capabilities ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-31 14:40:49 +08:00
Liu An	ae7c623a35	fix(rag/prompts): Restructure metadata extraction rules for precision (#12360 ) ### What problem does this PR solve? - Simplified and consolidated extraction rules - Emphasized strict evidence-based extraction only - Strengthened enum handling and hallucination prevention - Clarified output requirements for empty results ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-31 13:52:33 +08:00
Kevin Hu	1a4a7d1705	Fix: apply kb configured llm issue. (#12354 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-31 12:40:28 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
buua436	bffdb5fb11	Feat: add IMAP data source integration with configuration and sync capabilities (#12316 ) ### What problem does this PR solve? issue: #12217 [#12313](https://github.com/infiniflow/ragflow/issues/12313) change: add IMAP data source integration with configuration and sync capabilities ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-30 17:09:13 +08:00
Magicbook1108	5903d1c8f1	Feat: GitHub connector (#12314 ) ### What problem does this PR solve? Feat: GitHub connector ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-30 15:09:52 +08:00
Jin Hai	f0392e7501	Fix IDE warnings (#12315 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 15:04:09 +08:00
Lynn	4a6d37f0e8	Fix: use async task to save memory (#12308 ) ### What problem does this PR solve? Use async task to save memory. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:41:38 +08:00
Jin Hai	df3cbb9b9e	Refactor code (#12305 ) ### What problem does this PR solve? as title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:09:18 +08:00
Magicbook1108	c3ae1aaecd	Feat: Gitlab connector (#12248 ) ### What problem does this PR solve? Feat: Gitlab connector Fix: submit button in darkmode ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-29 17:05:20 +08:00
buua436	a764f0a5b2	Feat: Add Asana data source integration and configuration options (#12239 ) ### What problem does this PR solve? change: Add Asana data source integration and configuration options ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-29 13:28:37 +08:00
lys1313013	37e4485415	feat: add MDX file support (#12261 ) Feat: add MDX file support #12057 ### What problem does this PR solve? <img width="1055" height="270" alt="image" src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b" /> The page states that MDX files are supported, but uploading fails with the error: "x.mdx: This type of file has not been supported yet!" <img width="381" height="110" alt="image" src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-29 12:54:31 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Kevin Hu	bc9e1e3b9a	Fix: parent-children pipleine bad case. (#12246 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-26 18:57:16 +08:00
Yongteng Lei	51bc41b2e8	Refa: improve image table context (#12244 ) ### What problem does this PR solve? Improve image table context. Current strategy in attach_media_context: - Order by position when possible: if any chunk has page/position info, sort by (page, top, left), otherwise keep original order. - Apply only to media chunks: images use image_context_size, tables use table_context_size. - Primary matching: on the same page, choose a text chunk whose vertical span overlaps the media, then pick the one with the closest vertical midpoint. - Fallback matching: if no overlap on that page, choose the nearest text chunk on the same page (page-head uses the next text; page-tail uses the previous text). - Context extraction: inside the chosen text chunk, find a mid-sentence boundary near the text midpoint, then take context_size tokens split before/after (total budget). - No multi-chunk stitching: context comes from a single text chunk to avoid mixing unrelated segments. ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-26 17:55:32 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00

1 2 3 4 5 ...

1222 Commits