ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-08 08:07:21 +08:00

Author	SHA1	Message	Date
buua436	7a70a0fd85	Fix: preserve infinity available_int zero filter (#14416 ) ### What problem does this PR solve? preserve infinity available_int zero filter ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 12:54:32 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
Xing Hong	fb95136f39	Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090 ) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-25 14:30:15 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Idriss Sbaaoui	ca01c7a745	Fix blob sync: skip unsupported files before download (#14357 ) ### What problem does this PR solve? Blob storage sync was downloading unsupported files first and rejecting them later, which wasted bandwidth and made sync slower. This PR skips unsupported extensions before download and applies `allow_images` in blob sync. fixes #14338 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 19:22:32 +08:00
euvre	84b6069ec7	fix: escape single quotes in Infinity SQL filter conditions (#14186 ) ### What problem does this PR solve? ## Summary Fixes #5939 Entity names containing single quotes (e.g., `投影直线L'`) caused SQL syntax errors when building filter conditions for Infinity queries, due to unescaped string interpolation in `equivalent_condition_to_str`. ## Changes In `common/doc_store/infinity_conn_base.py`, added `.replace("'", "''")` escaping for string values in two branches of `equivalent_condition_to_str` where it was missing: 1. `field_keyword` branch with non-list value (line 190): The list branch already escaped single quotes on line 183, but the single-string branch did not. 2. Plain string value branch (line 209): Direct f-string interpolation `{k}='{v}'` was vulnerable to unescaped quotes. Both fixes use the same SQL-standard escape pattern (`'` → `''`) already applied elsewhere in this method. ## How to Test 1. Upload a document containing entity names with single quotes. 2. Enable Knowledge Graph (GraphRAG) in the parsing configuration. 3. Initiate document parsing — it should complete without SQL syntax errors. ## Note The original issue also reported a typo (`dge_graph_kwd` instead of `knowledge_graph_kwd`), which has already been fixed in the current codebase. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-20 10:04:07 +08:00
Wang Qi	96a23d2fd0	[Bug fix] fix bug found in regression when view chunks for document that not parsed in infinity, it would fail in UI (#14168 ) ### What problem does this PR solve? See title, the fail image: <img width="2667" height="915" alt="20260416-205718" src="https://github.com/user-attachments/assets/0c564237-5ed0-49af-bf4c-d3b5519abc6e" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-17 09:51:23 +08:00
euvre	0cd49e14dd	fix: make Infinity connection pool size configurable and add retry logic for GraphRAG write bursts (#14143 ) ### What problem does this PR solve? Resolve #14137 . ### Problem Graph resolution succeeds (nodes/edges merged, pagerank updated), but the subsequent burst of Infinity write operations in `set_graph` exhausts the connection pool with `TOO_MANY_CONNECTIONS` errors. Root causes: 1. Hardcoded pool size — `infinity_conn_pool.py` hardcoded `ConnectionPool(max_size=4)` on initial creation and `max_size=32` on refresh. Operators cannot tune this without patching code. 2. No retry on transient failures — a single `TOO_MANY_CONNECTIONS` on edge deletes or chunk inserts kills the entire resolution+community pipeline with no retry. ### Changes #### `common/doc_store/infinity_conn_pool.py` - Read `ConnectionPool` `max_size` from the `INFINITY_POOL_MAX_SIZE` environment variable (default: `4`), applied consistently to both initial creation and refresh paths. - Log the actual pool size on startup for easier debugging. #### `rag/graphrag/utils.py` — `set_graph()` - Edge deletes: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) so transient `TOO_MANY_CONNECTIONS` errors are retried instead of failing the entire job. Concurrency continues to be gated by the existing `chat_limiter`. - Batch inserts: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) for the same reason. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-16 15:40:54 +08:00
Daniil Sivak	c93ec0a1f3	Fix: reject empty/space-only content in update_chunk API (#14082 ) Closes #6541 ### What problem does this PR solve? Add content validation to `update_chunk` (SDK and non-SDK) to reject empty or whitespace-only content before it reaches the embedding model. Before: Calling `update_chunk` with space-only content (like `" "`, `""`, `"\n"`) bypassed validation and was sent directly to the embedding model, which returned an error. This was the same bug previously fixed for `add_chunk` in #6390, but `update_chunk` was missed. After: Empty/whitespace-only content is caught by validation and returns an error: `` `content` is required `` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-15 18:43:53 +08:00
Ea001	38cefd88e2	Fix tag_feas code injection in retrieval ranking (#13923 ) ## Summary - remove eval-based parsing from retrieval rank feature scoring - validate `tag_feas` at write time in chunk APIs and SDK routes - add regression tests for safe parsing and malicious payload rejection ## Details `tag_feas` is intended to be structured rank-feature data, but the retrieval ranking path was evaluating stored values as Python expressions. This change treats `tag_feas` strictly as data. ### What changed - replace `eval()` in `rag/nlp/search.py` with safe parsing via `json.loads()` and optional `ast.literal_eval()` compatibility for legacy Python-dict strings - strictly filter parsed values down to `dict[str, finite number]` - reject invalid `tag_feas` payloads at write time in web chunk routes and SDK document chunk routes - add focused regression tests to prove executable strings are ignored and invalid payloads are rejected ## Validation - `python -m pytest test/unit_test/common/test_tag_feature_utils.py test/unit_test/rag/test_rank_feature_scores.py -q` --------- Co-authored-by: unknown <zhenglinkai@CCN.Local> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-15 16:31:11 +08:00
eason	aa92abe73c	fix: close file handles properly in json.load() calls (#13997 ) ## Summary Fixes #13996 Replace `json.load(open(...))` with `with open(...) as f: json.load(f)` in two files to ensure file descriptors are properly closed. Affected files: - `common/doc_store/infinity_conn_base.py` — schema loading for Infinity doc store - `api/db/init_data.py` — agent template loading at startup ## Why this matters In a long-running server process like RAGFlow, leaked file descriptors from `json.load(open(...))` can accumulate over time. While CPython's refcounting usually cleans these up, it's not guaranteed (especially under memory pressure or with alternative Python runtimes), and can lead to `OSError: [Errno 24] Too many open files`. ## Test plan - [ ] Verify Infinity doc store schema loading still works correctly - [ ] Verify agent templates load correctly on startup <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Refactor * Improved file handling in internal data processing to ensure proper resource cleanup. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: easonysliu <easonysliu@tencent.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 12:16:49 +08:00
corevibe555	e7d044413f	Fix: Google Drive connector missing new files after initial sync (#13943 ) Closes https://github.com/infiniflow/ragflow/issues/13939 ## What problem does this PR solve? The Google Drive connector fails to detect new files after the initial sync (#13939). The root cause is that `generate_time_range_filter()` applies a strict `modifiedTime > poll_range_start` cutoff when querying the Google Drive API. Files uploaded to Google Drive that retain their original `modifiedTime` (common behavior) get silently excluded if their timestamp predates the last sync's cutoff. Unlike the Confluence and Jira connectors which use a configurable time buffer (`CONFLUENCE_SYNC_TIME_BUFFER_SECONDS`) to offset `poll_range_start` backward, the Google Drive connector had no such mechanism — resulting in a razor-sharp timestamp boundary with zero tolerance for overlap. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ## Summary * New Features * Added a configurable time buffer for Google Drive synchronization to address timing delays and improve sync reliability. * Improved file detection logic to include recently created files alongside modified ones, reducing missed synchronizations.	2026-04-10 11:39:19 +08:00
Octopus	c2ce49e037	fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969 ) Fixes #13823 ## Problem When querying with words like `cat`, RAGFlow's query expansion system looks up synonyms via WordNet, which can return terms containing single quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document store, these unescaped single quotes in the query string cause a `TokenError` because Infinity's lexer treats `'` as a string delimiter. ``` TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531 ``` ## Solution Strip single quotes from synonym terms before they are inserted into query expressions, consistent with how single quotes are already stripped from the input query text (line 51 of `query.py`): - `common/query_base.py`: In `sub_special_char()`, strip `'` before escaping other special characters. This fixes the Chinese text processing path and the `paragraph()` method. - `rag/nlp/query.py`: In the English text path, strip `'` from tokenized synonym terms. - `memory/services/query.py`: Same fix for the memory query English text path. ## Testing The fix can be verified by: 1. Using Infinity as the document store (`DOC_ENGINE=infinity`) 2. Creating a dataset and running a retrieval test with the keyword `cat` 3. Confirming no `TokenError` is raised and results are returned normally <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Enhanced special character handling in query processing and synonym expansion by properly sanitizing single quotes before text processing. * Simplified OCR detection output by removing timing metadata while preserving core detection accuracy. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ximi <octo-patch@github.com>	2026-04-09 19:10:34 +08:00
eviaaaaa	1e83c8c051	Fix: align MCP tool call timeout and handle empty content (#13899 ) ### What problem does this PR solve? Resolves #12105 This PR fixes two MCP tool call issues in `common/mcp_tool_call_conn.py`. First, the timeout passed to `tool_call(..., timeout=...)` was only applied to the outer `future.result(...)` wait, but was not forwarded to the internal MCP request. As a result, callers could pass a longer timeout while the actual MCP request still failed after the default internal timeout. Second, the MCP tool call result handling assumed `result.content[0]` always existed. If an MCP server returned an empty content list, this could raise an exception unexpectedly. This PR fixes both issues by: - forwarding the external `timeout` value to the internal MCP request timeout - returning a clear message when the MCP server returns empty content instead of indexing into an empty list ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe)	2026-04-09 18:44:04 +08:00
Magicbook1108	8d52ef2893	Feat: enable sync deleted files for connector (#14000 ) ### What problem does this PR solve? Feat: enable sync deleted files for connector 1. first comes with github ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added "sync deleted files" feature for data sources, enabling automatic removal of files deleted from the source system. * Added multilingual support for the new sync deleted files setting across multiple languages. * UI Improvements * Improved checkbox form field rendering and layout. * Enhanced full-width display for authentication token input fields.	2026-04-09 16:40:14 +08:00
Ricardo-M-L	424aee5bec	fix: correct typos in code comments, docstrings and docs (#13931 ) ## Summary - Fix `a image` → `an image` in README and log message - Fix `colomn` → `column` in table structure recognizer comment - Fix `formated` → `formatted` in confluence connector docstring - Fix `tabel of content` → `table of contents` in TOC prompt ## Test plan - [ ] Documentation and comment changes, no functional impact 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-04-07 13:05:39 +08:00
Magicbook1108	69264b3a70	Feat: Refact pipeline (#13826 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 19:26:45 +08:00
NeedmeFordev	6b7989b4b4	Add file type validation (#13802 ) ### What problem does this PR solve? This PR fixes WebDAV sync behavior for unsupported file types ([#13795](https://github.com/infiniflow/ragflow/issues/13795)). Previously, the WebDAV connector selected files primarily by modified time (and size threshold) and could still pass unsupported extensions into the download/document-generation path. This caused unnecessary processing and inconsistent behavior compared with connectors that validate file type earlier. This change adds extension validation in two places: 1. Early filter during recursive listing to skip unsupported files before they enter the download flow. 2. Defensive filter before download/document creation to prevent unsupported files from being processed if any listing edge case slips through. It also wires `allow_images` into the WebDAV sync path so image extension handling follows connector policy. Scope is intentionally limited to WebDAV for a focused bug-fix PR. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### How was this tested? - Manual verification with mixed file types under the configured WebDAV path: - supported: `.pdf`, `.txt`, `.md` - unsupported: `.exe`, `.bin`, `.dat` - Triggered full sync and polling sync. - Confirmed unsupported files are skipped before download. - Confirmed supported files are still indexed normally. - Confirmed image handling follows `allow_images` setting. Fixes: #13795	2026-04-02 14:12:27 +08:00
qinling0210	0462c20113	Fix special characters in matching text of search() (#13852 ) ### What problem does this PR solve? Fix special characters in matching text of search(). We should escape some special characters(such as ?, *,:) before passing to matching_text of search() Fix https://github.com/infiniflow/ragflow/issues/13729 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 18:47:10 +08:00
Zhichang Yu	0d85a8e7aa	feat: add dynamic log level adjustment APIs (#13850 ) Add REST APIs to dynamically query and modify log levels at runtime for both Python (Flask) and Go servers. Changes: - common/log_utils.py: add set_log_level() and get_log_levels() functions - admin/server/routes.py: add GET/PUT /api/v1/admin/log_levels endpoints - api/apps/system_app.py: add GET/PUT /api/{version}/system/log_levels endpoints - internal/logger/logger.go: add GetLevel() and SetLevel() with atomic level support - internal/handler/system.go: add GetLogLevel, SetLogLevel, Health handlers - internal/router/router.go: route /health to systemHandler - internal/admin/handler.go: add GetLogLevel, SetLogLevel handlers - internal/admin/router.go: add /api/v1/admin/log_level routes ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 18:40:58 +08:00
KeJun	cb78ce0a7b	feat: support rss datasource (#13721 ) ### What problem does this PR solve? Supporting public RSS/Atom feed URLs as data sources for RagFlow. link https://github.com/infiniflow/ragflow/issues/12313 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-27 22:58:44 +08:00
NeedmeFordev	840cc8fbe9	fix(asana): use project memberships endpoint for project IDs in connector (#13746 ) ### What problem does this PR solve? Fixes a bug in the Asana connector where providing `Project IDs` caused sync to fail with: `project_membership: Not a recognized ID: <PROJECT_GID>` Root cause: the connector called `get_project_membership(project_gid)`, but that API expects a project membership gid, not a project gid. This PR switches to the correct project-scoped API and adds regression tests. Fixes: [#13669](https://github.com/infiniflow/ragflow/issues/13669) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Changes made - Updated `common/data_source/asana_connector.py`: - Replaced `get_project_membership(pid, ...)` with `get_project_memberships_for_project(pid, ...)` - Trimmed and filtered `asana_project_ids` parsing to avoid empty/whitespace IDs - Normalized `asana_team_id` by trimming whitespace - Used safer access for membership email extraction (`m.get("user")`) - Added `test/unit_test/common/test_asana_connector.py`: - Verifies the correct project-membership API method is called - Verifies empty `project_ids` path returns workspace emails - Verifies project/team input normalization behavior ### Compatibility / risk - Non-breaking bug fix - No API contract changes - Existing behavior for empty `Project IDs` remains unchanged	2026-03-24 20:21:31 +08:00
Yongteng Lei	dd839f30e8	Fix: code supports matplotlib (#13724 ) ### What problem does this PR solve? Code as "final" node: ![img_v3_02vs_aece4caf-8403-4939-9e68-9845a22c2cfg](https://github.com/user-attachments/assets/9d87b8df-da6b-401c-bf6d-8b807fe92c22) Code as "mid" node: ![img_v3_02vv_f74f331f-d755-44ab-a18c-96fff8cbd34g](https://github.com/user-attachments/assets/c94ef3f9-2a6c-47cb-9d2b-19703d2752e4) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-20 20:32:00 +08:00
NeedmeFordev	c3f79dbcb0	fix(jira): prevent missed incremental updates after issue edits (#13674 ) ### What problem does this PR solve? Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira incremental sync could miss updated issues after initial sync, especially near time boundaries. Root cause: - Jira JQL uses minute-level precision for `updated` filters. - Incremental windows had no overlap buffer, so boundary updates could be skipped. - Sync log cursor tracking used a backward-facing update for `poll_range_start`. - Existing-doc updates in `upload_document` lacked a KB ownership guard for doc-id collisions. What changed: - Added Jira incremental overlap buffer (`time_buffer_seconds`, defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL lower-bound time. - Preserved second-level post-filtering to avoid duplicate reprocessing while still catching boundary updates. - Improved Jira sync logging to include start/end window and overlap configuration. - Updated sync cursor tracking in `increase_docs` to keep `poll_range_start` moving forward with max update time. - Added KB ID safety check before updating existing document records in `upload_document`. Verification performed: - Python syntax compile checks passed for modified files. - Manual verification flow: 1. Run full Jira sync. 2. Edit an already-indexed Jira issue. 3. Run next incremental sync. 4. Confirm updated content is re-ingested into KB. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 23:31:05 +08:00
NeedmeFordev	387b0b27c4	feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527 ) ### What problem does this PR solve? This PR adds support for parsing PDFs through an external Docling server, so RAGFlow can connect to remote `docling serve` deployments instead of relying only on local in-process Docling. It addresses the feature request in [#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns with the external-server usage pattern already used by MinerU. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What is changed? - Add external Docling server support in `DoclingParser`: - Use `DOCLING_SERVER_URL` to enable remote parsing mode. - Try `POST /v1/convert/source` first, and fallback to `/v1alpha/convert/source`. - Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not set. - Wire Docling env settings into parser invocation paths: - `rag/app/naive.py` - `rag/flow/parser/parser.py` - Add Docling env hints in constants and update docs: - `docs/guides/dataset/select_pdf_parser.md` - `docs/guides/agent/agent_component_reference/parser.md` - `docs/faq.mdx` ### Why this approach? This keeps the change focused on one issue and one capability (external Docling connectivity), without introducing unrelated provider-model plumbing. ### Validation - Static checks: - `python -m py_compile` on changed Python files - `python -m ruff check` on changed Python files - Functional checks: - Remote v1 endpoint path works - v1alpha fallback works - Local Docling path remains available when server URL is unset ### Related links - Feature request: [Support external Docling server (issue #13426)](https://github.com/infiniflow/ragflow/issues/13426) - Compare view for this branch: [main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1) ##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)	2026-03-12 17:09:03 +08:00
Yongteng Lei	e1b632a7bb	Feat: add delete all support for delete operations (#13530 ) ### What problem does this PR solve? Add delete all support for delete operations. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-12 09:47:42 +08:00
Magicbook1108	675810e0cf	Refact: optimize confluence performance (#13497 ) ### What problem does this PR solve? Refact: optimize confluence performance #13494 ### Type of change - [x] Refactoring	2026-03-10 15:02:24 +08:00
Yongteng Lei	7484298c82	Refa: convert download_img to async (#13477 ) ### What problem does this PR solve? Convert download_img to async. ### Type of change - [x] Refactoring - [x] Performance Improvement	2026-03-09 19:00:17 +08:00
Heyang Wang	c217b8f3d8	Feat: add DingTalk AI Table connector and integration for data synch… (#13413 ) ### What problem does this PR solve? Add DingTalk AI Table connector and integration for data synchronization Issue #13400 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: wangheyang <wangheyang@corp.netease.com>	2026-03-06 21:13:23 +08:00
tunsuy	020068dd16	Fix: preserve field boundaries in chunked documents from MySQL… (#13369 ) ### What problem does this PR solve? When multiple columns are used as content columns in RDBMS connector, the generated document text gets chunked by TxtParser which strips newline delimiters during merge. This causes field names and values from different columns to be concatenated without any separator, making the content unreadable. Changes: - txt_parser.py: restore newline separator when merging adjacent text segments within a chunk, so that split sections are not directly concatenated - rdbms_connector.py: use double newline between fields and place field value on a new line after the field name bracket, giving TxtParser clearer boundaries to work with Closes #13001 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tunsuytang <tunsuytang@tencent.com>	2026-03-04 21:42:02 +08:00
Jin Hai	b9ad014f63	Supports login cross multiple RAGFlow servers (#13322 ) ### What problem does this PR solve? 1. Use redis to store the secret key. 2. During startup API server will read the secret from redis. If no such secret key, generate one and store it into redis, atomically. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-04 13:07:45 +08:00
Ahmad Intisar	184388879d	feat: Add `disable_password_login` configuration to support SSO-only authentication (#13151 ) ### What problem does this PR solve? Enterprise deployments that use an external Identity Provider (e.g., Microsoft Entra ID, Okta, Keycloak) need the ability to enforce SSO-only authentication by hiding the email/password login form. Currently, the login page always shows the password form alongside OAuth buttons, with no way to disable it. This PR adds a `disable_password_login` configuration option under the existing `authentication` section in `service_conf.yaml`. When set to `true`, the login page only displays configured OAuth/SSO buttons and hides the email/password form, "Remember me" checkbox, and "Sign up" link. The flag can be set via: - `service_conf.yaml` (`authentication.disable_password_login: true`) - Environment variable (`DISABLE_PASSWORD_LOGIN=true`) Default behavior is unchanged (`false`). ### Behavior \| `disable_password_login` \| OAuth configured \| Result \| \|---\|---\|---\| \| `false` (default) \| No \| Standard email/password form \| \| `false` \| Yes \| Email/password form + SSO buttons below \| \| `true` \| Yes \| SSO buttons only (no form, no sign up link) \| \| `true` \| No \| Empty card (admin should configure OAuth first) \| ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (5) 1. `docker/service_conf.yaml.template` — added `disable_password_login: false` under authentication 2. `common/settings.py` — added `DISABLE_PASSWORD_LOGIN` global variable and loader in `init_settings()` 3. `common/config_utils.py` — fixed `TypeError` in `show_configs()` when authentication section contains non-dict values (e.g., booleans) 4. `api/apps/system_app.py` — exposed `disablePasswordLogin` flag in `/config` endpoint 5. `web/src/pages/login/index.tsx` — conditionally render password form based on config flag; OAuth buttons always render when channels exist --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-03-02 14:06:03 +08:00
Attili-sys	21bc1ab7ec	Feature rtl support (#13118 ) ### What problem does this PR solve? This PR adds comprehensive Right-to-Left (RTL) language support, primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu, etc.). Previously, RTL content had multiple rendering issues: - Incorrect sentence splitting for Arabic punctuation in citation logic - Misaligned text in chat messages and markdown components - Improper positioning of blockquotes and “think” sections - Incorrect table alignment - Citation placement ambiguity in RTL prompts - UI layout inconsistencies when mixing LTR and RTL text This PR introduces backend and frontend improvements to properly detect, render, and style RTL content while preserving existing LTR behavior. #### Backend - Updated sentence boundary regex in `rag/nlp/search.py` to include Arabic punctuation: - `،` (comma) - `؛` (semicolon) - `؟` (question mark) - `۔` (Arabic full stop) - Ensures citation insertion works correctly in RTL sentences. - Updated citation prompt instructions to clarify citation placement rules for RTL languages. #### Frontend - Introduced a new utility: `text-direction.ts` - Detects text direction based on Unicode ranges. - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts. - Provides `getDirAttribute()` for automatic `dir` assignment. - Applied dynamic `dir` attributes across: - Markdown rendering - Chat messages - Search results - Tables - Hover cards and reference popovers - Added proper RTL styling in LESS: - Text alignment adjustments - Blockquote border flipping - Section indentation correction - Table direction switching - Use of `<bdi>` for figure labels to prevent bidirectional conflicts #### DevOps / Environment - Added Windows backend launch script with retry handling. - Updated dependency metadata. - Adjusted development-only React debugging behavior. --- ### Type of change - [x] Bug Fix (non-breaking change which fixes RTL rendering and citation issues) - [x] New Feature (non-breaking change which adds RTL detection and dynamic direction handling) --------- Co-authored-by: 6ba3i <isbaaoui09@gmail.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:03:44 +08:00
Yesid Cano Castro	d1afcc9e71	feat(seafile): add library and directory sync scope support (#13153 ) ### What problem does this PR solve? The SeaFile connector currently synchronises the entire account — every library visible to the authenticated user. This is impractical for users who only need a subset of their data indexed, especially on large SeaFile instances with many shared libraries. This PR introduces granular sync scope support, allowing users to choose between syncing their entire account, a single library, or a specific directory within a library. It also adds support for SeaFile library-scoped API tokens (`/api/v2.1/via-repo-token/` endpoints), enabling tighter access control without exposing account-level credentials. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Test ``` from seafile_connector import SeaFileConnector import logging import os logging.basicConfig(level=logging.DEBUG) URL = os.environ.get("SEAFILE_URL", "https://seafile.example.com") TOKEN = os.environ.get("SEAFILE_TOKEN", "") REPO_ID = os.environ.get("SEAFILE_REPO_ID", "") SYNC_PATH = os.environ.get("SEAFILE_SYNC_PATH", "/Documents") REPO_TOKEN = os.environ.get("SEAFILE_REPO_TOKEN", "") def _test_scope(scope, repo_id=None, sync_path=None): print(f"\n{'='50}") print(f"Testing scope: {scope}") print(f"{'='50}") creds = {"seafile_token": TOKEN} if TOKEN else {} if REPO_TOKEN and scope in ("library", "directory"): creds["repo_token"] = REPO_TOKEN connector = SeaFileConnector( seafile_url=URL, batch_size=5, sync_scope=scope, include_shared = False, repo_id=repo_id, sync_path=sync_path, ) connector.load_credentials(creds) connector.validate_connector_settings() count = 0 for batch in connector.load_from_state(): for doc in batch: count += 1 print(f" [{count}] {doc.semantic_identifier} " f"({doc.size_bytes} bytes, {doc.extension})") print(f"\n-> {scope} scope: {count} document(s) found.\n") # 1. Account scope if TOKEN: _test_scope("account") else: print("\nSkipping account scope (set SEAFILE_TOKEN)") # 2. Library scope if REPO_ID and (TOKEN or REPO_TOKEN): _test_scope("library", repo_id=REPO_ID) else: print("\nSkipping library scope (set SEAFILE_REPO_ID + token)") # 3. Directory scope if REPO_ID and SYNC_PATH and (TOKEN or REPO_TOKEN): _test_scope("directory", repo_id=REPO_ID, sync_path=SYNC_PATH) else: print("\nSkipping directory scope (set SEAFILE_REPO_ID + SEAFILE_SYNC_PATH + token)") ```	2026-02-28 10:24:28 +08:00
He Wang	394ff16b66	fix: OceanBase metadata not returned in document list API (#13209 ) ### What problem does this PR solve? Fix #13144. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-25 15:29:17 +08:00
Phives	4ceb668d40	feat(api/utils): Harden file_utils for robustness and edge cases (#12915 ) ## Summary Improves robustness and edge-case handling in `api.utils.file_utils` to avoid crashes, DoS/OOM risks, and timeouts when processing user-provided filenames, paths, and file blobs. ## Changes ### Resource limits & timeouts - `MAX_BLOB_SIZE_THUMBNAIL` (50 MiB) and `MAX_BLOB_SIZE_PDF` (100 MiB) to reject oversized inputs before thumbnail/PDF processing. - `GHOSTSCRIPT_TIMEOUT_SEC` (120 s) for `repair_pdf_with_ghostscript` subprocess to avoid hangs on malicious or broken PDFs. ### `filename_type` - Handles `None`, empty string, non-string (e.g. int/list), and path-only input via new `_normalize_filename_for_type()`. - Uses basename for type detection (e.g. `a/b/c.pdf` → PDF). - Enforces `FILE_NAME_LEN_LIMIT`; invalid input returns `FileType.OTHER`. ### `thumbnail_img` - Rejects `None`/empty/oversized blob and invalid filename; returns `None` instead of raising. - Wraps PDF, image, and PPT handling in try/except so corrupt or malformed files return `None`. - Ensures PDF has pages and PPT has slides before use. - Normalizes PIL image mode (RGBA/P/LA → RGB) for safe PNG export. ### `repair_pdf_with_ghostscript` - Handles `None`/empty input; skips repair when input size exceeds limit. - Uses `subprocess.run(..., timeout=GHOSTSCRIPT_TIMEOUT_SEC)` and catches `TimeoutExpired`. - Returns original bytes when Ghostscript output is empty. ### `read_potential_broken_pdf` - `None` → `b""`; non–sequence-like (no `len`) → `b""`; empty → return as-is. - Oversized blob returned as-is (no repair) to avoid DoS. ### `sanitize_path` - Explicit `None` and non-string check; strips whitespace before normalizing. ## Testing - `test/unit_test/utils/test_api_file_utils.py` added with 36 unit tests covering the above behavior (filename_type, sanitize_path, read_potential_broken_pdf, thumbnail_img, thumbnail, repair_pdf_with_ghostscript, constants). - All tests pass. --------- Co-authored-by: Gittensor Miner <miner@gittensor.io>	2026-02-25 14:34:47 +08:00
Ahmad Intisar	99d1c9725c	Bug mysql connector empty content resolved: Semantic ID Issue (#13206 ) The RDBMS (MySQL/PostgreSQL) connector generates document filenames using the first 100 characters of the content column (semantic_identifier). When the content contains newline characters (\n), the resulting filename includes those newlines — for example: Category: غير صحيح كليًا\nTitle: تفنيد حقائق....txt RAGFlow's filename_type() function uses re.match(r".\.txt$", filename) to detect file types, but . does not match newline characters by default in Python regex. This causes the regex to fail, returning FileType.OTHER, which triggers: pythonraise RuntimeError("This type of file has not been supported yet!") As a result, all documents synced via the MySQL/PostgreSQL connector are silently discarded. The sync logs report success (e.g., "399 docs synchronized"), but zero documents actually appear in the dataset. This is the root cause of issue #13001. Root cause trace: rdbms_connector.py → _row_to_document() sets semantic_identifier from raw content (may contain \n) connector_service.py → duplicate_and_parse() uses semantic_identifier as the filename file_service.py → upload_document() calls filename_type(filename) file_utils.py → filename_type() regex .*\.txt$ fails on newlines → returns FileType.OTHER upload_document() raises "This type of file has not been supported yet!" Fix: Sanitize the semantic_identifier in _row_to_document() by replacing newlines and carriage returns with spaces before truncating to 100 characters. Relates to: #13001, #12817 Type of change Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-02-25 12:55:04 +08:00
ksufer	5a8fa7cf31	Fix #13119 : Use email.utils to fix IMAP parsing for names with commas (#13120 ) ## Type of Change - [x] Bug fix ## Description Closes #13119 The current IMAP connector uses `split(',')` to parse email headers, which crashes when a sender's display name contains a comma inside quotes (e.g., `"Doe, John" <john@example.com>`). This PR replaces the manual string splitting with Python's standard `email.utils.getaddresses`. This correctly handles RFC 5322 quoted strings and prevents the `RuntimeError: Expected a singular address`. ## Checklist - [x] I have checked the code and it works as expected. --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-02-24 19:18:55 +08:00
Ahmad Intisar	5885f150ab	fix: register WebDAVConnector in data_source __init__.py (#13121 ) What problem does this PR solve? The sync_data_source.py module imports WebDAVConnector from common.data_source, but WebDAVConnector was never registered in the package's __init__.py. This causes an ImportError at startup, crashing the data sync service: ImportError: cannot import name 'WebDAVConnector' from 'common.data_source' The webdav_connector.py file already exists in the common/data_source/ directory — it just wasn't exported. This PR adds the import and registers it in __all__. Type of change Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-02-12 16:05:58 +08:00
Magicbook1108	e89fd686e2	Improve: optimize file name (with path) in box container. (#13124 ) ### What problem does this PR solve? Refact: optimize file name (with path) in box container. ### Type of change - [x] Performance Improvement <img width="2357" height="1258" alt="image" src="https://github.com/user-attachments/assets/f4c5c90b-d885-4514-b7bc-f17ab62b045f" />	2026-02-12 15:40:55 +08:00
Lynn	30d5fc1a07	Refactor: split memory API into gateway and service layers (#13111 ) ### What problem does this PR solve? Decouple the memory API into a gateway layer (for routing/param parse) and a service layer (for business logic). ### Type of change - [x] Refactoring	2026-02-12 10:11:50 +08:00
Lynn	d938b47877	Fix: judge table name prefix before migrate (#13094 ) ### What problem does this PR solve? Judge table created with current infinity mapping before migrate db. #13089 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-10 17:05:34 +08:00
MkDev11	13a6545e48	fix(rdbms): use brackets around field names to preserve distinction after chunking (#13010 ) Fix RDBMS field separation after chunking by wrapping field names in brackets (【field】: value). This ensures fields remain distinguishable even when TxtParser strips newline delimiters during chunk merging. Closes #13001 Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>	2026-02-06 14:44:58 +08:00
Clint-chan	a68c56def7	fix: ensure all metadata filters are processed in AND logic (#13019 ) ### What problem does this PR solve? Bug: When a filter key doesn't exist in metas or has no matching values, the filter was skipped entirely, causing AND logic to fail. Example: - Filter 1: meeting_series = '宏观早8点' (matches doc1, doc2, doc3) - Filter 2: date = '2026-03-05' (no matches) - Expected: [] (AND should return empty) - Actual: [doc1, doc2, doc3] (Filter 2 was skipped) Root cause: Old logic iterated metas.items() first, then filters. If a filter's key wasn't in metas, it was never processed. Fix: Iterate filters first, then look up in metas. If key not found, treat as no match (empty result), which correctly applies AND logic. Changes: - Changed loop order from 'for k in metas: for f in filters' to 'for f in filters: if f.key in metas' - Explicitly handle missing keys as empty results ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Clint-chan <Clint-chan@users.noreply.github.com>	2026-02-06 12:57:27 +08:00
Clint-chan	90b726c988	fix: support date comparison operators (>=, <=, >, <) in metadata filtering (#12982 ) ## Description This PR fixes the issue where date metadata conditions with comparison operators (`>=`, `<=`, `>`, `<`) did not work correctly in the `/api/v1/retrieval` endpoint. ## Problem When using metadata conditions like: ```json { "metadata_condition": { "conditions": [ { "name": "date", "comparison_operator": ">=", "value": "2027-01-13" } ] } } The filtering did not work as expected because: 1. Operators >= and <= were not mapped to internal symbols ≥ and ≤ 2. Date strings like "2027-01-13" failed to parse with ast.literal_eval() 3. Non-standard date formats were incorrectly compared as strings Solution Changes in common/metadata_utils.py: 1. Added operator mapping in convert_conditions(): - >= → ≥ - <= → ≤ - != → ≠ 2. Implemented strict date format detection in meta_filter(): - Only processes dates in YYYY-MM-DD format (10 characters, properly formatted) - When query value is a date, only matches data in the same standard format - Non-standard formats (e.g., "2026年1月13日", "2026-1-22") are skipped 3. Maintained backward compatibility: - Numeric comparisons still work - String comparisons still work - Only affects date-formatted queries Testing All test cases pass (8/8): - ✅ Date >= comparison - ✅ Date > comparison - ✅ Date < comparison - ✅ Date <= comparison - ✅ Date = comparison - ✅ Date range queries - ✅ Non-date string comparison (backward compatibility) - ✅ Numeric comparison (backward compatibility) Example Usage { "dataset_ids": ["xxx"], "question": "test", "metadata_condition": { "conditions": [ { "name": "date", "comparison_operator": ">=", "value": "2027-01-13" } ] } } Notes - Only supports standard YYYY-MM-DD format - Non-standard date formats in data are treated as data quality issues and will not match - Users should ensure their date metadata is in the correct format --------- Co-authored-by: Clint-chan <Clint-chan@users.noreply.github.com>	2026-02-05 13:52:51 +08:00
Magicbook1108	1349e6b7d1	Fix: adressing style without a default value (#13009 ) ### What problem does this PR solve? Fix: adressing style without a default value #12396 #11510 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-05 13:52:23 +08:00
MkDev11	6f31c5fed2	feat/add MySQL and PostgreSQL data source connectors (#12817 ) ### What problem does this PR solve? This PR adds MySQL and PostgreSQL as data source connectors, allowing users to import data directly from relational databases into RAGFlow for RAG workflows. Many users store their knowledge in databases (product catalogs, documentation, FAQs, etc.) and currently have no way to sync this data into RAGFlow without exporting to files first. This feature lets them connect directly to their databases, run SQL queries, and automatically create documents from the results. Closes #763 Closes #11560 ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What this PR does New capabilities: - Connect to MySQL and PostgreSQL databases - Run custom SQL queries to extract data - Map database columns to document content (vectorized) and metadata (searchable) - Support incremental sync using a timestamp column - Full frontend UI with connection form and tooltips Files changed: Backend: - `common/constants.py` - Added MYSQL/POSTGRESQL to FileSource enum - `common/data_source/config.py` - Added to DocumentSource enum - `common/data_source/rdbms_connector.py` - New connector (368 lines) - `common/data_source/__init__.py` - Exported the connector - `rag/svr/sync_data_source.py` - Added MySQL and PostgreSQL sync classes - `pyproject.toml` - Added mysql-connector-python dependency Frontend: - `web/src/pages/user-setting/data-source/constant/index.tsx` - Form fields - `web/src/locales/en.ts` - English translations - `web/src/assets/svg/data-source/mysql.svg` - MySQL icon - `web/src/assets/svg/data-source/postgresql.svg` - PostgreSQL icon ### Testing done Tested with MySQL 8.0 and PostgreSQL 16: - Connection validation works correctly - Full sync imports all query results as documents - Incremental sync only fetches rows updated since last sync - Custom SQL queries filter data as expected - Invalid credentials show clear error messages - Lint checks pass (`ruff check` returns no errors) --------- Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>	2026-02-04 10:14:32 +08:00
He Wang	ff7afcbe5f	feat: add OceanBase memory store (#12955 ) ### What problem does this PR solve? Add OceanBase memory store and extracting base class `OBConnectionBase`. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-03 16:46:17 +08:00
Yesid Cano Castro	deeae8dba4	feat(connector): add Seafile as data source (#12945 ) ### What problem does this PR solve? This PR adds Seafile as a new data source connector for RAGFlow. [Seafile](https://www.seafile.com/) is an open-source, self-hosted file sync and share platform widely used by enterprises, universities, and organizations that require data sovereignty and privacy. Users who store documents in Seafile currently have no way to index and search their content through RAGFlow. This connector enables RAGFlow users to: - Connect to self-hosted Seafile servers via API token - Index documents from personal and shared libraries - Support incremental polling for updated files - Seamlessly integrate Seafile-stored documents into their RAG pipelines ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Changes included - `SeaFileConnector` implementing `LoadConnector` and `PollConnector` interfaces - Support for API token - Recursive file traversal across libraries - Time-based filtering for incremental updates - Seafile logo (sourced from Simple Icons, CC0) - Connector configuration and registration ### Testing - Tested against self-hosted Seafile Community Edition - Verified authentication (token) - Verified document ingestion from personal and shared libraries - Verified incremental polling with time filters	2026-02-03 13:42:05 +08:00
Philipp Heyken Soares	ad06c042c4	Support operator constraints in semi-automatic metadata filtering (#12956 ) ### What problem does this PR solve? #### Summary This PR enhances the Semi-automatic metadata filtering mode by allowing users to explicitly pre-define operators (e.g., contains, =, >, etc.) for selected metadata keys. While the LLM still dynamically extracts the filter value from the user's query, it is now strictly constrained to use the operator specified in the UI configuration. Using this feature is optional. By default the operator selection is set to "automatic" resulting in the LLM choosing the operator (as presently). #### Rationale & Use Case This enhancement was driven by a concrete challenge I encountered while working with technical documentation. In my specific use case, I was trying to filter for software versions within a technical manual. In this dataset, a single document chunk often applies to multiple software versions. These versions are stored as a combined string within the metadata for each chunk. When using the standard semi-automatic filter, the LLM would inconsistently choose between the contains and equals operators. When it chose equals, it would exclude every chunk that applied to more than one version, even if the version I was searching for was clearly included in that metadata string. This led to incomplete and frustrating retrieval results. By extending the semi-automatic filter to allow pre-defining the operator for a specific key, I was able to force the use of contains for the version field. This change immediately led to significantly improved and more reliable results in my case. I believe this functionality will be equally useful for others dealing with "tagged" or multi-value metadata where the relationship between the query and the field is known, but the specific value needs to remain dynamic. #### Key Changes ##### Backend & Core Logic - `common/metadata_utils.py`: Updated apply_meta_data_filter to support a mixed data structure for semi_auto (handling both legacy string arrays and the new object-based format {"key": "...", "op": "..."}). - `rag/prompts/generator.py`: Extended gen_meta_filter to accept and pass operator constraints to the LLM. - `rag/prompts/meta_filter.md`: Updated the system prompt to instruct the LLM to strictly respect provided operator constraints. ##### Frontend - `web/src/components/metadata-filter/metadata-semi-auto-fields.tsx`: Enhanced the UI to include an operator dropdown for each selected metadata key, utilizing existing operator constants. - `web/src/components/metadata-filter/index.tsx`: Updated the validation schema to accommodate the new state structure. #### Test Plan - Backward Compatibility: Verified that existing semi-auto filters stored as simple strings still function correctly. - Prompt Verification: Confirmed that constraints are correctly rendered in the LLM system prompt when specified. - Added unit tests as `test/unit_test/common/test_apply_semi_auto_meta_data_filter.py` - Manual End-to-End: - Configured a "Semi-automatic" filter for a "Version" key with the "contains" operator. - Asked a version-specific query. - Result <img width="1173" height="704" alt="Screenshot 2026-02-02 145359" src="https://github.com/user-attachments/assets/510a6a61-a231-4dc2-a7fe-cdfc07219132" /> ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>	2026-02-03 11:11:34 +08:00

1 2 3 4

168 Commits