ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-25 10:26:59 +08:00

Author	SHA1	Message	Date
Octopus	5c9124c3ef	fix: prepend bucket prefix in Azure Blob (SAS/SPN) to prevent cross-dataset file overwrites (#14174 ) Fixes #14159 ## Problem The `put()`, `get()`, `rm()`, and `obj_exist()` methods in both `azure_spn_conn.py` and `azure_sas_conn.py` ignore the `bucket` parameter entirely, storing all files flat using only the filename. This causes files from different datasets to overwrite each other when they share the same filename. By contrast, the MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix, creating logical folder isolation like `{kb_id}/{filename}`. ## Solution Prepend the `bucket` parameter as a path prefix to all file operations in both Azure storage implementations: - `azure_spn_conn.py`: `create_file`, `delete_file`, `get_file_client` now use `f"{bucket}/{fnm}"` - `azure_sas_conn.py`: `upload_blob`, `delete_blob`, `download_blob`, `get_blob_client` now use `f"{bucket}/{fnm}"` This matches the behavior of all other storage backends (MinIO, S3) and prevents filename collisions across knowledge bases. ## Testing - Verified the fix aligns with how MinIO/S3 connectors handle the bucket parameter - The `health()` method is left unchanged as it uses a fixed test path for connectivity checks only Co-authored-by: octo-patch <octo-patch@github.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:13:43 +08:00
Wang Qi	f45ce00347	Not allow to sort by id (#14526 ) ### What problem does this PR solve? id as "text", not a "keyword", order by it will cause error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 14:52:43 +08:00
orbisai0security	e992fe39b2	fix: the oceanbase database connector constructs sql... in ob_conn.py (#14470 ) ## Summary Fix critical severity security issue in `rag/utils/ob_conn.py`. ## Vulnerability \| Field \| Value \| \|-------\|-------\| \| ID \| V-003 \| \| Severity \| CRITICAL \| \| Scanner \| multi_agent_ai \| \| Rule \| `V-003` \| \| File \| `rag/utils/ob_conn.py:691` \| Description: The OceanBase database connector constructs SQL WHERE clauses by directly embedding user-controlled filter expressions using Python f-strings at lines 726, 777, 781, 787, 793, 821, and 827. No parameterization or allowlist validation is applied before the expressions are incorporated into live SQL queries. This is the most critical vulnerability in the codebase because it directly exposes the RAG knowledge base — the platform's core business asset — to complete compromise. ## Changes - `rag/utils/ob_conn.py` ## Verification - [x] Build passes - [x] Scanner re-scan confirms fix - [x] LLM code review passed --- Automated security fix by [OrbisAI Security](https://orbisappsec.com)	2026-04-30 14:25:17 +08:00
euvre	f3b7d55a1e	fix: handle Infinity table-not-exist error (3022) in update() methods (#14153 ) ### What problem does this PR solve? ## Summary Closes #6102 When using Infinity as the document store engine (GPU version), calling `update()` on a non-existent table throws an unhandled `InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This causes users to see a raw "3022" error when clicking on a parsed document. ## Root Cause The `update()` methods in both `rag/utils/infinity_conn.py` and `memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)` without catching `InfinityException`. In contrast, other CRUD methods (`insert`, `delete`, `search`) all handle this exception gracefully: \| Method \| Handles table-not-exist? \| Behavior \| \|----------\|--------------------------\|----------\| \| `insert` \| ✅ Yes \| Auto-creates the table \| \| `search` \| ✅ Yes \| Skips the table \| \| `delete` \| ✅ Yes \| Returns 0 \| \| `update` \| ❌ No \| Crashes with 3022 \| Additionally, `api/apps/document_app.py` worked around this with a fragile string match (`"3022" in msg`) to detect the error. ## Changes - `rag/utils/infinity_conn.py`: Catch `InfinityException` in `update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return `False` — consistent with `delete()`. - `memory/utils/infinity_conn.py`: Apply the same fix to its `update()` method. - `api/apps/document_app.py`: Remove the fragile `"3022"` string-matching workaround. Table-not-exist is now handled by the `if not ok` path with an improved error message. ### Type of change - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 11:52:22 +08:00
Lynn	e22cf333ed	Fix: allow search id or _id (#14356 ) ### What problem does this PR solve? Allow search id or _id when using es as doc_engine. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 21:38:19 +08:00
newyangyang	d84438fd53	fix azure blob put method param (#14329 ) ### What problem does this PR solve? when use azure blob as the file container, when click parse file, it calls: ```python partial(settings.STORAGE_IMPL.put, tenant_id=task["tenant_id"]) ``` So any storage backend used there must accept tenant_id as a kwarg. RAGFlowAzureSasBlob.put() did not, causing: ``` TypeError: ... got an unexpected keyword argument 'tenant_id' ``` Now it does, so parsing should proceed past this point. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-23 20:40:54 +08:00
Wang Qi	224574831c	Add REDIS zcard (#14316 ) ### What problem does this PR solve? As description. ### Type of change - [x] Refactoring	2026-04-23 12:51:55 +08:00
Zhichang Yu	b7744e053e	fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972 ) fix: support dense_vector from ES fields response (ES 9.x compatibility) - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Configuration Chore (non-breaking change which updates configuration) ## Summary by CodeRabbit * Bug Fixes * More accurate handling and unwrapping of dense-vector fields so returned values have correct shapes. * Field selection reliably limits returned data and falls back to alternate result locations when needed. * Use of consistent result IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased build memory and adjusted build-time flags for the frontend build. * Simplified runtime model/GPU checks and removed an automated runtime GPU-install attempt. * Build Fixes * `web/vite.config.ts`: make `build.minify` and `build.sourcemap` respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from Dockerfile instead of hardcoding `terser` and `true`. * Environment * Allow stack version override and default the runtime image tag to "latest". <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Correct unwrapping of dense-vector fields and reliable field selection with fallback locations. * Consistent use of hit-level IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased frontend build memory and added build-time minify/sourcemap flags; build minification and sourcemap now configurable. * Removed runtime GPU detection for model initialization; force CPU initialization. * Environment * Allow stack version override and default runtime image tag to "latest". <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 17:44:13 +08:00
MkDev11	cfee2bc9db	feat: Auto-adjust chunk recall weights based on user feedback (#12689 ) ### What problem does this PR solve? Implements automatic adjustment of knowledge base chunk recall weights based on user feedback (upvotes/downvotes). When users upvote or downvote a response, the system locates the corresponding knowledge snippets and adjusts their recall weight to improve future retrieval quality. Closes #12670 How it works: 1. User upvotes/downvotes a response via `POST /thumbup` 2. System extracts chunk IDs from the conversation reference 3. For each referenced chunk: - Reads current `pagerank_fea` value from document store - Increments (+1) for upvote or decrements (-1) for downvote - Clamps weight to [0, 100] range - Updates chunk in ES/Infinity/OceanBase 4. Future retrievals score these chunks higher/lower based on accumulated feedback Files changed: - `api/db/services/chunk_feedback_service.py` - New service for updating chunk pagerank weights - `api/apps/conversation_app.py` - Integrated feedback service into thumbup endpoint - `test/testcases/test_web_api/test_chunk_feedback/` - Unit tests ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Chat message feedback now updates per-chunk relevance weights (feature-flag gated), with configurable weighting and atomic updates across storage backends. * Bug Fixes * Stricter validation for message feedback inputs and more robust handling of feedback transitions. * Tests * Expanded test coverage for chunk-feedback behavior, weighting strategies, storage backends, and thumb-flip scenarios. * Chores * CI workflow extended to run the new chunk-feedback web API tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>	2026-04-08 09:52:18 +08:00
Yang_Ming	bc8d67ce78	feat: add region parameter support to MinIO connection (#13954 ) ## Summary - Add optional `region` parameter to `Minio()` client constructor in `rag/utils/minio_conn.py` - Reads from `MINIO.region` in settings, defaults to `None` when not configured - Required by some S3-compatible storage services (e.g., AWS S3, Tencent COS) for proper bucket access ## Motivation When using RAGFlow with S3-compatible storage that requires a region (such as AWS S3 or Tencent Cloud COS), the MinIO client fails to access buckets because the `region` parameter is not passed through. The `Minio()` Python client already supports the `region` parameter natively — this PR simply wires it up from the RAGFlow configuration. ## Changes - `rag/utils/minio_conn.py`: Pass `region=settings.MINIO.get("region", None) or None` to `Minio()` constructor ## Backward Compatibility - No breaking changes. When `region` is not configured, it defaults to `None`, preserving the existing behavior exactly. ## Test Plan - [ ] Verified with MinIO (no region set) — works as before - [x] Verified with S3-compatible storage requiring region — bucket access succeeds <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Enhanced MinIO client initialization with regional configuration support for improved compatibility with region-specific deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Jarry Wang <code-better-life@users.noreply.github.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-04-07 16:38:23 +08:00
qinling0210	49386bc1b5	Implement UpdateDataset and UpdateMetadata in GO (#13928 ) ### What problem does this PR solve? Implement UpdateDataset and UpdateMetadata in GO Add cli: UPDATE CHUNK <chunk_id> OF DATASET <dataset_name> SET <update_fields> REMOVE TAGS 'tag1', 'tag2' from DATASET 'dataset_name'; SET METADATA OF DOCUMENT <doc_id> TO <meta> ### Type of change - [ ] Refactoring	2026-04-07 09:44:51 +08:00
Zhichang Yu	ab358fe949	feat: make Azure cloud authority configurable for SPN auth (#13898 ) ## Summary - The Azure SPN storage handler hardcoded `AzureAuthorityHosts.AZURE_CHINA`, preventing users in Azure Public Cloud regions (UK-South, EU, US, etc.) from authenticating - Add a `cloud` config option (env: `AZURE_CLOUD`) supporting all four Azure sovereignties: `public`, `china`, `government`, `germany` - Defaults to `public` (global Azure) — the most common international use case Closes #13259 ## Test plan - [ ] Verify default (`cloud: public`) connects to Azure Public Cloud endpoints - [ ] Verify `cloud: china` retains existing behavior for Azure China users - [ ] Verify `AZURE_CLOUD` env var overrides the config file value 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 12:51:26 +08:00
qinling0210	f02f5fa435	Get ROW_ID from search() in Infinity (#13901 ) ### What problem does this PR solve? 1. Search() in Infinity can return row_id now 2. To Get ROW_ID from search(), refer to handling of retrieval_test. example ``` $ curl -s -X POST "http://localhost:$PORT/v1/chunk/retrieval_test" -H "Authorization: $TOKEN" -H "Content-Type: application/json" -d '{"kb_id": "4fcd01582ca911f1954184ba59049aa3", "question": "曹操"}' ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-02 18:56:43 +08:00
qinling0210	bb4a06f759	Implement InsertDataset and InsertMetadata in GO (#13883 ) ### What problem does this PR solve? Implement InsertDataset and InsertMetadata in GO new internal cli for go: INSERT DATASET FROM FILE "file_name" INSERT METADATA FROM FILE "file_name" ### Type of change - [x] Refactoring	2026-04-01 16:16:25 +08:00
qinling0210	620fe215a4	Fix python metadata search (#13727 ) ### What problem does this PR solve? Fix python metadata search ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 19:37:19 +08:00
qinling0210	0462c20113	Fix special characters in matching text of search() (#13852 ) ### What problem does this PR solve? Fix special characters in matching text of search(). We should escape some special characters(such as ?, *,:) before passing to matching_text of search() Fix https://github.com/infiniflow/ragflow/issues/13729 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 18:47:10 +08:00
Stephen Hu	d32967eda8	refactor: let excel use lazy image loader (#13558 ) ### What problem does this PR solve? let excel use lazy image loader ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-23 21:24:40 +08:00
qinling0210	1be07a0a34	Fix "Result window is too large" during meta data search (#13521 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13210#issuecomment-3982878498 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 18:59:56 +08:00
Yongteng Lei	e1b632a7bb	Feat: add delete all support for delete operations (#13530 ) ### What problem does this PR solve? Add delete all support for delete operations. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-12 09:47:42 +08:00
eviaaaaa	d0ca388bec	Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329 ) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * Centralized Abstraction (`docx_parser.py`): Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * Robust Corrupted Image Fallback (`docx_parser.py`): Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * Legacy Code & Redundancy Purge: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * Output Consistency: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * Memory Footprint Drop: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.	2026-03-11 10:00:07 +08:00
guptas6est	32d31284cc	Fix: upgrade pypdf to 6.7.5 and migrate from deprecated pypdf2 to fix CVE-2026-28804 and CVE-2023-36464 (#13454 ) ### What problem does this PR solve? This PR addresses security vulnerabilities in PDF processing dependencies identified by Trivy security scan: 1. CVE-2026-28804 (MEDIUM): pypdf 6.7.4 vulnerable to inefficient decoding of ASCIIHexDecode streams 2. CVE-2023-36464 (MEDIUM): pypdf2 3.0.1 susceptible to infinite loop when parsing malformed comments Since pypdf2 is deprecated with no available fixes, this PR migrates all pypdf2 usage to the actively maintained pypdf library (version 6.7.5), which resolves both vulnerabilities. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-09 12:06:00 +08:00
Yongteng Lei	d9785ea2ce	Fix: Alibaba cloud OSS config issue (#13406 ) ### What problem does this PR solve? Alibaba Could OSS config issue #13390. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-05 18:13:45 +08:00
Jin Hai	b9ad014f63	Supports login cross multiple RAGFlow servers (#13322 ) ### What problem does this PR solve? 1. Use redis to store the secret key. 2. During startup API server will read the secret from redis. If no such secret key, generate one and store it into redis, atomically. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-04 13:07:45 +08:00
Magicbook1108	daec36e935	Fix: add soft limit for graph rag size (#13252 ) ### What problem does this PR solve? Fix: add soft limit for graph rag size #13258 Q2 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 14:02:36 +08:00
huber	8a6b5ced6b	fix: add missing chunk_data column to OceanBase schema migration (#13306 ) ### What problem does this PR solve? When using OceanBase as the document storage engine, parsing and inserting chunks with chunk_data (e.g., table parser row data) fails with the following error: ``` [ERROR][Exception]: Insert chunk error: ['Unconsumed column names: chunk_data'] This happens because the chunk_data column was recently introduced but was omitted from the EXTRA_COLUMNS list in rag/utils/ob_conn.py ``` As a result, the automatic schema migration for existing OceanBase tables does not append the missing chunk_data column, causing the underlying pyobvector or SQLAlchemy to raise an unconsumed column names error during data insertion. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### What is the solution? Added column_chunk_data to the EXTRA_COLUMNS list in ``` rag/utils/ob_conn.py ``` This ensures that the OceanBase connection wrapper can correctly detect the missing column and automatically alter existing chunk tables to include the chunk_data field during initialization.	2026-03-02 13:25:11 +08:00
Yongteng Lei	c91e803a38	Fix: close detached PIL image on JPEG save failure in encode_image (#13278 ) ### What problem does this PR solve? Properly close detached PIL image on JPEG save failure in encode_image. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 14:43:35 +08:00
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
He Wang	394ff16b66	fix: OceanBase metadata not returned in document list API (#13209 ) ### What problem does this PR solve? Fix #13144. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-25 15:29:17 +08:00
Yao Wei	cf6fd6f115	fix: When using OceanBase as storage, the list_chunk sorting is abnormal. #13198 (#13208 ) Actual behavior When using OceanBase as storage, the list_chunk sorting is abnormal. The following is the SQL statement. SELECT id, content_with_weight, important_kwd, question_kwd, img_id, available_int, position_int, doc_type_kwd, create_timestamp_flt, create_time, array_to_string(page_num_int, ',') AS page_num_int_sort, array_to_string(top_int, ',') AS top_int_sort FROM rag_store_284250730805059584 WHERE doc_id = '' AND kb_id IN ('') ORDER BY page_num_int_sort ASC, top_int_sort ASC, create_timestamp_flt DESC LIMIT 0, 20 <img width="1610" height="740" alt="image" src="https://github.com/user-attachments/assets/84e14c30-a97f-4e8f-8c8c-6ccac915d97d" /> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>	2026-02-25 13:36:18 +08:00
PandaMan	f4cbdc3a3b	fix(api): MinIO health check use dynamic scheme and verify (Closes #13159 and #13158 ) (#13197 ) ## Summary Fixes MinIO SSL/TLS support in two places: the MinIO client connection and the health check used by the Admin/Service Health dashboard. Both now respect the `secure` and `verify` settings from the MinIO configuration. Closes #13158 Closes #13159 --- ## Problem #13158 – MinIO client: The client in `rag/utils/minio_conn.py` was hardcoded with `secure=False`, so RAGFlow could not connect to MinIO over HTTPS even when `secure: true` was set in config. There was also no way to disable certificate verification for self-signed certs. #13159 – MinIO health check: In `api/utils/health_utils.py`, the MinIO liveness check always used `http://` for the health URL. When MinIO was configured with SSL, the health check failed and the dashboard showed "timeout" even though MinIO was reachable over HTTPS. --- ## Solution ### MinIO client (`rag/utils/minio_conn.py`) - Read `MINIO.secure` (default `false`) and pass it into the `Minio()` constructor so HTTPS is used when configured. - Add `_build_minio_http_client()` that reads `MINIO.verify` (default `true`). When `verify` is false, return an `urllib3.PoolManager` with `cert_reqs=ssl.CERT_NONE` and pass it as `http_client` to `Minio()` so self-signed certificates are accepted. - Support string values for `secure` and `verify` (e.g. `"true"`, `"false"`). ### MinIO health check (`api/utils/health_utils.py`) - Add `_minio_scheme_and_verify()` to derive URL scheme (http/https) and the `verify` flag from `MINIO.secure` and `MINIO.verify`. - Update `check_minio_alive()` to use the correct scheme, pass `verify` into `requests.get(..., verify=verify)`, and use `timeout=10`. ### Config template (`docker/service_conf.yaml.template`) - Add commented optional MinIO keys `secure` and `verify` (and env vars `MINIO_SECURE`, `MINIO_VERIFY`) so deployers know they can enable HTTPS and optional cert verification. ### Tests - `test/unit_test/utils/test_health_utils_minio.py` – Tests for `_minio_scheme_and_verify()` and `check_minio_alive()` (scheme, verify, status codes, timeout, errors). - `test/unit_test/utils/test_minio_conn_ssl.py` – Tests for `_build_minio_http_client()` (verify true/false/missing, string values, `CERT_NONE` when verify is false). --- ## Testing - Unit tests added/updated as above; run with the project's test runner. - Manually: configure MinIO with HTTPS and `secure: true` (and optionally `verify: false` for self-signed); confirm client operations work and the Service Health dashboard shows MinIO as alive instead of timeout.	2026-02-25 09:47:12 +08:00
akie	6f785e06a4	Fix issue #13084 (#13088 ) When match_expressions contains coroutine objects (from GraphRAG's Dealer.get_vector()), the code cannot identify this type because it only checks for MatchTextExpr, MatchDenseExpr, or FusionExpr. As a result: score_func remains initialized as an empty string "" This empty string is appended to the output list The output list is passed to Infinity SDK's table_instance.output() method Infinity's SQL parser (via sqlglot) fails to parse the empty string, throwing a ParseError	2026-02-10 17:04:45 +08:00
He Wang	ff7afcbe5f	feat: add OceanBase memory store (#12955 ) ### What problem does this PR solve? Add OceanBase memory store and extracting base class `OBConnectionBase`. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-03 16:46:17 +08:00
Paul Y Hui	f028f74883	Fixed 12787 with syntax error in generated MySql json path expression (#12929 ) ### What problem does this PR solve? Fixed 12787 with syntax error in generated MySql json path expression https://github.com/infiniflow/ragflow/issues/12787 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) #### What was fixed: - Changed line 237 in ob_conn.py from value_str = get_value_str(value) if value else "" to value_str = get_value_str(value) - This fixes the bug where falsy but valid values (0, False, "", [], {}) were being converted to empty strings, causing invalid SQL syntax #### What was tested: - Comprehensive unit tests covering all edge cases - Regression tests specifically for the bug scenario --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-02-03 09:50:14 +08:00
Carve_	23bdf25a1f	feature:Add OceanBase Storage Support for Table Parser (#12923 ) ### What problem does this PR solve? close #12770 This PR adds OceanBase as a storage backend for the Table Parser. It enables dynamic table schema storage via JSON and implements OceanBase SQL execution for text-to-SQL retrieval. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes - Table Parser stores row data into `chunk_data` when doc engine is OceanBase. (table.py) - OceanBase table schema adds `chunk_data` JSON column and migrates if needed. - Implemented OceanBase `sql()` to execute text-to-SQL results. (ob_conn.py) - Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py) ### Test 1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`) <img width="1290" height="783" alt="doc_engine_ob" src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1" /> 2. Upload an Excel file to Knowledge Base.(for test, we use as below) <img width="786" height="930" alt="excel" src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a" /> 3. Choose Table as parsing method. <img width="2550" height="1134" alt="parse_excel" src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0" /> 4.Ask a natural language query in chat. <img width="2550" height="1134" alt="query" src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91" />	2026-01-31 15:11:54 +08:00
Phives	87305cb08c	fix: close file handles when loading JSON mapping in doc store connectors (#12904 ) What problem does this PR solve? When loading JSON mapping/schema files, the code used json.load(open(path)) without closing the file. The file handle stayed open until garbage collection, which can leak file descriptors under load (e.g. repeated reconnects or migrations). Type of change [x] Bug Fix (non-breaking change which fixes an issue) Change Replaced json.load(open(...)) with a context manager so the file is closed after loading: with open(fp_mapping, "r") as f: ... = json.load(f) Files updated rag/utils/opensearch_conn.py – mapping load (1 place) common/doc_store/es_conn_base.py – mapping load + doc_meta_mapping load (2 places) common/doc_store/infinity_conn_base.py – schema loads in _migrate_db, doc metadata table creation, and SQL field mapping (4 places) Behavior is unchanged; only resource handling is fixed. Co-authored-by: Gittensor Miner <miner@gittensor.io>	2026-01-30 14:07:51 +08:00
Angel98518	98b6a0e6d1	feat: Add OceanBase Performance Monitoring and Health Check Integration (#12886 ) ## Description This PR implements comprehensive OceanBase performance monitoring and health check functionality as requested in issue #12772. The implementation follows the existing ES/Infinity health check patterns and provides detailed metrics for operations teams. ## Problem Currently, RAGFlow lacks detailed health monitoring for OceanBase when used as the document engine. Operations teams need visibility into: - Connection status and latency - Storage space usage - Query throughput (QPS) - Slow query statistics - Connection pool utilization ## Solution ### 1. Enhanced OBConnection Class (`rag/utils/ob_conn.py`) Added comprehensive performance monitoring methods: - `get_performance_metrics()` - Main method returning all performance metrics - `_get_storage_info()` - Retrieves database storage usage - `_get_connection_pool_stats()` - Gets connection pool statistics - `_get_slow_query_count()` - Counts queries exceeding threshold - `_estimate_qps()` - Estimates queries per second - Enhanced `health()` method with connection status ### 2. Health Check Utilities (`api/utils/health_utils.py`) Added two new functions following ES/Infinity patterns: - `get_oceanbase_status()` - Returns OceanBase status with health and performance metrics - `check_oceanbase_health()` - Comprehensive health check with detailed metrics ### 3. API Endpoint (`api/apps/system_app.py`) Added new endpoint: - `GET /v1/system/oceanbase/status` - Returns OceanBase health status and performance metrics ### 4. Comprehensive Unit Tests (`test/unit_test/utils/test_oceanbase_health.py`) Added 340+ lines of unit tests covering: - Health check success/failure scenarios - Performance metrics retrieval - Error handling and edge cases - Connection pool statistics - Storage information retrieval - QPS estimation - Slow query detection ## Metrics Provided - Connection Status: connected/disconnected - Latency: Query latency in milliseconds - Storage: Used and total storage space - QPS: Estimated queries per second - Slow Queries: Count of queries exceeding threshold - Connection Pool: Active connections, max connections, pool size ## Testing - All unit tests pass - Error handling tested for connection failures - Edge cases covered (missing tables, connection errors) - Follows existing code patterns and conventions ## Code Statistics - Total Lines Changed: 665+ lines - New Code: ~600 lines - Test Coverage: 340+ lines of comprehensive tests - Files Modified: 3 - Files Created: 1 (test file) ## Acceptance Criteria Met ✅ `/system/oceanbase/status` API returns OceanBase health status ✅ Monitoring metrics accurately reflect OceanBase running status ✅ Clear error messages when health checks fail ✅ Response time optimized (metrics cached where possible) ✅ Follows existing ES/Infinity health check patterns ✅ Comprehensive test coverage ## Related Files - `rag/utils/ob_conn.py` - OceanBase connection class - `api/utils/health_utils.py` - Health check utilities - `api/apps/system_app.py` - System API endpoints - `test/unit_test/utils/test_oceanbase_health.py` - Unit tests Fixes #12772 --------- Co-authored-by: Daniel <daniel@example.com>	2026-01-30 09:44:42 +08:00
akie	d86b7f9721	Remove filter (kb_id) in infinity (#12853 ) Secondary indexes in infinity do not support IN expr --------- Signed-off-by: zpf121 <1219290549@qq.com>	2026-01-29 11:04:25 +08:00
dive2tech	15a534909f	fix: avoid ZeroDivisionError when fulltext column weights sum to zero (#12856 ) ### What problem does this PR solve? When all fulltext_search_columns use explicit weight 0 (e.g. "col^0"), weight_sum is 0 and dividing by it raises ZeroDivisionError. Use equal weights 1/n when weight_sum <= 0 and n > 0; otherwise normalize as before. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [x] Refactoring	2026-01-28 14:38:03 +08:00
qinling0210	9a5208976c	Put document metadata in ES/Infinity (#12826 ) ### What problem does this PR solve? Put document metadata in ES/Infinity. Index name of meta data: ragflow_doc_meta_{tenant_id} ### Type of change - [x] Refactoring	2026-01-28 13:29:34 +08:00
Stephen Hu	3a8c848af5	Fix:OSConnection.create_idx 4 arguments (#12862 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/12858 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-28 12:41:01 +08:00
Liu An	c2e8f90023	feat(ci): Add Redis service port configuration to test environment (#12855 ) ### What problem does this PR solve? Added Redis port calculation and environment variable export to support Redis service in test environment. The port is dynamically assigned based on runner number to prevent conflicts during parallel test execution. Removed by #12685 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-28 09:27:47 +08:00
Stephen Hu	52da81cf9e	Fix:Redis configuration template error in v0.22.1 (#12685 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/12674 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-27 12:47:46 +08:00
会敲代码的喵	2d9e7b4acd	Fix: aliyun oss need to use s3 signature_version (#12766 ) ### What problem does this PR solve? Aliyun OSS do not support boto s4 signature_version which will lead to an error: ``` botocore.exceptions.ClientError: An error occurred (InvalidArgument) when calling the PutObject operation: aws-chunked encoding is not supported with the specified x-amz-content-sha256 value ``` According to aliyun oss docs, oss_conn need to use s3 signature_version. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-22 11:43:55 +08:00
Kevin Hu	927db0b373	Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716 ) ### Type of change - [x] Refactoring	2026-01-20 13:29:37 +08:00
qinling0210	b40d639fdb	Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541 ) ### What problem does this PR solve? 1) Create dataset using table parser for infinity 2) Answer questions in chat using SQL ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-19 19:35:14 +08:00
He Wang	bd9163904a	fix(ob_conn): ignore duplicate errors when executing 'create_idx' (#12661 ) ### What problem does this PR solve? Skip duplicate errors to avoid 'create_idx' failures caused by slow metadata refresh or external modifications. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-16 20:46:37 +08:00
6ba3i	4f036a881d	Fix: Infinity keyword round-trip, highlight fallback, and KB update guards (#12660 ) ### What problem does this PR solve? Fixes Infinity-specific API regressions: preserves ```important_kwd``` round‑trip for ```[""]```, restores required highlight key in retrieval responses, and enforces Infinity guards for unsupported ```parser_id=tag``` and pagerank in ```/v1/kb/update```. Also removes a slow/buggy pandas row-wise apply that was throwing ```ValueError``` and causing flakiness. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-16 20:03:52 +08:00
liuxiaoyusky	2ea8dddef6	fix(infinity): Use comma separator for important_kwd to preserve mult… (#12618 ) ## Problem The \`important_kwd\` field in Infinity connector was using mismatched separators: - Storage: \`list2str(v)\` uses space as default separator - Reading: \`v.split()\` splits by all whitespace This causes multi-word keywords like \`\"Senior Fund Manager\"\` to be incorrectly split into \`[\"Senior\", \"Fund\", \"Manager\"]\`. ## Solution Use comma \`,\` as separator for both storing and reading, consistent with: 1. The LLM output format in \`keyword_prompt.md\` (\"delimited by ENGLISH COMMA\") 2. The \`cached.split(\",\")\` in \`task_executor.py\` ## Changes - \`insert()\`: \`list2str(v)\` → \`list2str(v, \",\")\` - \`update()\`: \`list2str(v)\` → \`list2str(v, \",\")\` - \`get_fields()\`: \`v.split()\` → \`v.split(\",\") if v else []\` ## Impact This bug affects: - Python-level reranking weight calculation (\`important_kwd * 5\`) - API response keyword display - Search precision due to fragmented keywords	2026-01-15 15:32:40 +08:00
Vedant Madane	ac936005e6	fix: ensure deleted chunks are not returned in retrieval (#12520 ) (#12546 ) ## Summary Fixes #12520 - Deleted chunks should not appear in retrieval/reference results. ## Changes ### Core Fix - api/apps/chunk_app.py: Include \doc_id\ in delete condition to properly scope the delete operation ### Improved Error Handling - api/db/services/document_service.py: Better separation of concerns with individual try-catch blocks and proper logging for each cleanup operation ### Doc Store Updates - rag/utils/es_conn.py: Updated delete query construction to support compound conditions - rag/utils/opensearch_conn.py: Same updates for OpenSearch compatibility ### Tests - test/testcases/.../test_retrieval_chunks.py: Added \TestDeletedChunksNotRetrievable\ class with regression tests - test/unit/test_delete_query_construction.py: Unit tests for delete query construction ## Testing - Added regression tests that verify deleted chunks are not returned by retrieval API - Tests cover single chunk deletion and batch deletion scenarios	2026-01-15 14:45:55 +08:00
He Wang	360114ed42	fix(ob_conn): avoid reusing SQLAlchemy Column objects in DDL (#12588 ) ### What problem does this PR solve? When there are multiple users, parsing a document for a new user can trigger the reuse of column objects, leading to the error `sqlalchemy.exc.ArgumentError: Column object 'id' already assigned to Table xxx`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 17:39:20 +08:00

1 2 3 4 5 ...

272 Commits