ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-03-15 20:07:30 +08:00

Author	SHA1	Message	Date
Yongteng Lei	e1b632a7bb	Feat: add delete all support for delete operations (#13530 ) ### What problem does this PR solve? Add delete all support for delete operations. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-12 09:47:42 +08:00
Liu An	852393c114	Test: Lower priority of chat assistant and chunk list API tests (#13540 ) ### What problem does this PR solve? Mark test cases as lower priority (p3) for: - Creating chat assistants - Deleting chat assistants - Listing chat assistants - Listing chunks within datasets ### Type of change - [x] Update testcases	2026-03-11 19:00:18 +08:00
Jin Hai	2028e895fd	Add license and time record DAO (#13522 ) ### What problem does this PR solve? 1. Change go server default port to 9382 2. Compatible with EE data model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-11 14:02:24 +08:00
qinling0210	1815f5950b	Call get_flatted_meta_by_kbs in dify retrieval (#13509 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13388 Call get_flatted_meta_by_kbs in dify retrieval. Remove get_meta_by_kbs. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-11 13:42:24 +08:00
Josh	2d2d3cdbcf	Fix document metadata loading for paged listings (#13515 ) ## Summary - scope normal document-list metadata lookups to the current page's document IDs - keep the `return_empty_metadata=True` path dataset-wide because it needs full knowledge of docs that already have metadata - add unit tests for both paged listing paths and the unchanged empty-metadata behavior ## Why `DocumentService.get_list()` and the normal `get_by_kb_id()` path were calling `DocMetadataService.get_metadata_for_documents(None, kb_id)`, which loads metadata for the entire dataset on every page request. That becomes especially problematic on large datasets. The metadata scan path paginates through the full metadata index without an explicit sort, while the ES helper only switches to `search_after` beyond `10000` results when a sort is present. In practice this can lead to unnecessary full-dataset metadata work, slower document-list loading, and unreliable `meta_fields` in list responses for large KBs. This change keeps the existing empty-metadata filter behavior intact, but scopes normal list responses to metadata for the current page only.	2026-03-11 13:42:16 +08:00
Heyang Wang	08f83ff331	Feat: Support get aggregated parsing status to dataset via the API (#13481 ) ### What problem does this PR solve? Support getting aggregated parsing status to dataset via the API Issue: #12810 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>	2026-03-10 18:05:45 +08:00
Magicbook1108	7143954b48	Fix: chats_openai in none stream condition (#13495 ) ### What problem does this PR solve? Fix: chats_openai in none stream condition #13453 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 13:44:17 +08:00
qinling0210	7c92f51133	Fix retrieval function when metadata_condtion is specified in retrieval API (#13473 ) ### What problem does this PR solve? Fix https://github.com/infiniflow/ragflow/issues/13388 The following command returns empty when there is doc with the meta data ``` curl --request POST \ --url http://localhost:9222/api/v1/retrieval \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer ragflow-fO3mPFePfLgUYg8-9gjBVVXbvHqrvMPLGaW0P86PvAk' \ --data '{ "question": "any question", "dataset_ids": ["9bb4f0591b8811f18a4a84ba59049aa3"], "metadata_condition": { "logic": "and", "conditions": [ { "name": "character", "comparison_operator": "is", "value": "刘备" } ] } }' ``` When metadata_condtion is specified in the retrieval API, it is converted to doc_ids and doc_ids is passed to retrieval function. In retrieval funciton, when doc_ids is explicitly provided , we should bypass threshold. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 11:57:32 +08:00
tunsuy	292a1a8566	fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 ) ## Problem When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer) cannot map CIDs to correct Unicode characters, outputting PUA characters (U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully trusted pdfplumber text without any garbled detection, causing garbled output in the final parsed result. Relates to #13366 ## Solution ### 1. Garbled text detection functions - `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16), replacement character U+FFFD, control characters, and unassigned/surrogate codepoints - `_is_garbled_text(text, threshold)`: Calculates garbled ratio and detects `(cid:xxx)` patterns ### 2. Box-level fallback (in `__ocr()`) When a text box has ≥50% garbled characters, discard pdfplumber text and fallback to OCR recognition. ### 3. Page-level detection (in `__images__()`) Sample characters from each page; if garbled rate ≥30%, clear all pdfplumber characters for that page, forcing full OCR. ### 4. Layout recognizer CID filtering Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text processing to prevent them from polluting layout analysis. ## Testing - 29 unit tests covering: normal CJK/English text, PUA characters, CID patterns, mixed text, boundary thresholds, edge cases - All 85 existing project unit tests pass without regression	2026-03-10 11:20:31 +08:00
Yongteng Lei	7484298c82	Refa: convert download_img to async (#13477 ) ### What problem does this PR solve? Convert download_img to async. ### Type of change - [x] Refactoring - [x] Performance Improvement	2026-03-09 19:00:17 +08:00
Liu An	7166a7e50e	Test: adjust test priority markers for API tests (#13450 ) ### What problem does this PR solve? Changed test priority markers from p1/p2 to p3 in three test files: - test_table_parser_dataset_chat.py: Adjusted priority for table parser dataset chat test - test_delete_chunks.py: Updated priority for chunk deletion test with invalid IDs - test_retrieval_chunks.py: Modified priority for chunks retrieval pagination test These changes demote the priority of specific test cases to p3, indicating they are lower priority tests that can run later in the test suite execution. ### Type of change - [x] Test update	2026-03-06 20:17:39 +08:00
Achieve3318	45cf24cd2f	feat(memory): implement get_highlight for OceanBase memory (#13449 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-06 20:17:11 +08:00
OliverW	3ed91345aa	fix(auth): return HTTP 401 for token-auth failures (#13420 ) Follow-up to #12488 #13386 ### What problem does this PR solve? Previously, token authentication failures returned HTTP 200 with an error code in the response body. This PR updates `token_required` to raise `Unauthorized` and relies on the global error handler to return a structured JSON response with HTTP 401 status. The response body structure (`code`, `message`, `data`) remains unchanged to preserve compatibility with the official SDK. Frontend logic has been updated to handle HTTP 401 responses in addition to checking `data.code`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-06 18:18:14 +08:00
Yongteng Lei	51be1f1442	Refa: empty ids means no-op operation (#13439 ) ### What problem does this PR solve? Empty ids means no-op operation. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Documentation Update - [x] Refactoring --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-06 18:16:42 +08:00
Achieve3318	37eb533fea	Feat(memory): implement get_aggregation for OceanBase memory (#13428 ) ### What problem does this PR solve? - Add aggregation_utils.aggregate_by_field for pure aggregation logic - Wire OBConnection.get_aggregation to use it (unwrap tuple, pass messages) - Add unit tests for aggregate_by_field (no DB/heavy deps) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-06 12:51:22 +08:00
Idriss Sbaaoui	d90d6026af	Playwright : new chat multi model test (#13402 ) ### What problem does this PR solve? new test for chat multiple model and other chat parameters under playwright ### Type of change - [x] Other (please describe): new test/ data-testid	2026-03-05 18:51:57 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Yongteng Lei	f13a1fb007	Refa: improve model verification ux (#13392 ) ### What problem does this PR solve? Improve model verification UX. #13395 ### Type of change - [x] Refactoring --------- Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:23:47 +08:00
tunsuy	e1f1184b01	test: add unit tests for graphrag/utils.py (87 test cases) (#13328 ) Add comprehensive unit tests for `graphrag/utils.py`, covering 15 functions/classes with 87 test cases. Tested functions: - clean_str, dict_has_keys_with_types, perform_variable_replacements - get_from_to, compute_args_hash, is_float_regex - GraphChange dataclass - handle_single_entity_extraction, handle_single_relationship_extraction - graph_merge, tidy_graph - split_string_by_multi_markers, pack_user_ass_to_openai_messages - is_continuous_subsequence, merge_tuples, flat_uniq_list All 327 existing + new tests pass with no regressions.	2026-03-05 15:30:43 +08:00
Idriss Sbaaoui	2508c46c8f	Playwright : add new test for configuration tab in datasets (#13365 ) ### What problem does this PR solve? this pr adds new tests, for the full configuration tab in datasests ### Type of change - [x] Other (please describe): new tests	2026-03-04 19:10:06 +08:00
Good0987	8a7272f423	Test: add scenario for embedding_model update when chunk_count > 0 (#13351 ) ### What problem does this PR solve? Guard embedding_model change when dataset has existing chunks. API must return code 102 with message 'When chunk_num (N) > 0, embedding_model must remain <current_model>' to prevent silent embedding drift. ### Type of change - [x] Add Testcases Co-authored-by: Liu An <asiro@qq.com>	2026-03-04 17:41:35 +08:00
Liu An	7715bad04e	refactor: reorganize unit test files into appropriate directories (#13343 ) ### What problem does this PR solve? Move test files from utils/ to their corresponding functional directories: - api/db/ for database related tests - api/utils/ for API utility tests - rag/utils/ for RAG utility tests ### Type of change - [x] Refactoring	2026-03-04 11:02:56 +08:00
Idriss Sbaaoui	2f4ca38adf	Fix : make playwright tests idempotent (#13332 ) ### What problem does this PR solve? Playwright tests previously depended on cross-file execution order (`auth -> provider -> dataset -> chat`). This change makes setup explicit and idempotent via fixtures so tests can run independently. - Added/standardized prerequisite fixtures in `test/playwright/conftest.py`: - `ensure_auth_context`, `ensure_model_provider_configured`, `ensure_dataset_ready`, `ensure_chat_ready` - Made provisioning reusable/idempotent with `RUN_ID`-based resource naming. - Synced auth envs (`E2E_ADMIN_EMAIL`, `E2E_ADMIN_PASSWORD`) into seeded creds. - Fixed provider cache freshness (`auth_header`/`page` refresh on cache hit). Also included minimal stability fixes: - dataset create stale-element click handling, - search wait logic for results/empty-state, - agent create-menu handling, - agent run-step retry when run UI doesn’t open first click. ### Type of change - [x] Test fix - [x] Refactoring --------- Co-authored-by: Liu An <asiro@qq.com>	2026-03-04 10:07:14 +08:00
Magicbook1108	4f09b3e2a4	Fix: pipeline canvas category (#13319 ) ### What problem does this PR solve? Fix: pipeline canvas category ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 20:27:36 +08:00
Ahmad Intisar	184388879d	feat: Add `disable_password_login` configuration to support SSO-only authentication (#13151 ) ### What problem does this PR solve? Enterprise deployments that use an external Identity Provider (e.g., Microsoft Entra ID, Okta, Keycloak) need the ability to enforce SSO-only authentication by hiding the email/password login form. Currently, the login page always shows the password form alongside OAuth buttons, with no way to disable it. This PR adds a `disable_password_login` configuration option under the existing `authentication` section in `service_conf.yaml`. When set to `true`, the login page only displays configured OAuth/SSO buttons and hides the email/password form, "Remember me" checkbox, and "Sign up" link. The flag can be set via: - `service_conf.yaml` (`authentication.disable_password_login: true`) - Environment variable (`DISABLE_PASSWORD_LOGIN=true`) Default behavior is unchanged (`false`). ### Behavior \| `disable_password_login` \| OAuth configured \| Result \| \|---\|---\|---\| \| `false` (default) \| No \| Standard email/password form \| \| `false` \| Yes \| Email/password form + SSO buttons below \| \| `true` \| Yes \| SSO buttons only (no form, no sign up link) \| \| `true` \| No \| Empty card (admin should configure OAuth first) \| ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (5) 1. `docker/service_conf.yaml.template` — added `disable_password_login: false` under authentication 2. `common/settings.py` — added `DISABLE_PASSWORD_LOGIN` global variable and loader in `init_settings()` 3. `common/config_utils.py` — fixed `TypeError` in `show_configs()` when authentication section contains non-dict values (e.g., booleans) 4. `api/apps/system_app.py` — exposed `disablePasswordLogin` flag in `/config` endpoint 5. `web/src/pages/login/index.tsx` — conditionally render password form based on config flag; OAuth buttons always render when channels exist --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-03-02 14:06:03 +08:00
Idriss Sbaaoui	860c4bd0bb	Feat: UI testing automation with playwright (#12749 ) ### What problem does this PR solve? This PR helps automate the testing of the ui interface using pytest Playwright ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test automation infrastructure --------- Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:04:08 +08:00
Idriss Sbaaoui	9d78d3ddb1	Tests: fix failling http in CI (#13301 ) ### What problem does this PR solve? test_doc_sdk_routes_unit had two flaky/incorrect branch assumptions: 1. parse/stop_parsing production logic gates on doc.run, but tests used progress, causing branch mismatch and unintended fallthrough into mutation/DB paths. 2. stop_parsing invalid-state test asserted an outdated message fragment, making the contract brittle. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 10:44:33 +08:00
Magicbook1108	1027916bfe	Fix: inconsistent state handling for multi-user single-canvas access (#13267 ) ### What problem does this PR solve? <img width="700" alt="image" src="https://github.com/user-attachments/assets/1db7412e-4554-44bc-84ba-16421949aacc" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-02-28 15:09:21 +08:00
天海蒼灆	983150b936	Fix (api): fix the document parsing status check logic (#12504 ) ### What problem does this PR solve? When the original code terminates the parsing task halfway, the progress may not be 0 or 1, which will result in the inability to call the interface to parse again -Change the document parsing progress check to task status check, and use TaskStatus.RUNNING.value to judge -Update the condition judgment for stopping parsing documents, and check whether the task is running instead ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-28 14:38:55 +08:00
qinling0210	8b6d363a98	Use pagination in _search_metadata (#13238 ) ### What problem does this PR solve? Fix [#13210](https://github.com/infiniflow/ragflow/issues/13210) Remove limit in _search_metadata, use pagination in _search_metadata. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-27 11:24:49 +08:00
6ba3i	22c4d72891	tests: improve RAGFlow coverage based on Codecov report (#13219 ) ### What problem does this PR solve? Codecov’s coverage report shows that several RAGFlow code paths are currently untested or under-tested. This makes it easier for regressions to slip in during refactors and feature work. This PR adds targeted automated tests to cover the files and branches highlighted by Codecov, improving confidence in core behavior while keeping runtime functionality unchanged. ### Type of change - [x] Other (please describe): Test coverage improvement (adds/extends unit and integration tests to address Codecov-reported gaps)	2026-02-26 19:03:26 +08:00
PandaMan	d43aebe701	Fix/13142 auto metadata (#13217 ) ### What problem does this PR solve? Close #13142 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-26 10:25:48 +08:00
6ba3i	38011f2c16	tests: improve RAGFlow coverage based on Codecov report (#13200 ) ### What problem does this PR solve? Codecov’s coverage report shows that several RAGFlow code paths are currently untested or under-tested. This makes it easier for regressions to slip in during refactors and feature work. This PR adds targeted automated tests to cover the files and branches highlighted by Codecov, improving confidence in core behavior while keeping runtime functionality unchanged. ### Type of change - [x] Other (please describe): Test coverage improvement (adds/extends unit and integration tests to address Codecov-reported gaps)	2026-02-25 19:12:11 +08:00
Phives	4ceb668d40	feat(api/utils): Harden file_utils for robustness and edge cases (#12915 ) ## Summary Improves robustness and edge-case handling in `api.utils.file_utils` to avoid crashes, DoS/OOM risks, and timeouts when processing user-provided filenames, paths, and file blobs. ## Changes ### Resource limits & timeouts - `MAX_BLOB_SIZE_THUMBNAIL` (50 MiB) and `MAX_BLOB_SIZE_PDF` (100 MiB) to reject oversized inputs before thumbnail/PDF processing. - `GHOSTSCRIPT_TIMEOUT_SEC` (120 s) for `repair_pdf_with_ghostscript` subprocess to avoid hangs on malicious or broken PDFs. ### `filename_type` - Handles `None`, empty string, non-string (e.g. int/list), and path-only input via new `_normalize_filename_for_type()`. - Uses basename for type detection (e.g. `a/b/c.pdf` → PDF). - Enforces `FILE_NAME_LEN_LIMIT`; invalid input returns `FileType.OTHER`. ### `thumbnail_img` - Rejects `None`/empty/oversized blob and invalid filename; returns `None` instead of raising. - Wraps PDF, image, and PPT handling in try/except so corrupt or malformed files return `None`. - Ensures PDF has pages and PPT has slides before use. - Normalizes PIL image mode (RGBA/P/LA → RGB) for safe PNG export. ### `repair_pdf_with_ghostscript` - Handles `None`/empty input; skips repair when input size exceeds limit. - Uses `subprocess.run(..., timeout=GHOSTSCRIPT_TIMEOUT_SEC)` and catches `TimeoutExpired`. - Returns original bytes when Ghostscript output is empty. ### `read_potential_broken_pdf` - `None` → `b""`; non–sequence-like (no `len`) → `b""`; empty → return as-is. - Oversized blob returned as-is (no repair) to avoid DoS. ### `sanitize_path` - Explicit `None` and non-string check; strips whitespace before normalizing. ## Testing - `test/unit_test/utils/test_api_file_utils.py` added with 36 unit tests covering the above behavior (filename_type, sanitize_path, read_potential_broken_pdf, thumbnail_img, thumbnail, repair_pdf_with_ghostscript, constants). - All tests pass. --------- Co-authored-by: Gittensor Miner <miner@gittensor.io>	2026-02-25 14:34:47 +08:00
PandaMan	f4cbdc3a3b	fix(api): MinIO health check use dynamic scheme and verify (Closes #13159 and #13158 ) (#13197 ) ## Summary Fixes MinIO SSL/TLS support in two places: the MinIO client connection and the health check used by the Admin/Service Health dashboard. Both now respect the `secure` and `verify` settings from the MinIO configuration. Closes #13158 Closes #13159 --- ## Problem #13158 – MinIO client: The client in `rag/utils/minio_conn.py` was hardcoded with `secure=False`, so RAGFlow could not connect to MinIO over HTTPS even when `secure: true` was set in config. There was also no way to disable certificate verification for self-signed certs. #13159 – MinIO health check: In `api/utils/health_utils.py`, the MinIO liveness check always used `http://` for the health URL. When MinIO was configured with SSL, the health check failed and the dashboard showed "timeout" even though MinIO was reachable over HTTPS. --- ## Solution ### MinIO client (`rag/utils/minio_conn.py`) - Read `MINIO.secure` (default `false`) and pass it into the `Minio()` constructor so HTTPS is used when configured. - Add `_build_minio_http_client()` that reads `MINIO.verify` (default `true`). When `verify` is false, return an `urllib3.PoolManager` with `cert_reqs=ssl.CERT_NONE` and pass it as `http_client` to `Minio()` so self-signed certificates are accepted. - Support string values for `secure` and `verify` (e.g. `"true"`, `"false"`). ### MinIO health check (`api/utils/health_utils.py`) - Add `_minio_scheme_and_verify()` to derive URL scheme (http/https) and the `verify` flag from `MINIO.secure` and `MINIO.verify`. - Update `check_minio_alive()` to use the correct scheme, pass `verify` into `requests.get(..., verify=verify)`, and use `timeout=10`. ### Config template (`docker/service_conf.yaml.template`) - Add commented optional MinIO keys `secure` and `verify` (and env vars `MINIO_SECURE`, `MINIO_VERIFY`) so deployers know they can enable HTTPS and optional cert verification. ### Tests - `test/unit_test/utils/test_health_utils_minio.py` – Tests for `_minio_scheme_and_verify()` and `check_minio_alive()` (scheme, verify, status codes, timeout, errors). - `test/unit_test/utils/test_minio_conn_ssl.py` – Tests for `_build_minio_http_client()` (verify true/false/missing, string values, `CERT_NONE` when verify is false). --- ## Testing - Unit tests added/updated as above; run with the project's test runner. - Manually: configure MinIO with HTTPS and `secure: true` (and optionally `verify: false` for self-signed); confirm client operations work and the Service Health dashboard shows MinIO as alive instead of timeout.	2026-02-25 09:47:12 +08:00
Liu An	65ebc06956	Refa: test file location for better organization (#13107 ) ### What problem does this PR solve? Renamed test/unit/test_delete_query_construction.py to test/unit_test/common/test_delete_query_construction.py to align with the project's directory structure and improve test categorization. ### Type of change - [x] Refactoring	2026-02-12 10:15:09 +08:00
Jim Smith	7029b8ca81	Fix: Make time_utils tests timezone-independent (#13100 ) ## Summary - Replace hardcoded CST (UTC+8) expected values in `test_time_utils.py` with dynamically computed local-time expectations using `time.localtime()` and `time.mktime()` - Tests previously failed in any timezone other than UTC+8; they now pass regardless of the system's local timezone ## Test plan - [x] `uv run pytest test/unit_test/ -v` — 317 passed, 25 skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Jim Smith <jhsmith0@me.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 10:51:53 +08:00
Liu An	392ec99651	Docs: Update version references to v0.24.0 in READMEs and docs (#13095 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.23.1 to v0.24.0 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-02-10 17:24:03 +08:00
6ba3i	fabbfcab90	Fix: failing p3 test for SDK/HTTP APIs (#13062 ) ### What problem does this PR solve? Adjust highlight parsing, add row-count SQL override, tweak retrieval thresholding, and update tests with engine-aware skips/utilities. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-09 14:56:10 +08:00
Carve_	7b230aadf4	chore(tests): move oceanbase peewee test under test/ and fix enum check (#12969 ) ### What problem does this PR solve? This mistake was made by PR #12926 This PR makes the OceanBase peewee unit test discoverable by the default unit test runner/CI (by moving it under test/), so it’s included in the unified unit test suite. It also fixes `test_database_lock_enum_values` to correctly handle Enum alias members (DatabaseLock uses the same value for MYSQL and OCEANBASE). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Screenshots The original `test_oceanbase_peewee.py` was placed under tests/, which isn’t included in the default unit test runner’s testpaths, so it wasn’t picked up by the unit test suite. So we need to move it to correct path. <img width="670" height="540" alt="image" src="https://github.com/user-attachments/assets/69d39346-450f-46dc-8965-29c3d7b32bc9" /> When using old version in `test_oceanbase_peewee.py`: ``` def test_database_lock_enum_values(self): """Test DatabaseLock enum has all expected values.""" expected = {'MYSQL', 'OCEANBASE', 'POSTGRES'} actual = {e.name for e in DatabaseLock} assert expected.issubset(actual), f"Missing: {expected - actual}" ``` The old check iterated Enum members, so alias values were skipped and only `MYSQL/POSTGRES` were seen, making OCEANBASE appear missing. <img width="1998" height="931" alt="65e2837f23b7b298980a410c7d5c2f09" src="https://github.com/user-attachments/assets/d8e98c5a-2cfa-4182-ae35-a3ef03554a27" /> and new version uses `DatabaseLock.__members__` and passes: <img width="2024" height="1170" alt="1aa8c6facb28d24149270fe1bc4a9dd9" src="https://github.com/user-attachments/assets/d8688936-ccac-4a39-a389-23dc6f0fe276" />	2026-02-03 17:28:53 +08:00
Philipp Heyken Soares	ad06c042c4	Support operator constraints in semi-automatic metadata filtering (#12956 ) ### What problem does this PR solve? #### Summary This PR enhances the Semi-automatic metadata filtering mode by allowing users to explicitly pre-define operators (e.g., contains, =, >, etc.) for selected metadata keys. While the LLM still dynamically extracts the filter value from the user's query, it is now strictly constrained to use the operator specified in the UI configuration. Using this feature is optional. By default the operator selection is set to "automatic" resulting in the LLM choosing the operator (as presently). #### Rationale & Use Case This enhancement was driven by a concrete challenge I encountered while working with technical documentation. In my specific use case, I was trying to filter for software versions within a technical manual. In this dataset, a single document chunk often applies to multiple software versions. These versions are stored as a combined string within the metadata for each chunk. When using the standard semi-automatic filter, the LLM would inconsistently choose between the contains and equals operators. When it chose equals, it would exclude every chunk that applied to more than one version, even if the version I was searching for was clearly included in that metadata string. This led to incomplete and frustrating retrieval results. By extending the semi-automatic filter to allow pre-defining the operator for a specific key, I was able to force the use of contains for the version field. This change immediately led to significantly improved and more reliable results in my case. I believe this functionality will be equally useful for others dealing with "tagged" or multi-value metadata where the relationship between the query and the field is known, but the specific value needs to remain dynamic. #### Key Changes ##### Backend & Core Logic - `common/metadata_utils.py`: Updated apply_meta_data_filter to support a mixed data structure for semi_auto (handling both legacy string arrays and the new object-based format {"key": "...", "op": "..."}). - `rag/prompts/generator.py`: Extended gen_meta_filter to accept and pass operator constraints to the LLM. - `rag/prompts/meta_filter.md`: Updated the system prompt to instruct the LLM to strictly respect provided operator constraints. ##### Frontend - `web/src/components/metadata-filter/metadata-semi-auto-fields.tsx`: Enhanced the UI to include an operator dropdown for each selected metadata key, utilizing existing operator constants. - `web/src/components/metadata-filter/index.tsx`: Updated the validation schema to accommodate the new state structure. #### Test Plan - Backward Compatibility: Verified that existing semi-auto filters stored as simple strings still function correctly. - Prompt Verification: Confirmed that constraints are correctly rendered in the LLM system prompt when specified. - Added unit tests as `test/unit_test/common/test_apply_semi_auto_meta_data_filter.py` - Manual End-to-End: - Configured a "Semi-automatic" filter for a "Version" key with the "contains" operator. - Asked a version-specific query. - Result <img width="1173" height="704" alt="Screenshot 2026-02-02 145359" src="https://github.com/user-attachments/assets/510a6a61-a231-4dc2-a7fe-cdfc07219132" /> ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>	2026-02-03 11:11:34 +08:00
Paul Y Hui	f028f74883	Fixed 12787 with syntax error in generated MySql json path expression (#12929 ) ### What problem does this PR solve? Fixed 12787 with syntax error in generated MySql json path expression https://github.com/infiniflow/ragflow/issues/12787 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) #### What was fixed: - Changed line 237 in ob_conn.py from value_str = get_value_str(value) if value else "" to value_str = get_value_str(value) - This fixes the bug where falsy but valid values (0, False, "", [], {}) were being converted to empty strings, causing invalid SQL syntax #### What was tested: - Comprehensive unit tests covering all edge cases - Regression tests specifically for the bug scenario --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-02-03 09:50:14 +08:00
Liu An	c4f60b349d	Fix(test): downgrade test priorities (#12913 ) ### What problem does this PR solve? Changed test priorities in multiple test files, downgrading from p1 to p2 and p2 to p3. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-30 20:02:56 +08:00
Liu An	4947e9473a	Fix(test): Update error message assertions for unsupported content type tests (#12901 ) ### What problem does this PR solve? This commit updates test cases for create, delete, and update dataset endpoints to expect consistent error messages when an unsupported content type is provided. ### Type of change - [x] Bug Fix (test)	2026-01-30 09:45:04 +08:00
Angel98518	98b6a0e6d1	feat: Add OceanBase Performance Monitoring and Health Check Integration (#12886 ) ## Description This PR implements comprehensive OceanBase performance monitoring and health check functionality as requested in issue #12772. The implementation follows the existing ES/Infinity health check patterns and provides detailed metrics for operations teams. ## Problem Currently, RAGFlow lacks detailed health monitoring for OceanBase when used as the document engine. Operations teams need visibility into: - Connection status and latency - Storage space usage - Query throughput (QPS) - Slow query statistics - Connection pool utilization ## Solution ### 1. Enhanced OBConnection Class (`rag/utils/ob_conn.py`) Added comprehensive performance monitoring methods: - `get_performance_metrics()` - Main method returning all performance metrics - `_get_storage_info()` - Retrieves database storage usage - `_get_connection_pool_stats()` - Gets connection pool statistics - `_get_slow_query_count()` - Counts queries exceeding threshold - `_estimate_qps()` - Estimates queries per second - Enhanced `health()` method with connection status ### 2. Health Check Utilities (`api/utils/health_utils.py`) Added two new functions following ES/Infinity patterns: - `get_oceanbase_status()` - Returns OceanBase status with health and performance metrics - `check_oceanbase_health()` - Comprehensive health check with detailed metrics ### 3. API Endpoint (`api/apps/system_app.py`) Added new endpoint: - `GET /v1/system/oceanbase/status` - Returns OceanBase health status and performance metrics ### 4. Comprehensive Unit Tests (`test/unit_test/utils/test_oceanbase_health.py`) Added 340+ lines of unit tests covering: - Health check success/failure scenarios - Performance metrics retrieval - Error handling and edge cases - Connection pool statistics - Storage information retrieval - QPS estimation - Slow query detection ## Metrics Provided - Connection Status: connected/disconnected - Latency: Query latency in milliseconds - Storage: Used and total storage space - QPS: Estimated queries per second - Slow Queries: Count of queries exceeding threshold - Connection Pool: Active connections, max connections, pool size ## Testing - All unit tests pass - Error handling tested for connection failures - Edge cases covered (missing tables, connection errors) - Follows existing code patterns and conventions ## Code Statistics - Total Lines Changed: 665+ lines - New Code: ~600 lines - Test Coverage: 340+ lines of comprehensive tests - Files Modified: 3 - Files Created: 1 (test file) ## Acceptance Criteria Met ✅ `/system/oceanbase/status` API returns OceanBase health status ✅ Monitoring metrics accurately reflect OceanBase running status ✅ Clear error messages when health checks fail ✅ Response time optimized (metrics cached where possible) ✅ Follows existing ES/Infinity health check patterns ✅ Comprehensive test coverage ## Related Files - `rag/utils/ob_conn.py` - OceanBase connection class - `api/utils/health_utils.py` - Health check utilities - `api/apps/system_app.py` - System API endpoints - `test/unit_test/utils/test_oceanbase_health.py` - Unit tests Fixes #12772 --------- Co-authored-by: Daniel <daniel@example.com>	2026-01-30 09:44:42 +08:00
Philipp Heyken Soares	6305c7e411	Fix metadata filter (#12861 ) ### What problem does this PR solve? ##### Summary This PR fixes a bug in the metadata filtering logic where the contains and not contains operators were behaving identically to the in and not in operators. It also standardizes the syntax for string-based operators. ##### The Issue On the main branch, the contains operator was implemented as: `matched = input in value if not isinstance(input, list) else all(i in value for i in input)` This logic is identical to the `in` operator. It checks if the metadata (`input`) exists within the filter (`value`). For a "contains" search, the logic should be reversed: _we want to check if the filter value exists within the metadata input_. ##### Solution Presented Here The operators have been rewritten using str.find(): Contains: `str(input).find(value) >= 0` Not Contains: `str(input).find(value) == -1` ##### Advantage This approach places the metadata (input) on the left side of the expression. This maintains stylistic consistency with the existing start with and end with operators in the same file, which also place the input on the left (e.g., str(input).lower().startswith(...)). ##### Considered Alternative In a previous PR we considered using the standard Python `in` operator: `value in str(input)`. The `in` operator is approximately 15% faster because it uses optimized Python bytecode (CONTAINS_OP) and avoids an attribute lookup. However following rejection of this PR we now propose the change presented here. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Philipp Heyken Soares <philipp.heyken-soares@am.ai>	2026-01-29 09:59:48 +08:00
qinling0210	9a5208976c	Put document metadata in ES/Infinity (#12826 ) ### What problem does this PR solve? Put document metadata in ES/Infinity. Index name of meta data: ragflow_doc_meta_{tenant_id} ### Type of change - [x] Refactoring	2026-01-28 13:29:34 +08:00
Stephen Hu	52da81cf9e	Fix:Redis configuration template error in v0.22.1 (#12685 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/12674 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-27 12:47:46 +08:00
Julien Deveaux	6be197cbb6	Fix: Use tiktoken for proper token counting in OpenAI-compatible endpoint #7850 (#12760 ) ### What problem does this PR solve? The OpenAI-compatible chat endpoint (`/chats_openai/<chat_id>/chat/completions`) was not returning accurate token usage in streaming responses. The token counts were either missing or inaccurate because the underlying LLM API responses weren't being properly parsed for usage data. This PR adds proper token counting using tiktoken (cl100k_base encoding) as a fallback when the LLM API doesn't provide usage data in streaming chunks. This ensures clients always receive token usage information in the response, which is essential for billing and quota management. Changes: - Add tiktoken-based token counting for streaming responses in OpenAI-compatible endpoint - Ensure `usage` field is always populated in the final streaming chunk - Add unit tests for token usage calculation Fixes #7850 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-23 09:36:21 +08:00
Kevin Hu	3beb85efa0	Feat: enhance metadata arranging. (#12745 ) ### What problem does this PR solve? #11564 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-22 15:34:08 +08:00

1 2 3

122 Commits