ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-05-22 17:08:23 +08:00

Author	SHA1	Message	Date
Xing Hong	fb95136f39	Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090 ) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-25 14:30:15 +08:00
Eden	1f33ca1099	fix(dialog): restore decorated answer in async_ask final SSE event (#13917 ) ## What's the problem Both `async_chat()` and `async_ask()` call `decorate_answer()` to build the final SSE payload — it inserts citation markers (`##N$$`) into the answer text and prunes `doc_aggs` to only the cited documents. Immediately after, both functions overwrite `final["answer"]` with `""`: ```python # async_chat(), line ~774 (issue #13828) final = decorate_answer(thought + full_answer) final["final"] = True final["audio_binary"] = None final["answer"] = "" # discards decorated text yield final # async_ask(), line ~1444 (same bug, different path) final = decorate_answer(full_answer) final["final"] = True final["answer"] = "" # discards decorated text yield final ``` The client receives filtered references (built for a citation-decorated answer it never sees) while displaying the raw, undecorated streaming text. Citations can never match. ## Root cause `final["answer"] = ""` was left over from an earlier design where clients were meant to reconstruct the full answer purely from delta events. Once `decorate_answer()` started placing citation markers, this blank-out broke the contract: the final event is where the decorated answer should land. ## Fix Remove the two blank-override lines — one in `async_chat()`, one in `async_ask()`: ```diff - final["answer"] = "" ``` `decorate_answer()` already sets `final["answer"]` to the correct decorated string; there is nothing to override. ## Relation to #13828 Issue #13828 and PR #13835 identify the bug in `async_chat()`. This PR absorbs that fix and also corrects the identical pattern in `async_ask()` (used by the `/retrieval` route in `chat_api.py`), which PR #13835 does not touch. ## Regression test Added `test/unit_test/api/db/services/test_dialog_service_final_answer.py` with three tests: \| Test \| Purpose \| \|------\|---------\| \| `test_buggy_pattern_drops_answer` \| Documents the old behaviour: blank-override empties the final answer \| \| `test_fixed_pattern_preserves_decorated_answer` \| Core invariant: final event carries the decorated text from `decorate_answer()` \| \| `test_final_event_reference_matches_decorated_result` \| Citation markers in the answer must match the pruned `doc_aggs` in the same event \| Local run result: ``` test_dialog_service_final_answer.py::test_buggy_pattern_drops_answer PASSED test_dialog_service_final_answer.py::test_fixed_pattern_preserves_decorated_answer PASSED test_dialog_service_final_answer.py::test_final_event_reference_matches_decorated_result PASSED 3 passed in 0.04s ``` `ruff check` passes with no issues on all changed files. --------- Co-authored-by: edenfunf <edenfunf@gmail.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-04-15 14:10:36 +08:00
bitloi	853021ff2a	feat: support multiple canvas_types for agent templates and remove duplicate files (#14030 ) ### What problem does this PR solve? Closes #13907 The template catalog had duplicate files (e.g. `*_r.json`) only to place the same template into multiple sidebar groups. This increases maintenance cost and makes template updates error-prone. This PR adds first-class support for multiple template categories in a single file via `canvas_types`, then removes duplicate template files. What changed: - Added `canvas_types` to `CanvasTemplate` model and DB migration. - Added normalization logic when loading templates: - accepts legacy `canvas_type` - accepts new `canvas_types` - merges/deduplicates values - preserves backward compatibility by keeping `canvas_type` as first normalized value. - Updated template import flow to load only `.json` files and in stable sorted order. - Updated frontend template filtering to match on `canvas_types` first, with fallback to legacy `canvas_type`. - Consolidated duplicated template pairs into single files and removed: - `deep_search_r.json` - `reflective_academic_paper_generator_r.json` - `seo_article_writer_r.json` - Added regression/edge-case tests for category normalization and route serialization expectations. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-04-13 20:26:30 +08:00
Jack	c4b0aaa874	Fix: #6098 - Add validation logic for parser_config when update document (#13911 ) ### What problem does this PR solve? Add validation logic for parser_config. Refactor the processing flow. Before change, validation logics and update logics are mixed up - some validation logis executes followed by some update logic executes and then another such "validation-and-then-update" which is not good. After change, all validation logic executes firstly. Update logic will be executed after ALL validation logic executed. Validation logic for parameters (that come from front end) will be checked using Pydantic. For validation logic that depends on data from DB, they will be in separate methods. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2026-04-07 11:33:05 +08:00
NeedmeFordev	c3f79dbcb0	fix(jira): prevent missed incremental updates after issue edits (#13674 ) ### What problem does this PR solve? Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira incremental sync could miss updated issues after initial sync, especially near time boundaries. Root cause: - Jira JQL uses minute-level precision for `updated` filters. - Incremental windows had no overlap buffer, so boundary updates could be skipped. - Sync log cursor tracking used a backward-facing update for `poll_range_start`. - Existing-doc updates in `upload_document` lacked a KB ownership guard for doc-id collisions. What changed: - Added Jira incremental overlap buffer (`time_buffer_seconds`, defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL lower-bound time. - Preserved second-level post-filtering to avoid duplicate reprocessing while still catching boundary updates. - Improved Jira sync logging to include start/end window and overlap configuration. - Updated sync cursor tracking in `increase_docs` to keep `poll_range_start` moving forward with max update time. - Added KB ID safety check before updating existing document records in `upload_document`. Verification performed: - Python syntax compile checks passed for modified files. - Manual verification flow: 1. Run full Jira sync. 2. Edit an already-indexed Jira issue. 3. Run next incremental sync. 4. Confirm updated content is re-ingested into KB. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 23:31:05 +08:00
Daniil Sivak	60ad32a0c2	Feat: support epub parsing (#13650 ) Closes #1398 ### What problem does this PR solve? Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required. ### Type of change - [x] New Feature (non-breaking change which adds functionality) To check this parser manually: ```python uv run --python 3.12 python -c " from deepdoc.parser import EpubParser with open('$HOME/some_epub_book.epub', 'rb') as f: data = f.read() sections = EpubParser()(None, binary=data, chunk_token_num=512) print(f'Got {len(sections)} sections') for i, s in enumerate(sections[:5]): print(f'\n--- Section {i} ---') print(s[:200]) " ```	2026-03-17 20:14:06 +08:00
Josh	a353c7bdd7	Fix: avoid empty doc filter in knowledge retrieval (#13484 ) ## Summary Fix knowledge-base chat retrieval when no individual document IDs are selected. ## Root Cause `async_chat()` initialized `doc_ids` as an empty list when the request did not explicitly select documents. That empty list was then forwarded into retrieval as an active `doc_id` filter, effectively becoming `doc_id IN []` and suppressing all chunk matches. ## Changes - treat missing selected document IDs as `None` instead of `[]` - keep explicit document filtering when IDs are actually provided - add regression coverage for the shared chat retrieval path ## Validation - `python3 -m py_compile api/db/services/dialog_service.py test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py` - `.venv/bin/python -m pytest test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py` - manually verified that chat completions again inject retrieved knowledge into the prompt --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-12 16:03:30 +08:00
Josh	2d2d3cdbcf	Fix document metadata loading for paged listings (#13515 ) ## Summary - scope normal document-list metadata lookups to the current page's document IDs - keep the `return_empty_metadata=True` path dataset-wide because it needs full knowledge of docs that already have metadata - add unit tests for both paged listing paths and the unchanged empty-metadata behavior ## Why `DocumentService.get_list()` and the normal `get_by_kb_id()` path were calling `DocMetadataService.get_metadata_for_documents(None, kb_id)`, which loads metadata for the entire dataset on every page request. That becomes especially problematic on large datasets. The metadata scan path paginates through the full metadata index without an explicit sort, while the ES helper only switches to `search_after` beyond `10000` results when a sort is present. In practice this can lead to unnecessary full-dataset metadata work, slower document-list loading, and unreliable `meta_fields` in list responses for large KBs. This change keeps the existing empty-metadata filter behavior intact, but scopes normal list responses to metadata for the current page only.	2026-03-11 13:42:16 +08:00
Heyang Wang	08f83ff331	Feat: Support get aggregated parsing status to dataset via the API (#13481 ) ### What problem does this PR solve? Support getting aggregated parsing status to dataset via the API Issue: #12810 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: heyang.why <heyang.why@alibaba-inc.com>	2026-03-10 18:05:45 +08:00
Idriss Sbaaoui	2508c46c8f	Playwright : add new test for configuration tab in datasets (#13365 ) ### What problem does this PR solve? this pr adds new tests, for the full configuration tab in datasests ### Type of change - [x] Other (please describe): new tests	2026-03-04 19:10:06 +08:00
Liu An	7715bad04e	refactor: reorganize unit test files into appropriate directories (#13343 ) ### What problem does this PR solve? Move test files from utils/ to their corresponding functional directories: - api/db/ for database related tests - api/utils/ for API utility tests - rag/utils/ for RAG utility tests ### Type of change - [x] Refactoring	2026-03-04 11:02:56 +08:00

11 Commits