ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-03-19 13:47:43 +08:00

Author	SHA1	Message	Date
tunsuy	020068dd16	Fix: preserve field boundaries in chunked documents from MySQL… (#13369 ) ### What problem does this PR solve? When multiple columns are used as content columns in RDBMS connector, the generated document text gets chunked by TxtParser which strips newline delimiters during merge. This causes field names and values from different columns to be concatenated without any separator, making the content unreadable. Changes: - txt_parser.py: restore newline separator when merging adjacent text segments within a chunk, so that split sections are not directly concatenated - rdbms_connector.py: use double newline between fields and place field value on a new line after the field name bracket, giving TxtParser clearer boundaries to work with Closes #13001 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tunsuytang <tunsuytang@tencent.com>	2026-03-04 21:42:02 +08:00
Ahmad Intisar	99d1c9725c	Bug mysql connector empty content resolved: Semantic ID Issue (#13206 ) The RDBMS (MySQL/PostgreSQL) connector generates document filenames using the first 100 characters of the content column (semantic_identifier). When the content contains newline characters (\n), the resulting filename includes those newlines — for example: Category: غير صحيح كليًا\nTitle: تفنيد حقائق....txt RAGFlow's filename_type() function uses re.match(r".\.txt$", filename) to detect file types, but . does not match newline characters by default in Python regex. This causes the regex to fail, returning FileType.OTHER, which triggers: pythonraise RuntimeError("This type of file has not been supported yet!") As a result, all documents synced via the MySQL/PostgreSQL connector are silently discarded. The sync logs report success (e.g., "399 docs synchronized"), but zero documents actually appear in the dataset. This is the root cause of issue #13001. Root cause trace: rdbms_connector.py → _row_to_document() sets semantic_identifier from raw content (may contain \n) connector_service.py → duplicate_and_parse() uses semantic_identifier as the filename file_service.py → upload_document() calls filename_type(filename) file_utils.py → filename_type() regex .*\.txt$ fails on newlines → returns FileType.OTHER upload_document() raises "This type of file has not been supported yet!" Fix: Sanitize the semantic_identifier in _row_to_document() by replacing newlines and carriage returns with spaces before truncating to 100 characters. Relates to: #13001, #12817 Type of change Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-02-25 12:55:04 +08:00
MkDev11	13a6545e48	fix(rdbms): use brackets around field names to preserve distinction after chunking (#13010 ) Fix RDBMS field separation after chunking by wrapping field names in brackets (【field】: value). This ensures fields remain distinguishable even when TxtParser strips newline delimiters during chunk merging. Closes #13001 Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>	2026-02-06 14:44:58 +08:00
MkDev11	6f31c5fed2	feat/add MySQL and PostgreSQL data source connectors (#12817 ) ### What problem does this PR solve? This PR adds MySQL and PostgreSQL as data source connectors, allowing users to import data directly from relational databases into RAGFlow for RAG workflows. Many users store their knowledge in databases (product catalogs, documentation, FAQs, etc.) and currently have no way to sync this data into RAGFlow without exporting to files first. This feature lets them connect directly to their databases, run SQL queries, and automatically create documents from the results. Closes #763 Closes #11560 ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What this PR does New capabilities: - Connect to MySQL and PostgreSQL databases - Run custom SQL queries to extract data - Map database columns to document content (vectorized) and metadata (searchable) - Support incremental sync using a timestamp column - Full frontend UI with connection form and tooltips Files changed: Backend: - `common/constants.py` - Added MYSQL/POSTGRESQL to FileSource enum - `common/data_source/config.py` - Added to DocumentSource enum - `common/data_source/rdbms_connector.py` - New connector (368 lines) - `common/data_source/__init__.py` - Exported the connector - `rag/svr/sync_data_source.py` - Added MySQL and PostgreSQL sync classes - `pyproject.toml` - Added mysql-connector-python dependency Frontend: - `web/src/pages/user-setting/data-source/constant/index.tsx` - Form fields - `web/src/locales/en.ts` - English translations - `web/src/assets/svg/data-source/mysql.svg` - MySQL icon - `web/src/assets/svg/data-source/postgresql.svg` - PostgreSQL icon ### Testing done Tested with MySQL 8.0 and PostgreSQL 16: - Connection validation works correctly - Full sync imports all query results as documents - Incremental sync only fetches rows updated since last sync - Custom SQL queries filter data as expected - Invalid credentials show clear error messages - Lint checks pass (`ruff check` returns no errors) --------- Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>	2026-02-04 10:14:32 +08:00

4 Commits