1222 Commits

Author SHA1 Message Date
bd9163904a fix(ob_conn): ignore duplicate errors when executing 'create_idx' (#12661)
### What problem does this PR solve?

Skip duplicate errors to avoid 'create_idx' failures caused by slow
metadata refresh or external modifications.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-16 20:46:37 +08:00
4f036a881d Fix: Infinity keyword round-trip, highlight fallback, and KB update guards (#12660)
### What problem does this PR solve?

Fixes Infinity-specific API regressions: preserves ```important_kwd```
round‑trip for ```[""]```, restores required highlight key in retrieval
responses, and enforces Infinity guards for unsupported
```parser_id=tag``` and pagerank in ```/v1/kb/update```. Also removes a
slow/buggy pandas row-wise apply that was throwing ```ValueError``` and
causing flakiness.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-16 20:03:52 +08:00
cec06bfb5d Fix: empty chunk issue. (#12638)
#12570

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-15 17:46:21 +08:00
2ea8dddef6 fix(infinity): Use comma separator for important_kwd to preserve mult… (#12618)
## Problem

The \`important_kwd\` field in Infinity connector was using mismatched
separators:
- **Storage**: \`list2str(v)\` uses space as default separator
- **Reading**: \`v.split()\` splits by all whitespace

This causes multi-word keywords like \`\"Senior Fund Manager\"\` to be
incorrectly split into \`[\"Senior\", \"Fund\", \"Manager\"]\`.

## Solution

Use comma \`,\` as separator for both storing and reading, consistent
with:
1. The LLM output format in \`keyword_prompt.md\` (\"delimited by
ENGLISH COMMA\")
2. The \`cached.split(\",\")\` in \`task_executor.py\`

## Changes

- \`insert()\`: \`list2str(v)\` → \`list2str(v, \",\")\`
- \`update()\`: \`list2str(v)\` → \`list2str(v, \",\")\`
- \`get_fields()\`: \`v.split()\` → \`v.split(\",\") if v else []\`

## Impact

This bug affects:
- Python-level reranking weight calculation (\`important_kwd * 5\`)
- API response keyword display
- Search precision due to fragmented keywords
2026-01-15 15:32:40 +08:00
ac936005e6 fix: ensure deleted chunks are not returned in retrieval (#12520) (#12546)
## Summary
Fixes #12520 - Deleted chunks should not appear in retrieval/reference
results.

## Changes

### Core Fix
- **api/apps/chunk_app.py**: Include \doc_id\ in delete condition to
properly scope the delete operation

### Improved Error Handling
- **api/db/services/document_service.py**: Better separation of concerns
with individual try-catch blocks and proper logging for each cleanup
operation

### Doc Store Updates
- **rag/utils/es_conn.py**: Updated delete query construction to support
compound conditions
- **rag/utils/opensearch_conn.py**: Same updates for OpenSearch
compatibility

### Tests
- **test/testcases/.../test_retrieval_chunks.py**: Added
\TestDeletedChunksNotRetrievable\ class with regression tests
- **test/unit/test_delete_query_construction.py**: Unit tests for delete
query construction

## Testing
- Added regression tests that verify deleted chunks are not returned by
retrieval API
- Tests cover single chunk deletion and batch deletion scenarios
2026-01-15 14:45:55 +08:00
d8192f8f17 Fix: validate regex pattern in split_with_pattern to prevent crash (#12633)
### What problem does this PR solve?

Fix regex pattern validation in split_with_pattern (#12605)

- Add try-except block to validate user-provided regex patterns before
use
- Gracefully fallback to single chunk when invalid regex is provided
- Prevent server crash during DOCX parsing with malformed delimiters

## Problem

Parsing DOCX files with custom regex delimiters crashes with `re.error:
nothing to repeat at position 9` when users provide invalid regex
patterns.

Closes #12605 

## Solution

Validate and compile regex pattern before use. On invalid pattern, log
warning and return content as single chunk instead of crashing.

## Changes

- `rag/nlp/__init__.py`: Add regex validation in `split_with_pattern()`
function

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=42954461
2026-01-15 14:24:51 +08:00
b40a7b2e7d Feat: Hash doc id to avoid duplicate name. (#12573)
### What problem does this PR solve?

Feat: Hash doc id to avoid duplicate name. 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-15 14:02:15 +08:00
9a10558f80 Refa: async retrieval process. (#12629)
### Type of change

- [x] Refactoring
- [x] Performance Improvement
2026-01-15 12:28:49 +08:00
678a4f959c Fix: skip internal bookmark references in DOCX parsing (#12604) (#12611)
### What problem does this PR solve?

Fixes #12604 - DOCX files containing hyperlinks to internal bookmarks
(e.g., `#_文档目录`) cause a `KeyError` during parsing:

```
KeyError: "There is no item named 'word/#_文档目录' in the archive"
```

This happens because python-docx incorrectly tries to read internal
bookmark references as files from the ZIP archive. Internal bookmarks
are relationship targets starting with `#` and are not actual files.

This PR extends the existing `load_from_xml_v2` workaround (which
already handles `NULL` targets) to also skip relationship targets
starting with `#`.

Related upstream issue:
https://github.com/python-openxml/python-docx/issues/902

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=94194147
2026-01-14 19:08:46 +08:00
b091ff2730 Fix enable_thinking parameter for Qwen3 models (#12603)
### Issue

When using Qwen3 models (`qwen3-32b`, `qwen3-max`) through the
Tongyi-Qianwen provider for non-streaming calls (e.g., knowledge graph
generation), the API fails with:

Closes #12424

```
parameter.enable_thinking must be set to false for non-streaming calls
```

### Root Cause

In `LiteLLMBase.async_chat()`, the `extra_body={"enable_thinking":
False}` was set in `kwargs` but never forwarded to
`_construct_completion_args()`.

### What problem does this PR solve?

Pass merged kwargs to `_construct_completion_args()` using
`**{**gen_conf, **kwargs}` to safely handle potential duplicate
parameters.

### Changes

- `rag/llm/chat_model.py`: Forward kwargs containing `extra_body` to
`_construct_completion_args()` in `async_chat()`


_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=42954461
2026-01-14 16:35:46 +08:00
f72a35188d refactor: remove debug print statements (#12598)
### What problem does this PR solve?

This PR eliminates unnecessary debug print statements that were left in
hot paths of the codebase.

### Type of change

- [x] Refactoring
2026-01-14 10:05:34 +08:00
360114ed42 fix(ob_conn): avoid reusing SQLAlchemy Column objects in DDL (#12588)
### What problem does this PR solve?

When there are multiple users, parsing a document for a new user can
trigger the reuse of column objects, leading to the error
`sqlalchemy.exc.ArgumentError: Column object 'id' already assigned to
Table xxx`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-13 17:39:20 +08:00
68e5c86e9c Fix: image not displaying thumbnails when using pipeline (#12574)
### What problem does this PR solve?

Fix image not displaying thumbnails when using pipeline.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-13 12:54:13 +08:00
4fe3c24198 feat: PaddleOCR PDF parser supports thumnails and positions (#12565)
### What problem does this PR solve?

1. PaddleOCR PDF parser supports thumnails and positions.
2. Add FAQ documentation for PaddleOCR PDF parser.


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-13 09:51:08 +08:00
44bada64c9 Feat: support tree structured deep-research policy. (#12559)
### What problem does this PR solve?

#12558
### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-13 09:41:35 +08:00
a7dd3b7e9e Add time cost when start servers (#12552)
### What problem does this PR solve?

- API server
- Ingestion server
- Data sync server
- Admin server

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-01-12 12:48:23 +08:00
638c510468 refactor: introduce common normalize method in rerank base class (#12550)
### What problem does this PR solve?

introduce common normalize method in rerank base class

### Type of change

- [x] Refactoring
2026-01-12 11:07:11 +08:00
b226e06e2d refactor: remove debug print statements (#12534)
### What problem does this PR solve?

refactor: remove debug print statements

### Type of change

- [x] Refactoring
2026-01-09 19:23:50 +08:00
2e09db02f3 feat: add paddleocr parser (#12513)
### What problem does this PR solve?

Add PaddleOCR as a new PDF parser.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-09 17:48:45 +08:00
fbe55cef05 fix: keep password in opendal config to fix connection initialization (#12529)
### What problem does this PR solve?

If we delete the password in kwargs, func 'init_db_config' will fail, so
we need to keep this field.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-09 14:19:32 +08:00
f522391d1e Fix: "AttributeError(\"'list' object has no attribute 'get'\")" (#12518)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/12515

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-09 10:19:51 +08:00
f1dc2df23c Fix:Bedrock assume_role auth mode fails with LiteLLM "Extra inputs are not permitted" error (#12495)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/12489

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-08 12:53:41 +08:00
23a9544b73 Fix: toc async issue. (#12485)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-07 15:35:30 +08:00
011bbe9556 Feat: support context window for docx (#12455)
### What problem does this PR solve?

Feat: support context window for docx

#12303

Done:
- [x] naive.py
- [x] one.py

TODO:
- [ ] book.py
- [ ] manual.py

Fix: incorrect image position
Fix: incorrect chunk type tag

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-01-07 15:08:17 +08:00
8d406bd2e6 fix: prevent MinIO health check failure in multi-bucket mode (#12446)
### What problem does this PR solve?

- Fixes the health check failure in multi-bucket MinIO environments.
Previously, health checks would fail because the default
"ragflow-bucket" did not exist. This caused false negatives for system
health.

- Also removes the _health_check write in single-bucket mode to avoid
side effects (minor optimization).

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-07 10:07:18 +08:00
55c9fc0017 fix: add 'mom_id' column to OBConnection chunk table (#12444)
### What problem does this PR solve?

Fix #12428

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-05 19:31:44 +08:00
606f4e6c9e Refa: improve TOC building with better error handling (#12427)
### What problem does this PR solve?

Refactor TOC building logic to use enumerate instead of while loop, add
comprehensive error handling for missing/invalid chunk_id values, and
improve logging with more specific error messages. The changes make the
code more robust against malformed TOC data while maintaining the same
functionality for valid inputs.

### Type of change

- [x] Refactoring
2026-01-05 10:02:42 +08:00
4cd4526492 Feat: PDF vision figure parser supports reading context (#12416)
### What problem does this PR solve?

PDF vision figure parser supports reading context.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-01-05 09:55:43 +08:00
d6e006f086 Improve task executor heartbeat handling and cleanup (#12390)
Improve task executor heartbeat handling and cleanup.

### What problem does this PR solve?

- **Reduce lock contention during executor cleanup**: The cleanup lock
is acquired only when removing expired executors, not during regular
heartbeat reporting, reducing potential lock contention.

- **Optimize own heartbeat cleanup**: Each executor removes its own
expired heartbeat using `zremrangebyscore` instead of `zcount` +
`zpopmin`, reducing Redis operations and improving efficiency.

- **Improve cleanup of other executors' heartbeats**: Expired executors
are detected by checking their latest heartbeat, and stale entries are
removed safely.

- **Other improvements**: IP address and PID are captured once at
startup, and unnecessary global declarations are removed.

### Type of change

- [x] Performance Improvement

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-01-04 11:24:05 +08:00
f56bceb2a9 Fix: remvoe async wrappers (#12405)
### What problem does this PR solve?

Fix: remvoe async wrappers  #12396

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-04 11:19:48 +08:00
Rin
bbaf918d74 security: harden OpenDAL SQL initialization against injection (#12393)
Eliminates SQL injection vectors in the OpenDAL MySQL initialization
logic by implementing strict input validation and explicit type casting.

**Modifications:**
1. **`init_db_config`**: Enforced integer casting for
`max_allowed_packet` before formatting it into the SQL string.
2. **`init_opendal_mysql_table`**: Implemented regex-based validation
for `table_name` to ensure only alphanumeric characters and underscores
are permitted, preventing arbitrary SQL command injection through
configuration parameters.

These changes ensure that even if configuration values are sourced from
untrusted environments, the database initialization remains secure.
2026-01-04 11:19:26 +08:00
6f2fc2f1cb refactor:re order logics in clean_gen_conf (#12391)
### What problem does this PR solve?

re order logics in clean_gen_conf
#12388

### Type of change
- [x] Refactoring
2026-01-04 10:31:56 +08:00
96810b7d97 Fix: webdav connector (#12380)
### What problem does this PR solve?

fix webdav #11422

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-31 19:00:00 +08:00
7d4d687dde Feat: Bitbucket connector (#12332)
### What problem does this PR solve?

Feat: Bitbucket connector NOT READY TO MERGE

### Type of change


- [x] New Feature (non-breaking change which adds functionality)
2025-12-31 17:18:30 +08:00
c2ee2bf7fe Feat: add Zendesk data source integration with configuration and sync capabilities (#12344)
### What problem does this PR solve?
issue:
#12313
change:
add Zendesk data source integration with configuration and sync
capabilities

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-31 14:40:49 +08:00
ae7c623a35 fix(rag/prompts): Restructure metadata extraction rules for precision (#12360)
### What problem does this PR solve?

- Simplified and consolidated extraction rules
- Emphasized strict evidence-based extraction only
- Strengthened enum handling and hallucination prevention
- Clarified output requirements for empty results

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-31 13:52:33 +08:00
1a4a7d1705 Fix: apply kb configured llm issue. (#12354)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-31 12:40:28 +08:00
52f91c2388 Refine: image/table context. (#12336)
### What problem does this PR solve?

#12303

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-30 20:24:27 +08:00
bffdb5fb11 Feat: add IMAP data source integration with configuration and sync capabilities (#12316)
### What problem does this PR solve?
issue:
#12217 [#12313](https://github.com/infiniflow/ragflow/issues/12313)
change:
add IMAP data source integration with configuration and sync
capabilities

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-30 17:09:13 +08:00
5903d1c8f1 Feat: GitHub connector (#12314)
### What problem does this PR solve?

Feat: GitHub connector

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-30 15:09:52 +08:00
f0392e7501 Fix IDE warnings (#12315)
### What problem does this PR solve?

As title.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 15:04:09 +08:00
4a6d37f0e8 Fix: use async task to save memory (#12308)
### What problem does this PR solve?

Use async task to save memory.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 11:41:38 +08:00
df3cbb9b9e Refactor code (#12305)
### What problem does this PR solve?

as title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-30 11:09:18 +08:00
c3ae1aaecd Feat: Gitlab connector (#12248)
### What problem does this PR solve?

Feat: Gitlab connector
Fix: submit button in darkmode

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-29 17:05:20 +08:00
a764f0a5b2 Feat: Add Asana data source integration and configuration options (#12239)
### What problem does this PR solve?

change: Add Asana data source integration and configuration options

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-29 13:28:37 +08:00
37e4485415 feat: add MDX file support (#12261)
Feat: add MDX file support  #12057 
### What problem does this PR solve?

<img width="1055" height="270" alt="image"
src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b"
/>

The page states that MDX files are supported, but uploading fails with
the error: "x.mdx: This type of file has not been supported yet!"
<img width="381" height="110" alt="image"
src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39"
/>


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-29 12:54:31 +08:00
01f0ced1e6 Fix IDE warnings (#12281)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-29 12:01:18 +08:00
bc9e1e3b9a Fix: parent-children pipleine bad case. (#12246)
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-12-26 18:57:16 +08:00
51bc41b2e8 Refa: improve image table context (#12244)
### What problem does this PR solve?

Improve image table context.

Current strategy in attach_media_context:

- Order by position when possible: if any chunk has page/position info,
sort by (page, top, left), otherwise keep original order.
- Apply only to media chunks: images use image_context_size, tables use
table_context_size.
- Primary matching: on the same page, choose a text chunk whose vertical
span overlaps the media, then pick the one with the closest vertical
midpoint.
- Fallback matching: if no overlap on that page, choose the nearest text
chunk on the same page (page-head uses the next text; page-tail uses the
previous text).
- Context extraction: inside the chosen text chunk, find a mid-sentence
boundary near the text midpoint, then take context_size tokens split
before/after (total budget).
- No multi-chunk stitching: context comes from a single text chunk to
avoid mixing unrelated segments.

### Type of change

- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-12-26 17:55:32 +08:00
6e9691a419 Feat: message manage (#12196)
### What problem does this PR solve?

Manage message and use in agent.

Issue #4213 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-25 21:18:13 +08:00