Files
ragflow/rag
shawnxiao105-afk 8b6dd6a5c2 fix: guard whitespace-only chunks before embedding (#13938)
## Problem

When parsing DOCX files with many tables, DeepDOC generates chunks
containing only empty HTML table tags, such as:

```html
<table><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr></table>
```

After the regex cleanup at `task_executor.py:584`, this becomes `" "`
(whitespace only).

The guard at line 585 (`if not c`) only catches empty strings `""`, but
whitespace strings are truthy in Python and pass through. When sent to
Zhipu `embedding-3` API, it rejects them with error 1213:
`未正常接收到prompt参数`.

## Root Cause

```python
c = re.sub(r"</?(table|td|caption|tr|th)( [^<>]{0,12})?>", " ", c)
if not c:       # ← only catches "", not "   " / "\n" / "\t"
    c = "None"
```

Verified with Zhipu `embedding-3`:
| Input | Result |
|---|---|
| `""` | error 1213 |
| `" "` | error 1213 |
| `"\n"` | error 1213 |
| `"None"` | OK |

## Fix

```diff
- if not c:
+ if not c.strip():
      c = "None"
```

## Testing

Reproduced with a 678KB DOCX file (166 tables, 270 chunks). Chunk #89 is
the empty table above. After fix, `"None"` is sent instead and embedding
succeeds.

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-13 11:47:50 +08:00
..
2026-05-12 09:31:47 +08:00
2025-12-31 17:18:30 +08:00
2026-03-05 17:27:17 +08:00