mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-01-19 03:35:11 +08:00
Fix: validate regex pattern in split_with_pattern to prevent crash (#12633)
### What problem does this PR solve? Fix regex pattern validation in split_with_pattern (#12605) - Add try-except block to validate user-provided regex patterns before use - Gracefully fallback to single chunk when invalid regex is provided - Prevent server crash during DOCX parsing with malformed delimiters ## Problem Parsing DOCX files with custom regex delimiters crashes with `re.error: nothing to repeat at position 9` when users provide invalid regex patterns. Closes #12605 ## Solution Validate and compile regex pattern before use. On invalid pattern, log warning and return content as single chunk instead of crashing. ## Changes - `rag/nlp/__init__.py`: Add regex validation in `split_with_pattern()` function ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=42954461
This commit is contained in:
@ -275,7 +275,18 @@ def tokenize(d, txt, eng):
|
||||
|
||||
def split_with_pattern(d, pattern: str, content: str, eng) -> list:
|
||||
docs = []
|
||||
txts = [txt for txt in re.split(r"(%s)" % pattern, content, flags=re.DOTALL)]
|
||||
|
||||
# Validate and compile regex pattern before use
|
||||
try:
|
||||
compiled_pattern = re.compile(r"(%s)" % pattern, flags=re.DOTALL)
|
||||
except re.error as e:
|
||||
logging.warning(f"Invalid delimiter regex pattern '{pattern}': {e}. Falling back to no split.")
|
||||
# Fallback: return content as single chunk
|
||||
dd = copy.deepcopy(d)
|
||||
tokenize(dd, content, eng)
|
||||
return [dd]
|
||||
|
||||
txts = [txt for txt in compiled_pattern.split(content)]
|
||||
for j in range(0, len(txts), 2):
|
||||
txt = txts[j]
|
||||
if not txt:
|
||||
|
||||
Reference in New Issue
Block a user