Files
ragflow/test
Phives 4ceb668d40 feat(api/utils): Harden file_utils for robustness and edge cases (#12915)
## Summary
Improves robustness and edge-case handling in `api.utils.file_utils` to
avoid crashes, DoS/OOM risks, and timeouts when processing user-provided
filenames, paths, and file blobs.

## Changes

### Resource limits & timeouts
- **`MAX_BLOB_SIZE_THUMBNAIL`** (50 MiB) and **`MAX_BLOB_SIZE_PDF`**
(100 MiB) to reject oversized inputs before thumbnail/PDF processing.
- **`GHOSTSCRIPT_TIMEOUT_SEC`** (120 s) for
`repair_pdf_with_ghostscript` subprocess to avoid hangs on malicious or
broken PDFs.

### `filename_type`
- Handles `None`, empty string, non-string (e.g. int/list), and
path-only input via new **`_normalize_filename_for_type()`**.
- Uses basename for type detection (e.g. `a/b/c.pdf` → PDF).
- Enforces **`FILE_NAME_LEN_LIMIT`**; invalid input returns
`FileType.OTHER`.

### `thumbnail_img`
- Rejects `None`/empty/oversized blob and invalid filename; returns
`None` instead of raising.
- Wraps PDF, image, and PPT handling in try/except so corrupt or
malformed files return `None`.
- Ensures PDF has pages and PPT has slides before use.
- Normalizes PIL image mode (RGBA/P/LA → RGB) for safe PNG export.

### `repair_pdf_with_ghostscript`
- Handles `None`/empty input; skips repair when input size exceeds
limit.
- Uses `subprocess.run(..., timeout=GHOSTSCRIPT_TIMEOUT_SEC)` and
catches `TimeoutExpired`.
- Returns original bytes when Ghostscript output is empty.

### `read_potential_broken_pdf`
- `None` → `b""`; non–sequence-like (no `len`) → `b""`; empty → return
as-is.
- Oversized blob returned as-is (no repair) to avoid DoS.

### `sanitize_path`
- Explicit `None` and non-string check; strips whitespace before
normalizing.

## Testing
- **`test/unit_test/utils/test_api_file_utils.py`** added with 36 unit
tests covering the above behavior (filename_type, sanitize_path,
read_potential_broken_pdf, thumbnail_img, thumbnail,
repair_pdf_with_ghostscript, constants).
- All tests pass.

---------

Co-authored-by: Gittensor Miner <miner@gittensor.io>
2026-02-25 14:34:47 +08:00
..


(1). Deploy RAGFlow services and images

https://ragflow.io/docs/build_docker_image

(2). Configure the required environment for testing

Install Python dependencies (including test dependencies):

uv sync --python 3.12 --only-group test --no-default-groups --frozen

Activate the environment:

source .venv/bin/activate

Install SDK:

uv pip install sdk/python 

Modify the .env file: Add the following code:

COMPOSE_PROFILES=${COMPOSE_PROFILES},tei-cpu
TEI_MODEL=BAAI/bge-small-en-v1.5
RAGFLOW_IMAGE=infiniflow/ragflow:v0.24.0 #Replace with the image you are using

Start the containerwait two minutes:

docker compose -f docker/docker-compose.yml up -d


(3). Test Elasticsearch

a) Run sdk tests against Elasticsearch:

export HTTP_API_TEST_LEVEL=p2
export HOST_ADDRESS=http://127.0.0.1:9380  # Ensure that this port is the API port mapped to your localhost
pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_sdk_api 

b) Run http api tests against Elasticsearch:

pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_http_api 


(4). Test Infinity

Modify the .env file:

DOC_ENGINE=${DOC_ENGINE:-infinity}

Start the container:

docker compose -f docker/docker-compose.yml down -v 
docker compose -f docker/docker-compose.yml up -d

a) Run sdk tests against Infinity:

DOC_ENGINE=infinity pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_sdk_api 

b) Run http api tests against Infinity:

DOC_ENGINE=infinity pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_http_api