4ceb668d40
feat(api/utils): Harden file_utils for robustness and edge cases ( #12915 )
...
## Summary
Improves robustness and edge-case handling in `api.utils.file_utils` to
avoid crashes, DoS/OOM risks, and timeouts when processing user-provided
filenames, paths, and file blobs.
## Changes
### Resource limits & timeouts
- **`MAX_BLOB_SIZE_THUMBNAIL`** (50 MiB) and **`MAX_BLOB_SIZE_PDF`**
(100 MiB) to reject oversized inputs before thumbnail/PDF processing.
- **`GHOSTSCRIPT_TIMEOUT_SEC`** (120 s) for
`repair_pdf_with_ghostscript` subprocess to avoid hangs on malicious or
broken PDFs.
### `filename_type`
- Handles `None`, empty string, non-string (e.g. int/list), and
path-only input via new **`_normalize_filename_for_type()`**.
- Uses basename for type detection (e.g. `a/b/c.pdf` → PDF).
- Enforces **`FILE_NAME_LEN_LIMIT`**; invalid input returns
`FileType.OTHER`.
### `thumbnail_img`
- Rejects `None`/empty/oversized blob and invalid filename; returns
`None` instead of raising.
- Wraps PDF, image, and PPT handling in try/except so corrupt or
malformed files return `None`.
- Ensures PDF has pages and PPT has slides before use.
- Normalizes PIL image mode (RGBA/P/LA → RGB) for safe PNG export.
### `repair_pdf_with_ghostscript`
- Handles `None`/empty input; skips repair when input size exceeds
limit.
- Uses `subprocess.run(..., timeout=GHOSTSCRIPT_TIMEOUT_SEC)` and
catches `TimeoutExpired`.
- Returns original bytes when Ghostscript output is empty.
### `read_potential_broken_pdf`
- `None` → `b""`; non–sequence-like (no `len`) → `b""`; empty → return
as-is.
- Oversized blob returned as-is (no repair) to avoid DoS.
### `sanitize_path`
- Explicit `None` and non-string check; strips whitespace before
normalizing.
## Testing
- **`test/unit_test/utils/test_api_file_utils.py`** added with 36 unit
tests covering the above behavior (filename_type, sanitize_path,
read_potential_broken_pdf, thumbnail_img, thumbnail,
repair_pdf_with_ghostscript, constants).
- All tests pass.
---------
Co-authored-by: Gittensor Miner <miner@gittensor.io >
2026-02-25 14:34:47 +08:00
f1c2fac03e
Refa: remove ppt image. ( #12909 )
...
### What problem does this PR solve?
remove `aspose`
### Type of change
- [x] Refactoring
2026-01-30 13:35:42 +08:00
37e4485415
feat: add MDX file support ( #12261 )
...
Feat: add MDX file support #12057
### What problem does this PR solve?
<img width="1055" height="270" alt="image"
src="https://github.com/user-attachments/assets/a0ab49f9-7806-41cd-8a96-f593591ab36b "
/>
The page states that MDX files are supported, but uploading fails with
the error: "x.mdx: This type of file has not been supported yet!"
<img width="381" height="110" alt="image"
src="https://github.com/user-attachments/assets/4bbb7d08-cb47-416a-95fc-bc90b90fcc39 "
/>
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-12-29 12:54:31 +08:00
bd5dda6b10
Feature/doc upload api add parent path 20251112 ( #11231 )
...
### What problem does this PR solve?
Add the specified parent_path to the document upload api interface
(#11230 )
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Co-authored-by: virgilwong <hyhvirgil@gmail.com >
2025-11-13 09:59:39 +08:00
9a486e0f51
Move some funcs from api to rag module ( #10972 )
...
### What problem does this PR solve?
As title
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2025-11-03 19:26:09 +08:00
1284647694
Refactor file utils ( #10970 )
...
### What problem does this PR solve?
As title.
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2025-11-03 18:54:55 +08:00
d008a4df9f
Move base64_image related functions to common directory ( #10957 )
...
### What problem does this PR solve?
As title
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2025-11-03 15:20:46 +08:00
fa210e7c58
Feat: parsing hyperlinks in docx and pdf & Fix: default parser config of toc extraction ( #10877 )
...
### What problem does this PR solve?
Feat: parsing hyperlinks in docx and pdf #10848
Fix: default parser config of toc extraction
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-11-03 09:34:12 +08:00
44f2d6f5da
Move 'get_project_base_directory' to common directory ( #10940 )
...
### What problem does this PR solve?
As title
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2025-11-02 21:05:28 +08:00
57a83eca8a
Remove unused code ( #10938 )
...
### What problem does this PR solve?
As title
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2025-11-02 16:25:16 +08:00
f24d464a53
Fix: video file suffix ( #10740 )
...
### What problem does this PR solve?
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-23 11:13:09 +08:00
d956a442ce
Fix: Remove pdf embed support, update based on #10635 ( #10663 )
...
### What problem does this PR solve?
Fix: Remove pdf embed support, update based on #10635
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2025-10-20 13:45:53 +08:00
8ee0b6ea54
File: Now parsing support all types of embedded documents, solved #10059 ( #10635 )
...
### What problem does this PR solve?
File: Now parsing support all types of embedded documents, solved #10059
Fix: Incomplete words in chat #10530
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-10-17 18:46:47 +08:00
cbf04ee470
Feat: Use data pipeline to visualize the parsing configuration of the knowledge base ( #10423 )
...
### What problem does this PR solve?
#9869
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Signed-off-by: dependabot[bot] <support@github.com >
Signed-off-by: jinhai <haijin.chn@gmail.com >
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
Co-authored-by: chanx <1243304602@qq.com >
Co-authored-by: balibabu <cike8899@users.noreply.github.com >
Co-authored-by: Lynn <lynn_inf@hotmail.com >
Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com >
Co-authored-by: huangzl <huangzl@shinemo.com >
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com >
Co-authored-by: Wilmer <33392318@qq.com >
Co-authored-by: Adrian Weidig <adrianweidig@gmx.net >
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Yongteng Lei <yongtengrey@outlook.com >
Co-authored-by: Liu An <asiro@qq.com >
Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com >
Co-authored-by: BadwomanCraZY <511528396@qq.com >
Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com >
Co-authored-by: Russell Valentine <russ@coldstonelabs.org >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Billy Bao <newyorkupperbay@gmail.com >
Co-authored-by: Zhedong Cen <cenzhedong2@126.com >
Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com >
Co-authored-by: TensorNull <tensor.null@gmail.com >
Co-authored-by: TeslaZY <TeslaZY@outlook.com >
Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com >
Co-authored-by: AB <aj@Ajays-MacBook-Air.local >
Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com >
Co-authored-by: He Wang <wanghechn@qq.com >
Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com >
Co-authored-by: Jin Hai <haijin.chn@gmail.com >
Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com >
Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box >
Co-authored-by: Stephen Hu <stephenhu@seismic.com >
Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com >
Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com >
Co-authored-by: mxc <mxc@example.com >
Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com >
Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com >
Co-authored-by: mcoder6425 <mcoder64@gmail.com >
Co-authored-by: lemsn <lemsn@msn.com >
Co-authored-by: lemsn <lemsn@126.com >
Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com >
Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com >
Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com >
2025-10-09 12:36:19 +08:00
39ef2ffba9
Feat: parsing supports jsonl or ldjson format ( #9087 )
...
### What problem does this PR solve?
Supports jsonl or ldjson format. Feature request from
[discussion](https://github.com/orgs/infiniflow/discussions/8774 ).
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-07-30 09:48:20 +08:00
4a2ff633e0
Fix typo in code ( #8327 )
...
### What problem does this PR solve?
Fix typo in code
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2025-06-18 09:41:09 +08:00
0ebf05440e
Feat: repair corrupted PDF files on upload automatically ( #7693 )
...
### What problem does this PR solve?
Try the best to repair corrupted PDF files on upload automatically.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2025-05-19 14:54:06 +08:00
c813c1ff4c
Made task_executor async to speedup parsing ( #5530 )
...
### What problem does this PR solve?
Made task_executor async to speedup parsing
### Type of change
- [x] Performance Improvement
2025-03-03 18:59:49 +08:00
8a2542157f
Fix: possible memory leaks close #5277 ( #5500 )
...
### What problem does this PR solve?
close #5277 by make sure the file close
### Type of change
- [x] Performance Improvement
---------
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-03 10:26:45 +08:00
112ef42a19
Ensure thumbnail be smaller than 64K ( #3722 )
...
### What problem does this PR solve?
Ensure thumbnail be smaller than 64K. Close #1443
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com >
2024-11-28 19:15:31 +08:00
2249d5d413
Always open text file for write with UTF-8 ( #3688 )
...
### What problem does this PR solve?
Always open text file for write with UTF-8. Close #932
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-27 16:24:16 +08:00
7c486ee3f9
Fix typo ( #3319 )
...
### What problem does this PR solve?
Fix typo
### Type of change
- [x] Refactoring
2024-11-11 09:36:39 +08:00
b2524eec49
fix sequence2txt error and usage total token issue ( #2961 )
...
### What problem does this PR solve?
#1363
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-10-22 11:38:37 +08:00
6a4858a7ee
Fix thumbnail_img NoneType error ( #2941 )
...
### What problem does this PR solve?
fix thumbnail_img NoneType error
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-10-22 09:21:05 +08:00
485bfd6c08
fix: Large document thumbnail display failed ( #2763 )
...
### What problem does this PR solve?
In MySQL, when the thumbnail base64 of a document is relatively large,
the display of the document's thumbnail fails.
Now, I put the document thumbnail into MiniIO storage.
### Type of change
- [✓] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: chongchuanbing <chongchuanbing@gmail.com >
2024-10-10 09:09:29 +08:00
6b3a40be5c
Format file format from Windows/dos to Unix ( #1949 )
...
### What problem does this PR solve?
Related source file is in Windows/DOS format, they are format to Unix
format.
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com >
2024-08-15 09:17:36 +08:00
cafdee536f
add sql to naive parser ( #1908 )
...
### What problem does this PR solve?
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2024-08-12 15:29:33 +08:00
ede733e130
add support for eml file parser ( #1768 )
...
### What problem does this PR solve?
add support for eml file parser
#1363
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Zhedong Cen <cenzhedong2@126.com >
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com >
2024-08-06 16:42:14 +08:00
0171082cc5
fix create dialog bug ( #982 )
...
### What problem does this PR solve?
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-30 09:25:05 +08:00
8dd45459be
Add support for HTML file ( #973 )
...
### What problem does this PR solve?
Add support for HTML file
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2024-05-30 09:12:55 +08:00
c3bc72dfd9
fix too large thumbnail issue ( #817 )
...
### What problem does this PR solve?
#709
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-17 14:04:21 +08:00
6ff63ee2ba
Support for code files parse ( #789 )
...
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2024-05-15 16:34:28 +08:00
cab274f560
remove PyMuPDF ( #618 )
...
### What problem does this PR solve?
#613
### Type of change
- [x] Other (please describe):
2024-04-30 12:38:09 +08:00
674b3aeafd
fix disable and enable llm setting in dialog ( #616 )
...
### What problem does this PR solve?
#614
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2024-04-30 11:04:14 +08:00
2af74cc494
refine docker layers ( #606 )
...
### What problem does this PR solve?
### Type of change
- [x] Performance Improvement
2024-04-29 17:57:40 +08:00
f69ff39fa0
add file management feature ( #560 )
...
### What problem does this PR solve?
### Type of change
- [x] Documentation Update
2024-04-26 17:21:53 +08:00
72384b191d
Add .doc file parser. ( #497 )
...
### What problem does this PR solve?
Add `.doc` file parser, using tika.
```
pip install tika
```
```
from tika import parser
from io import BytesIO
def extract_text_from_doc_bytes(doc_bytes):
file_like_object = BytesIO(doc_bytes)
parsed = parser.from_buffer(file_like_object)
return parsed["content"]
```
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: chrysanthemum-boy <fannc@qq.com >
2024-04-23 15:31:43 +08:00
b8e58fe27a
add redis to accelerate access of minio ( #482 )
...
### What problem does this PR solve?
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2024-04-22 14:11:09 +08:00
fd7fcb5baf
apply pep8 formalize ( #155 )
2024-03-27 11:33:46 +08:00
d1c600d5d3
add ocr and recognizer demo, update README ( #74 )
2024-02-26 19:51:35 +08:00
7fd1eca582
init README of deepdoc, add picture processer. ( #71 )
...
* init README of deepdoc, add picture processer.
* add resume parsing
2024-02-23 18:28:12 +08:00
a8294f2168
Refine resume parts and fix bugs in retrival using sql ( #66 )
2024-02-19 19:22:17 +08:00
c5ea37cd30
Add resume parser and fix bugs ( #59 )
...
* Update .gitignore
* Update .gitignore
* Add resume parser and fix bugs
2024-02-07 19:27:23 +08:00
e6acaf6738
Add Q&A and Book, fix task running bugs ( #50 )
2024-02-01 18:53:56 +08:00
072f9dd5bc
Add app to rag module: presentaion & laws ( #43 )
2024-01-25 18:57:39 +08:00
e32ef75e99
Test chat API and refine ppt chunker ( #42 )
2024-01-23 19:45:36 +08:00
34b2ab3b2f
Test APIs and fix bugs ( #41 )
2024-01-22 19:51:38 +08:00
484e5abc1f
llm configuation refine and trievalTest API refine ( #40 )
2024-01-19 19:51:57 +08:00
9bf75d4511
add dialog api ( #33 )
2024-01-17 20:20:42 +08:00
6be3dd56fa
rename web_server to api ( #29 )
...
* add front end code
* change licence
* rename web_server to API
* change name to web_server
2024-01-17 09:43:27 +08:00