Commit Graph

102 Commits

Author SHA1 Message Date
1a36287b89 [Bugfix] Fix xpu build (#7644) 2024-08-18 22:00:09 -07:00
386087970a [CI/Build] build on empty device for better dev experience (#4773) 2024-08-11 13:09:44 -07:00
80cbe10c59 [OpenVINO] migrate to latest dependencies versions (#7251) 2024-08-07 09:49:10 -07:00
a8d604ca2a [Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
b482b9a5b1 [CI/Build] Add support for Python 3.12 (#7035) 2024-08-02 13:51:22 -07:00
7ecee34321 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
dbfe254eda [Feature] vLLM CLI (#5090)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-07-14 15:36:43 -07:00
ccd3c04571 [ci][build] fix commit id (#6420)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-07-14 22:16:21 +08:00
111fc6e7ec [Misc] Add generated git commit hash as vllm.__commit__ (#6386) 2024-07-12 22:52:15 +00:00
57f09a419c [Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
728c4c8a06 [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
03dccc886e [Misc] Add vLLM version getter to utils (#5098) 2024-06-13 11:21:39 -07:00
916d219d62 [ci] Use sccache to build images (#5419)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-12 17:58:12 -07:00
1a8bfd92d5 [Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
8bab4959be [Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389) 2024-06-11 00:37:56 -07:00
5467ac3196 [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 2024-06-09 16:23:30 -04:00
a66cf40b20 [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927)
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
a360ff80bb [CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034) 2024-05-31 22:06:45 -06:00
5bd3c65072 [Core][Optimization] remove vllm-nccl (#5091) 2024-05-29 05:13:52 +00:00
8bc68e198c [Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 (#4208) 2024-05-13 14:57:07 -07:00
ff5abcd746 [ROCm] Add support for Punica kernels on AMD GPUs (#3140)
Co-authored-by: miloice <jeffaw99@hotmail.com>
2024-05-09 09:19:50 -07:00
89579a201f [Misc] Use vllm-flash-attn instead of flash-attn (#4686) 2024-05-08 13:15:34 -07:00
344bf7cd2d [Misc] add installation time env vars (#4574) 2024-05-03 15:55:56 -07:00
5ad60b0cbd [Misc] Exclude the tests directory from being packaged (#4552) 2024-05-02 10:50:25 -07:00
8b798eec75 [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-05-01 18:01:50 +00:00
715c2d854d [Frontend] [Core] Tensorizer: support dynamic num_readers, update version (#4467) 2024-04-30 16:32:13 -07:00
a88081bf76 [CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
cd2f63fb36 [CI/CD] add neuron docker and ci test scripts (#3571) 2024-04-18 15:26:01 -07:00
563c54f760 [BugFix] Fix tensorizer extra in setup.py (#4072) 2024-04-14 14:12:42 -07:00
711a000255 [Frontend] [Core] feat: Add model loading using tensorizer (#3476) 2024-04-13 17:13:01 -07:00
c2b4a1bce9 [Doc] Add typing hints / mypy types cleanup (#3816)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-11 17:17:21 -07:00
cfaf49a167 [Misc] Define common requirements (#3841) 2024-04-05 00:39:17 -07:00
ca81ff5196 [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805) 2024-04-04 10:26:19 -07:00
0e3f06fe9c [Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
2024-04-01 22:07:30 -07:00
3492859b68 [CI/Build] update default number of jobs and nvcc threads to avoid overloading the system (#3675) 2024-03-28 00:18:54 -04:00
8f44facddd [Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00
01bfb22b41 [CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
42bc386129 [CI/Build] respect the common environment variable MAX_JOBS (#3600) 2024-03-24 17:04:00 -07:00
523e30ea0c [BugFix] Hot fix in setup.py for neuron build (#3537) 2024-03-20 17:59:52 -07:00
ba8ae1d84f Check for _is_cuda() in compute_num_jobs (#3481) 2024-03-20 10:06:56 -07:00
9fdf3de346 Cmake based build system (#2830) 2024-03-18 15:38:33 -07:00
abfc4f3387 [Misc] Use dataclass for InputMetadata (#3452)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-03-17 10:02:46 +00:00
6b78837b29 Fix setup.py neuron-ls issue (#2671) 2024-03-16 16:00:25 -07:00
8e67598aa6 [Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
604f235937 [Misc] add error message in non linux platform (#3438) 2024-03-15 21:21:37 +00:00
739c350c19 [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) 2024-03-13 09:43:24 -07:00
2f8844ba08 Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
1cb0cc2975 [FIX] Make flash_attn optional (#3269) 2024-03-08 10:52:20 -08:00
2daf23ab0c Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
c0c2335ce0 Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00