Commit Graph

78 Commits

Author SHA1 Message Date
d4adf92beb Merge branch 'main' into woosuk-tpu 2024-04-16 07:56:53 +00:00
563c54f760 [BugFix] Fix tensorizer extra in setup.py (#4072) 2024-04-14 14:12:42 -07:00
711a000255 [Frontend] [Core] feat: Add model loading using tensorizer (#3476) 2024-04-13 17:13:01 -07:00
c2b4a1bce9 [Doc] Add typing hints / mypy types cleanup (#3816)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-11 17:17:21 -07:00
696b653193 yapf 2024-04-10 08:02:21 +00:00
60ff6b8c5c Merge branch 'main' into woosuk-tpu 2024-04-10 07:51:35 +00:00
cfaf49a167 [Misc] Define common requirements (#3841) 2024-04-05 00:39:17 -07:00
ca81ff5196 [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805) 2024-04-04 10:26:19 -07:00
0e3f06fe9c [Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
2024-04-01 22:07:30 -07:00
4cdb732cef Add TPU to setup 2024-04-01 07:07:38 +00:00
3492859b68 [CI/Build] update default number of jobs and nvcc threads to avoid overloading the system (#3675) 2024-03-28 00:18:54 -04:00
8f44facddd [Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00
01bfb22b41 [CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
42bc386129 [CI/Build] respect the common environment variable MAX_JOBS (#3600) 2024-03-24 17:04:00 -07:00
523e30ea0c [BugFix] Hot fix in setup.py for neuron build (#3537) 2024-03-20 17:59:52 -07:00
ba8ae1d84f Check for _is_cuda() in compute_num_jobs (#3481) 2024-03-20 10:06:56 -07:00
9fdf3de346 Cmake based build system (#2830) 2024-03-18 15:38:33 -07:00
abfc4f3387 [Misc] Use dataclass for InputMetadata (#3452)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-03-17 10:02:46 +00:00
6b78837b29 Fix setup.py neuron-ls issue (#2671) 2024-03-16 16:00:25 -07:00
8e67598aa6 [Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
604f235937 [Misc] add error message in non linux platform (#3438) 2024-03-15 21:21:37 +00:00
739c350c19 [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) 2024-03-13 09:43:24 -07:00
2f8844ba08 Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
1cb0cc2975 [FIX] Make flash_attn optional (#3269) 2024-03-08 10:52:20 -08:00
2daf23ab0c Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
c0c2335ce0 Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
2c08ff23c0 Fix building from source on WSL (#3112) 2024-02-29 11:13:58 -08:00
cfc15a1031 Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
264017a2bf [ROCm] include gfx908 as supported (#2792) 2024-02-19 17:58:59 -08:00
0580aab02f [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) 2024-02-10 23:14:37 -08:00
931746bc6d Add documentation on how to do incremental builds (#2796) 2024-02-07 14:42:02 -08:00
f0d4e14557 Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
2ccee3def6 [ROCm] Fixup arch checks for ROCM (#2627) 2024-02-05 14:59:09 -08:00
5d60def02c DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
ea8489fce2 ROCm: Allow setting compilation target (#2581) 2024-01-29 10:52:31 -08:00
9090bf02e7 Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
380170038e Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
390b495ff3 Don't build punica kernels by default (#2605) 2024-01-26 15:19:19 -08:00
6b7de1a030 [ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
9b945daaf1 [Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
18473cf498 [Neuron] Add an option to build with neuron (#2065) 2024-01-18 10:58:50 -08:00
6e01e8c1c8 [CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
1b7c791d60 [ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
2acd76f346 [ROCm] Temporarily remove GPTQ ROCm support (#2138) 2023-12-15 17:13:58 -08:00
0fbfc4b81b Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
6ccc0bfffb Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
c8e7eb1eb3 fix typo in getenv call (#1972) 2023-12-07 16:04:41 -08:00
24f60a54f4 [Docker] Adding number of nvcc_threads during build as envar (#1893) 2023-12-07 11:00:32 -08:00
e0c6f556e8 [Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
5ffc0d13a2 Migrate linter from pylint to ruff (#1665) 2023-11-20 11:58:01 -08:00