youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Tyler Michael Smith	6e588da0f4	[Build/CI] Fix CUDA 11.8 build (#17679 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-05-22 12:13:54 -07:00
bnellnm	92247c522e	[Bug] Fix moe_sum signature (#18440 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-05-20 22:37:08 -07:00
Jinzhen Lin	e73b7dfd69	[Bugfix] fix `an illegal memory access was encountered` of marlin kernel + act_order (#18245 )	2025-05-16 16:02:44 -07:00
bnellnm	f9c069c85e	Modularize fused experts and integrate PPLX kernels (#15956 )	2025-05-14 13:11:54 -07:00
Jinzhen Lin	d4154c35a2	[Bugfix] fix moe marlin `topk_weight` loading (#18080 ) Co-authored-by: mgoin <mgoin64@gmail.com>	2025-05-13 23:31:57 -07:00
Jinzhen Lin	d74e5f37bc	[Kernel] fp4 marlin kernel (#17687 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-05-10 19:58:49 -07:00
Michael Goin	a17cef70ea	Removed unused marlin cuda code (#17684 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-05-06 17:59:47 -07:00
Jinzhen Lin	1d0c9d6b2d	[Kernel] some optimizations for dense marlin and moe marlin (#16850 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-05-05 09:39:30 -07:00
Caleb_Du	3e887d2e0c	permute/unpermute kernel for moe optimization (#14568 ) Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>	2025-05-02 11:31:55 -07:00
Harry Mellor	40896bdf3f	`pre-commit autoupdate` (#17380 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-04-29 06:46:55 -07:00
Lucas Wilkinson	7eb4255628	[BugFix] Accuracy fix for llama4 int4 - improperly casted scales (#16801 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-04-17 22:13:29 -07:00
Jinzhen Lin	d06ba4ed3f	[Kernel] moe wna16 marlin kernel (#14447 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-04-14 20:05:22 -07:00
TJian	916836bbfb	[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. (#14664 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-03-12 09:31:19 -07:00
Sage Moore	45f3f3f59e	[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. (#14629 ) Signed-off-by: Sage Moore <sage@neuralmagic.com>	2025-03-12 08:00:28 -04:00
Jinzhen Lin	90e88ab756	[Kernel] moe wna16 cuda kernel (#13321 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-03-10 20:12:40 -04:00
Michael Goin	2344192a55	Optimize moe_align_block_size for deepseek_v3 (#12850 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-13 18:43:37 -05:00
Shiyan Deng	f1042e86f0	[Misc] AMD Build Improvements (#12923 )	2025-02-12 02:36:10 -08:00
Gregory Shtrasberg	5b19b93082	[ROCm][Kernel] Using the correct warp_size value	2025-02-05 19:15:08 -08:00
Yang Chen	95460fc513	[Kernel] port sgl moe_align_block_size kernels (#12574 ) sgl_moe_align_block_size is based on: `ded9fcd09a` moe_align_block_size is based on: `ba5112ff69` Signed-off-by: Yang Chen <yangche@fb.com>	2025-02-03 13:09:50 +08:00
Harry Mellor	823ab79633	Update `pre-commit` hooks (#12475 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-27 17:23:08 -07:00
ElizaWszola	221d388cc5	[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413 )	2025-01-25 01:49:28 +00:00
Jinzhen Lin	1e60f87bb3	[Kernel] fix moe_align_block_size error condition (#12239 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-01-21 10:30:28 -08:00
Jinzhen Lin	750f4cabfa	[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (#12222 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-01-20 16:42:16 -08:00
Simon Mo	f49777ba62	Deepseek v3 (#11502 ) Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com>	2024-12-26 16:09:44 -08:00
Charlie Fu	59449095ab	[Performance][Kernel] Fused_moe Performance Improvement (#9384 ) Signed-off-by: charlifu <charlifu@amd.com>	2024-10-24 15:37:52 -07:00
bnellnm	eca2c5f7c0	[Bugfix] Fix support for dimension like integers and ScalarType (#9299 )	2024-10-17 19:08:34 +00:00
ElizaWszola	05d686432f	[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973 ) Co-authored-by: Dipika <dipikasikka1@gmail.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu>	2024-10-04 12:34:44 -06:00
Lucas Wilkinson	aeb37c2a72	[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845 )	2024-10-03 22:55:25 -04:00
ElizaWszola	d081da0064	[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-28 18:19:40 -07:00
ElizaWszola	a928ded995	[Kernel] Split Marlin MoE kernels into multiple files (#8661 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-09-24 09:31:42 -07:00
Tyler Michael Smith	d66ac62854	[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (#8643 )	2024-09-21 23:45:02 +00:00
Tyler Michael Smith	4c34ce8916	[Kernel] Remove marlin moe templating on thread_m_blocks (#8573 ) Co-authored-by: lwilkinson@neuralmagic.com	2024-09-19 01:42:49 +00:00
ElizaWszola	a091e2da3e	[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032 ) Co-authored-by: Dipika <dipikasikka1@gmail.com>	2024-09-16 09:47:19 -06:00
Dipika Sikka	6cd5e5b07e	[Misc] Fused MoE Marlin support for GPTQ (#8217 )	2024-09-09 23:02:52 -04:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Michael Goin	aae74ef95c	Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 )" (#7764 )	2024-08-22 03:42:14 +00:00
Dipika Sikka	8678a69ab5	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-21 16:17:10 -07:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Divakar Verma	a66cf40b20	[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927 ) This PR enables the fused topk_softmax kernel used in moe layer for HIP	2024-06-02 14:13:26 -07:00
Michael Goin	5f6d10c14c	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00

42 Commits