Commit Graph

18 Commits

Author SHA1 Message Date
e73b7dfd69 [Bugfix] fix an illegal memory access was encountered of marlin kernel + act_order (#18245) 2025-05-16 16:02:44 -07:00
e60f550b38 [v1] Support multiple KV cache groups in GPU model runner (#17945)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-14 18:54:54 -07:00
cabaf4eff3 [Attention] MLA decode optimizations (#12528)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-30 23:49:37 -08:00
bbe5f9de7d [Model] Support for fairseq2 Llama (#11442)
Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
2025-01-19 10:40:40 -08:00
60508ffda9 [Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
b00b33d77e [Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
ce00231a8b [Bugfix] Fix Weight Loading Multiple GPU Test - Large Models (#9213) 2024-10-10 14:15:40 +08:00
6cd5e5b07e [Misc] Fused MoE Marlin support for GPTQ (#8217) 2024-09-09 23:02:52 -04:00
c7cb5c3335 [Misc] GPTQ Activation Ordering (#8135) 2024-09-09 16:27:26 -04:00
2188a60c7e [Misc] Update GPTQ to use vLLMParameters (#7976) 2024-09-03 17:21:44 -04:00
fc911880cc [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
665304092d [Misc] Update qqq to use vLLMParameters (#7805) 2024-08-26 13:16:15 -06:00
f1df5dbfd6 [Misc] Update marlin to use vLLMParameters (#7803) 2024-08-23 14:30:52 -04:00
955b5191c9 [Misc] update fp8 to use vLLMParameter (#7437) 2024-08-22 08:36:18 -04:00
aae74ef95c Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) 2024-08-22 03:42:14 +00:00
8678a69ab5 [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
b1e5afc3e7 [Misc] Update awq and awq_marlin to use vLLMParameters (#7422) 2024-08-13 17:08:20 -04:00
fb377d7e74 [Misc] Update gptq_marlin to use new vLLMParameters (#7281) 2024-08-13 14:30:11 -04:00