55dcce91df
Upstream Llama4 Support to Main ( #16113 )
...
Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com >
Signed-off-by: Chris Thi <chris.c.thi@gmail.com >
Signed-off-by: drisspg <drisspguessous@gmail.com >
Signed-off-by: Jon Swenson <jmswen@gmail.com >
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Xiaodong Wang <xdwang@meta.com >
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Lu Fang <lufang@fb.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 08:06:27 -07:00
e59ca942f5
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. ( #13932 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-01 12:07:43 -04:00
9239bf718e
[Kernel] CUTLASS grouped gemm fp8 MoE kernel ( #13972 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com >
2025-03-27 00:54:44 +00:00
23114d3364
[Misc] Warn about v0 in benchmark_paged_attn.py ( #15495 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-25 20:31:04 -07:00
f90d34b498
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 ( #15322 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-03-23 01:10:10 -07:00
400d483e87
[Kernels] LoRA - Retire SGMV and BGMV Kernels ( #14685 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-18 09:47:53 +00:00
a73122de96
[Bugfix] fix benchmark moe ( #14653 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-13 16:12:42 +08:00
a1c8f3796c
dynamic distpatch of fp8 kernels ( #14245 )
...
Signed-off-by: Jeff Daily <jeff.daily@amd.com >
2025-03-11 10:54:56 -04:00
5ff0d32580
[V1] LoRA - Add triton kernels for V1 ( #13096 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-10 17:27:53 -04:00
3b352a2f92
Correct capitalisation: VLLM -> vLLM ( #14562 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 16:36:21 +00:00
c34eeec58d
[Bugfix] Correctly call cudaProfilerStop in benchmarks script ( #14183 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-07 00:42:49 +00:00
ca100c90fe
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM ( #13917 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 17:08:51 -08:00
7bab4bb048
[Misc] Add Qwen2MoeForCausalLM moe tuning support ( #14276 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-05 23:11:29 +08:00
f78c0be80a
Fix benchmark_moe.py tuning for CUDA devices ( #14164 )
2025-03-03 21:11:03 -08:00
bb5b640359
[core] moe fp8 block quant tuning support ( #14068 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-03-04 01:30:23 +00:00
848a6438ae
[ROCm] Faster Custom Paged Attention kernels ( #12348 )
2025-03-03 09:24:45 -08:00
cf069aa8aa
Update deprecated Python 3.8 typing ( #13971 )
2025-03-02 17:34:51 -08:00
6a92ff93e1
[Misc][Kernel]: Add GPTQAllSpark Quantization ( #12931 )
2025-02-28 22:30:59 -08:00
5157338ed9
[Misc] Improve LoRA spelling ( #13831 )
2025-02-25 23:43:01 -08:00
781096e385
Expert Parallelism (EP) Support for DeepSeek V2 ( #12583 )
2025-02-24 07:33:20 -08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
cfa134d247
[Bugfix/CI] Fixup benchmark_moe.py ( #12562 )
...
Fixes `is_marlin` not being passed into `get_default_config`
Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-01 13:41:35 +08:00
1c1bb0bbf2
[Misc][MoE] add Deepseek-V3 moe tuning support ( #12558 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-30 00:47:30 +00:00
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
2acba47d9b
[bugfix] moe tuning. rm is_navi() ( #12273 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-21 22:47:32 +00:00
8027a72461
[ROCm][MoE] moe tuning support for rocm ( #12049 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-17 14:49:16 +08:00
5fd24ec02e
[misc] Add LoRA kernel micro benchmarks ( #11579 )
2025-01-16 15:51:40 +00:00
02222a0256
[Misc] Kernel Benchmark for RMSNorm ( #11241 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com >
2024-12-17 06:57:02 +00:00
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-29 14:47:44 +00:00
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 21:58:04 -07:00
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-18 11:00:56 +00:00
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-21 20:18:00 -04:00
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale ( #6081 )
2024-07-16 15:31:32 -07:00
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-11 15:40:11 +00:00
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
d74674bbd9
[Misc] Fix arg names ( #5524 )
2024-06-14 09:47:44 -07:00
51a08e7d8f
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 ( #5238 )
2024-06-05 10:59:14 -07:00
27208be66e
[Kernel] Add back batch size 1536 and 3072 to MoE tuning ( #5242 )
2024-06-04 09:58:47 -07:00