d0feea31c7
[Kernel] optimize performance of gptq marlin kernel when n is small ( #14138 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-07 11:53:38 -05:00
e5e03c2c1b
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs ( #14396 )
2025-03-06 21:56:06 -08:00
99b0915d3b
[Kernel] Add needs_fixed_stride_order tag to most GEMMs ( #14306 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 14:17:09 -08:00
ada19210a3
Adding cpu inference with VXE ISA for s390x architecture ( #12613 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com >
Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com >
2025-03-06 08:40:53 -08:00
f89978ad7c
add cutlass support for blackwell fp8 gemm ( #13798 )
2025-03-04 07:55:07 -08:00
848a6438ae
[ROCm] Faster Custom Paged Attention kernels ( #12348 )
2025-03-03 09:24:45 -08:00
09e56f9262
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure ( #14051 )
2025-03-02 17:35:01 -08:00
cf069aa8aa
Update deprecated Python 3.8 typing ( #13971 )
2025-03-02 17:34:51 -08:00
6a92ff93e1
[Misc][Kernel]: Add GPTQAllSpark Quantization ( #12931 )
2025-02-28 22:30:59 -08:00
378b3ef6f8
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding ( #13922 )
2025-02-26 20:04:12 -08:00
a31614e386
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined ( #13851 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-27 10:39:10 +08:00
094b7d9496
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues ( #13797 )
2025-02-25 18:52:03 -08:00
aabeb2688f
[ROCm][Quantization][Kernel] Using HIP FP8 header ( #12593 )
2025-02-25 00:39:59 -08:00
82e0d601fc
[CI/Build] Fix pre-commit errors from #13571 ( #13709 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-22 16:50:38 -08:00
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm ( #13571 )
2025-02-22 05:24:05 -08:00
288cc6c234
[Attention] MLA with chunked prefill ( #12639 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Patrick Horn <patrick.horn@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-21 15:30:12 -08:00
839b27c6cc
[Kernel]Add streamK for block-quantized CUTLASS kernels ( #12978 )
2025-02-20 22:14:24 -08:00
1cdc88614a
Missing comment explaining VDR variable in GGUF kernels ( #13290 )
2025-02-20 22:06:54 -08:00
27a09dc52c
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant ( #13632 )
2025-02-20 22:01:48 -08:00
0023cd2b9d
[ROCm] MI300A compile targets deprecation ( #13560 )
2025-02-19 23:05:00 -08:00
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm ( #13235 )
2025-02-14 20:30:42 -08:00
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 ( #13227 )
2025-02-13 20:07:05 -08:00
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels ( #13198 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-14 00:01:14 +00:00
2344192a55
Optimize moe_align_block_size for deepseek_v3 ( #12850 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-13 18:43:37 -05:00
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization ( #12784 )
2025-02-12 19:51:51 -08:00
f1042e86f0
[Misc] AMD Build Improvements ( #12923 )
2025-02-12 02:36:10 -08:00
59fff4a01a
[core] improve error handling when wake up from sleep mode ( #12981 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 09:38:57 +08:00
8a69e0e20e
[CI/Build] Auto-fix Markdown files ( #12941 )
2025-02-08 04:25:15 -08:00
85ac82d228
[Kernel] Make rotary_embedding ops more flexible with input shape ( #12777 )
2025-02-06 08:46:13 -08:00
5b19b93082
[ROCm][Kernel] Using the correct warp_size value
2025-02-05 19:15:08 -08:00
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) ( #12676 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-02-04 18:22:24 -08:00
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm ( #12696 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 13:04:59 -08:00
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral ( #12413 )
2025-01-25 01:49:28 +00:00
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
68ad4e3a8d
[Core] Support fully transparent sleep mode ( #11743 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:39:32 +08:00
1e60f87bb3
[Kernel] fix moe_align_block_size error condition ( #12239 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-01-21 10:30:28 -08:00
750f4cabfa
[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) ( #12222 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-20 16:42:16 -08:00
3ea7b94523
Move linting to pre-commit ( #11975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-20 14:58:01 +08:00
0794e7446e
[Misc] Add multipstep chunked-prefill support for FlashInfer ( #10467 )
2025-01-15 12:47:49 +08:00
42f5e7c52a
[Kernel] Support MulAndSilu ( #11624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 02:29:53 +00:00
cfd3219f58
[Hardware][Apple] Native support for macOS Apple Silicon ( #11696 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-08 16:35:49 +08:00
4068f4b5b5
[MISC] Replace c10::optional with std::optional ( #11730 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-05 10:20:34 +09:00
5dba257506
Resolve race conditions in Marlin kernel ( #11493 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-01-02 22:58:56 +00:00
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00