82e0d601fc
[CI/Build] Fix pre-commit errors from #13571 ( #13709 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-22 16:50:38 -08:00
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm ( #13571 )
2025-02-22 05:24:05 -08:00
288cc6c234
[Attention] MLA with chunked prefill ( #12639 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Patrick Horn <patrick.horn@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-21 15:30:12 -08:00
839b27c6cc
[Kernel]Add streamK for block-quantized CUTLASS kernels ( #12978 )
2025-02-20 22:14:24 -08:00
1cdc88614a
Missing comment explaining VDR variable in GGUF kernels ( #13290 )
2025-02-20 22:06:54 -08:00
27a09dc52c
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant ( #13632 )
2025-02-20 22:01:48 -08:00
0023cd2b9d
[ROCm] MI300A compile targets deprecation ( #13560 )
2025-02-19 23:05:00 -08:00
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 ( #13227 )
2025-02-13 20:07:05 -08:00
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels ( #13198 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-14 00:01:14 +00:00
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization ( #12784 )
2025-02-12 19:51:51 -08:00
8a69e0e20e
[CI/Build] Auto-fix Markdown files ( #12941 )
2025-02-08 04:25:15 -08:00
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm ( #12696 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 13:04:59 -08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
4068f4b5b5
[MISC] Replace c10::optional with std::optional ( #11730 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-05 10:20:34 +09:00
5dba257506
Resolve race conditions in Marlin kernel ( #11493 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-01-02 22:58:56 +00:00
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00
8936316d58
[Kernel] Refactor Cutlass c3x ( #10049 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 07:00:18 +00:00
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com >
Co-authored-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-18 09:57:16 -05:00
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-13 03:19:23 +00:00
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 ( #10687 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-26 22:55:32 -08:00
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm ( #10254 )
2024-11-22 21:14:49 -08:00
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-19 19:40:33 -08:00
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing ( #9487 )
2024-10-22 15:41:13 -07:00
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels ( #9391 )
2024-10-17 01:34:06 +00:00
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel ( #9425 )
2024-10-16 23:46:06 +00:00
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError ( #9218 )
2024-10-10 17:39:56 +00:00
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
23f322297f
[Misc] Remove SqueezeLLM ( #8220 )
2024-09-06 16:29:03 -06:00
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 ( #7757 )
2024-08-22 08:28:52 -04:00
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-21 20:18:00 -04:00
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels ( #7323 )
2024-08-12 14:40:13 -04:00