f33ec6d2d2
Remove unneeded changes
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-04 18:13:50 +00:00
c33aeecf24
simplify - get rid of tokenshape
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 22:42:09 +00:00
230730c34d
Updates for FA3 and other changes
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 20:58:48 +00:00
d151b63b8b
Merge branch 'main' into hacking_full_cudagraph
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 14:05:45 -05:00
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral ( #12413 )
2025-01-25 01:49:28 +00:00
bed9efafeb
Merge branch 'main' into hacking_full_cudagraph
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-24 16:08:18 -05:00
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
68ad4e3a8d
[Core] Support fully transparent sleep mode ( #11743 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:39:32 +08:00
1e60f87bb3
[Kernel] fix moe_align_block_size error condition ( #12239 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-01-21 10:30:28 -08:00
750f4cabfa
[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) ( #12222 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-20 16:42:16 -08:00
3ea7b94523
Move linting to pre-commit ( #11975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-20 14:58:01 +08:00
d4c9448b26
WIP initial working version
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-17 13:54:51 -05:00
0794e7446e
[Misc] Add multipstep chunked-prefill support for FlashInfer ( #10467 )
2025-01-15 12:47:49 +08:00
42f5e7c52a
[Kernel] Support MulAndSilu ( #11624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 02:29:53 +00:00
cfd3219f58
[Hardware][Apple] Native support for macOS Apple Silicon ( #11696 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-08 16:35:49 +08:00
4068f4b5b5
[MISC] Replace c10::optional with std::optional ( #11730 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-05 10:20:34 +09:00
5dba257506
Resolve race conditions in Marlin kernel ( #11493 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-01-02 22:58:56 +00:00
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00
f49777ba62
Deepseek v3 ( #11502 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com >
2024-12-26 16:09:44 -08:00
8936316d58
[Kernel] Refactor Cutlass c3x ( #10049 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 07:00:18 +00:00
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) ( #11311 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-19 02:43:30 +00:00
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com >
Co-authored-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-18 09:57:16 -05:00
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-13 03:19:23 +00:00
3b61cb450d
[V1] Further reduce CPU overheads in flash-attn ( #10989 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 12:38:46 -08:00
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None ( #10928 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2024-12-08 01:21:18 +08:00
f13cf9ad50
[Build] Fix for the Wswitch-bool clang warning ( #10060 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-07 09:03:44 +00:00
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 ( #10687 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-26 22:55:32 -08:00
a6760f6456
[Feature] vLLM ARM Enablement for AARCH64 CPUs ( #9228 )
...
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-11-25 18:32:39 -08:00
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm ( #10254 )
2024-11-22 21:14:49 -08:00
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-19 19:40:33 -08:00
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
1ea291a417
Fix: Build error seen on Power Architecture ( #10421 )
...
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: ismael-dm <ismaeldm99@gmail.com >
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Angus Wang <wangjadehao@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Ricky Xu <rickyx@anyscale.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:34:57 -08:00
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-14 21:23:29 +00:00
b6dde33019
[Core] Flashinfer - Remove advance step size restriction ( #10282 )
2024-11-13 16:29:32 +08:00
812c981fa0
Splitting attention kernel file ( #10091 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2024-11-11 22:55:07 -08:00
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target ( #10108 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 18:42:50 +08:00
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com >
2024-11-06 23:50:47 -08:00
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 04:43:08 +00:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-10-31 20:06:25 +00:00
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 00:19:28 -07:00
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2024-10-24 15:37:52 -07:00