youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Tyler Michael Smith	cbbc904470	[Kernel] Squash a few more warnings (#6914 )	2024-07-30 13:50:42 -04:00
Varun Sundar Rabindranath	af647fb8b3	[Kernel] Tuned int8 kernels for Ada Lovelace (#6848 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 20:24:58 -06:00
Tyler Michael Smith	61a97c32f6	[Kernel] Fix marlin divide-by-zero warnings (#6904 )	2024-07-30 01:26:07 +00:00
Tyler Michael Smith	aae6d36f7e	[Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908 )	2024-07-29 18:01:17 -06:00
Tyler Michael Smith	60d1c6e584	[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901 )	2024-07-29 09:59:02 -07:00
Varun Sundar Rabindranath	766435e660	[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 09:42:35 -06:00
Alexander Matveev	75acdaa4b6	[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795 )	2024-07-27 17:52:33 -04:00
Lucas Wilkinson	55712941e5	[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852 )	2024-07-27 02:27:44 +00:00
Tyler Michael Smith	50704f52c4	[Bugfix][Kernel] Promote another index to int64_t (#6838 )	2024-07-26 18:41:04 +00:00
Tyler Michael Smith	fea59c7712	[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649 )	2024-07-22 14:08:30 -06:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath	2e26564259	[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593 ) Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>	2024-07-19 18:15:26 -07:00
Varun Sundar Rabindranath	b5241e41d9	[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-18 01:38:35 +00:00
Tyler Michael Smith	9dad5cc859	[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384 )	2024-07-14 13:37:19 +00:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
Tyler Michael Smith	6a2d659d28	[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931 )	2024-06-28 17:10:34 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Varun Sundar Rabindranath	6c916ac8a8	[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-23 21:07:11 +00:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath	a7dcc62086	[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-20 13:33:21 +00:00
Varun Sundar Rabindranath	111af1fa2c	[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-20 06:37:08 +00:00
Tyler Michael Smith	b23ce92032	[Bugfix] Fix CUDA version check for mma warning suppression (#5642 )	2024-06-18 23:48:49 +00:00
Tyler Michael Smith	348616ac4b	[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401 )	2024-06-14 10:02:00 -07:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
Tyler Michael Smith	ccd4f129e8	[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-05 10:44:15 -07:00
Tyler Michael Smith	cbb2f59cc8	[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )	2024-06-03 09:52:30 -07:00
Varun Sundar Rabindranath	f081c3ce4b	[Kernel] Update Cutlass fp8 configs (#5144 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-01 08:46:07 +00:00
Tyler Michael Smith	260d119e86	[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137 )	2024-06-01 06:45:32 +00:00
Tyler Michael Smith	1197e02141	[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168 )	2024-05-31 17:21:38 -07:00
Simon Mo	e9d3aa04f6	Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149 )	2024-05-30 22:00:26 -07:00
Alexander Matveev	6d21fa1cad	[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136 )	2024-05-30 21:02:11 -05:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Alexander Matveev	6066253296	Marlin 24 prefill performance improvement (about 25% better on average) (#4983 )	2024-05-23 02:39:27 -04:00
Tyler Michael Smith	8674f9880e	[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954 ) Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs	2024-05-22 14:10:43 +00:00
Michael Goin	5f6d10c14c	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
Alexander Matveev	da5a0b539d	Remove marlin warning (#4918 )	2024-05-20 14:55:34 +00:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
alexm-nm	e288df0632	[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626 )	2024-05-08 17:14:31 -07:00
Philipp Moritz	a98187cf72	[Kernel] Make static FP8 scaling more robust (#4570 ) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.2295\|± \|0.0035\| \| - humanities \|N/A \|none \| 5\|acc \|0.2421\|± \|0.0062\| \| - other \|N/A \|none \| 5\|acc \|0.2398\|± \|0.0076\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.2171\|± \|0.0074\| \| - stem \|N/A \|none \| 5\|acc \|0.2125\|± \|0.0073\| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7008\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6453\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7692\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8083\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6115\|± \|0.0083\| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.	2024-05-06 17:39:28 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Philipp Moritz	12628d3c78	[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-27 04:49:59 +00:00
alexm-nm	aae08249ac	[Bugfix] Fix marlin kernel crash on H100 (#4218 ) This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.	2024-04-24 10:35:01 -07:00

1 2

71 Commits