Commit Graph

21 Commits

Author SHA1 Message Date
2ff767b513 Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
d6e4a130b0 [Minor] Remove gather_cached_kv kernel (#3043) 2024-02-26 15:00:54 -08:00
923797fea4 Fix compile error when using rocm (#2648) 2024-02-01 09:35:09 -08:00
9090bf02e7 Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
5265631d15 use a correct device when creating OptionalCUDAGuard (#2583) 2024-01-25 23:48:17 -08:00
77af974b40 [FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
6ccc0bfffb Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
0ce8647dc5 Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
c1376e0f82 Change scheduler & input tensor shape (#1381) 2023-10-16 17:48:42 -07:00
8ce9c50d40 Avoid compiling kernels for double data type (#933) 2023-09-02 14:59:47 +09:00
0b98ba15c7 Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
e070829ae8 Support bfloat16 data type (#54) 2023-05-03 14:09:44 -07:00
e3cec88aa5 Memcpy kernel for flash attention (#29)
* optimize

* add benchmark

* add assert

* add test
2023-04-10 18:22:49 -07:00
0f40557af6 Implement block copy kernel to optimize beam search (#32) 2023-04-07 17:45:07 -07:00
897cb2ae28 Optimize data movement (#20) 2023-04-02 00:30:17 -07:00
88c0268a18 Implement custom kernel for LLaMA rotary embedding (#14) 2023-03-30 11:04:21 -07:00
cfae35b861 Add miscellaneous updates (#8) 2023-03-13 13:48:38 -07:00
1a7eb7da61 Support beam search & parallel generation (#7) 2023-03-10 09:58:21 -08:00
0deacbce6e Implement single_query_cached_kv_attention kernel (#3) 2023-03-01 15:02:19 -08:00
c413c41cda Add reshape_and_cache op 2023-02-18 19:22:57 +00:00
ffad4e1e03 cache_kernel -> cache_kernels 2023-02-16 20:05:45 +00:00