[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 15:47:41 -07:00
parent 4ddc4743d7
commit a046f86397
10 changed files with 208 additions and 47 deletions
--- a/docs/source/quantization/fp8_e4m3_kvcache.rst
+++ b/docs/source/quantization/fp8_e4m3_kvcache.rst
@ -45,5 +45,3 @@ Here is an example of how to enable this feature:
        # output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
        # output w/o scaling factors:  England, located in the southeastern part of the country. It is known 

-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
-
--- a/docs/source/quantization/fp8_e5m2_kvcache.rst
+++ b/docs/source/quantization/fp8_e5m2_kvcache.rst
@ -32,5 +32,3 @@ Here is an example of how to enable this feature:
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
-