10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 ( #3834 )
2024-06-03 13:37:11 -07:00
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
dfbe60dc62
[Misc] Simplify code and fix type annotations in conftest.py ( #5118 )
2024-06-02 16:05:50 -07:00
ed59a7ed23
Update test_ignore_eos ( #4898 )
2024-06-02 02:21:53 +00:00
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-01 08:46:07 +00:00
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-30 19:24:41 -07:00
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-05-30 02:52:14 -07:00
b1c255630d
[Core] Avoid the need to pass None values to Sequence.inputs ( #5099 )
2024-05-29 16:05:01 -07:00
eecd864388
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators ( #5096 )
2024-05-29 16:02:25 -07:00
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
18c1f16d86
[Bugfix] Fix arguments passed to Sequence in stop checker test ( #5092 )
2024-05-29 07:16:41 +00:00
5bd3c65072
[Core][Optimization] remove vllm-nccl ( #5091 )
2024-05-29 05:13:52 +00:00
dfba529b40
[Bugfix] Remove the last EOS token unless explicitly specified ( #5077 )
2024-05-28 17:15:35 -07:00
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-28 13:29:31 -07:00
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local >
2024-05-28 11:07:07 +09:00
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) ( #4846 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-27 15:18:17 -07:00
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding ( #5000 )
2024-05-25 10:00:14 -07:00
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com >
Co-authored-by: bapatra <bapatra@microsoft.com >
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-24 22:00:52 -07:00
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-24 10:07:09 -07:00
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-24 12:28:27 +00:00
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-05-23 21:29:18 +00:00
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-05-23 09:54:48 -07:00
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
2024-05-20 11:29:28 -07:00
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net >
2024-05-16 11:16:09 -07:00
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-05-16 12:56:15 -04:00
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cade Daniel <cade@anyscale.com >
2024-05-16 00:53:51 -07:00
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-05-15 22:11:54 -07:00
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-15 19:13:36 -04:00