Commit Graph

2014 Commits

Author SHA1 Message Date
fb3db61688 [CI/Build] Remove sparseml requirement from testing (#7037) 2024-08-01 12:00:51 -07:00
c8a7e93273 [core][scheduler] simplify and improve scheduler (#6867) 2024-07-31 23:51:09 -07:00
3c10591ef2 [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) 2024-07-31 21:13:34 -07:00
7ecee34321 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
460c1884e3 [Bugfix] Support cpu offloading with fp8 quantization (#6960) 2024-07-31 12:47:46 -07:00
bd70013407 [MISC] Introduce pipeline parallelism partition strategies (#6920)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-31 12:02:17 -07:00
daed30c4a9 [Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982) 2024-07-31 23:46:17 +08:00
6512937de1 Support W4A8 quantization for vllm (#5218) 2024-07-31 07:55:21 -06:00
f230cc2ca6 [Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836) 2024-07-31 10:38:45 +08:00
d7a299edaa [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) 2024-07-30 16:37:01 -04:00
052b6f8ca4 [Bugfix] Fix tensorizer memory profiling bug during testing (#6881) 2024-07-30 11:48:50 -07:00
5cf9254a9c [BugFix] Fix use of per-request seed with pipeline parallel (#6698) 2024-07-30 10:40:08 -07:00
af647fb8b3 [Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 20:24:58 -06:00
9f69d8245a [Frontend] New allowed_token_ids decoding request parameter (#6753) 2024-07-29 23:37:27 +00:00
9a7e2d0534 [Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-29 14:51:27 -07:00
db9e5708a9 [Core] Reduce unnecessary compute when logprobs=None (#6532) 2024-07-29 16:47:31 +00:00
766435e660 [Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
7cbd9ec7a9 [Model] Initialize support for InternVL2 series models (#6514)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-29 10:16:30 +00:00
75acdaa4b6 [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
1ad86acf17 [Model] Initial support for BLIP-2 (#5920)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-07-27 11:53:07 +00:00
Joe
14dbd5a767 [Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
969d032265 [Bugfix]: Fix Tensorizer test failures (#6835) 2024-07-26 20:02:25 -07:00
443c7cf4cf [ci][distributed] fix flaky tests (#6806) 2024-07-25 17:44:09 -07:00
65b1f121c8 [Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761) 2024-07-25 09:46:15 -07:00
316a41ac1d [Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) 2024-07-24 22:48:07 -07:00
309aaef825 [Bugfix] Fix decode tokens w. CUDA graph (#6757) 2024-07-24 22:33:56 -07:00
9e169a4c61 [Model] Adding support for MiniCPM-V (#4087) 2024-07-24 20:59:30 -07:00
5689e256ba [Frontend] Represent tokens with identifiable strings (#6626) 2024-07-25 09:51:00 +08:00
421e218b37 [Bugfix] Bump transformers to 4.43.2 (#6752) 2024-07-24 13:22:16 -07:00
0e63494cf3 Add fp8 support to reshape_and_cache_flash (#6667) 2024-07-24 18:36:52 +00:00
2cf0df3381 [Bugfix] Fix speculative decode seeded test (#6743) 2024-07-24 08:58:31 -07:00
c882a7f5b3 [SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714) 2024-07-24 07:34:22 +00:00
5e8ca973eb [Bugfix] fix flashinfer cudagraph capture for PP (#6708) 2024-07-24 01:49:44 +00:00
87525fab92 [bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
2f808e69ab [Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-23 23:05:05 +00:00
01c16ede6b [CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) 2024-07-23 22:45:12 +00:00
1bedf210e3 Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690) 2024-07-23 13:47:48 -07:00
58f53034ad [Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652) 2024-07-23 11:41:55 -07:00
22fa2e35cb [VLM][Model] Support image input for Chameleon (#6633) 2024-07-22 23:50:48 -07:00
97234be0ec [Misc] Manage HTTP connections in one place (#6600) 2024-07-22 21:32:02 -07:00
9e0b558a09 [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) 2024-07-23 04:11:50 +00:00
42c7f66a38 [Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
fea59c7712 [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) 2024-07-22 14:08:30 -06:00
739b61a348 [Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
396d92d5e0 [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 2024-07-21 19:41:42 -04:00
14f91fe67c [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) 2024-07-20 23:58:58 -07:00
d7f4178dd9 [Frontend] Move chat utils (#6602)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-21 08:38:17 +08:00
06d6c5fe9f [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) 2024-07-20 09:39:07 -07:00
9042d68362 [Misc] Consolidate and optimize logic for building padded tensors (#6541) 2024-07-20 04:17:24 +00:00
7bd82002ae [Core] Allow specifying custom Executor (#6557) 2024-07-20 01:25:06 +00:00