youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Steve Grubb	dac6a3f6ed	[Misc] Apply a couple g++ cleanups (#4719 )	2024-05-10 13:37:05 +00:00
Kunshang Ji	64b77dfd7e	[Core]fix type annotation for `swap_blocks` (#4726 )	2024-05-10 21:52:48 +09:00
Simon Mo	51d4094fda	chunked-prefill-doc-syntax (#4603 ) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com>	2024-05-10 14:13:23 +09:00
Allen.Dou	e965d46184	[Misc] Keep only one implementation of the create_dummy_prompt function. (#4716 )	2024-05-09 21:42:38 -07:00
youkaichao	208b71bcc1	[Core][Distributed] refactor pynccl (#4591 ) [Core][Distributed] refactor pynccl to hold multiple communicators (#4591)	2024-05-09 19:48:43 -07:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
Philipp Moritz	379da6dcb5	[Kernel] [FP8] Improve FP8 linear layer performance (#4691 ) This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)	2024-05-09 16:38:07 -07:00
Hao Zhang	ebce310b74	[Model] Snowflake arctic model implementation (#4652 ) Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by: Aurick Qiao <qiao@aurick.net> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-09 22:37:14 +00:00
Michael Goin	be0c5180ac	[Bugfix] Add logs for all model dtype casting (#4717 )	2024-05-09 18:36:25 +00:00
Robert Shaw	cea64430f6	[Bugfix] Update grafana.json (#4711 )	2024-05-09 10:10:13 -07:00
Cyrus Leung	a3c124570a	[Bugfix] Fix CLI arguments in OpenAI server docs (#4709 )	2024-05-09 09:53:14 -07:00
kliuae	ff5abcd746	[ROCm] Add support for Punica kernels on AMD GPUs (#3140 ) Co-authored-by: miloice <jeffaw99@hotmail.com>	2024-05-09 09:19:50 -07:00
Woosuk Kwon	0ee535b294	[Misc] Set block size at initialization & Fix test_model_runner (#4705 )	2024-05-09 09:04:59 -07:00
Woosuk Kwon	190bc838e1	[Misc] Remove unnecessary ModelRunner imports (#4703 )	2024-05-09 00:17:17 -07:00
Cyrus Leung	f12b20decc	[Frontend] Move async logic outside of constructor (#4674 )	2024-05-08 22:48:33 -07:00
Mahmoud Ashraf	16bc0a098f	[Frontend] add tok/s speed metric to llm class when using tqdm (#4400 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-08 22:02:31 -07:00
alexm-nm	e288df0632	[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626 )	2024-05-08 17:14:31 -07:00
Cade Daniel	8b9241be3a	[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (#4672 )	2024-05-08 23:24:46 +00:00
Cody Yu	f942efb5a3	[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-08 21:44:00 +00:00
Woosuk Kwon	89579a201f	[Misc] Use vllm-flash-attn instead of flash-attn (#4686 )	2024-05-08 13:15:34 -07:00
youkaichao	230c4b38c1	[CI/Test] fix swap test for multi gpu (#4689 )	2024-05-08 13:14:02 -07:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
Antoni Baum	ad932a221d	[Core] Faster startup for LoRA enabled models (#4634 )	2024-05-08 10:33:18 -07:00
Woosuk Kwon	5510cf0e8a	[Misc] Add `get_name` method to attention backends (#4685 )	2024-05-08 09:59:31 -07:00
DefTruth	0f9a6e3d22	[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573 )	2024-05-08 09:19:58 -07:00
SangBin Cho	f6a593093a	[CI] Make mistral tests pass (#4596 )	2024-05-08 08:44:35 -07:00
SangBin Cho	d7740ea4dc	[Core] Optimize sampler get_logprobs (#4594 )	2024-05-08 08:42:28 -07:00
youkaichao	cc466a3290	[Core][Distributed] support cpu&device in broadcast tensor dict (#4660 ) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)	2024-05-07 19:34:47 -07:00
leiwen83	8344f7742b	[Bug fix][Core] fixup ngram not setup correctly (#4551 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-07 11:40:18 -07:00
youkaichao	469f85c782	[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648 )	2024-05-07 11:06:32 -07:00
Austin Veselka	10760da800	[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609 )	2024-05-07 10:59:07 -07:00
Alexei-V-Ivanov-AMD	478aed5827	[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642 )	2024-05-07 09:23:17 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
Philipp Moritz	a98187cf72	[Kernel] Make static FP8 scaling more robust (#4570 ) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.2295\|± \|0.0035\| \| - humanities \|N/A \|none \| 5\|acc \|0.2421\|± \|0.0062\| \| - other \|N/A \|none \| 5\|acc \|0.2398\|± \|0.0076\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.2171\|± \|0.0074\| \| - stem \|N/A \|none \| 5\|acc \|0.2125\|± \|0.0073\| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7008\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6453\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7692\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8083\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6115\|± \|0.0083\| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.	2024-05-06 17:39:28 -07:00
Noam Gat	bd99d22629	Update lm-format-enforcer to 0.10.1 (#4631 )	2024-05-06 23:51:59 +00:00
Cade Daniel	19cb4716ee	[CI] Add retry for agent lost (#4633 )	2024-05-06 23:18:57 +00:00
Simon Mo	e186d37cb1	[CI] use ccache actions properly in release workflow (#4629 )	2024-05-06 22:23:36 +00:00
Cyrus Leung	323f27b904	[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623 )	2024-05-06 09:31:05 -07:00
zhaoyang-star	0650e5935b	Disable cuda version check in vllm-openai image (#4530 )	2024-05-05 16:58:55 -07:00
Simon Mo	c7f2cf2b7f	[CI] Reduce wheel size by not shipping debug symbols (#4602 ) v0.4.2	2024-05-04 21:28:58 -07:00
Simon Mo	8d8357c8ed	bump version to v0.4.2 (#4600 )	2024-05-04 17:09:49 -07:00
DearPlanet	4302987069	[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937 )	2024-05-04 15:39:34 -07:00
Simon Mo	021b1a2ab7	[CI] check size of the wheels (#4319 )	2024-05-04 20:44:36 +00:00
Michael Goin	2a052011ca	[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 ) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.	2024-05-04 11:45:16 -07:00
SangBin Cho	36fb68f947	[Doc] Chunked Prefill Documentation (#4580 )	2024-05-04 00:18:00 -07:00
Cody Yu	bc8ad68455	[Misc][Refactor] Introduce ExecuteModelData (#4540 )	2024-05-03 17:47:07 -07:00
youkaichao	344bf7cd2d	[Misc] add installation time env vars (#4574 )	2024-05-03 15:55:56 -07:00
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck	f8e7adda21	Fix/async chat serving (#2727 )	2024-05-03 11:04:14 -07:00

1 2 3 4 5 ...

1314 Commits