youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Austin Veselka	10760da800	[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609 )	2024-05-07 10:59:07 -07:00
Alexei-V-Ivanov-AMD	478aed5827	[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642 )	2024-05-07 09:23:17 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
Philipp Moritz	a98187cf72	[Kernel] Make static FP8 scaling more robust (#4570 ) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.2295\|± \|0.0035\| \| - humanities \|N/A \|none \| 5\|acc \|0.2421\|± \|0.0062\| \| - other \|N/A \|none \| 5\|acc \|0.2398\|± \|0.0076\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.2171\|± \|0.0074\| \| - stem \|N/A \|none \| 5\|acc \|0.2125\|± \|0.0073\| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7008\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6453\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7692\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8083\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6115\|± \|0.0083\| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.	2024-05-06 17:39:28 -07:00
Noam Gat	bd99d22629	Update lm-format-enforcer to 0.10.1 (#4631 )	2024-05-06 23:51:59 +00:00
Cade Daniel	19cb4716ee	[CI] Add retry for agent lost (#4633 )	2024-05-06 23:18:57 +00:00
Simon Mo	e186d37cb1	[CI] use ccache actions properly in release workflow (#4629 )	2024-05-06 22:23:36 +00:00
Cyrus Leung	323f27b904	[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623 )	2024-05-06 09:31:05 -07:00
zhaoyang-star	0650e5935b	Disable cuda version check in vllm-openai image (#4530 )	2024-05-05 16:58:55 -07:00
Simon Mo	c7f2cf2b7f	[CI] Reduce wheel size by not shipping debug symbols (#4602 ) v0.4.2	2024-05-04 21:28:58 -07:00
Simon Mo	8d8357c8ed	bump version to v0.4.2 (#4600 )	2024-05-04 17:09:49 -07:00
DearPlanet	4302987069	[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937 )	2024-05-04 15:39:34 -07:00
Simon Mo	021b1a2ab7	[CI] check size of the wheels (#4319 )	2024-05-04 20:44:36 +00:00
Michael Goin	2a052011ca	[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 ) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.	2024-05-04 11:45:16 -07:00
SangBin Cho	36fb68f947	[Doc] Chunked Prefill Documentation (#4580 )	2024-05-04 00:18:00 -07:00
Cody Yu	bc8ad68455	[Misc][Refactor] Introduce ExecuteModelData (#4540 )	2024-05-03 17:47:07 -07:00
youkaichao	344bf7cd2d	[Misc] add installation time env vars (#4574 )	2024-05-03 15:55:56 -07:00
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck	f8e7adda21	Fix/async chat serving (#2727 )	2024-05-03 11:04:14 -07:00
Michael Goin	7e65477e5e	[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586 )	2024-05-03 10:32:21 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
youkaichao	2d7bce9cd5	[Doc] add env vars to the doc (#4572 )	2024-05-03 05:13:49 +00:00
DefTruth	ce3f1eedf8	[Misc] remove chunk detected debug logs (#4571 )	2024-05-03 04:48:08 +00:00
Yang, Bo	808632d3b4	[BugFix] Prevent the task of `_force_log` from being garbage collected (#4567 )	2024-05-03 01:35:18 +00:00
youkaichao	344a5d0c33	[Core][Distributed] enable allreduce for multiple tp groups (#4566 )	2024-05-02 17:32:33 -07:00
SangBin Cho	0f8a91401c	[Core] Ignore infeasible swap requests. (#4557 )	2024-05-02 14:31:20 -07:00
Alexei-V-Ivanov-AMD	9b5c9f9484	[CI/Build] AMD CI pipeline with extended set of tests. (#4267 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-05-02 12:29:07 -07:00
Michał Moskal	32881f3f31	[kernel] fix sliding window in prefix prefill Triton kernel (#4405 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2024-05-02 11:23:37 -07:00
youkaichao	5b8a7c1cb0	[Misc] centralize all usage of environment variables (#4548 )	2024-05-02 11:13:25 -07:00
Mark McLoughlin	1ff0c73a79	[BugFix] Include target-device specific requirements.txt in sdist (#4559 )	2024-05-02 10:52:51 -07:00
Hu Dong	5ad60b0cbd	[Misc] Exclude the `tests` directory from being packaged (#4552 )	2024-05-02 10:50:25 -07:00
SangBin Cho	fb087af52e	[mypy][7/N] Cover all directories (#4555 )	2024-05-02 10:47:41 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
youkaichao	2a85f93007	[Core][Distributed] enable multiple tp group (#4512 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-02 04:28:21 +00:00
SangBin Cho	cf8cac8c70	[mypy][6/N] Fix all the core subdirectory typing (#4450 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-02 03:01:00 +00:00
Ronen Schaffer	5e401bce17	[CI]Add regression tests to ensure the async engine generates metrics (#4524 )	2024-05-01 19:57:12 -07:00
SangBin Cho	0d62fe58db	[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451 )	2024-05-01 19:24:13 -07:00
Danny Guinther	b8afa8b95a	[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273 )	2024-05-01 17:34:40 -07:00
Woosuk Kwon	826b82a260	[Misc] Fix expert_ids shape in MoE (#4517 )	2024-05-01 23:47:59 +00:00
Philipp Moritz	c9d852d601	[Misc] Remove Mixtral device="cuda" declarations (#4543 ) Remove the device="cuda" declarations in mixtral as promised in #4343	2024-05-01 16:30:52 -07:00
youkaichao	6ef09b08f8	[Core][Distributed] fix pynccl del error (#4508 )	2024-05-01 15:23:06 -07:00
Roy	3a922c1e7e	[Bugfix][Core] Fix and refactor logging stats (#4336 )	2024-05-01 20:08:14 +00:00
sasha0552	c47ba4aaa9	[Bugfix] Add validation for seed (#4529 )	2024-05-01 19:31:22 +00:00
Philipp Moritz	24bb4fe432	[Kernel] Update fused_moe tuning script for FP8 (#4457 ) This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency	2024-05-01 11:47:38 -07:00
Nick Hill	a657bfc48a	[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357 )	2024-05-01 18:41:59 +00:00
leiwen83	24750f4cad	[Core] Enable prefix caching with block manager v2 enabled (#4142 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>	2024-05-01 11:20:32 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
Travis Johnson	8b798eec75	[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-05-01 18:01:50 +00:00
sasha0552	69909126a7	[Bugfix] Use random seed if seed is -1 (#4531 )	2024-05-01 10:41:17 -07:00

1 2 3 4 5 ...

1284 Commits