youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Robert Shaw	c777df79f7	[BugFix] Fix Memory Leak (#17567 ) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>	2025-05-02 01:07:03 -07:00
Chen Zhang	81ecf425f0	[v1][Spec Decode] Make sliding window compatible with eagle prefix caching (#17398 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-04-30 18:25:53 +00:00
Alec	0be6d05b5e	[V1][Metrics] add support for kv event publishing (#16750 ) Signed-off-by: alec-flowers <aflowers@nvidia.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com>	2025-04-30 07:44:45 -07:00
Marko Rosenmueller	77073c77bc	[Core] Prevent side-channel attacks via cache salting (#17045 ) Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>	2025-04-30 20:27:21 +08:00
Lily Liu	20e489eaa1	[V1][Spec Decode] Make eagle compatible with prefix caching. (#17137 ) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2025-04-27 09:29:43 -07:00
Ning Xie	fd11a325b8	[MISC] rename interval to max_recent_requests (#14285 )	2025-04-26 16:59:18 +00:00
Nick Hill	df6f3ce883	[Core] Remove prompt string from engine core data structures (#17214 ) Signed-off-by: Nick Hill <nhill@redhat.com>	2025-04-25 23:41:05 -07:00
Mark McLoughlin	340d7b1b21	[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics (#16665 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-04-24 08:57:40 -07:00
Rui Qiao	c0dfd97519	[V1][PP] Optimization: continue scheduling prefill chunks (#17080 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2025-04-24 05:27:08 -07:00
Woosuk Kwon	c4ab9f3e71	[V1] Remove pre-allocation for KV cache (#16941 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-04-22 00:52:18 -07:00
Woosuk Kwon	3a0fba5cf4	[V1][Spec Decode] Handle draft tokens beyond max_model_len (#16087 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-04-21 12:38:50 -07:00
vie-serendipity	d9737ca1c6	[V1][Misc] stop update prefix cache stats when logs_stats is disabled (#16460 ) Signed-off-by: vie-serendipity <2733147505@qq.com>	2025-04-19 02:25:19 -07:00
Yihua Cheng	3408e47159	[P/D][V1] KV Connector API V1 (#15960 ) Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Signed-off-by: remi <remi@mistral.ai> Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>	2025-04-17 13:22:40 -07:00
Lily Liu	f49e5aff11	[V1][Spec Decode] KV cache slots for eagle heads (#16370 ) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2025-04-12 19:42:51 -07:00
Michael Goin	aa3b3d76e0	Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (#16447 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-04-11 08:09:52 +00:00
rongfu.leng	4716377fbc	[Feature] Estimate max-model-len use available KV cache memory (#16168 ) Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>	2025-04-08 19:12:51 -07:00
Michael Goin	8e5314a468	[V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill (#15837 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-04-07 23:24:07 -07:00
Roger Wang	f2ebb6f541	[V1] Scatter and gather placeholders in the model runner (#16076 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>	2025-04-08 10:43:41 +08:00
Roger Wang	af51d80fa1	Revert "[V1] Scatter and gather placeholders in the model runner" (#16075 )	2025-04-04 14:50:57 -07:00
Cyrus Leung	f5722a5052	[V1] Scatter and gather placeholders in the model runner (#15712 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-04-04 21:26:44 +00:00
Mark McLoughlin	a35a8a8392	[V1][Spec Decode] Avoid logging useless nan metrics (#16023 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-04-04 08:52:41 -07:00
Mark McLoughlin	a79cc68b3a	[V1][Metrics] Initial speculative decoding metrics (#15151 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-04-01 10:45:04 -07:00
Chen Zhang	3a5f0afcd2	[V1] Implement sliding window attention in kv_cache_manager (#14097 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-04-01 00:33:17 -07:00
Mark McLoughlin	f98a4920f9	[V1][Core] Remove unused speculative config from scheduler (#15818 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-03-31 19:15:21 +00:00
Cody Yu	54aa619459	[V1] Refactor num_computed_tokens logic (#15307 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-03-27 04:54:36 +00:00
marko	27df5199d9	Support SHA256 as hash function in prefix caching (#15297 ) Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>	2025-03-26 11:11:28 -07:00
Lu Fang	082ab86f5f	[V1] Support long_prefill_token_threshold in v1 scheduler (#15419 ) Signed-off-by: Lu Fang <lufang@fb.com>	2025-03-25 14:22:26 -07:00
Chen Zhang	93a00d7dde	[v1] Refactor KVCacheConfig (#14079 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-03-21 04:56:27 -07:00
Woosuk Kwon	0c6f5023c3	[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-03-20 17:50:43 -07:00
afeldman-nm	ef64044079	[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (#13949 )	2025-03-08 01:48:12 +00:00
Aaron Pham	80e9afb5bc	[V1][Core] Support for Structured Outputs (#12388 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-03-07 07:19:11 -08:00
Harry Mellor	cf069aa8aa	Update deprecated Python 3.8 typing (#13971 )	2025-03-02 17:34:51 -08:00
Chen Zhang	28943d36ce	[v1] Move block pool operations to a separate class (#13973 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-28 20:53:31 +00:00
Woosuk Kwon	cd4a72a28d	[V1][Spec decode] Move drafter to model runner (#13363 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-17 15:40:12 -08:00
Lily Liu	80f63a3966	[V1][Spec Decode] Ngram Spec Decode (#12193 ) Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2025-02-15 18:05:11 -08:00
Cody Yu	9206b3d7ec	[V1][PP] Run engine busy loop with batch queue (#13064 )	2025-02-15 03:59:01 -08:00
Mark McLoughlin	75e6e14516	[V1][Metrics] Add several request timing histograms (#12644 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-02-11 10:14:00 -05:00
Cody Yu	41c5dd45b9	[V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592 )	2025-02-11 08:27:25 +00:00
Woosuk Kwon	3243158336	[V1] Move KV block hashes from Request to KVCacheManager (#12922 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-07 19:14:10 -08:00
afeldman-nm	0630d4537a	[V1] Logprobs and prompt logprobs support (#9880 ) This PR is adding support for sample logprobs & prompt logprobs to vLLM v1. New behavior: - During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order. - In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized. - During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.) - Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer. Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2025-02-07 07:26:20 -08:00
Varun Sundar Rabindranath	467a96a541	[V1] LoRA Support (#10957 ) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2025-02-06 09:32:51 -08:00
Woosuk Kwon	18a88fcccc	[V1] Remove scheduling constraint on partial requests (#12674 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-04 02:43:58 -08:00
Cody Yu	5095e96606	[V1] Revert `uncache_blocks` and support recaching full blocks (#12415 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-03 15:04:53 -08:00
Russell Bryant	e489ad7a21	[Misc] Add SPDX-License-Identifier headers to python source files (#12628 ) - Add SPDX license headers to python source files - Check for SPDX headers using pre-commit commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-02 11:58:18 -08:00
Shawn Du	f8ece6e17f	[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608 ) As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>	2025-02-02 16:40:58 +08:00
Chen Zhang	89003c4082	[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603 ) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-31 13:13:04 -08:00
Cody Yu	f0ef37233e	[V1] Add `uncache_blocks` (#12333 )	2025-01-23 04:19:21 +00:00
Cody Yu	7206ce4ce1	[Core] Support `reset_prefix_cache` (#12284 )	2025-01-22 18:52:27 +00:00
Chen Zhang	994fc655b7	[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003 )	2025-01-15 07:55:30 +00:00
Roger Wang	91b361ae89	[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (#11685 ) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-01-06 19:58:16 +00:00

1 2

60 Commits