youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Aaron Pham	21063c11c7	[CI/Build] drop support for Python 3.8 EOL (#8464 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>	2024-11-06 07:11:55 +00:00
Cyrus Leung	bbc3619dc8	[Core] Make encoder-decoder inputs a nested structure to be more composable (#9604 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-11-05 10:07:31 +08:00
youkaichao	4fdc581f9e	[core] simplify seq group code (#9569 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-10-24 00:16:44 -07:00
Cody Yu	d11bf435a0	[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510 )	2024-10-18 14:30:55 -07:00
Cyrus Leung	051eaf6db3	[Model] Add user-configurable task for models that support both generation and embedding (#9424 )	2024-10-18 11:31:58 -07:00
Kuntai Du	81ede99ca4	[Core] Deprecating block manager v1 and make block manager v2 default (#8704 ) Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).	2024-10-17 11:38:15 -05:00
sroy745	f3a507f1d3	[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149 )	2024-10-10 14:17:17 +08:00
youkaichao	18b296fdb2	[core] remove beam search from the core (#9105 )	2024-10-07 05:47:04 +00:00
Varun Sundar Rabindranath	cb3b2b9ba4	[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-10-06 12:48:11 -07:00
sroy745	5bf8789b2a	[Bugfix] Block manager v2 with preemption and lookahead slots (#8824 )	2024-09-29 09:17:45 +08:00
sroy745	fc3afc20df	Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (#8752 )	2024-09-24 21:26:36 -07:00
sroy745	ee777d9c30	Fix test_schedule_swapped_simple in test_scheduler.py (#8780 )	2024-09-24 21:26:18 -07:00
sroy745	88577ac928	Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728 )	2024-09-24 04:43:13 +00:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Cody Yu	2deb029d11	[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822 )	2024-08-26 11:24:53 -07:00
Cody Yu	3ac50b47d0	[MISC] Add prefix cache hit rate to metrics (#7606 )	2024-08-19 11:52:07 -07:00
SangBin Cho	ff7ec82c4d	[Core] Optimize SPMD architecture with delta + serialization optimization (#7109 )	2024-08-18 17:57:20 -07:00
Cade Daniel	baa240252e	[Core] Fix edge case in chunked prefill + block manager v2 (#7380 )	2024-08-09 23:48:49 +00:00
Zach Zheng	782e53ab59	[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849 )	2024-08-08 10:43:30 -07:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
youkaichao	c8a7e93273	[core][scheduler] simplify and improve scheduler (#6867 )	2024-07-31 23:51:09 -07:00
Jiaxin Shan	42c7f66a38	[Core] Support dynamically loading Lora adapter from HuggingFace (#6234 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-22 15:42:40 -07:00
Antoni Baum	9ed82e7074	[Misc] Small perf improvements (#6520 )	2024-07-19 12:10:56 -07:00
Alexander Matveev	3476ed0809	[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602 )	2024-07-01 20:10:37 -07:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
leiwen83	1b8a0d71cf	[Core][Bugfix]: fix prefix caching for blockv2 (#5364 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-06-14 17:23:56 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Kaiyang Chen	10c38e3e46	[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834 )	2024-06-03 13:37:11 -07:00
Cyrus Leung	b1c255630d	[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099 )	2024-05-29 16:05:01 -07:00
afeldman-nm	4238bc82f2	[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837 )	2024-05-29 16:09:13 +00:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
Michał Moskal	d4f3985907	[Core] Sliding window for block manager v2 (#4545 ) Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>	2024-05-28 11:07:07 +09:00
leiwen83	e64fde4b01	[Core][Bugfix]: fix prefix caching for blockv2 (#4764 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-24 10:07:09 -07:00
SangBin Cho	e7c46b9527	[Scheduler] Warning upon preemption and Swapping (#4647 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-13 23:50:44 +09:00
Cyrus Leung	350f9e107f	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 ) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.	2024-05-13 23:50:09 +09:00
Robert Shaw	fcc2994be6	[CI] Nits for bad initialization of SeqGroup in testing (#4748 )	2024-05-10 18:01:01 -04:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
youkaichao	469f85c782	[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648 )	2024-05-07 11:06:32 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
SangBin Cho	0f8a91401c	[Core] Ignore infeasible swap requests. (#4557 )	2024-05-02 14:31:20 -07:00
leiwen83	24750f4cad	[Core] Enable prefix caching with block manager v2 enabled (#4142 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>	2024-05-01 11:20:32 -07:00
SangBin Cho	050f285ff6	[Core] Scheduling optimization 2 (#4280 )	2024-04-23 08:02:11 +00:00
SangBin Cho	ad8d696a99	[Core] Scheduler perf fix (#4270 )	2024-04-22 21:11:06 +00:00
Cade Daniel	e95cd87959	[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894 )	2024-04-16 13:09:21 -07:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Cade Daniel	eb69d68804	[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783 )	2024-04-02 00:49:51 +00:00

1 2

61 Commits