Commit Graph

184 Commits

Author SHA1 Message Date
21063c11c7 [CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
d2e80332a7 [Feature] Update benchmark_throughput.py to support image input (#9851)
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net>
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net>
2024-11-05 19:30:02 +00:00
9a5664d4a4 [Misc] Refactor benchmark_throughput.py (#9779)
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net>
Co-authored-by: Linkun Chen <lkchen@github.com>
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net>
2024-11-04 14:32:16 -08:00
ea4adeddc1 [Bugfix] Fix E2EL mean and median stats (#9984)
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com>
2024-11-04 09:37:58 +00:00
abbfb6134d [Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint (#9837) 2024-10-30 18:15:56 -07:00
622b7ab955 [Hardware] using current_platform.seed_everything (#9785)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-29 14:47:44 +00:00
32176fee73 [torch.compile] support moe models (#9632)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 21:58:04 -07:00
fd0e2cfdb2 [Misc] Separate total and output tokens in benchmark_throughput.py (#8914) 2024-10-23 16:47:20 +00:00
65050a40e6 [Bugfix] Generate exactly input_len tokens in benchmark_throughput (#9592) 2024-10-22 17:45:35 -07:00
cb6fdaa0a0 [Misc] Make benchmarks use EngineArgs (#9529) 2024-10-22 15:40:38 -07:00
855e0e6f97 [Frontend][Misc] Goodput metric support (#9338) 2024-10-20 18:39:32 +00:00
7dbe738d65 [Misc] benchmark: Add option to set max concurrency (#9390)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-18 11:15:28 -07:00
d65049daab [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script (#9013)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-17 21:11:11 +00:00
81ede99ca4 [Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
7e7eae338d [Misc] Standardize RoPE handling for Qwen2-VL (#9250) 2024-10-16 13:56:17 +08:00
5d264f4ab8 pass ignore_eos parameter to all benchmark_serving calls (#9349) 2024-10-15 13:30:44 -07:00
94bf9ae4e9 [Misc] Fix sampling from sonnet for long context case (#9235) 2024-10-11 00:33:16 +00:00
f3a507f1d3 [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) 2024-10-10 14:17:17 +08:00
18b296fdb2 [core] remove beam search from the core (#9105) 2024-10-07 05:47:04 +00:00
168cab6bbf [Frontend] API support for beam search (#9087)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-10-05 23:39:03 -07:00
27302dd584 [Misc] Fix CI lint (#9085) 2024-10-04 16:07:54 -07:00
0cc566ca8f [Misc] Add random seed for prefix cache benchmark (#9081) 2024-10-04 21:58:57 +00:00
fbb74420e7 [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412) 2024-10-04 14:01:44 -07:00
22f5851b80 Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997) 2024-10-01 11:07:06 -07:00
e585b583a9 [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891) 2024-09-28 18:51:22 +00:00
0e088750af [MISC] Fix invalid escape sequence '\' (#8830)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2024-09-27 01:13:25 -07:00
3b00b9c26c [Core] renamePromptInputs and inputs (#8876) 2024-09-26 20:35:15 -07:00
4f1ba0844b Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) 2024-09-25 10:36:26 -07:00
28e1299e60 rename PromptInputs and inputs with backward compatibility (#8760) 2024-09-25 09:36:47 -07:00
6da1ab6b41 [Core] Adding Priority Scheduling (#5958) 2024-09-24 19:50:50 -07:00
3185fb0cca Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" (#8750) 2024-09-24 05:45:20 +00:00
0250dd68c5 re-implement beam search on top of vllm core (#8726)
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-09-23 22:08:12 -07:00
86e9c8df29 [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
0057894ef7 [Core] Rename PromptInputs and inputs(#8673) 2024-09-20 19:00:54 -07:00
855c8ae2c9 [MISC] remove engine_use_ray in benchmark_throughput.py (#8615) 2024-09-18 22:33:20 -07:00
c52ec5f034 [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) 2024-09-19 05:24:24 +00:00
9d104b5beb [CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
6ffa3f314c [CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
1b6de8352b [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) 2024-09-17 07:34:27 +00:00
8baa454937 [Misc] Move device options to a single place (#8322) 2024-09-11 13:25:58 -07:00
795b662cff Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241) 2024-09-06 20:18:16 -07:00
e5cab71531 [Frontend] Add --logprobs argument to benchmark_serving.py (#8191) 2024-09-06 09:01:14 -07:00
77d9e514a2 [MISC] Replace input token throughput with total token throughput (#8164)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-04 20:23:22 +00:00
d4db9f53c8 [Benchmark] Add --async-engine option to benchmark_throughput.py (#7964) 2024-09-03 20:57:41 -04:00
0c785d344d Add more percentiles and latencies (#7759) 2024-08-29 16:48:11 -07:00
345be0e244 [benchmark] Update TGI version (#7917) 2024-08-27 15:07:53 -07:00
2eedede875 [Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
9db93de20c [Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
d3b5b98021 [Misc] Enhance prefix-caching benchmark tool (#6568) 2024-08-22 09:32:02 -07:00
7937009a7e [Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00