8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-13 23:50:44 +09:00
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-10 17:36:25 +00:00
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-08 21:44:00 +00:00
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-07 11:40:18 -07:00
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics ( #3937 )
2024-05-04 15:39:34 -07:00
2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) ( #4527 )
...
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436 .
This PR enables the following checkpoint loading features for Mixtral:
Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:
The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
ab50275111
[Speculative decoding] Support target-model logprobs ( #4378 )
2024-05-03 15:52:01 -07:00
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >
2024-05-03 15:51:27 -07:00
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel ( #4405 )
...
Co-authored-by: SangBin Cho <rkooo567@gmail.com >
2024-05-02 11:23:37 -07:00
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
2a85f93007
[Core][Distributed] enable multiple tp group ( #4512 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-02 04:28:21 +00:00
5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics ( #4524 )
2024-05-01 19:57:12 -07:00
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
a657bfc48a
[Core] Add multiproc_worker_utils for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Sage Moore <sagemoore@utexas.edu >
2024-05-01 11:20:32 -07:00
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-01 11:13:03 -07:00
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-04-30 16:28:46 -07:00
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-04-30 21:46:12 +00:00