Commit Graph

477 Commits

Author SHA1 Message Date
06e9ebebd5 Add instructions to install vLLM+cu118 (#1717) 2023-11-18 23:48:58 -08:00
c5f7740d89 Bump up to v0.2.2 (#1689) v0.2.2 2023-11-18 21:57:07 -08:00
be66d9b125 Fix warning msg on quantization (#1715) 2023-11-18 21:49:55 -08:00
e1054247ba [Optimization] Implement fused add rmsnorm (#1667) 2023-11-18 18:18:02 -08:00
8d17774f92 Add AWQ support for all models (#1714) 2023-11-18 17:56:47 -08:00
e946260cf3 use get_tensor in safe_open (#1696) 2023-11-18 16:45:18 -08:00
edb305584b Support download models from www.modelscope.cn (#1588) 2023-11-17 20:38:31 -08:00
bb00f66e19 Use quantization_config in hf config (#1695) 2023-11-17 16:23:49 -08:00
Roy
e87557b069 Support Min P Sampler (#1642) 2023-11-17 16:20:49 -08:00
dcc543a298 [Minor] Fix comment (#1704) 2023-11-17 09:42:49 -08:00
0fc280b06c Update the adding-model doc according to the new refactor (#1692) 2023-11-16 18:46:26 -08:00
20d0699d49 [Fix] Fix comm test (#1691) 2023-11-16 16:28:39 -08:00
686f5e3210 Return usage for openai streaming requests (#1663) 2023-11-16 15:28:36 -08:00
415d109527 [Fix] Update Supported Models List (#1690) 2023-11-16 14:47:26 -08:00
521b35f799 Support Microsoft Phi 1.5 (#1664) 2023-11-16 14:28:39 -08:00
cb08cd0d75 [Minor] Fix duplication of ignored seq group in engine step (#1666) 2023-11-16 13:11:41 -08:00
2a2c135b41 Fix loading error when safetensors contains empty tensor (#1687) 2023-11-16 10:38:10 -08:00
65ea2ddf17 feat(config): support parsing torch.dtype (#1641)
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2023-11-16 01:31:06 -08:00
b514d3c496 Revert MptConfig to MPTConfig (#1668) 2023-11-16 01:19:39 -08:00
7076fa1c9f TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)
Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
660a7fcfa4 Add DeepSpeed MII backend to benchmark script (#1649) 2023-11-14 12:35:30 -08:00
054072bee5 [Minor] Move RoPE selection logic to get_rope (#1633) 2023-11-12 16:04:50 -08:00
eb825c1e74 Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape (#1631) 2023-11-12 15:53:12 -08:00
1b290ace4f Run default _AsyncLLMEngine._run_workers_async in threadpool (#1628) 2023-11-11 14:50:44 -08:00
Sin
0d578228ca config parser: add ChatGLM2 seq_length to _get_and_verify_max_len (#1617) 2023-11-09 19:29:51 -08:00
aebfcb262a Dockerfile: Upgrade Cuda to 12.1 (#1609) 2023-11-09 11:49:02 -08:00
ab9e8488d5 Add Yi model to quantization support (#1600) 2023-11-09 11:47:14 -08:00
fd58b73a40 Build CUDA11.8 wheels for release (#1596) 2023-11-09 03:52:29 -08:00
8efe23f150 Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546) 2023-11-08 14:19:12 -08:00
06458a0b42 Upgrade to CUDA 12 (#1527)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-08 14:17:49 -08:00
1a2bbc9301 ChatGLM Support (#1261) 2023-11-06 16:09:33 -08:00
Roy
e7f579eb97 Support Yi model (#1567) 2023-11-06 15:26:03 -08:00
8516999495 Add Quantization and AutoAWQ to docs (#1235) 2023-11-04 22:43:39 -07:00
9f669a9a7c Support YaRN models (#1264)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
555bdcc5a3 Added logits processor API to sampling params (#1469) 2023-11-03 14:12:15 -07:00
54ca1ba71d docs: add description (#1553) 2023-11-03 09:14:52 -07:00
9738b84a08 Force paged attention v2 for long contexts (#1510) 2023-11-01 16:24:32 -07:00
1fe0990023 Remove MPTConfig (#1529) 2023-11-01 15:29:05 -07:00
7e90a2d117 Add /health Endpoint for both Servers (#1540) 2023-11-01 10:29:44 -07:00
5687d584fe [BugFix] Set engine_use_ray=True when TP>1 (#1531) 2023-11-01 02:14:18 -07:00
cf8849f2d6 Add MptForCausalLM key in model_loader (#1526) 2023-10-31 15:46:53 -07:00
e575df33b1 [Small] Formatter only checks lints in changed files (#1528) 2023-10-31 15:39:38 -07:00
0ce8647dc5 Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
9cabcb7645 Add Dockerfile (#1350) 2023-10-31 12:36:47 -07:00
7b895c5976 [Fix] Fix duplicated logging messages (#1524) 2023-10-31 09:04:47 -07:00
7013a80170 Add support for spaces_between_special_tokens 2023-10-30 16:52:56 -07:00
79a30912b8 Add py.typed so consumers of vLLM can get type checking (#1509)
* Add py.typed so consumers of vLLM can get type checking

* Update py.typed

---------
Co-authored-by: aarnphm <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-30 14:50:47 -07:00
2f3d36a8a1 Fix logging so we actually get info level entries in the log. (#1494) 2023-10-30 10:02:21 -07:00
ac8d36f3e5 Refactor LLMEngine demo script for clarity and modularity (#1413)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-30 09:14:37 -07:00
15f5632365 Delay GPU->CPU sync in sampling (#1337) 2023-10-30 09:01:34 -07:00