|
|
f1e15da6fe
|
[Frontend] Continuous usage stats in OpenAI completion API (#5742)
|
2024-07-05 10:37:09 -07:00 |
|
|
|
69ec3ca14c
|
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-07-04 16:35:51 -07:00 |
|
|
|
3dd507083f
|
[CI/Build] Cleanup VLM tests (#6107)
|
2024-07-03 18:58:18 -07:00 |
|
|
|
62963d129e
|
[ Misc ] Clean Up CompressedTensorsW8A8 (#6113)
|
2024-07-03 22:50:08 +00:00 |
|
|
|
d9e98f42e4
|
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-03 22:14:16 +00:00 |
|
|
|
47f0954af0
|
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975)
|
2024-07-03 17:38:00 +00:00 |
|
|
|
d18bab3587
|
[CI] Fix base url doesn't strip "/" (#6087)
|
2024-07-02 21:31:25 -07:00 |
|
|
|
9831aec49f
|
[Core] Dynamic image size support for VLMs (#5276)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-07-02 20:34:00 -07:00 |
|
|
|
482045ee77
|
[hardware][misc] introduce platform abstraction (#6080)
|
2024-07-02 20:12:22 -07:00 |
|
|
|
9d6a8daa87
|
[Model] Jamba support (#4115)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-02 23:11:29 +00:00 |
|
|
|
ee93f4f92a
|
[CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
|
2024-07-02 22:25:17 +00:00 |
|
|
|
7c008c51a9
|
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-02 21:54:35 +00:00 |
|
|
|
4d26d806e1
|
Update conftest.py (#6076)
|
2024-07-02 20:14:22 +00:00 |
|
|
|
c5832d2ae9
|
[Core] Pipeline Parallel Support (#4412)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-02 10:58:08 -07:00 |
|
|
|
15aba081f3
|
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050)
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
|
2024-07-02 07:20:29 -07:00 |
|
|
|
98d6682cd1
|
[VLM] Remove image_input_type from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-02 07:57:09 +00:00 |
|
|
|
3476ed0809
|
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602)
|
2024-07-01 20:10:37 -07:00 |
|
|
|
12a59959ed
|
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029)
|
2024-07-01 21:08:29 +00:00 |
|
|
|
80ca1e6a3a
|
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348)
|
2024-07-01 00:33:05 -07:00 |
|
|
|
614aa51203
|
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
|
2024-06-30 20:07:34 -07:00 |
|
|
|
af9ad46fca
|
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (#5940)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-30 23:06:27 +00:00 |
|
|
|
f5e73c9f1b
|
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909)
Co-authored-by: sang <sangcho@anyscale.com>
|
2024-06-30 17:11:15 +00:00 |
|
|
|
c6c240aa0a
|
[Frontend]: Support base64 embedding (#5935)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-06-30 23:53:00 +08:00 |
|
|
|
2be6955a3f
|
[ci][distributed] fix device count call
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991)
|
2024-06-30 08:06:13 +00:00 |
|
|
|
9d47f64eb6
|
[CI/Build] [3/3] Reorganize entrypoints tests (#5966)
|
2024-06-30 12:58:49 +08:00 |
|
|
|
cff6a1fec1
|
[CI/Build] Reuse code for checking output consistency (#5988)
|
2024-06-30 11:44:25 +08:00 |
|
|
|
9def10664e
|
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949)
|
2024-06-29 12:47:58 -07:00 |
|
|
|
99397da534
|
[CI/Build] Add TP test for vision models (#5892)
|
2024-06-29 15:45:54 +00:00 |
|
|
|
8dbfcd35bf
|
[ CI/Build ] Added E2E Test For Compressed Tensors (#5839)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-29 21:12:58 +08:00 |
|
|
|
51e971d39e
|
[Bugfix] Support eos_token_id from config.json (#5954)
|
2024-06-29 11:19:02 +00:00 |
|
|
|
580353da93
|
[Bugfix] Fix precisions in Gemma 1 (#5913)
|
2024-06-29 03:10:21 +00:00 |
|
|
|
ba4994443a
|
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
Signed-off-by: Joe Runde <joe@joerun.de>
|
2024-06-29 10:48:25 +08:00 |
|
|
|
906a19cdb0
|
[Misc] Extend vLLM Metrics logging API (#5925)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-06-29 10:36:06 +08:00 |
|
|
|
7041de4384
|
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
|
2024-06-28 15:28:49 -07:00 |
|
|
|
6a2d659d28
|
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
|
2024-06-28 17:10:34 +00:00 |
|
|
|
b2c620230a
|
[Spec Decode] Introduce DraftModelRunner (#5799)
|
2024-06-28 09:17:51 -07:00 |
|
|
|
b90d8cd832
|
[Distributed] Make it clear that % should not be in tensor dict keys. (#5927)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
|
2024-06-28 15:20:22 +00:00 |
|
|
|
3b752a6555
|
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
|
2024-06-28 07:59:18 -07:00 |
|
|
|
57f09a419c
|
[Hardware][Intel] OpenVINO vLLM backend (#5379)
|
2024-06-28 13:50:16 +00:00 |
|
|
|
5cbe8d155c
|
[Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
|
2024-06-28 12:09:56 +00:00 |
|
|
|
736ed38849
|
[CI/Build] Fix Args for _get_logits_warper in Sampler Test (#5922)
|
2024-06-27 11:43:04 -07:00 |
|
|
|
e9d32d077d
|
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
|
2024-06-27 12:43:17 +00:00 |
|
|
|
d12af207d2
|
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly (#5880)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
|
2024-06-27 15:15:24 +08:00 |
|
|
|
c54269d967
|
[Frontend] Add tokenize/detokenize endpoints (#5054)
|
2024-06-26 16:54:22 +00:00 |
|
|
|
5bfd1bbc98
|
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (#5560)
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2024-06-26 15:16:00 +00:00 |
|
|
|
6984c02a27
|
[CI/Build] Refactor image test assets (#5821)
|
2024-06-26 01:02:34 -07:00 |
|
|
|
515080ad2f
|
[bugfix][distributed] fix shm broadcast when the queue size is full (#5801)
|
2024-06-25 21:56:02 -07:00 |
|
|
|
dda4811591
|
[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408)
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>
|
2024-06-25 20:30:03 -07:00 |
|
|
|
c2a8ac75e0
|
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-06-26 00:04:08 +00:00 |
|
|
|
dd793d1de5
|
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422)
|
2024-06-25 15:56:15 -07:00 |
|