Commit Graph

1863 Commits

Author SHA1 Message Date
fc17110bbe [BugFix]: set outlines pkg version (#6262) 2024-07-11 04:37:11 +00:00
439c84581a [Doc] Update description of vLLM support for CPUs (#6003) 2024-07-10 21:15:29 -07:00
99ded1e1c4 [Doc] Remove comments incorrectly copied from another project (#6286) 2024-07-10 17:05:26 -07:00
997df46a32 [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor (#6313) 2024-07-10 16:39:02 -07:00
ae151d73be [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765) 2024-07-10 16:02:47 -07:00
44cc76610d [Bugfix] Fix OpenVINOExecutor abstractmethod error (#6296)
Signed-off-by: sangjune.park <sangjune.park@navercorp.com>
2024-07-10 10:03:32 -07:00
b422d4961a [CI/Build] Enable mypy typing for remaining folders (#6268) 2024-07-10 22:15:55 +08:00
c38eba3046 [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. (#6303)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-10 09:04:07 -04:00
e72ae80b06 [Bugfix] Support 2D input shape in MoE layer (#6287) 2024-07-10 09:03:16 -04:00
8a924d2248 [Doc] Guide for adding multi-modal plugins (#6205) 2024-07-10 14:55:34 +08:00
5ed3505d82 [Bugfix][TPU] Add prompt adapter methods to TPUExecutor (#6279) 2024-07-09 19:30:56 -07:00
da78caecfa [core][distributed] zmq fallback for broadcasting large objects (#6183)
[core][distributed] add zmq fallback for broadcasting large objects (#6183)
2024-07-09 18:49:11 -07:00
2416b26e11 [Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) 2024-07-09 18:34:02 -07:00
d3a245138a [Bugfix]fix and needs_scalar_to_array logic check (#6238)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-07-09 23:43:24 +00:00
673dd4cae9 [Docs] Docs update for Pipeline Parallel (#6222)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-09 16:24:58 -07:00
4d6ada947c [CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
a0550cbc80 Add support for multi-node on CI (#5955)
Signed-off-by: kevin <kevin@anyscale.com>
2024-07-09 12:56:56 -07:00
08c5bdecae [Bugfix][TPU] Fix outlines installation in TPU Dockerfile (#6256) 2024-07-09 02:56:06 -07:00
5d5b4c5fe5 [Bugfix][TPU] Add missing None to model input (#6245) 2024-07-09 00:21:37 -07:00
70c232f85a [core][distributed] fix ray worker rank assignment (#6235) 2024-07-08 21:31:44 -07:00
a3c9435d93 [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability (#6216) 2024-07-08 20:02:15 -07:00
4f0e0ea131 Add FlashInfer to default Dockerfile (#6172) 2024-07-08 13:38:03 -07:00
ddc369fba1 [Bugfix] Mamba cache Cuda Graph padding (#6214) 2024-07-08 11:25:51 -07:00
185ad31f37 [Bugfix] use diskcache in outlines _get_guide #5436 (#6203) 2024-07-08 11:23:24 -07:00
543aa48573 [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-08 17:12:15 +00:00
f7a8fa39d8 [Kernel] reloading fused_moe config on the last chunk (#6210) 2024-07-08 08:00:38 -07:00
717f4bcea0 Feature/add benchmark testing (#5947)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-08 07:52:06 +00:00
16620f439d do not exclude object field in CompletionStreamResponse (#6196) 2024-07-08 10:32:57 +08:00
3b08fe2b13 [misc][frontend] log all available endpoints (#6195)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-07-07 15:11:12 -07:00
abfe705a02 [ Misc ] Support Fp8 via llm-compressor (#6110)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
333306a252 add benchmark for fix length input and output (#5857)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-07 07:42:13 +00:00
6206dcb29e [Model] Add PaliGemma (#5189)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-07 09:25:50 +08:00
9389380015 [Doc] Move guide for multimodal model and other improvements (#6168) 2024-07-06 17:18:59 +08:00
175c43eca4 [Doc] Reorganize Supported Models by Type (#6167) 2024-07-06 05:59:36 +00:00
bc96d5c330 Move release wheel env var to Dockerfile instead (#6163) 2024-07-05 17:19:53 -07:00
f0250620dd Fix release wheel build env var (#6162) 2024-07-05 16:24:31 -07:00
2de490d60f Update wheel builds to strip debug (#6161) 2024-07-05 14:51:25 -07:00
79d406e918 [Docs] Fix readthedocs for tag build (#6158) v0.5.1 2024-07-05 12:44:40 -07:00
abad5746a7 bump version to v0.5.1 (#6157) 2024-07-05 12:04:51 -07:00
e58294ddf2 [Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695) 2024-07-05 10:41:01 -07:00
f1e15da6fe [Frontend] Continuous usage stats in OpenAI completion API (#5742) 2024-07-05 10:37:09 -07:00
0097bb1829 [Bugfix] Use templated datasource in grafana.json to allow automatic imports (#6136)
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
2024-07-05 09:49:47 -07:00
ea4b570483 [VLM] Cleanup validation and update docs (#6149) 2024-07-05 05:49:38 +00:00
a41357e941 [VLM] Improve consistency between feature size calculation and dummy data for profiling (#6146) 2024-07-05 09:29:47 +08:00
ae96ef8fbd [VLM] Calculate maximum number of multi-modal tokens by model (#6121) 2024-07-04 16:37:23 -07:00
69ec3ca14c [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-04 16:35:51 -07:00
81d7a50f24 [Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2024-07-04 15:22:12 -07:00
27902d42be [misc][doc] try to add warning for latest html (#5979) 2024-07-04 09:57:09 -07:00
56b325e977 [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043)
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
2024-07-03 22:19:38 -07:00
3dd507083f [CI/Build] Cleanup VLM tests (#6107) 2024-07-03 18:58:18 -07:00