Commit Graph

1705 Commits

Author SHA1 Message Date
82079729cc [Bugfix] Fix assertion in NeuronExecutor (#5841) 2024-06-25 19:52:10 -07:00
c2a8ac75e0 [CI/Build] Add E2E tests for MLPSpeculator (#5791)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-26 00:04:08 +00:00
f178e56c68 [Hardware][TPU] Raise errors for unsupported sampling params (#5850) 2024-06-25 16:58:23 -07:00
dd793d1de5 [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) 2024-06-25 15:56:15 -07:00
bc34937d68 [Hardware][TPU] Refactor TPU backend (#5831) 2024-06-25 15:25:52 -07:00
dd248f7675 [Misc] Update w4a16 compressed-tensors support to include w8a16 (#5794) 2024-06-25 19:23:35 +00:00
d9b34baedd [CI/Build] Add unit testing for FlexibleArgumentParser (#5798) 2024-06-25 12:18:03 -07:00
c18ebfdd71 [doc][distributed] add both gloo and nccl tests (#5834) 2024-06-25 15:10:28 -04:00
67882dbb44 [Core] Add fault tolerance for RayTokenizerGroupPool (#5748) 2024-06-25 10:15:10 -07:00
7b99314301 [Misc] Remove useless code in cpu_worker (#5824) 2024-06-25 09:41:36 -07:00
2ce5d6688b [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) 2024-06-25 09:56:06 +00:00
f23871e9ee [Doc] Add notice about breaking changes to VLMs (#5818) 2024-06-25 01:25:03 -07:00
e9de9dd551 [ci] Remove aws template (#5757)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-24 21:09:02 -07:00
ba991d5c84 [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (#5795) 2024-06-24 17:01:19 -06:00
1744cc99ba [Doc] Add Phi-3-medium to list of supported models (#5788) 2024-06-24 10:48:55 -07:00
e72dc6cb35 [Doc] Add "Suggest edit" button to doc pages (#5789) 2024-06-24 10:26:17 -07:00
c246212952 [doc][faq] add warning to download models for every nodes (#5783) 2024-06-24 15:37:42 +08:00
edd5fe5fa2 [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772) 2024-06-24 12:11:53 +08:00
5d4d90536f [Distributed] Add send and recv helpers (#5719) 2024-06-23 14:42:28 -07:00
6c916ac8a8 [BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-23 21:07:11 +00:00
832ea88fcb [core][distributed] improve shared memory broadcast (#5754) 2024-06-22 10:00:43 -07:00
8c00f9c15d [Docs][TPU] Add installation tip for TPU (#5761) 2024-06-21 23:09:40 -07:00
0cbc1d2b4f [Bugfix] Fix pin_lora error in TPU executor (#5760) 2024-06-21 22:25:14 -07:00
ff9ddbceee [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py (#5756) 2024-06-22 03:33:12 +00:00
9c62db07ed [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (#5710)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-22 02:07:08 +00:00
cf90ae0123 [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616) 2024-06-21 17:09:34 -07:00
f5dda63eb5 [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) 2024-06-21 15:42:46 -07:00
7187507301 [ci][test] fix ca test in main (#5746) 2024-06-21 14:04:26 -07:00
f1e72cc19a [BugFix] exclude version 1.15.0 for modelscope (#5668) 2024-06-21 13:15:48 -06:00
5b15bde539 [Doc] Documentation on supported hardware for quantization methods (#5745) 2024-06-21 12:44:29 -04:00
bd620b01fb [Kernel][CPU] Add Quick gelu to CPU (#5717) 2024-06-21 06:39:40 +00:00
d9a252bc8e [Core][Distributed] add shm broadcast (#5399)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-21 05:12:35 +00:00
67005a07bc [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
c35e4a3dd7 [BugFix] Fix test_phi3v.py (#5725) 2024-06-21 04:45:34 +00:00
1f5674218f [Kernel] Add punica dimension for Qwen2 LoRA (#5441) 2024-06-20 17:55:41 -07:00
b12518d3cf [Model] MLPSpeculator speculative decoding support (#4947)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
6c5b7af152 [distributed][misc] use fork by default for mp (#5669) 2024-06-20 17:06:34 -07:00
8065a7e220 [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 2024-06-20 17:00:13 -06:00
3f3b6b2150 [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) 2024-06-20 18:36:10 +00:00
a7dcc62086 [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 13:33:21 +00:00
ad137cd111 [Model] Port over CLIPVisionModel for VLMs (#5591) 2024-06-20 11:52:09 +00:00
111af1fa2c [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 06:37:08 +00:00
1b2eaac316 [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703) 2024-06-19 23:10:47 -07:00
3730a1c832 [Misc] Improve conftest (#5681) 2024-06-19 19:09:21 -07:00
949e49a685 [ci] Limit num gpus if specified for A100 (#5694)
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-19 16:30:03 -07:00
4a30d7e3cc [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) 2024-06-19 18:06:44 -04:00
e83db9e7e3 [Doc] Update docker references (#5614)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-06-19 15:01:45 -07:00
78687504f7 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) 2024-06-19 13:57:12 -07:00
d571ca0108 [ci][distributed] add tests for custom allreduce (#5689) 2024-06-19 20:16:04 +00:00
afed90a034 [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) 2024-06-19 14:41:42 -04:00