Commit Graph

21 Commits

Author SHA1 Message Date
e46a60aa4c [BugFix] Fix handling of stop strings and stop token ids (#3672) 2024-04-11 15:34:12 -07:00
63e7176f26 [Core][Refactor] move parallel_utils into vllm/distributed (#3950)
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
2024-04-10 15:33:30 -07:00
e5043a3e75 [Misc] Add pytest marker to opt-out of global test cleanup (#3863) 2024-04-04 21:54:16 -07:00
eb69d68804 [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783) 2024-04-02 00:49:51 +00:00
26422e477b [Test] Make model tests run again and remove --forked from pytest (#3631)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-28 21:06:40 -07:00
b51c1cc9d2 [2/N] Chunked prefill data update (#3538) 2024-03-28 10:06:01 -07:00
64172a976c [Feature] Add vision language model support. (#3042) 2024-03-25 14:16:30 -07:00
01bfb22b41 [CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
fb96c1e98c Asynchronous tokenization (#2879) 2024-03-15 23:37:01 +00:00
c0c2335ce0 Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
e433c115bc Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
a61f0521b8 [Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
4aaafdd289 [BUGFIX] Fix the path of test prompts (#2273) 2023-12-26 10:37:21 -08:00
f1c8520146 [BugFix] Fix input positions for long context with sliding window (#2088) 2023-12-13 12:28:13 -08:00
5ffc0d13a2 Migrate linter from pylint to ruff (#1665) 2023-11-20 11:58:01 -08:00
9d9072a069 Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
ba0bfd40e2 TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
c9927c1a6a Use queue for finished requests (#957) 2023-09-05 19:27:23 -07:00
002800f081 Align vLLM's beam search implementation with HF generate (#857) 2023-09-04 17:29:42 -07:00
32b6816e55 Add tests for models (#922) 2023-09-01 11:19:43 +09:00