youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Roy	c1c0d00b88	Don't use cupy when `enforce_eager=True` (#3037 )	2024-02-26 17:33:38 -08:00
Roy	d9f726c4d0	[Minor] Remove unused config files (#3039 )	2024-02-26 17:25:22 -08:00
Woosuk Kwon	d6e4a130b0	[Minor] Remove gather_cached_kv kernel (#3043 )	2024-02-26 15:00:54 -08:00
Philipp Moritz	cfc15a1031	Optimize Triton MoE Kernel (#2979 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-02-26 13:48:56 -08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Woosuk Kwon	f7c1234990	[Fix] Fissertion on YaRN model len (#2984 )	2024-02-23 12:57:48 -08:00
zhaoyang-star	57f044945f	Fix nvcc not found in vlm-openai image (#2781 )	2024-02-22 14:25:07 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	6f32cddf1c	Remove Flash Attention in test env (#2982 )	2024-02-22 09:58:29 -08:00
44670	c530e2cfe3	[FIX] Fix a bug in initializing Yarn RoPE (#2983 )	2024-02-22 01:40:05 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Woosuk Kwon	95529e3253	Use Llama RMSNorm custom op for Gemma (#2974 )	2024-02-21 18:28:23 -08:00
Roy	344020c926	Migrate MistralForCausalLM to LlamaForCausalLM (#2868 )	2024-02-21 18:25:05 -08:00
Mustafa Eyceoz	5574081c49	Added early stopping to completion APIs (#2939 )	2024-02-21 18:24:01 -08:00
Ronen Schaffer	d7f396486e	Update comment (#2934 )	2024-02-21 18:18:37 -08:00
Zhuohan Li	8fbd84bf78	Bump up version to v0.3.2 (#2968 ) This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832). v0.3.2	2024-02-21 11:47:25 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Woosuk Kwon	dc903e70ac	[ROCm] Upgrade transformers to v4.38.0 (#2967 )	2024-02-21 09:46:57 -08:00
Zhuohan Li	a9c8212895	[FIX] Add Gemma model to the doc (#2966 )	2024-02-21 09:46:15 -08:00
Woosuk Kwon	c20ecb6a51	Upgrade transformers to v4.38.0 (#2965 )	2024-02-21 09:38:03 -08:00
Xiang Xu	5253edaacb	Add Gemma model (#2964 )	2024-02-21 09:34:30 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Antoni Baum	181b27d881	Make vLLM logging formatting optional (#2877 )	2024-02-20 14:38:55 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
James Whedbee	264017a2bf	[ROCm] include gfx908 as supported (#2792 )	2024-02-19 17:58:59 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Simon Mo	86fd8bb0ac	Add warning to prevent changes to benchmark api server (#2858 )	2024-02-18 21:36:19 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
Zhuohan Li	537c9755a7	[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905 )	2024-02-18 14:39:00 -08:00
Mark Mozolewski	786b7f18a5	Add code-revision config argument for Hugging Face Hub (#2892 )	2024-02-17 22:36:53 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Nick Hill	185b2c29e2	Defensively copy `sampling_params` (#2881 ) If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request. Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059	2024-02-17 11:18:04 -08:00
Woosuk Kwon	5f08050d8d	Bump up to v0.3.1 (#2887 ) v0.3.1	2024-02-16 15:05:18 -08:00
shiyi.c_98	64da65b322	Prefix Caching- fix t4 triton error (#2517 )	2024-02-16 14:17:55 -08:00
Hongxia Yang	5255d99dc5	[ROCm] Dockerfile fix for flash-attention build (#2885 )	2024-02-15 10:22:39 -08:00
Philipp Moritz	4f2ad11135	Fix DeciLM (#2883 )	2024-02-14 22:29:57 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00
Philipp Moritz	31348dff03	Align LoRA code between Mistral and Mixtral (fixes #2875 ) (#2880 ) * Fix AttributeError: MixtralModel object has no attribute org_vocab_size. * Make LoRA logic for Mistral and Mixtral the same --------- Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>	2024-02-15 01:00:43 +01:00
Woosuk Kwon	25e86b6a61	Don't use cupy NCCL for AMD backends (#2855 )	2024-02-14 12:30:44 -08:00
Roy	4efbac6d35	Migrate AquilaForCausalLM to LlamaForCausalLM (#2867 )	2024-02-14 12:30:24 -08:00
Nikola Borisov	87069ccf68	Fix docker python version (#2845 )	2024-02-14 10:17:57 -08:00
Woosuk Kwon	7e45107f51	[Fix] Fix memory profiling when GPU is used by multiple processes (#2863 )	2024-02-13 19:52:34 -08:00
Philipp Moritz	0c48b37c31	Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861 )	2024-02-13 18:01:15 -08:00
Philipp Moritz	7eacffd951	Migrate InternLMForCausalLM to LlamaForCausalLM (#2860 ) Co-authored-by: Roy <jasonailu87@gmail.com>	2024-02-13 17:12:05 -08:00
Terry	2a543d6efe	Add LoRA support for Mixtral (#2831 ) * add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell	2024-02-14 00:55:45 +01:00
Philipp Moritz	317b29de0f	Remove Yi model definition, please use `LlamaForCausalLM` instead (#2854 ) Co-authored-by: Roy <jasonailu87@gmail.com>	2024-02-13 14:22:22 -08:00
Woosuk Kwon	a463c333dd	Use CuPy for CUDA graphs (#2811 )	2024-02-13 11:32:06 -08:00

1 2 3 4 5 ...

787 Commits