youngkingdom/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
James Whedbee	264017a2bf	[ROCm] include gfx908 as supported (#2792 )	2024-02-19 17:58:59 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Simon Mo	86fd8bb0ac	Add warning to prevent changes to benchmark api server (#2858 )	2024-02-18 21:36:19 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
Zhuohan Li	537c9755a7	[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905 )	2024-02-18 14:39:00 -08:00
Mark Mozolewski	786b7f18a5	Add code-revision config argument for Hugging Face Hub (#2892 )	2024-02-17 22:36:53 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Nick Hill	185b2c29e2	Defensively copy `sampling_params` (#2881 ) If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request. Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059	2024-02-17 11:18:04 -08:00
Woosuk Kwon	5f08050d8d	Bump up to v0.3.1 (#2887 ) v0.3.1	2024-02-16 15:05:18 -08:00
shiyi.c_98	64da65b322	Prefix Caching- fix t4 triton error (#2517 )	2024-02-16 14:17:55 -08:00
Hongxia Yang	5255d99dc5	[ROCm] Dockerfile fix for flash-attention build (#2885 )	2024-02-15 10:22:39 -08:00
Philipp Moritz	4f2ad11135	Fix DeciLM (#2883 )	2024-02-14 22:29:57 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00
Philipp Moritz	31348dff03	Align LoRA code between Mistral and Mixtral (fixes #2875 ) (#2880 ) * Fix AttributeError: MixtralModel object has no attribute org_vocab_size. * Make LoRA logic for Mistral and Mixtral the same --------- Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>	2024-02-15 01:00:43 +01:00
Woosuk Kwon	25e86b6a61	Don't use cupy NCCL for AMD backends (#2855 )	2024-02-14 12:30:44 -08:00
Roy	4efbac6d35	Migrate AquilaForCausalLM to LlamaForCausalLM (#2867 )	2024-02-14 12:30:24 -08:00
Nikola Borisov	87069ccf68	Fix docker python version (#2845 )	2024-02-14 10:17:57 -08:00
Woosuk Kwon	7e45107f51	[Fix] Fix memory profiling when GPU is used by multiple processes (#2863 )	2024-02-13 19:52:34 -08:00
Philipp Moritz	0c48b37c31	Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861 )	2024-02-13 18:01:15 -08:00
Philipp Moritz	7eacffd951	Migrate InternLMForCausalLM to LlamaForCausalLM (#2860 ) Co-authored-by: Roy <jasonailu87@gmail.com>	2024-02-13 17:12:05 -08:00
Terry	2a543d6efe	Add LoRA support for Mixtral (#2831 ) * add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell	2024-02-14 00:55:45 +01:00
Philipp Moritz	317b29de0f	Remove Yi model definition, please use `LlamaForCausalLM` instead (#2854 ) Co-authored-by: Roy <jasonailu87@gmail.com>	2024-02-13 14:22:22 -08:00
Woosuk Kwon	a463c333dd	Use CuPy for CUDA graphs (#2811 )	2024-02-13 11:32:06 -08:00
Philipp Moritz	ea356004d4	Revert "Refactor llama family models (#2637 )" (#2851 ) This reverts commit `5c976a7e1a`.	2024-02-13 09:24:59 -08:00
Roy	5c976a7e1a	Refactor llama family models (#2637 )	2024-02-13 00:09:23 -08:00
Simon Mo	f964493274	[CI] Ensure documentation build is checked in CI (#2842 )	2024-02-12 22:53:07 -08:00
Roger Wang	a4211a4dc3	Serving Benchmark Refactoring (#2433 )	2024-02-12 22:53:00 -08:00
Rex	563836496a	Refactor 2 awq gemm kernels into m16nXk32 (#2723 ) Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>	2024-02-12 11:02:17 -08:00
Philipp Moritz	4ca2c358b1	Add documentation section about LoRA (#2834 )	2024-02-12 17:24:45 +01:00
Hongxia Yang	0580aab02f	[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768 )	2024-02-10 23:14:37 -08:00
Woosuk Kwon	3711811b1d	Disable custom all reduce by default (#2808 )	2024-02-08 09:58:03 -08:00
SangBin Cho	65b89d16ee	[Ray] Integration compiled DAG off by default (#2471 )	2024-02-08 09:57:25 -08:00
Philipp Moritz	931746bc6d	Add documentation on how to do incremental builds (#2796 )	2024-02-07 14:42:02 -08:00
Hongxia Yang	c81dddb45c	[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support (#2790 )	2024-02-06 22:36:59 -08:00
Lily Liu	fe6d09ae61	[Minor] More fix of test_cache.py CI test failure (#2750 )	2024-02-06 11:38:38 -08:00
liuyhwangyh	ed70c70ea3	modelscope: fix issue when model parameter is not a model id but path of the model. (#2489 )	2024-02-06 09:57:15 -08:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
Douglas Lehr	2ccee3def6	[ROCm] Fixup arch checks for ROCM (#2627 )	2024-02-05 14:59:09 -08:00
Lukas	b92adec8e8	Set local logging level via env variable (#2774 )	2024-02-05 14:26:50 -08:00
Hongxia Yang	56f738ae9b	[ROCm] Fix some kernels failed unit tests (#2498 )	2024-02-05 14:25:36 -08:00
Woosuk Kwon	72d3a30c63	[Minor] Fix benchmark_latency script (#2765 )	2024-02-05 12:45:37 -08:00
whyiug	c9b45adeeb	Require triton >= 2.1.0 (#2746 ) Co-authored-by: yangrui1 <yangrui@lanjingren.com>	2024-02-04 23:07:36 -08:00
Rex	5a6c81b051	Remove eos tokens from output by default (#2611 )	2024-02-04 14:32:42 -08:00
dancingpipi	51cd22ce56	set&get llm internal tokenizer instead of the TokenizerGroup (#2741 ) Co-authored-by: shujunhua1 <shujunhua1@jd.com>	2024-02-04 14:25:36 -08:00
Massimiliano Pronesti	5ed704ec8c	docs: fix langchain (#2736 )	2024-02-03 18:17:55 -08:00
Cheng Su	4abf6336ec	Add one example to run batch inference distributed on Ray (#2696 )	2024-02-02 15:41:42 -08:00
zspo	0e163fce18	Fix default length_penalty to 1.0 (#2667 )	2024-02-01 15:59:39 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
Pernekhan Utemuratov	c410f5d020	Use revision when downloading the quantization config file (#2697 ) Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>	2024-02-01 15:41:58 -08:00

1 2 3 4 5 ...

761 Commits