[Lora] Support long context lora (#4787)

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files
This commit is contained in:
SangBin Cho
2024-05-18 16:05:23 +09:00
committed by GitHub
parent c0724fc915
commit 2e9a2227ec
25 changed files with 998 additions and 71 deletions

View File

@ -112,7 +112,7 @@ mypy vllm/model_executor --config-file pyproject.toml
CODESPELL_EXCLUDES=(
'--skip' '*docs/source/_build/**'
'--skip' '*docs/source/_build/**,./tests/lora/data'
)
# check spelling of specified files
@ -133,10 +133,9 @@ spell_check_changed() {
# `diff-filter=ACM` and $MERGEBASE is to ensure we only lint files that
# exist on both branches.
MERGEBASE="$(git merge-base origin/main HEAD)"
if ! git diff --diff-filter=ACM --quiet --exit-code "$MERGEBASE" -- '*.py' '*.pyi' &>/dev/null; then
git diff --name-only --diff-filter=ACM "$MERGEBASE" -- '*.py' '*.pyi' | xargs \
codespell "${CODESPELL_EXCLUDES[@]}"
codespell "${CODESPELL_EXCLUDES[@]}"
fi
}