|
|
1ef0d2efd0
|
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
|
2024-09-13 17:01:11 -07:00 |
|
|
|
18e9e1f7b3
|
[HotFix] Fix final output truncation with stop string + streaming (#8468)
|
2024-09-13 11:31:12 -07:00 |
|
|
|
a84e598e21
|
[CI/Build] Reorganize models tests (#7820)
|
2024-09-13 10:20:06 -07:00 |
|
|
|
a2469127db
|
[misc][ci] fix quant test (#8449)
|
2024-09-13 17:20:14 +08:00 |
|
|
|
9b4a3b235e
|
[CI/Build] Enable InternVL2 PP test only on single node (#8437)
|
2024-09-13 06:35:20 +00:00 |
|
|
|
6821020109
|
[Bugfix] Fix async log stats (#8417)
|
2024-09-12 20:48:59 -07:00 |
|
|
|
8427550488
|
[CI/Build] Update pixtral tests to use JSON (#8436)
|
2024-09-13 03:47:52 +00:00 |
|
|
|
40c396533d
|
[Bugfix] Mapping physical device indices for e2e test utils (#8290)
|
2024-09-13 11:06:28 +08:00 |
|
|
|
5ec9c0fb3c
|
[Core] Factor out input preprocessing to a separate class (#7329)
|
2024-09-13 02:56:13 +00:00 |
|
|
|
d31174a4e1
|
[Hotfix][Pixtral] Fix multiple images bugs (#8415)
|
2024-09-12 15:21:51 -07:00 |
|
|
|
b61bd98f90
|
[CI/Build] Disable multi-node test for InternVL2 (#8428)
|
2024-09-12 15:05:35 -07:00 |
|
|
|
551ce01078
|
[Core] Add engine option to return only deltas or final output (#7381)
|
2024-09-12 12:02:00 -07:00 |
|
|
|
a6c0f3658d
|
[multi-step] add flashinfer backend (#7928)
|
2024-09-12 11:16:22 -07:00 |
|
|
|
f2e263b801
|
[Bugfix] Offline mode fix (#8376)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-12 11:11:57 -07:00 |
|
|
|
c6202daeed
|
[Model] Support multiple images for qwen-vl (#8247)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-12 10:10:54 -07:00 |
|
|
|
e56bf27741
|
[Bugfix] Fix InternVL2 inference with various num_patches (#8375)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-12 10:10:35 -07:00 |
|
|
|
7de49aa86c
|
[torch.compile] hide slicing under custom op for inductor (#8384)
|
2024-09-12 00:11:55 -07:00 |
|
|
|
f842a7aff1
|
[misc] remove engine_use_ray (#8126)
|
2024-09-11 18:23:36 -07:00 |
|
|
|
a65cb16067
|
[MISC] Dump model runner inputs when crashing (#8305)
|
2024-09-12 01:12:25 +00:00 |
|
|
|
d394787e52
|
Pixtral (#8377)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-09-11 14:41:55 -07:00 |
|
|
|
775f00f81e
|
[Speculative Decoding] Test refactor (#8317)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-09-11 14:07:34 -07:00 |
|
|
|
73202dbe77
|
[Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
|
2024-09-11 12:52:19 -07:00 |
|
|
|
0b952af458
|
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
|
2024-09-11 09:46:46 -07:00 |
|
|
|
3b7fea770f
|
[Model][VLM] Add Qwen2-VL model support (#7905)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-11 09:31:19 -07:00 |
|
|
|
cea95dfb94
|
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347)
|
2024-09-11 05:30:11 +00:00 |
|
|
|
6a512a00df
|
[model] Support for Llava-Next-Video model (#7559)
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-09-10 22:21:36 -07:00 |
|
|
|
efcf946a15
|
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112)
|
2024-09-11 00:38:40 -04:00 |
|
|
|
1230263e16
|
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (#8299)
|
2024-09-11 10:11:01 +08:00 |
|
|
|
8c054b7a62
|
[Frontend] Clean up type annotations for mistral tokenizer (#8314)
|
2024-09-10 16:49:11 +00:00 |
|
|
|
6cd5e5b07e
|
[Misc] Fused MoE Marlin support for GPTQ (#8217)
|
2024-09-09 23:02:52 -04:00 |
|
|
|
c7cb5c3335
|
[Misc] GPTQ Activation Ordering (#8135)
|
2024-09-09 16:27:26 -04:00 |
|
|
|
08287ef675
|
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272)
|
2024-09-09 10:45:11 -04:00 |
|
|
|
cfe712bf1a
|
[CI/Build] Use python 3.12 in cuda image (#8133)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-07 13:03:16 -07:00 |
|
|
|
e807125936
|
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
|
2024-09-07 16:38:23 +08:00 |
|
|
|
9f68e00d27
|
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
|
2024-09-07 08:02:39 +00:00 |
|
|
|
ce2702a923
|
[tpu][misc] fix typo (#8260)
|
2024-09-06 22:40:46 -07:00 |
|
|
|
2f707fcb35
|
[Model] Multi-input support for LLaVA (#8238)
|
2024-09-07 02:57:24 +00:00 |
|
|
|
29f49cd6e3
|
[Model] Allow loading from original Mistral format (#8168)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-09-06 17:02:05 -06:00 |
|
|
|
1447c97e75
|
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
|
2024-09-06 11:51:03 -07:00 |
|
|
|
e5cab71531
|
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191)
|
2024-09-06 09:01:14 -07:00 |
|
|
|
db3bf7c991
|
[Core] Support load and unload LoRA in api server (#6566)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2024-09-05 18:10:33 -07:00 |
|
|
|
9da25a88aa
|
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-05 12:48:10 +00:00 |
|
|
|
8685ba1a1e
|
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860)
|
2024-09-05 11:33:37 +00:00 |
|
|
|
e39ebf5cf5
|
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173)
|
2024-09-05 05:12:26 +00:00 |
|
|
|
e02ce498be
|
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649)
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
|
2024-09-04 13:18:13 -07:00 |
|
|
|
561d6f8077
|
[CI] Change test input in Gemma LoRA test (#8163)
|
2024-09-04 13:05:50 -07:00 |
|
|
|
d1dec64243
|
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-09-04 11:57:54 -07:00 |
|
|
|
2ad2e5608e
|
[MISC] Consolidate FP8 kv-cache tests (#8131)
|
2024-09-04 18:53:25 +00:00 |
|
|
|
855c262a6b
|
[Frontend] Multimodal support in offline chat (#8098)
|
2024-09-04 05:22:17 +00:00 |
|
|
|
2be8ec6e71
|
[Model] Add Ultravox support for multiple audio chunks (#7963)
|
2024-09-04 04:38:21 +00:00 |
|