[ChatQnA] Switch to vLLM as default llm backend on Xeon#1403

Merged

chensuyue merged 24 commits into

opea-project:mainfrom

wangkl2:vllm-default

Jan 17, 2025

Collaborator

wangkl2 commented Jan 16, 2025 •

edited

Loading

Description

Switch to vLLM as the default LLM backend on Xeon for ChatQnA pipeline.

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB. This PR also aligns the parameters of llm service in all chatqna test scripts with what in readme file.

Issues

Type of change

New feature (non-breaking change which adds new functionality)
Others (enhancement, documentation, validation, etc.)

Dependencies

n/a

Tests

TGI version: 2.4.0
vLLM version: 0.6.6.post2.dev151+gbd828722

Benchmark and compare the LLMServe perf on GNR server with OOB-vLLM and OOB-TGI backend via GenAIEval. The geomean perf of vLLM performs better than TGI for average total latency, average TTFT, average TPOT and throughput on 7B LLM with 4 sets of ISL/OSL (128/128, 128/1024, 1024/128, 1024/1024), measured on different num_queries and concurrency, including 32/8, 128/32.

wangkl2 and others added 15 commits

January 15, 2025 06:20


          [ChatQnA] Switch to vLLM as default llm backend on Xeon

725106a

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB.

Implement opea-project#1213

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Use vllm llm backend for pinecone eg

f108812

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Update readme

31c11c6

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Use vllm llm backend for qdrant eg

9ba9897

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Update names of ut scripts

c319bb0

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Fix the vllm test script

937375a

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Update the wo-rerank test script

58f611a

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Update the pinecone test script

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Update the qdrant test script

df03057

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Align the function names and llm svc val parameters in all test scripts

29816b2

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Merge branch 'opea-project:main' into vllm-default

34a918a


          Update readme for descriptions of several deployment variants

a9dd2ae

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Merge branch 'vllm-default' of https://github.com/wangkl2/GenAIExamples…

09b6a7d

… into vllm-default


          fix test script isssue for docker start and stop

b268cd2

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          solve conflicts

1fec5ab

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>

wangkl2 requested review from letonghan and lvliang-intel as code owners

January 16, 2025 10:24

github-actions Bot commented Jan 16, 2025 •

edited

Loading

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

pre-commit-ci Bot and others added 9 commits

January 16, 2025 10:24


          [pre-commit.ci] auto fixes from pre-commit.com hooks

490026f

for more information, see https://pre-commit.ci


          Merge branch 'main' into vllm-default

4c27467


          Fix ci issues

b0bcc46

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Merge branch 'vllm-default' of https://github.com/wangkl2/GenAIExamples…

65c59b5

… into vllm-default


          Merge branch 'main' into vllm-default

e790569


          Fix ci issues

2156bd3

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Merge branch 'vllm-default' of https://github.com/wangkl2/GenAIExamples…

dbe5d35

… into vllm-default


          Fix ci issues

4d9c6dc

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>


          Merge branch 'main' into vllm-default

d4938e1

wangkl2 requested review from XinyaoWa and chensuyue

January 16, 2025 15:44

joshuayao requested a review from yao531441

January 17, 2025 08:13

joshuayao requested a review from XinyuYe-Intel

January 17, 2025 08:13

XinyuYe-Intel approved these changes

View reviewed changes

yao531441 approved these changes

View reviewed changes

chensuyue merged commit 742cb6d into opea-project:main

lianhao mentioned this pull request

[Bug] default ChatQnA's model Intel/neural-chat-7b-v3-3 is extremely slow with vLLM on ICX cpu #1420

Closed

6 tasks

wangkl2 mentioned this pull request

[Bug]OPEA vllm image issue, only on CPU is busy, other are idle for vllm inference #1519

Closed

8 tasks

joshuayao mentioned this pull request

[Feature] vLLM enablement for 8 GenAI examples #1436

Closed

21 tasks

chyundunovDatamonsters pushed a commit to chyundunovDatamonsters/OPEA-GenAIExamples that referenced this pull request


          [ChatQnA] Switch to vLLM as default llm backend on Xeon (opea-project…

7e693e1

…#1403)

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf.

opea-project#1213
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Chingis Yundunov <YundunovCN@sibedge.com>

joshuayao added this to the v1.3 milestone

cogniware-devops pushed a commit to Cogniware-Inc/GenAIExamples that referenced this pull request


          [ChatQnA] Switch to vLLM as default llm backend on Xeon (opea-project…

e795504

…#1403)

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf.

opea-project#1213
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: cogniware-devops <ambarish.desai@cogniware.ai>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

yao531441 yao531441 approved these changes

XinyuYe-Intel XinyuYe-Intel approved these changes

lvliang-intel Awaiting requested review from lvliang-intel lvliang-intel is a code owner

letonghan Awaiting requested review from letonghan

chensuyue Awaiting requested review from chensuyue

XinyaoWa Awaiting requested review from XinyaoWa

Labels

None yet