Skip to content

[ChatQnA] Switch to vLLM as default llm backend on Xeon#1403

Merged
chensuyue merged 24 commits into
opea-project:mainfrom
wangkl2:vllm-default
Jan 17, 2025
Merged

[ChatQnA] Switch to vLLM as default llm backend on Xeon#1403
chensuyue merged 24 commits into
opea-project:mainfrom
wangkl2:vllm-default

Conversation

@wangkl2
Copy link
Copy Markdown
Collaborator

@wangkl2 wangkl2 commented Jan 16, 2025

Description

Switch to vLLM as the default LLM backend on Xeon for ChatQnA pipeline.

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB. This PR also aligns the parameters of llm service in all chatqna test scripts with what in readme file.

Issues

#1213

Type of change

  • New feature (non-breaking change which adds new functionality)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

n/a

Tests

TGI version: 2.4.0
vLLM version: 0.6.6.post2.dev151+gbd828722

Benchmark and compare the LLMServe perf on GNR server with OOB-vLLM and OOB-TGI backend via GenAIEval. The geomean perf of vLLM performs better than TGI for average total latency, average TTFT, average TPOT and throughput on 7B LLM with 4 sets of ISL/OSL (128/128, 128/1024, 1024/128, 1024/1024), measured on different num_queries and concurrency, including 32/8, 128/32.

wangkl2 and others added 15 commits January 15, 2025 06:20
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB.

Implement opea-project#1213

Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 16, 2025

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

@wangkl2 wangkl2 requested review from XinyaoWa and chensuyue January 16, 2025 15:44
@joshuayao joshuayao requested a review from yao531441 January 17, 2025 08:13
@chensuyue chensuyue merged commit 742cb6d into opea-project:main Jan 17, 2025
chyundunovDatamonsters pushed a commit to chyundunovDatamonsters/OPEA-GenAIExamples that referenced this pull request Mar 4, 2025
…#1403)

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf.

opea-project#1213
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: Chingis Yundunov <YundunovCN@sibedge.com>
@joshuayao joshuayao added this to the v1.3 milestone Mar 7, 2025
cogniware-devops pushed a commit to Cogniware-Inc/GenAIExamples that referenced this pull request Dec 19, 2025
…#1403)

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf.

opea-project#1213
Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
Signed-off-by: cogniware-devops <ambarish.desai@cogniware.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants