Priority
P4-Low
OS type
Ubuntu
Hardware type
Xeon-ICX
Installation method
Deploy method
Running nodes
Single Node
What's the version?
docker compose commit id: 742cb6d
docker image opea/vllm info:
- repoDigest: opea/vllm@sha256:61760224596acb8fbce25dfd4942049263363764f327d6d6ea8e1e69c0799988
- "Created": "2025-01-17T03:57:18.181292698Z"
Description
PR #1403 switch to vLLM as the default inference backend for ChatQnA for xeon cpu environment, however, the ChatQnA's Intel/neural-chat-7b-v3-3 model is extremely slow on vLLM on Xeon ICX CPU.
Run the following curl command to inference with vLLM directly to generate 32 tokens, it takes more than 4minutes to complete:
curl http://localhost:9009/v1/completions -H "Content-Type: application/json" -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'
We should either switch to another model or revert the PR #1403 to use TGI as the default inference backend on Xeon CPU.
Reproduce steps
docker compose up -d
curl http://localhost:9009/v1/completions -H "Content-Type: application/json" -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'
Raw log
Priority
P4-Low
OS type
Ubuntu
Hardware type
Xeon-ICX
Installation method
Deploy method
Running nodes
Single Node
What's the version?
docker compose commit id: 742cb6d
docker image
opea/vllminfo:- repoDigest: opea/vllm@sha256:61760224596acb8fbce25dfd4942049263363764f327d6d6ea8e1e69c0799988
- "Created": "2025-01-17T03:57:18.181292698Z"
Description
PR #1403 switch to vLLM as the default inference backend for ChatQnA for xeon cpu environment, however, the ChatQnA's
Intel/neural-chat-7b-v3-3model is extremely slow on vLLM on Xeon ICX CPU.Run the following curl command to inference with vLLM directly to generate 32 tokens, it takes more than 4minutes to complete:
curl http://localhost:9009/v1/completions -H "Content-Type: application/json" -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'We should either switch to another model or revert the PR #1403 to use TGI as the default inference backend on Xeon CPU.
Reproduce steps
docker compose up -d
curl http://localhost:9009/v1/completions -H "Content-Type: application/json" -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'
Raw log