Skip to content

[Bug] default ChatQnA's model Intel/neural-chat-7b-v3-3 is extremely slow with vLLM on ICX cpu #1420

@lianhao

Description

@lianhao

Priority

P4-Low

OS type

Ubuntu

Hardware type

Xeon-ICX

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source

Deploy method

  • Docker compose
  • Docker
  • Kubernetes
  • Helm

Running nodes

Single Node

What's the version?

docker compose commit id: 742cb6d
docker image opea/vllm info:
- repoDigest: opea/vllm@sha256:61760224596acb8fbce25dfd4942049263363764f327d6d6ea8e1e69c0799988
- "Created": "2025-01-17T03:57:18.181292698Z"

Description

PR #1403 switch to vLLM as the default inference backend for ChatQnA for xeon cpu environment, however, the ChatQnA's Intel/neural-chat-7b-v3-3 model is extremely slow on vLLM on Xeon ICX CPU.

Run the following curl command to inference with vLLM directly to generate 32 tokens, it takes more than 4minutes to complete:
curl http://localhost:9009/v1/completions -H "Content-Type: application/json" -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'

We should either switch to another model or revert the PR #1403 to use TGI as the default inference backend on Xeon CPU.

Reproduce steps

docker compose up -d
curl http://localhost:9009/v1/completions -H "Content-Type: application/json" -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'

Raw log

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions