[Bug] default ChatQnA's model Intel/neural-chat-7b-v3-3 is extremely slow with vLLM on ICX cpu

### Priority

P4-Low

### OS type

Ubuntu

### Hardware type

Xeon-ICX

### Installation method

- [x] Pull docker images from hub.docker.com
- [ ] Build docker images from source

### Deploy method

- [x] Docker compose
- [ ] Docker
- [ ] Kubernetes
- [ ] Helm

### Running nodes

Single Node

### What's the version?

docker compose commit id: 742cb6d
docker image `opea/vllm` info: 
    - repoDigest: opea/vllm@sha256:61760224596acb8fbce25dfd4942049263363764f327d6d6ea8e1e69c0799988
    - "Created": "2025-01-17T03:57:18.181292698Z"

### Description

PR #1403 switch to vLLM as the default inference backend for ChatQnA for xeon cpu environment, however, the ChatQnA's `Intel/neural-chat-7b-v3-3` model is extremely slow on vLLM on Xeon ICX CPU.

Run the following curl command to inference with vLLM directly to generate 32 tokens, it takes more than 4minutes to complete:
`curl http://localhost:9009/v1/completions   -H "Content-Type: application/json"   -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'`

We should either switch to another model or revert the PR #1403 to use TGI as the default inference backend on Xeon CPU.

### Reproduce steps

docker compose up -d
curl http://localhost:9009/v1/completions   -H "Content-Type: application/json"   -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32}'


### Raw log

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] default ChatQnA's model Intel/neural-chat-7b-v3-3 is extremely slow with vLLM on ICX cpu #1420

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] default ChatQnA's model Intel/neural-chat-7b-v3-3 is extremely slow with vLLM on ICX cpu #1420

Description

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Reproduce steps

Raw log

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions