Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AudioQnA/audioqna.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
SPEECHT5_SERVER_PORT = int(os.getenv("SPEECHT5_SERVER_PORT", 7055))
LLM_SERVER_HOST_IP = os.getenv("LLM_SERVER_HOST_IP", "0.0.0.0")
LLM_SERVER_PORT = int(os.getenv("LLM_SERVER_PORT", 3006))
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "Intel/neural-chat-7b-v3-3")
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")


def align_inputs(self, inputs, cur_node, runtime_graph, llm_parameters_dict, **kwargs):
Expand Down
2 changes: 1 addition & 1 deletion AudioQnA/audioqna_multilang.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
GPT_SOVITS_SERVER_PORT = int(os.getenv("GPT_SOVITS_SERVER_PORT", 9088))
LLM_SERVER_HOST_IP = os.getenv("LLM_SERVER_HOST_IP", "0.0.0.0")
LLM_SERVER_PORT = int(os.getenv("LLM_SERVER_PORT", 8888))
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "Intel/neural-chat-7b-v3-3")
LLM_MODEL_ID = os.getenv("LLM_MODEL_ID", "meta-llama/Meta-Llama-3-8B-Instruct")


def align_inputs(self, inputs, cur_node, runtime_graph, llm_parameters_dict, **kwargs):
Expand Down
111 changes: 86 additions & 25 deletions AudioQnA/docker_compose/intel/cpu/xeon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

This document outlines the deployment process for a AudioQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server.

The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [Start the MegaService](#-start-the-megaservice) section in this page.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## 🚀 Build Docker images

### 1. Source Code install GenAIComps
Expand All @@ -17,9 +21,15 @@ cd GenAIComps
docker build -t opea/whisper:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/src/integrations/dependency/whisper/Dockerfile .
```

### 3. Build LLM Image
### 3. Build vLLM Image

Intel Xeon optimized image hosted in huggingface repo will be used for TGI service: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu (https://github.com/huggingface/text-generation-inference)
```bash
git clone https://github.com/vllm-project/vllm.git
cd ./vllm/
VLLM_VER="$(git describe --tags "$(git rev-list --tags --max-count=1)" )"
git checkout ${VLLM_VER}
docker build --no-cache --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile.cpu -t opea/vllm:latest --shm-size=128g .
```

### 4. Build TTS Image

Expand All @@ -43,9 +53,10 @@ docker build --no-cache -t opea/audioqna:latest --build-arg https_proxy=$https_p
Then run the command `docker images`, you will have following images ready:

1. `opea/whisper:latest`
2. `opea/speecht5:latest`
3. `opea/audioqna:latest`
4. `opea/gpt-sovits:latest` (optional)
2. `opea/vllm:latest`
3. `opea/speecht5:latest`
4. `opea/audioqna:latest`
5. `opea/gpt-sovits:latest` (optional)

## 🚀 Set the environment variables

Expand All @@ -55,7 +66,7 @@ Before starting the services with `docker compose`, you have to recheck the foll
export host_ip=<your External Public IP> # export host_ip=$(hostname -I | awk '{print $1}')
export HUGGINGFACEHUB_API_TOKEN=<your HF token>

export LLM_MODEL_ID=Intel/neural-chat-7b-v3-3
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"

export MEGA_SERVICE_HOST_IP=${host_ip}
export WHISPER_SERVER_HOST_IP=${host_ip}
Expand All @@ -73,40 +84,90 @@ export BACKEND_SERVICE_ENDPOINT=http://${host_ip}:3008/v1/audioqna

or use set_env.sh file to setup environment variables.

Note: Please replace with host_ip with your external IP address, do not use localhost.
Note:

- Please replace with host_ip with your external IP address, do not use localhost.
- If you are in a proxy environment, also set the proxy-related environment variables:

```
export http_proxy="Your_HTTP_Proxy"
export https_proxy="Your_HTTPs_Proxy"
# Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
export no_proxy="Your_No_Proxy",${host_ip},whisper-service,speecht5-service,gpt-sovits-service,tgi-service,vllm-service,audioqna-xeon-backend-server,audioqna-xeon-ui-server
```

## 🚀 Start the MegaService

```bash
cd GenAIExamples/AudioQnA/docker_compose/intel/cpu/xeon/
```

If use vLLM as the LLM serving backend:

```
docker compose up -d

# multilang tts (optional)
docker compose -f compose_multilang.yaml up -d
```

If use TGI as the LLM serving backend:

```
docker compose -f compose_tgi.yaml up -d
```

## 🚀 Test MicroServices

```bash
# whisper service
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav
curl http://${host_ip}:7066/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file="@./sample.wav" \
-F model="openai/whisper-small"

# tgi service
curl http://${host_ip}:3006/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
1. Whisper Service

# speecht5 service
curl http://${host_ip}:7055/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
```bash
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav
curl http://${host_ip}:${WHISPER_SERVER_PORT}/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F file="@./sample.wav" \
-F model="openai/whisper-small"
```

# gpt-sovits service (optional)
curl http://${host_ip}:9880/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
```
2. LLM backend Service

In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready and the container (`vllm-service` or `tgi-service`) status shown via `docker ps` will be `healthy`. Before that, the status will be `health: starting`.

Or try the command below to check whether the LLM serving is ready.

```bash
# vLLM service
docker logs vllm-service 2>&1 | grep complete
# If the service is ready, you will get the response like below.
INFO: Application startup complete.
```

```bash
# TGI service
docker logs tgi-service | grep Connected
# If the service is ready, you will get the response like below.
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
```

Then try the `cURL` command below to validate services.

```bash
# either vLLM or TGI service
curl http://${host_ip}:${LLM_SERVER_PORT}/v1/chat/completions \
-X POST \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-H 'Content-Type: application/json'
```

3. TTS Service

```
# speecht5 service
curl http://${host_ip}:${SPEECHT5_SERVER_PORT}/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3

# gpt-sovits service (optional)
curl http://${host_ip}:${GPT_SOVITS_SERVER_PORT}/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3
```

## 🚀 Test MegaService

Expand Down
25 changes: 14 additions & 11 deletions AudioQnA/docker_compose/intel/cpu/xeon/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ services:
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
container_name: whisper-service
ports:
- "7066:7066"
- ${WHISPER_SERVER_PORT:-7066}:7066
ipc: host
environment:
no_proxy: ${no_proxy}
Expand All @@ -17,38 +17,41 @@ services:
image: ${REGISTRY:-opea}/speecht5:${TAG:-latest}
container_name: speecht5-service
ports:
- "7055:7055"
- ${SPEECHT5_SERVER_PORT:-7055}:7055
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped
tgi-service:
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
container_name: tgi-service
vllm-service:
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
container_name: vllm-service
ports:
- "3006:80"
- ${LLM_SERVER_PORT:-3006}:80
volumes:
- "${MODEL_CACHE:-./data}:/data"
shm_size: 1g
- "${MODEL_CACHE:-./data}:/root/.cache/huggingface/hub"
shm_size: 128g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
LLM_MODEL_ID: ${LLM_MODEL_ID}
VLLM_TORCH_PROFILER_DIR: "/mnt"
LLM_SERVER_PORT: ${LLM_SERVER_PORT}
healthcheck:
test: ["CMD-SHELL", "curl -f http://$host_ip:3006/health || exit 1"]
test: ["CMD-SHELL", "curl -f http://$host_ip:${LLM_SERVER_PORT}/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
command: --model ${LLM_MODEL_ID} --host 0.0.0.0 --port 80
audioqna-xeon-backend-server:
image: ${REGISTRY:-opea}/audioqna:${TAG:-latest}
container_name: audioqna-xeon-backend-server
depends_on:
- whisper-service
- tgi-service
- vllm-service
- speecht5-service
ports:
- "3008:8888"
Expand Down
26 changes: 17 additions & 9 deletions AudioQnA/docker_compose/intel/cpu/xeon/compose_multilang.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ services:
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
container_name: whisper-service
ports:
- "7066:7066"
- ${WHISPER_SERVER_PORT:-7066}:7066
ipc: host
environment:
no_proxy: ${no_proxy}
Expand All @@ -18,27 +18,35 @@ services:
image: ${REGISTRY:-opea}/gpt-sovits:${TAG:-latest}
container_name: gpt-sovits-service
ports:
- "9880:9880"
- ${GPT_SOVITS_SERVER_PORT:-9880}:9880
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped
tgi-service:
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
container_name: tgi-service
vllm-service:
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
container_name: vllm-service
ports:
- "3006:80"
- ${LLM_SERVER_PORT:-3006}:80
volumes:
- "${MODEL_CACHE:-./data}:/data"
shm_size: 1g
- "${MODEL_CACHE:-./data}:/root/.cache/huggingface/hub"
shm_size: 128g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
LLM_MODEL_ID: ${LLM_MODEL_ID}
VLLM_TORCH_PROFILER_DIR: "/mnt"
LLM_SERVER_PORT: ${LLM_SERVER_PORT}
healthcheck:
test: ["CMD-SHELL", "curl -f http://$host_ip:${LLM_SERVER_PORT}/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
command: --model ${LLM_MODEL_ID} --host 0.0.0.0 --port 80
audioqna-xeon-backend-server:
image: ${REGISTRY:-opea}/audioqna-multilang:${TAG:-latest}
container_name: audioqna-xeon-backend-server
Expand Down
87 changes: 87 additions & 0 deletions AudioQnA/docker_compose/intel/cpu/xeon/compose_tgi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

services:
whisper-service:
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
container_name: whisper-service
ports:
- ${WHISPER_SERVER_PORT:-7066}:7066
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped
speecht5-service:
image: ${REGISTRY:-opea}/speecht5:${TAG:-latest}
container_name: speecht5-service
ports:
- ${SPEECHT5_SERVER_PORT:-7055}:7055
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped
tgi-service:
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
container_name: tgi-service
ports:
- ${LLM_SERVER_PORT:-3006}:80
volumes:
- "${MODEL_CACHE:-./data}:/data"
shm_size: 1g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
LLM_SERVER_PORT: ${LLM_SERVER_PORT}
healthcheck:
test: ["CMD-SHELL", "curl -f http://$host_ip:${LLM_SERVER_PORT}/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
audioqna-xeon-backend-server:
image: ${REGISTRY:-opea}/audioqna:${TAG:-latest}
container_name: audioqna-xeon-backend-server
depends_on:
- whisper-service
- tgi-service
- speecht5-service
ports:
- "3008:8888"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
- WHISPER_SERVER_HOST_IP=${WHISPER_SERVER_HOST_IP}
- WHISPER_SERVER_PORT=${WHISPER_SERVER_PORT}
- LLM_SERVER_HOST_IP=${LLM_SERVER_HOST_IP}
- LLM_SERVER_PORT=${LLM_SERVER_PORT}
- LLM_MODEL_ID=${LLM_MODEL_ID}
- SPEECHT5_SERVER_HOST_IP=${SPEECHT5_SERVER_HOST_IP}
- SPEECHT5_SERVER_PORT=${SPEECHT5_SERVER_PORT}
ipc: host
restart: always
audioqna-xeon-ui-server:
image: ${REGISTRY:-opea}/audioqna-ui:${TAG:-latest}
container_name: audioqna-xeon-ui-server
depends_on:
- audioqna-xeon-backend-server
ports:
- "5173:5173"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- CHAT_URL=${BACKEND_SERVICE_ENDPOINT}
ipc: host
restart: always

networks:
default:
driver: bridge
2 changes: 1 addition & 1 deletion AudioQnA/docker_compose/intel/cpu/xeon/set_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ export host_ip=$(hostname -I | awk '{print $1}')
export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
# <token>

export LLM_MODEL_ID=Intel/neural-chat-7b-v3-3
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"

export MEGA_SERVICE_HOST_IP=${host_ip}
export WHISPER_SERVER_HOST_IP=${host_ip}
Expand Down
Loading