mhbuehler · dmsuehir · Feb 20, 2025 · Feb 19, 2025 · Feb 19, 2025 · Feb 20, 2025
diff --git a/MultimodalQnA/README.md b/MultimodalQnA/README.md
@@ -41,12 +41,14 @@ flowchart LR
         UI([UI server<br>]):::orchid
     end
 
+    ASR{{Whisper service <br>}}
     TEI_EM{{Embedding service <br>}}
     VDB{{Vector DB<br><br>}}
     R_RET{{Retriever service <br>}}
     DP([Data Preparation<br>]):::blue
     LVM_gen{{LVM Service <br>}}
     GW([MultimodalQnA GateWay<br>]):::orange
+    TTS{{SpeechT5 service <br>}}
 
     %% Data Preparation flow
     %% Ingest data flow
@@ -74,25 +76,42 @@ flowchart LR
     R_RET <-.->VDB
     DP <-.->VDB
 
+    %% Audio speech recognition used for translating audio queries to text
+    GW <-.-> ASR
 
+    %% Generate spoken responses with text-to-speech using the SpeechT5 model
+    GW <-.-> TTS
 
 ```
 
 This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on [Intel Gaudi2](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html) and [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html), and we invite contributions from other hardware vendors to expand the example.
 
+The [Whisper Service](https://github.com/opea-project/GenAIComps/blob/main/comps/asr/src/README.md)
+is used by MultimodalQnA for converting audio queries to text. If a spoken response is requested, the
+[SpeechT5 Service](https://github.com/opea-project/GenAIComps/blob/main/comps/tts/src/README.md) translates the text
+response from the LVM to a speech audio file.
+
 The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit [Habana AI products](https://habana.ai/products) for more details.
 
 In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.
 
 <details>
-<summary><b>Gaudi default compose.yaml</b></summary>
+<summary><b>Gaudi and Xeon default compose.yaml settings</b></summary>
 
 | MicroService | Open Source Project   | HW    | Port | Endpoint                                                    |
 | ------------ | --------------------- | ----- | ---- | ----------------------------------------------------------- |
+| Dataprep     | Redis, Langchain, TGI | Xeon  | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
 | Embedding    | Langchain             | Xeon  | 6000 | /v1/embeddings                                              |
-| Retriever    | Langchain, Redis      | Xeon  | 7000 | /v1/multimodal_retrieval                                    |
-| LVM          | Langchain, TGI        | Gaudi | 9399 | /v1/lvm                                                     |
+| LVM          | Langchain, Transformers | Xeon | 9399 | /v1/lvm                                                    |
+| Retriever    | Langchain, Redis      | Xeon  | 7000 | /v1/retrieval                                               |
+| SpeechT5     | Transformers          | Xeon  | 7055 | /v1/tts                                                     |
+| Whisper      | Transformers          | Xeon  | 7066 | /v1/asr                                                     |
 | Dataprep     | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
+| Embedding    | Langchain             | Gaudi | 6000 | /v1/embeddings                                              |
+| LVM          | Langchain, TGI        | Gaudi | 9399 | /v1/lvm                                                     |
+| Retriever    | Langchain, Redis      | Gaudi | 7000 | /v1/retrieval                                               |
+| SpeechT5     | Transformers          | Gaudi | 7055 | /v1/tts                                                     |
+| Whisper      | Transformers          | Gaudi | 7066 | /v1/asr                                                     |
 
 </details>
 
@@ -104,18 +123,41 @@ By default, the embedding and LVM models are set to a default value as listed be
 | --------- | ----- | ----------------------------------------- |
 | embedding | Xeon  | BridgeTower/bridgetower-large-itm-mlm-itc |
 | LVM       | Xeon  | llava-hf/llava-1.5-7b-hf                  |
+| SpeechT5  | Xeon  | microsoft/speecht5_tts                    |
+| Whisper   | Xeon  | openai/whisper-small                      |
 | embedding | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
 | LVM       | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf         |
+| SpeechT5  | Gaudi | microsoft/speecht5_tts                    |
+| Whisper   | Gaudi  | openai/whisper-small                     |
 
 You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
 
 ## Deploy MultimodalQnA Service
 
 The MultimodalQnA service can be effortlessly deployed on either Intel Gaudi2 or Intel XEON Scalable Processors.
 
-Currently we support deploying MultimodalQnA services with docker compose.
+Currently we support deploying MultimodalQnA services with docker compose. The [`docker_compose`](docker_compose)
+directory has folders which include `compose.yaml` files for different hardware types:
+
+```
+📂 docker_compose
+├── 📂 amd
+│   └── 📂 gpu
+│       └── 📂 rocm
+│           ├── 📄 compose.yaml
+│           └── ...
+└── 📂 intel
+    ├── 📂 cpu
+    │   └── 📂 xeon
+    │       ├── 📄 compose.yaml
+    │       └── ...
+    └── 📂 hpu
+        └── 📂 gaudi
+            ├── 📄 compose.yaml
+            └── ...
+```
 
-### Setup Environment Variable
+### Setup Environment Variables
 
 To set up environment variables for deploying MultimodalQnA services, follow these steps:
 
@@ -124,8 +166,10 @@ To set up environment variables for deploying MultimodalQnA services, follow the
    ```bash
    # Example: export host_ip=$(hostname -I | awk '{print $1}')
    export host_ip="External_Public_IP"
+
+   # Append the host_ip to the no_proxy list to allow container communication
    # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
-   export no_proxy="Your_No_Proxy"
+   export no_proxy="${no_proxy},${host_ip}"
    ```
 
 2. If you are in a proxy environment, also set the proxy-related environment variables:
@@ -137,36 +181,41 @@ To set up environment variables for deploying MultimodalQnA services, follow the
 
 3. Set up other environment variables:
 
-   > Notice that you can only choose **one** command below to set up envs according to your hardware. Other that the port numbers may be set incorrectly.
+   > Choose **one** command below to set env vars according to your hardware. Otherwise, the port numbers may be set incorrectly.
 
    ```bash
    # on Gaudi
-   source ./docker_compose/intel/hpu/gaudi/set_env.sh
+   cd docker_compose/intel/hpu/gaudi
+   source ./set_env.sh
+
    # on Xeon
-   source ./docker_compose/intel/cpu/xeon/set_env.sh
+   cd docker_compose/intel/cpu/xeon
+   source ./set_env.sh
    ```
 
 ### Deploy MultimodalQnA on Gaudi
 
-Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) to build docker images from source.
+Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) if you would like to build docker images from
+source, otherwise images will be pulled from Docker Hub.
 
 Find the corresponding [compose.yaml](./docker_compose/intel/hpu/gaudi/compose.yaml).
 
 ```bash
-cd GenAIExamples/MultimodalQnA/docker_compose/intel/hpu/gaudi/
+# While still in the docker_compose/intel/hpu/gaudi directory, use docker compose to bring up the services
 docker compose -f compose.yaml up -d
 ```
 
-> Notice: Currently only the **Habana Driver 1.17.x** is supported for Gaudi.
+> Notice: Currently only the **Habana Driver 1.18.x** is supported for Gaudi.
 
 ### Deploy MultimodalQnA on Xeon
 
-Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) for more instructions on building docker images from source.
+Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) if you would like to build docker images from
+source, otherwise images will be pulled from Docker Hub.
 
 Find the corresponding [compose.yaml](./docker_compose/intel/cpu/xeon/compose.yaml).
 
 ```bash
-cd GenAIExamples/MultimodalQnA/docker_compose/intel/cpu/xeon/
+# While still in the docker_compose/intel/cpu/xeon directory, use docker compose to bring up the services
 docker compose -f compose.yaml up -d
 ```
 

diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
@@ -63,7 +63,7 @@ Since the `compose.yaml` will consume some environment variables, you need to se
 
 **Export the value of the public IP address of your Xeon server to the `host_ip` environment variable**
 
-> Change the External_Public_IP below with the actual IPV4 value
+> Change the External_Public_IP below with the actual IPV4 value when setting the `host_ip` value (do not use localhost).
 
 ```
 export host_ip="External_Public_IP"
@@ -72,13 +72,10 @@ export host_ip="External_Public_IP"
 **Append the value of the public IP address to the no_proxy list**
 
 ```bash
-export your_no_proxy=${your_no_proxy},"External_Public_IP"
+export no_proxy=${no_proxy},${host_ip}
 ```
 
 ```bash
-export no_proxy=${your_no_proxy}
-export http_proxy=${your_http_proxy}
-export https_proxy=${your_http_proxy}
 export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
 export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
@@ -114,8 +111,6 @@ export UI_PORT=5173
 export UI_TIMEOUT=200
 ```
 
-Note: Please replace with `host_ip` with you external IP address, do not use localhost.
-
 > Note: The `MAX_IMAGES` environment variable is used to specify the maximum number of images that will be sent from the LVM service to the LLaVA server.
 > If an image list longer than `MAX_IMAGES` is sent to the LVM server, a shortened image list will be sent to the LLaVA service. If the image list
 > needs to be shortened, the most recent images (the ones at the end of the list) are prioritized to send to the LLaVA service. Some LLaVA models have not

diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
@@ -8,10 +8,6 @@ popd > /dev/null
 
 export host_ip=$(hostname -I | awk '{print $1}')
 
-export no_proxy=${your_no_proxy}
-export http_proxy=${your_http_proxy}
-export https_proxy=${your_http_proxy}
-
 export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
 export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}

diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
@@ -8,7 +8,7 @@ Since the `compose.yaml` will consume some environment variables, you need to se
 
 **Export the value of the public IP address of your Gaudi server to the `host_ip` environment variable**
 
-> Change the External_Public_IP below with the actual IPV4 value
+> Change the External_Public_IP below with the actual IPV4 value when setting the `host_ip` value (do not use localhost).
 
 ```
 export host_ip="External_Public_IP"
@@ -17,13 +17,10 @@ export host_ip="External_Public_IP"
 **Append the value of the public IP address to the no_proxy list**
 
 ```bash
-export your_no_proxy=${your_no_proxy},"External_Public_IP"
+export no_proxy=${no_proxy},${host_ip}
 ```
 
 ```bash
-export no_proxy=${your_no_proxy}
-export http_proxy=${your_http_proxy}
-export https_proxy=${your_http_proxy}
 export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
 export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
@@ -60,8 +57,6 @@ export UI_PORT=5173
 export UI_TIMEOUT=200
 ```
 
-Note: Please replace with `host_ip` with you external IP address, do not use localhost.
-
 > Note: The `MAX_IMAGES` environment variable is used to specify the maximum number of images that will be sent from the LVM service to the LLaVA server.
 > If an image list longer than `MAX_IMAGES` is sent to the LVM server, a shortened image list will be sent to the LLaVA service. If the image list
 > needs to be shortened, the most recent images (the ones at the end of the list) are prioritized to send to the LLaVA service. Some LLaVA models have not

diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
@@ -13,10 +13,6 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
 export LVM_SERVICE_HOST_IP=${host_ip}
 export MEGA_SERVICE_HOST_IP=${host_ip}
 
-export no_proxy=${your_no_proxy}
-export http_proxy=${your_http_proxy}
-export https_proxy=${your_http_proxy}
-
 export REDIS_DB_PORT=6379
 export REDIS_INSIGHTS_PORT=8001
 export REDIS_URL="redis://${host_ip}:${REDIS_DB_PORT}"