Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
f076c55
Added tests + updated docs for asr mp3 change (#51)
okhleif-10 Feb 5, 2025
11f797e
first commit for tts addition
okhleif-10 Feb 5, 2025
05ddb11
added TTS linkage to backend
okhleif-10 Feb 7, 2025
ee62b73
removed unused import
okhleif-10 Feb 7, 2025
4f99bd1
Merge branch 'opea-project:main' into mmqna-phase3
okhleif-10 Feb 10, 2025
0f4e77d
added necessary env vars
okhleif-10 Feb 10, 2025
fc99972
Merge remote-tracking branch 'origin/mmqna-phase3' into omar/tts-mmqna
okhleif-10 Feb 10, 2025
66a89d6
MMQnA UI option to show/delete files from vector store (#52)
HarshaRamayanam Feb 10, 2025
5527bc3
Configurable UI timeout (#54)
mhbuehler Feb 10, 2025
e500c10
reworked temp tts toggle logic
okhleif-10 Feb 11, 2025
aafee33
added modalities as a toggle
okhleif-10 Feb 11, 2025
e686ec3
removed print statement
okhleif-10 Feb 12, 2025
cd0a364
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Feb 12, 2025
e4ae51d
removed gaudi from tts
okhleif-10 Feb 12, 2025
0818fff
Merge remote-tracking branch 'origin/mmqna-phase3' into omar/tts-mmqna
okhleif-10 Feb 12, 2025
e145e93
Update retrieval endpoints in READMEs (#55)
mhbuehler Feb 12, 2025
a1c7adb
doc updates and code refactor
okhleif-10 Feb 13, 2025
0c056a4
Merge remote-tracking branch 'origin/mmqna-phase3' into omar/tts-mmqna
okhleif-10 Feb 13, 2025
632a60b
added tts test to megaservice tests
okhleif-10 Feb 13, 2025
08ab760
remove log diles
okhleif-10 Feb 13, 2025
220096e
addressed recent review comments
okhleif-10 Feb 13, 2025
311e6b6
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Feb 19, 2025
fc4f46c
Modifies a megaservice test to verify correct apple color (#57)
mhbuehler Feb 20, 2025
6faf0bb
Documentation and diagram update for MultimodalQnA phase 3 enhancemen…
dmsuehir Feb 20, 2025
aa41588
Backend Implementation for Text to Speech (#53)
okhleif-10 Feb 20, 2025
51c6ec5
Merge branch 'main_upstream' into mmqna-phase3
HarshaRamayanam Feb 20, 2025
f84d52d
Test and documentation updates for image and audio data ingestion (#56)
dmsuehir Feb 20, 2025
186f7a8
Merge branch 'mmqna-phase3' into hramayan/tts-mmqna-ui
HarshaRamayanam Feb 20, 2025
a0b8177
Added TTS validation test (#59)
okhleif-10 Feb 26, 2025
2137998
Added Logic for audio responses & refactored code to align with new g…
HarshaRamayanam Mar 4, 2025
a575dd3
Merge branch 'mmqna-phase3' into hramayan/tts-mmqna-ui
HarshaRamayanam Mar 4, 2025
59fb709
Minr bug fixes and UI changes
HarshaRamayanam Mar 5, 2025
4013a0d
UI layout update & handling empty text with spaces
HarshaRamayanam Mar 5, 2025
cd4c645
Updates on review comments
HarshaRamayanam Mar 5, 2025
a2cf4dd
Update on review comments
HarshaRamayanam Mar 5, 2025
6f39ff1
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Mar 6, 2025
b5a0e27
Merge branch 'mmqna-phase3' into hramayan/tts-mmqna-ui
HarshaRamayanam Mar 6, 2025
58734e9
Update MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py
HarshaRamayanam Mar 7, 2025
1e09283
Some updates to review comments. More to come after testing
HarshaRamayanam Mar 7, 2025
2c4ead5
Restrict file media types to known/working formats
HarshaRamayanam Mar 7, 2025
1ce67e2
Remove extra whitespace
HarshaRamayanam Mar 7, 2025
e9f0cd0
Fix test_compose_on_gaudi.sh script's diff not syncing with phase3
HarshaRamayanam Mar 7, 2025
5ad1c18
Changes per review comments
HarshaRamayanam Mar 10, 2025
4954f77
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Mar 12, 2025
3a34ec2
Added single space to the pload
HarshaRamayanam Mar 13, 2025
5b47407
Added logic to flush chatbot assistant's voice reponse .wav
HarshaRamayanam Mar 13, 2025
9189732
Merge branch 'mmqna-phase3' into hramayan/tts-mmqna-ui
HarshaRamayanam Mar 13, 2025
943fa9e
Enable audio caption upload in the UI (#61)
mhbuehler Mar 14, 2025
dc5065d
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Mar 17, 2025
dea974b
Merge branch 'mmqna-phase3' into hramayan/tts-mmqna-ui
HarshaRamayanam Mar 17, 2025
401570f
Fix spoken audio responses on Gaudi and add tests (#63)
dmsuehir Mar 17, 2025
4aa013c
Clearing PDF clears textbox and version upgrade fixes interactive=Fal…
mhbuehler Mar 17, 2025
d2a2bc4
Fixed issue where assistant's image is not sent
HarshaRamayanam Mar 18, 2025
c1843f7
Merge branch 'mmqna-phase3' into hramayan/tts-mmqna-ui
HarshaRamayanam Mar 18, 2025
4ed2117
Revert build yaml
HarshaRamayanam Mar 18, 2025
b4ba36c
Clear diff
HarshaRamayanam Mar 18, 2025
8d5690e
Add missing env vars for MMQnA UI data prep endpoints (#62)
dmsuehir Mar 18, 2025
36ee073
Fix first query test (#65)
mhbuehler Mar 18, 2025
bc43cc1
changes per review
HarshaRamayanam Mar 18, 2025
9aad174
small change
HarshaRamayanam Mar 18, 2025
abf0200
Update Dockerfile
HarshaRamayanam Mar 18, 2025
e24bddf
Merge pull request #60 from mhbuehler/hramayan/tts-mmqna-ui
HarshaRamayanam Mar 18, 2025
7005119
Update Dockerfile
HarshaRamayanam Mar 18, 2025
a5e201f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 19, 2025
e63fee0
Merge branch 'main' into mmqna-phase3
mhbuehler Mar 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion MultimodalQnA/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ WORKDIR $HOME
FROM base AS git

RUN apt-get update && apt-get install -y --no-install-recommends git
RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git
# RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git
RUN git clone --single-branch --branch="mmqna-phase3" https://github.com/mhbuehler/GenAIComps.git
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is for testing purposes and has to be reverted before merging.


# Stage 3: common layer shared by services using GenAIComps
FROM base AS comps-base
Expand Down
89 changes: 69 additions & 20 deletions MultimodalQnA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.

`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user, which can be text or audio.

The MultimodalQnA architecture shows below:

Expand Down Expand Up @@ -41,12 +41,14 @@ flowchart LR
UI([UI server<br>]):::orchid
end

ASR{{Whisper service <br>}}
TEI_EM{{Embedding service <br>}}
VDB{{Vector DB<br><br>}}
R_RET{{Retriever service <br>}}
DP([Data Preparation<br>]):::blue
LVM_gen{{LVM Service <br>}}
GW([MultimodalQnA GateWay<br>]):::orange
TTS{{SpeechT5 service <br>}}

%% Data Preparation flow
%% Ingest data flow
Expand Down Expand Up @@ -74,25 +76,42 @@ flowchart LR
R_RET <-.->VDB
DP <-.->VDB

%% Audio speech recognition used for translating audio queries to text
GW <-.-> ASR

%% Generate spoken responses with text-to-speech using the SpeechT5 model
GW <-.-> TTS

```

This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on [Intel Gaudi2](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html) and [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html), and we invite contributions from other hardware vendors to expand the example.

The [Whisper Service](https://github.com/opea-project/GenAIComps/blob/main/comps/asr/src/README.md)
is used by MultimodalQnA for converting audio queries to text. If a spoken response is requested, the
[SpeechT5 Service](https://github.com/opea-project/GenAIComps/blob/main/comps/tts/src/README.md) translates the text
response from the LVM to a speech audio file.

The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit [Habana AI products](https://habana.ai/products) for more details.

In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.

<details>
<summary><b>Gaudi default compose.yaml</b></summary>

| MicroService | Open Source Project | HW | Port | Endpoint |
| ------------ | --------------------- | ----- | ---- | ----------------------------------------------------------- |
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval |
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
<summary><b>Gaudi and Xeon default compose.yaml settings</b></summary>

| MicroService | Open Source Project | HW | Port | Endpoint |
| ------------ | ----------------------- | ----- | ---- | ----------------------------------------------------------- |
| Dataprep | Redis, Langchain, TGI | Xeon | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
| LVM | Langchain, Transformers | Xeon | 9399 | /v1/lvm |
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/retrieval |
| SpeechT5 | Transformers | Xeon | 7055 | /v1/tts |
| Whisper | Transformers | Xeon | 7066 | /v1/asr |
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
| Embedding | Langchain | Gaudi | 6000 | /v1/embeddings |
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
| Retriever | Langchain, Redis | Gaudi | 7000 | /v1/retrieval |
| SpeechT5 | Transformers | Gaudi | 7055 | /v1/tts |
| Whisper | Transformers | Gaudi | 7066 | /v1/asr |

</details>

Expand All @@ -104,18 +123,41 @@ By default, the embedding and LVM models are set to a default value as listed be
| --------- | ----- | ----------------------------------------- |
| embedding | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
| SpeechT5 | Xeon | microsoft/speecht5_tts |
| Whisper | Xeon | openai/whisper-small |
| embedding | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
| SpeechT5 | Gaudi | microsoft/speecht5_tts |
| Whisper | Gaudi | openai/whisper-small |

You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.

## Deploy MultimodalQnA Service

The MultimodalQnA service can be effortlessly deployed on either Intel Gaudi2 or Intel XEON Scalable Processors.

Currently we support deploying MultimodalQnA services with docker compose.
Currently we support deploying MultimodalQnA services with docker compose. The [`docker_compose`](docker_compose)
directory has folders which include `compose.yaml` files for different hardware types:

### Setup Environment Variable
```
📂 docker_compose
├── 📂 amd
│   └── 📂 gpu
│   └── 📂 rocm
│   ├── 📄 compose.yaml
│   └── ...
└── 📂 intel
├── 📂 cpu
│   └── 📂 xeon
│   ├── 📄 compose.yaml
│   └── ...
└── 📂 hpu
└── 📂 gaudi
├── 📄 compose.yaml
└── ...
```

### Setup Environment Variables

To set up environment variables for deploying MultimodalQnA services, follow these steps:

Expand All @@ -124,8 +166,10 @@ To set up environment variables for deploying MultimodalQnA services, follow the
```bash
# Example: export host_ip=$(hostname -I | awk '{print $1}')
export host_ip="External_Public_IP"

# Append the host_ip to the no_proxy list to allow container communication
# Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
export no_proxy="Your_No_Proxy"
export no_proxy="${no_proxy},${host_ip}"
```

2. If you are in a proxy environment, also set the proxy-related environment variables:
Expand All @@ -137,36 +181,41 @@ To set up environment variables for deploying MultimodalQnA services, follow the

3. Set up other environment variables:

> Notice that you can only choose **one** command below to set up envs according to your hardware. Other that the port numbers may be set incorrectly.
> Choose **one** command below to set env vars according to your hardware. Otherwise, the port numbers may be set incorrectly.

```bash
# on Gaudi
source ./docker_compose/intel/hpu/gaudi/set_env.sh
cd docker_compose/intel/hpu/gaudi
source ./set_env.sh

# on Xeon
source ./docker_compose/intel/cpu/xeon/set_env.sh
cd docker_compose/intel/cpu/xeon
source ./set_env.sh
```

### Deploy MultimodalQnA on Gaudi

Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) to build docker images from source.
Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) if you would like to build docker images from
source, otherwise images will be pulled from Docker Hub.

Find the corresponding [compose.yaml](./docker_compose/intel/hpu/gaudi/compose.yaml).

```bash
cd GenAIExamples/MultimodalQnA/docker_compose/intel/hpu/gaudi/
# While still in the docker_compose/intel/hpu/gaudi directory, use docker compose to bring up the services
docker compose -f compose.yaml up -d
```

> Notice: Currently only the **Habana Driver 1.17.x** is supported for Gaudi.
> Notice: Currently only the **Habana Driver 1.18.x** is supported for Gaudi.

### Deploy MultimodalQnA on Xeon

Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) for more instructions on building docker images from source.
Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) if you would like to build docker images from
source, otherwise images will be pulled from Docker Hub.

Find the corresponding [compose.yaml](./docker_compose/intel/cpu/xeon/compose.yaml).

```bash
cd GenAIExamples/MultimodalQnA/docker_compose/intel/cpu/xeon/
# While still in the docker_compose/intel/cpu/xeon directory, use docker compose to bring up the services
docker compose -f compose.yaml up -d
```

Expand Down
2 changes: 1 addition & 1 deletion MultimodalQnA/docker_compose/amd/gpu/rocm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ curl http://${host_ip}:$MM_EMBEDDING_PORT_MICROSERVICE/v1/embeddings \

```bash
export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(512)]; print(embedding)")
curl http://${host_ip}:7000/v1/multimodal_retrieval \
curl http://${host_ip}:7000/v1/retrieval \
-X POST \
-H "Content-Type: application/json" \
-d "{\"text\":\"test\",\"embedding\":${your_embedding}}"
Expand Down
2 changes: 2 additions & 0 deletions MultimodalQnA/docker_compose/amd/gpu/rocm/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,8 @@ services:
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
- DATAPREP_GET_FILE_ENDPOINT=${DATAPREP_GET_FILE_ENDPOINT}
- DATAPREP_DELETE_FILE_ENDPOINT=${DATAPREP_DELETE_FILE_ENDPOINT}
ipc: host
restart: always

Expand Down
Loading