Document / support for using BFLOAT16 with (Xeon) TGI service

The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3

TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
```
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
         args:
         - --model-id
         - $(LLM_MODEL_ID)
+        - --dtype
+        - bfloat16
         #- "/data/Llama-2-7b-hf"
         # - "/data/Mistral-7B-Instruct-v0.2"
         # - --quantize
```

However, only newer Xeons support BFLOAT16.  Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.

This can be automated by using `node-feature-discovery` and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu

It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Document / support for using BFLOAT16 with (Xeon) TGI service #330

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions