Priority
P1-Stopper
OS type
Ubuntu
Hardware type
Gaudi/AMD GPU
Running nodes
Single Node
Description
Feature Objective:
Set vLLM as the default serving framework on Gaudi and AMD GPU for all remaining GenAI examples to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.
Feature Details:
Replace TGI with vLLM as the default serving backend for inference on Xeon/Gaudi/AMD GPU devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Xeon/Gaudi/AMD GPU hardware.
Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Xeon/Gaudi/AMD GPU.
Feature Scope:
Intel Xeon and Gaudi
execution plan
In v1.3, upgrade vllm-fork version to v0.6.6.post1+Gaudi-1.20.0.
Examples:
Components:
AMD/GPU ROCm
Examples:
Components:
Priority
P1-Stopper
OS type
Ubuntu
Hardware type
Gaudi/AMD GPU
Running nodes
Single Node
Description
Feature Objective:
Set vLLM as the default serving framework on Gaudi and AMD GPU for all remaining GenAI examples to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.
Feature Details:
Replace TGI with vLLM as the default serving backend for inference on Xeon/Gaudi/AMD GPU devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Xeon/Gaudi/AMD GPU hardware.
Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Xeon/Gaudi/AMD GPU.
Feature Scope:
Intel Xeon and Gaudi
execution plan
In v1.3, upgrade vllm-fork version to v0.6.6.post1+Gaudi-1.20.0.
Examples:
Components:
AMD/GPU ROCm
Examples:
Components: