Preparing for a Solutions Architect IC role at Baseten. This guide covers the end-to-end customer journey, core platform pillars, POC methodology, hardware selection, scripting, and inference engineering fundamentals.
Work through each section in order. Each builds on the last:
-
00 — Inference Engineering Foundations The mental models you need before touching any platform. Covers TTFT, throughput, batching, quantization, KV cache, prefill vs decode — the physics of inference.
-
01 — The Customer Journey (End-to-End) What a customer experiences from first contact through production deployment. Maps to the SA role: where you add value at each stage.
-
02 — Baseten Core Pillars Deep dive into the four product areas: Model Performance, MCM/Infra, DevUI & Truss, and Post-Training. What differentiates each from competitors and on-prem alternatives.
-
03 — Hardware Selection Guide H100 vs B200 vs A100 — how hardware choice affects throughput, latency, cost-per-token, and what a customer ultimately selects.
-
04 — The POC Playbook How a Solutions Architect runs a proof-of-concept: model selection, deploy, optimize, benchmark, and present findings that prove better throughput/latency than what the customer has today.
-
05 — Scripting for SAs Python and bash scripts for benchmarking, deployment automation, metrics collection, and customer-facing reporting. Hands-on examples.
-
06 — Competitive Landscape & On-Prem How Baseten compares to running vLLM on your own GPUs, using Replicate/Modal/ RunPod/Together, and what arguments win deals.
-
07 — Blind Spots & Glossary Things that trip up people new to inference engineering. Terminology, common misconceptions, and interview-relevant gotchas.
-
08 — Book Corrections & Additions (READ THIS FIRST) Cross-referenced against Philip Kiely's "Inference Engineering" (Baseten Books, 2026). 20 gaps, corrections, and key additions. Covers disaggregation, EAGLE speculation, ops:byte ratio, SGLang, NVIDIA Dynamo, quantization sensitivity, cache-aware routing, H200/B300 GPUs, MIG, distillation, and more. This is the errata sheet — read alongside the originals.
All in scripts/ — runnable examples that demonstrate SA-relevant workflows:
deploy_model.py— Deploy a model via Truss programmaticallybenchmark.py— Load test an endpoint, measure TTFT/throughput/p95compare_quantizations.py— Deploy same model at fp16/fp8/fp4, comparecost_calculator.py— Calculate cost-per-token for different GPU configsgenerate_report.py— Generate a customer-facing POC report from benchmark datahealth_check.sh— Quick endpoint health/latency check (bash)
# Install dependencies
pip install truss openai requests numpy tabulate matplotlib
# Authenticate with Baseten
uvx truss login
# Run your first benchmark against a pre-optimized model
python scripts/benchmark.py --model "deepseek-v3" --requests 100