Metaphor is a tensor computation library built around lazy graph construction and deferred execution. Operations are recorded as a symbolic graph and compiled into fused kernels at execution time, enabling automatic operator fusion, memory lifecycle management, and device-agnostic code.
Tensors are lightweight handles. Operations on them produce graph nodes, not immediate results. When a value is needed, the graph is compiled into an execution sequence that:
- Fuses compatible element-wise operations into single kernel launches
- Manages tensor memory lifetimes via reference-counted liveness analysis
- Supports automatic differentiation through the same graph infrastructure
- Compiles once and re-executes via fingerprint-based change detection
- Host — JIT-compiled kernels with BLAS acceleration (MKL/OpenBLAS)
- HIP — AMD GPUs via ROCm, hipTENSOR, and runtime kernel compilation
- CUDA — NVIDIA GPUs via cuTENSOR, cuDNN, and NVRTC
Backend selection is transparent to user code. The same model definition runs on any supported device.
Backward passes are constructed by composing forward operations — the same mechanism used for forward computation. Gradient kernels are fused and scheduled through the execution graph like any other operation.
/// Apply rotary position embeddings to q or k (batched prefill).
/// x: [B, S, H, head_dim], cos_pos: [1, S, 1, head_dim/2], sin_pos: [1, S, 1, head_dim/2]
/// Returns [B, S, H, head_dim] with RoPE applied.
fn Tensor? apply_rope_batched(Tensor x, Tensor cos_pos, Tensor sin_pos)
{
ulong half = x.shape().get(3) / 2;
Tensor x1 = x.slice({ { 0, half, 1 } }, offset: 3)!;
Tensor x2 = x.slice({ { half, 0, 1 } }, offset: 3)!;
Tensor rot1 = x1 * cos_pos - x2 * sin_pos;
Tensor rot2 = x2 * cos_pos + x1 * sin_pos;
return tensor::cat({ rot1, rot2 }, 3);
} // 10. Output: y = C * h + D * x
Tensor y = c_proj.einsum(h, "bld,bled->ble")!!
.add(x_silu.einsum(self.d_vec, "ble,e->ble")!!)!!;
// 11. Gate with z_branch
Tensor gated = y.mul(z_branch.silu()!!)!!;
// 12. Output projection with optional LoRA
return self.has_out_proj_lora
? lora_forward(&self.out_proj_lora, gated, self.out_proj, dev)!!.stable()
: functional::linear(gated, self.out_proj)!!.stable();host_device::HostDevice cpu;
cpu.init();
DeviceReference dev = cpu.reference();
graph::@subgraph()
{
Tensor a = tensor::constant(F32, dev, { 2, 3 }, 2.0)!!;
Tensor b = tensor::constant(F32, dev, { 2, 3 }, 3.0)!!;
Tensor c = (a + b).collect();
dev.sync();
float[6] result;
c.get(&result);
for (usz i = 0; i < 6; i++)
{
assert(math::abs(result[i] - 5.0f) < TOL, "add: expected 5.0");
}
};c3c build # build the static library
cmake --build build -j$(nproc) # build GPU shared libraries
timeout 60 c3c test metaphor # run the full test suite