Skip to content

andrewCodeDev/metaphor

Repository files navigation

Metaphor

Metaphor is a tensor computation library built around lazy graph construction and deferred execution. Operations are recorded as a symbolic graph and compiled into fused kernels at execution time, enabling automatic operator fusion, memory lifecycle management, and device-agnostic code.

Design

Tensors are lightweight handles. Operations on them produce graph nodes, not immediate results. When a value is needed, the graph is compiled into an execution sequence that:

  • Fuses compatible element-wise operations into single kernel launches
  • Manages tensor memory lifetimes via reference-counted liveness analysis
  • Supports automatic differentiation through the same graph infrastructure
  • Compiles once and re-executes via fingerprint-based change detection

Backends

  • Host — JIT-compiled kernels with BLAS acceleration (MKL/OpenBLAS)
  • HIP — AMD GPUs via ROCm, hipTENSOR, and runtime kernel compilation
  • CUDA — NVIDIA GPUs via cuTENSOR, cuDNN, and NVRTC

Backend selection is transparent to user code. The same model definition runs on any supported device.

Autodiff

Backward passes are constructed by composing forward operations — the same mechanism used for forward computation. Gradient kernels are fused and scheduled through the execution graph like any other operation.

Snippets

/// Apply rotary position embeddings to q or k (batched prefill).
/// x: [B, S, H, head_dim], cos_pos: [1, S, 1, head_dim/2], sin_pos: [1, S, 1, head_dim/2]
/// Returns [B, S, H, head_dim] with RoPE applied.
fn Tensor? apply_rope_batched(Tensor x, Tensor cos_pos, Tensor sin_pos)
{
	ulong half = x.shape().get(3) / 2;
	Tensor x1 = x.slice({ { 0, half, 1 } }, offset: 3)!;
	Tensor x2 = x.slice({ { half, 0, 1 } }, offset: 3)!;

	Tensor rot1 = x1 * cos_pos - x2 * sin_pos;
	Tensor rot2 = x2 * cos_pos + x1 * sin_pos;
	return tensor::cat({ rot1, rot2 }, 3);
}
	// 10. Output: y = C * h + D * x
Tensor y = c_proj.einsum(h, "bld,bled->ble")!!
	.add(x_silu.einsum(self.d_vec, "ble,e->ble")!!)!!;

// 11. Gate with z_branch
Tensor gated = y.mul(z_branch.silu()!!)!!;

// 12. Output projection with optional LoRA
return self.has_out_proj_lora
	? lora_forward(&self.out_proj_lora, gated, self.out_proj, dev)!!.stable()
	: functional::linear(gated, self.out_proj)!!.stable();
host_device::HostDevice cpu;
cpu.init();
DeviceReference dev = cpu.reference();

graph::@subgraph()
{
	Tensor a = tensor::constant(F32, dev, { 2, 3 }, 2.0)!!;
	Tensor b = tensor::constant(F32, dev, { 2, 3 }, 3.0)!!;
	Tensor c = (a + b).collect();

	dev.sync();

	float[6] result;
	c.get(&result);

	for (usz i = 0; i < 6; i++)
	{
		assert(math::abs(result[i] - 5.0f) < TOL, "add: expected 5.0");
	}
};

Building

c3c build                          # build the static library
cmake --build build -j$(nproc)     # build GPU shared libraries

Testing

timeout 60 c3c test metaphor       # run the full test suite

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors