KFabrik Documentation¶

Welcome to KFabrik, an integrated platform for deploying and managing Large Language Models (LLMs) on local Kubernetes clusters.

What is KFabrik?¶

KFabrik enables ML developers to deploy LLM inference servers on minikube for local development and testing with a single command. A typical deployment takes under 10 minutes and requires no manual configuration of Kubernetes resources, service mesh routing, or monitoring infrastructure.

Note: KFabrik is designed for development and testing purposes only. It is not intended for production deployments.

The Problem We Solve¶

ML developers face significant friction when testing LLM deployments locally:

Configuration Burden: Deploying a single model requires creating and maintaining dozens of Kubernetes manifests across multiple namespaces
Dependency Management: The ML serving stack has strict ordering requirements that frequently fail due to race conditions
Observability Gap: Understanding model performance requires correlating inference latency metrics with GPU utilization
Reproducibility: Development environments vary between team members, making it hard to share and reproduce issues

KFabrik solves these problems by providing an opinionated, fully-integrated ML inference platform that deploys with a single command.

Core Components¶

KFabrik comprises a CLI and three minikube addons:

kfabrik CLI¶

Command-line interface for model deployment, querying, and management. Provides commands for:

Starting/stopping clusters with GPU support
Deploying and managing models
Querying models via OpenAI-compatible API
Viewing logs and status

kfabrik-bootstrap Addon¶

Installs foundational infrastructure:

Component	Purpose
Cert-Manager	TLS certificate management
Istio	Service mesh and ingress routing
KServe	Model serving platform
NVIDIA Device Plugin	GPU resource scheduling

kfabrik-model Addon¶

Provides pre-configured model definitions optimized for consumer GPUs (6GB VRAM):

Model	Parameters	VRAM	Description
qwen-small	0.5B	~1GB	Qwen 2.5 0.5B Instruct
qwen-medium	1.5B	~3GB	Qwen 2.5 1.5B Instruct
tinyllama	1.1B	~2.5GB	TinyLlama 1.1B Chat
smollm2	1.7B	~3.5GB	SmolLM2 1.7B Instruct
phi2	2.7B	~5.5GB	Microsoft Phi-2

kfabrik-monitoring Addon¶

Deploys observability stack:

Component	Purpose
Prometheus	Metrics collection
Grafana	Visualization
DCGM Exporter	GPU metrics

Quick Start¶

# Clone the custom minikube repository
git clone https://github.com/kfabrik/minikube.git
cd minikube

# Build minikube with kfabrik addons
make build

# Install kfabrik CLI and minikube
./scripts/install.sh

# Start cluster with GPU support
kfabrik cluster start

# List available models
kfabrik list

# Deploy a model
kfabrik deploy --models qwen-small --wait

# Query the model
kfabrik query --model qwen-small --prompt "What is Kubernetes?"

# Clean up
kfabrik delete --model qwen-small
kfabrik cluster stop

Design Principles¶

Simplicity over flexibility: Optimized for the common case of deploying HuggingFace models on GPU-enabled minikube clusters.

Consistent environments: All developers get identical local configurations using standard KServe InferenceService specifications and Istio routing rules.

Explicit over implicit: All configuration is visible and auditable. Standard Kubernetes resources that can be inspected, modified, and versioned.

Fast feedback: Deployment speed optimized through parallel installations, aggressive health checks, and clear progress reporting.

GPU-first: Assumes GPU workloads are the primary use case. Resource defaults are tuned for NVIDIA GPUs with 6GB+ VRAM.

Quick Links¶

Getting Started - Installation and setup guide
CLI Reference - Complete command documentation
Architecture - Detailed design documentation
Addons - Addon configuration and troubleshooting
Contributing - How to contribute to the project