Architecture¶

This document describes the architecture, component interactions, and operational characteristics of the KFabrik platform.

Overview¶

KFabrik consists of four primary components that work together to provide a complete ML inference platform:

flowchart TB
    subgraph Workstation["User Workstation"]
        CLI["kfabrik CLI"]
    end

    CLI -->|"kubectl / Kubernetes API"| Cluster

    subgraph Cluster["Minikube Cluster (Docker Driver)"]
        subgraph Bootstrap["kfabrik-bootstrap addon"]
            CertManager["Cert-Manager"]
            Istio["Istio"]
            KServe["KServe"]
            NvidiaPlugin["NVIDIA Device Plugin"]
        end

        subgraph Model["kfabrik-model addon"]
            subgraph ModelServing["model-serving namespace"]
                ConfigMap["ConfigMap: model-config"]
                InferenceServices["InferenceServices"]
                Predictors["Predictor Pods"]
            end
        end

        subgraph Monitoring["kfabrik-monitoring addon"]
            Prometheus["Prometheus"]
            Grafana["Grafana"]
            DCGM["DCGM Exporter"]
        end
    end

Component Responsibilities¶

kfabrik CLI¶

Provides a command-line interface for deploying models, querying inference endpoints, and managing the model lifecycle. The CLI communicates with the Kubernetes API to create InferenceService resources and uses kubectl port-forwarding to access inference endpoints.

Kubernetes Client Architecture:

Typed Clientset (kubernetes.Clientset) for standard Kubernetes resources (ConfigMaps, Pods, Services)
Dynamic Client (dynamic.Interface) for custom resources (InferenceServices)

kfabrik-bootstrap Addon¶

Installs the foundational infrastructure required for model serving:

Cert-Manager for TLS certificate management
Istio for service mesh routing
KServe for model lifecycle management
NVIDIA Device Plugin for GPU resource scheduling

The addon uses a Kubernetes Job to orchestrate Helm-based installations in dependency order.

kfabrik-model Addon¶

Creates the model-serving namespace and deploys a ConfigMap containing pre-configured model definitions. These definitions specify HuggingFace model URIs, resource requirements, and inference server parameters optimized for 6GB VRAM GPUs.

kfabrik-monitoring Addon¶

Deploys the observability stack:

Prometheus for metrics collection
Grafana for visualization
DCGM Exporter for GPU-specific metrics

Prometheus is pre-configured to scrape KServe inference metrics and DCGM GPU metrics.

Bootstrap Installation Sequence¶

The bootstrap addon installs components through a sequenced process with explicit health checks:

flowchart TD
    subgraph Phase1["Phase 1: Cert-Manager Installation"]
        P1A["Add Jetstack Helm repository"] --> P1B["Helm install cert-manager with CRDs"]
        P1B --> P1C["Wait for deployments"]
        P1C --> P1D["Verify CRD exists"]
    end

    subgraph Phase2["Phase 2: Istio Installation"]
        P2A["Add Istio Helm repository"] --> P2B["Helm install istio-base"]
        P2B --> P2C["Helm install istiod"]
        P2C --> P2D["Wait for istiod deployment"]
        P2D --> P2E["Create IngressClass"]
    end

    subgraph Phase3["Phase 3: KServe Cleanup"]
        P3A["Delete existing ClusterRoles"] --> P3B["Delete webhooks & CRDs"]
        P3B --> P3C["Wait for cleanup"]
    end

    subgraph Phase4["Phase 4: KServe Installation"]
        P4A["Helm install kserve-crd"] --> P4B["Helm install kserve controller"]
        P4B --> P4C["Wait for controller-manager"]
        P4C --> P4D["Verify CRD exists"]
    end

    subgraph Phase5["Phase 5: GPU Node Labeling"]
        P5A["Query nodes for GPU capacity"] --> P5B["Label GPU nodes"]
        P5B --> P5C["Log warning if no GPUs"]
    end

    Phase1 --> Phase2 --> Phase3 --> Phase4 --> Phase5

Design Decisions¶

Why RawDeployment Mode?¶

KServe supports two deployment modes: Serverless (using Knative) and RawDeployment (using standard Kubernetes Deployments). KFabrik currently uses RawDeployment for simplicity—it's easier to set up, debug, and operate for local development.

Knative support is planned for a future release.

Why Sequential Installation?¶

Parallel installation would be faster but creates race conditions. Istio requires Cert-Manager for TLS certificates. KServe requires Istio for ingress routing. The sequential approach adds approximately 2 minutes to installation time but eliminates entire categories of failure modes.

Why Cleanup Before KServe Installation?¶

KServe's webhook configurations and CRDs can become orphaned when previous installations fail mid-process. The cleanup phase ensures idempotent behavior: running the installer multiple times produces the same result.

Model Deployment Flow¶

When a user runs kfabrik deploy --models qwen-small:

sequenceDiagram
    participant CLI as kfabrik CLI
    participant K8s as Kubernetes API
    participant KServe as KServe Controller
    participant Pod as Predictor Pod

    CLI->>K8s: Read model-config ConfigMap
    K8s-->>CLI: Return configuration
    CLI->>CLI: Parse YAML, apply defaults
    CLI->>K8s: Check if InferenceService exists

    alt Already exists
        K8s-->>CLI: Skip (idempotent)
    else Does not exist
        CLI->>K8s: Create InferenceService
        K8s->>KServe: Watch event
        KServe->>K8s: Create Deployment
        KServe->>K8s: Create Service
        KServe->>K8s: Create VirtualService
        KServe->>K8s: Create DestinationRule
        K8s->>Pod: Schedule pod
        Pod->>Pod: Download model from HuggingFace
        Pod->>Pod: Initialize inference server
        Pod-->>K8s: Readiness probe succeeds
        KServe->>K8s: Update status Ready=True
        K8s-->>CLI: Ready status (if --wait)
    end

InferenceService Specification¶

The CLI generates InferenceService resources with the following structure:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-small
  namespace: model-serving
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: hf://Qwen/Qwen2.5-0.5B-Instruct
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "2"
          memory: "4Gi"
          nvidia.com/gpu: "1"
      env:
        - name: HF_MODEL_ID
          value: "Qwen/Qwen2.5-0.5B-Instruct"
        - name: MAX_MODEL_LEN
          value: "2048"
        - name: DTYPE
          value: "float16"

Monitoring Architecture¶

flowchart TB
    subgraph monitoring["monitoring namespace"]
        Prometheus["Prometheus<br/>• Scrape jobs<br/>• Alert rules<br/>• TSDB storage"]
        Grafana["Grafana<br/>• Dashboards<br/>• Data sources<br/>• Alerting"]
        DCGM["DCGM Exporter<br/>(DaemonSet)<br/>GPU nodes only"]

        DCGM -->|"scrape :9400"| Prometheus
        Prometheus -->|"query :9090"| Grafana
    end

    subgraph model-serving["model-serving namespace"]
        qwen-small["qwen-small<br/>predictor :9090"]
        qwen-medium["qwen-medium<br/>predictor :9090"]
        phi2["phi2<br/>predictor :9090"]
    end

    qwen-small -->|scrape| Prometheus
    qwen-medium -->|scrape| Prometheus
    phi2 -->|scrape| Prometheus

Prometheus Scrape Jobs¶

Scrape Job	Targets	Metrics
prometheus	Self (localhost:9090)	Prometheus internal metrics
kubernetes-apiservers	API server	API request latency, etcd metrics
kubernetes-nodes	All nodes	Node CPU, memory, disk
kubernetes-nodes-cadvisor	All nodes	Container resource usage
kubernetes-service-endpoints	Annotated services	Custom service metrics
kubernetes-pods	Annotated pods	Custom pod metrics
dcgm-exporter	GPU node pods	GPU utilization, temperature, memory
kserve-inferenceservices	Model predictor pods	Inference latency, throughput

GPU Metrics (DCGM Exporter)¶

Metric	Type	Description
DCGM_FI_DEV_GPU_UTIL	gauge	GPU utilization percentage
DCGM_FI_DEV_MEM_COPY_UTIL	gauge	Memory copy engine utilization
DCGM_FI_DEV_FB_FREE	gauge	Free framebuffer memory (MiB)
DCGM_FI_DEV_FB_USED	gauge	Used framebuffer memory (MiB)
DCGM_FI_DEV_FB_TOTAL	gauge	Total framebuffer memory (MiB)
DCGM_FI_DEV_GPU_TEMP	gauge	GPU temperature (Celsius)
DCGM_FI_DEV_MEMORY_TEMP	gauge	Memory temperature (Celsius)
DCGM_FI_DEV_POWER_USAGE	gauge	Power consumption (Watts)

Security Considerations¶

Privilege Requirements¶

The kfabrik-bootstrap installer Job requires cluster-admin privileges for:

Creating CustomResourceDefinitions (cluster-scoped)
Creating ClusterRoles and ClusterRoleBindings
Creating ValidatingWebhookConfigurations
Installing Helm charts that modify cluster-wide resources

Mitigations:

Installer Job has TTL of 300 seconds (auto-deleted after completion)
ServiceAccount is scoped to kserve namespace
Installer runs once during addon enablement, not continuously

Network Exposure¶

By default, KFabrik does not expose services outside the cluster. All access occurs through:

kubectl port-forward for CLI queries
minikube service for Grafana/Prometheus dashboards

Dependencies¶

External Dependencies¶

Component	Source	Purpose
Cert-Manager	Jetstack Helm	TLS certificate management
Istio	Istio Helm	Service mesh, ingress
KServe	KServe OCI	Model serving platform
NVIDIA Device Plugin	NVIDIA	GPU resource scheduling
Prometheus	Docker Hub	Metrics collection
Grafana	Docker Hub	Visualization
DCGM Exporter	NVIDIA	GPU metrics

Runtime Dependencies¶

Dependency	Required By	Purpose
minikube	kfabrik CLI	Cluster management
kubectl	kfabrik CLI	Port-forwarding, API access
Docker	minikube	Container runtime
NVIDIA Driver	Host	GPU access
NVIDIA Container Toolkit	minikube	GPU passthrough

Resource Requirements¶

Default Resource Requirements¶

Component	CPU Request	CPU Limit	Memory Request	Memory Limit
KServe Controller	100m	500m	256Mi	512Mi
Prometheus	250m	500m	512Mi	1Gi
Grafana	100m	200m	256Mi	512Mi
DCGM Exporter	100m	500m	512Mi	1Gi
Model Predictor (small)	1	2	2Gi	4Gi
Model Predictor (medium)	2	4	6Gi	10Gi

Port Assignments¶

Service	Port	Protocol	Purpose
Prometheus	9090	HTTP	Metrics API, Web UI
Grafana	3000	HTTP	Dashboard UI
DCGM Exporter	9400	HTTP	GPU metrics endpoint
Model Predictor	80	HTTP	Inference API
Model Predictor Metrics	9090	HTTP	Prometheus metrics