Building a Grafana monitoring stack for my homelab

Running a homelab with 30+ services is genuinely fun right up until something quietly breaks at 2am and you have no idea when it started or why. That was me six months ago — I had Proxmox, TrueNAS SCALE, a handful of Docker containers, and absolutely zero visibility into any of it. I could tell you if a service was currently down, but not when it went down, how long it had been slow, or whether my NAS was quietly running out of disk space for the third time that month.

The fix was obvious: proper monitoring. I knew about Prometheus and Grafana but had always put it off because it seemed like a lot of infrastructure for a personal homelab. Turns out it’s not that bad, and now that it’s running, it’s genuinely one of the most satisfying parts of my setup.

The stack

The core is straightforward: Prometheus scrapes metrics from exporters, Grafana visualizes them, and Alertmanager routes notifications when something is actually wrong. For host-level metrics I use Node Exporter on every machine. For Docker containers, cAdvisor gives me per-container CPU, memory, and network. TrueNAS exposes metrics natively via its Prometheus endpoint, and Proxmox has a dedicated exporter that gives you VM and node-level stats.

Everything runs in Docker on a dedicated small VM — a 2-core, 4GB Proxmox VM that does nothing but monitoring. Keeping it isolated means the monitoring stack stays up even if I’m doing something reckless on the main hosts.

What I actually dashboard

The temptation is to add everything and end up with a wall of numbers that nobody reads. I’ve tried to be disciplined: each dashboard has a clear purpose, and if I find myself never looking at a panel, I remove it. The dashboards I actually use day-to-day are a host overview (CPU, memory, disk I/O per machine), a network overview (bandwidth per interface, DNS query latency from Pi-hole), and a storage dashboard for TrueNAS that tracks pool usage, SMART stats, and replication job status.

Alerting is where monitoring earns its keep. I have alerts for disk usage over 80%, any VM in a stopped state that should be running, and certificate expiry within 30 days. These go to a Discord webhook, which I already have open all the time. Simple and it works.

The part nobody tells you

The hardest part isn’t the technical setup — it’s the label cardinality problem. When you first set up Prometheus, it’s tempting to add tons of labels to your metrics. This quickly explodes storage and query performance. I learned this the hard way with cAdvisor: the default configuration exports per-container CPU metrics with the full image name as a label, and I had so many containers that queries slowed to a crawl. The fix was to use metric_relabel_configs to drop high-cardinality labels I didn’t care about and limit which containers got full metrics coverage.

If you’re running a homelab and you’re not monitoring it, I’d genuinely recommend starting with just Node Exporter and a single Grafana dashboard. The feedback loop of being able to see what your machines are doing — in real time, historically, with context — changes how you think about your infrastructure.