Monitoring with InfluxDB and Grafana on Kubernetes
Four years later, I still have not gotten the hang of telegraf, I'm still running my own home-made detailed system and process monitoring reporting to InfluxDB running container-lessly in lexicon and I feel the time is up for moving these services into the Kubernetes cluster. Besides keeping them updated, what I'm most looking forward is leveraging the cluster's infrastructure to expose these services (only) over HTTPS with automatically renewed SSL certs.
Current Setup
Continuous Monitoring describes the current, complete setup with the OSS versions of InfluxDB and Grafana.
Kubernetes Deployment
There quite a few articles out there explaining how to
run all 3 components (Telegraf, InfluxDB, Grafana) on
Docker, on some of which the following
monitoring.yaml deployment is (very loosly) based:
Kubernetes deployment: monitoring.yaml
| monitoring.yaml | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | |
This setup reuses the existing dedicated users
influxdb (114) and grafana (115) and requires
new directories owned by these users:
$ ls -ld /home/k8s/influxdb/ /home/k8s/grafana/
drwxr-xr-x 1 grafana grafana 0 Mar 28 23:25 /home/k8s/grafana/
drwxr-xr-x 1 influxdb influxdb 0 Mar 28 21:54 /home/k8s/influxdb/
$ kubectl apply -f monitoring.yaml
namespace/monitoring created
persistentvolume/influxdb-pv created
persistentvolumeclaim/influxdb-pv-claim created
deployment.apps/influxdb created
service/influxdb-svc created
configmap/telegraf created
daemonset.apps/telegraf created
persistentvolume/grafana-pv created
persistentvolumeclaim/grafana-pv-claim created
deployment.apps/grafana created
service/grafana-svc created
$ kubectl -n monitoring get all
NAME READY STATUS RESTARTS AGE
pod/grafana-6c49f96c47-hx7kd 1/1 Running 0 73s
pod/influxdb-6c86444bb7-kt4sx 1/1 Running 0 73s
pod/telegraf-pwtkh 1/1 Running 0 73s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/grafana NodePort 10.111.217.211 <none> 3000:30300/TCP 73s
service/influxdb NodePort 10.109.61.156 <none> 8086:30086/TCP 73s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/telegraf 1 1 1 1 1 <none> 73s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/grafana 1/1 1 1 73s
deployment.apps/influxdb 1/1 1 1 73s
NAME DESIRED CURRENT READY AGE
replicaset.apps/grafana-6c49f96c47 1 1 1 73s
replicaset.apps/influxdb-6c86444bb7 1 1 1 73s
Grafana is able to query InfluxDB at http://influxdb:8086/ and the following steps will be to enable accessing both Grafana and InfluxDB over HTTPS externally.
Grafana Setup
Once InfluxDB is ready and Telegraf is feeding data into it, setup Grafana by creating a Data source:
- Type: InfluxDB
- Name:
telegraf - Query language: InfluxQL
- URL: http://influxdb-svc:18086 (see Pod Hostname DNS)
- Database:
telegraf
Secure InfluxDB
The next step is to add authentication to InfluxDB. This will require updating Telegraf and Grafana.
InfluxDB Authentication
Authentication and authorization in InfluxDB starts by enabling authentication.
Create at least one admin user:
$ influx -host localhost -port 30086
Connected to http://localhost:30086 version 1.8.10
InfluxDB shell version: 1.6.7~rc0
> USE telegraf
Using database telegraf
> CREATE USER admin WITH PASSWORD '**********' WITH ALL PRIVILEGES
Warning
The password must be enclosed in single quotes (').
Enable authentication in the deployment
by setting the INFLUXDB_HTTP_AUTH_ENABLED variable:
| monitoring.yaml | |
|---|---|
Restart InfluxDB:
The result is not that connections are rejected, but access to the database is denied:
$ kubectl -n monitoring logs telegraf-2k8hb | tail -1
2024-04-20T18:24:16Z E! [outputs.influxdb] E! [outputs.influxdb] Failed to write metric (will be dropped: 401 Unauthorized): unable to parse authentication credentials
$ influx -host localhost -port 30086
Connected to http://localhost:30086 version 1.8.10
InfluxDB shell version: 1.6.7~rc0
> USE telegraf
ERR: unable to parse authentication credentials
DB does not exist!
Update Grafana
Updating the InfluxDB connection under Data soures
in Grafana, by adding the username (admin) and
password, is enough to get the connextion restored.
Update Telegraf
To restore access to Telegraf, add the credentials
to the ConfigMap as follows:
| monitoring.yaml | |
|---|---|
And restart telegraf:
HTTPS Access
To do this, add an Ingress pointing to each service:
$ kubectl apply -f monitoring.yaml
...
ingress.networking.k8s.io/grafana-ingress created
ingress.networking.k8s.io/influxdb-ingress created
Each Ingress will need to obtain its own certificate, which requires patching each ACME solver to listen on port 32080 (set up in router), leveraging the script for Monthly renewal of certificates (automated):
This script runs every 30 minutes via crontab
but can also be run manually to speed things up.
If the external port 80 seems to be timing out, it may be necessary to remove the port forwarding rule in the router and add it again.
Once DNS records are updated, Grafana should be available at https://gra.ssl.uu.am and InfluxDB should be available at https://inf.ssl.uu.am
Troubleshooting
Persistent Volumes
Because there are multiple PersistentVolume and
PersistentVolumeClaim, it is necessary to link them
explicitly by adding volumeName to each
PersistentVolumeClaim, otherwise volatile volumes
are used which are then discarded each time the
deployment is deleted.
Ingress Multiple SSL Certificates
When multiple Ingress are created in the same
namespace, it is also necessary to give each a
different tls.secretName for each tls.hosts
value; otherwise only one SSL certificate will be
created (and signed) and it won't be valid for all
subdomains in tls.hosts.
Pod Hostname DNS
For this setup to work, telegraf and grafana need
to send HTTP requests to influxdb. At first, it was
enough to set the hostname value in the influxdb
deployment, so that other services can connect to it
via the internal port and DNS: http://influxdb:8086
However, after the deployment was deleted and applied
a few times, telegraf and grafana were no longer
able to connect; the internal DNS would no longer
return an IP address for the influxdb hostname:
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 kube-dns.kube-system.svc.cluster.local
PING kube-dns.kube-system.svc.cluster.local (10.96.0.10) 56(84) bytes of data.
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 influxdb.monitoring.svc.cluster.local
ping: influxdb.monitoring.svc.cluster.local: Name or service not known
command terminated with exit code 2
Telegraf cannot write to InfluxDB:
$ kubectl -n monitoring logs telegraf-5rv6z | tail -2
2024-04-20T16:05:50Z E! [outputs.influxdb] When writing to [http://influxdb:8086/]: failed doing req: Post "http://influxdb:8086/write?db=telegraf": dial tcp: lookup influxdb on 10.96.0.10:53: no such host
2024-04-20T16:05:50Z E! [agent] Error writing to outputs.influxdb: could not write any address
Grafana cannot query InfluxDB:
Get "http://influxdb:8086/query?db=telegraf&epoch=ms&q=SELECT++FROM+%22%22+WHERE+time+%3E%3D+1713606668691ms+and+time+%3C%3D+1713628268691ms": dial tcp: lookup influxdb on 10.96.0.10:53: no such host
After getting tired of not finding relevant information
to troubleshoot this, updated the deployment and
Grafana to query InfluxDB via its NodePort at
10.0.0.6:30086
After updating dhe deployement, restart telegraf:
Still went to to check whether DNS queries are being received/processed? and after adding logging of queries and trying again:
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 kube-dns.kube-system.svc.cluster.local
PING kube-dns.kube-system.svc.cluster.local (10.96.0.10) 56(84) bytes of data.
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 kube-dns.kube-system
PING kube-dns.kube-system.svc.cluster.local (10.96.0.10) 56(84) bytes of data.
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 influxdb.monitoring
ping: influxdb.monitoring: Name or service not known
command terminated with exit code 2
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 influxdb.monitoring.svc.cluster.local
ping: influxdb.monitoring.svc.cluster.local: Name or service not known
command terminated with exit code 2
$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
[INFO] 10.244.0.216:34419 - 28080 "AAAA IN kube-dns.kube-system.monitoring.svc.cluster.local. udp 67 false 512" NXDOMAIN qr,aa,rd 160 0.00013723s
[INFO] 10.244.0.216:34419 - 8637 "A IN kube-dns.kube-system.monitoring.svc.cluster.local. udp 67 false 512" NXDOMAIN qr,aa,rd 160 0.000129401s
[INFO] 10.244.0.216:43649 - 36293 "A IN kube-dns.kube-system.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000073868s
[INFO] 10.244.0.216:43649 - 62915 "AAAA IN kube-dns.kube-system.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 149 0.000107266s
[INFO] 10.244.0.216:42170 - 53476 "A IN influxdb.monitoring.svc.cluster.local. udp 55 false 512" NXDOMAIN qr,aa,rd 148 0.000116052s
[INFO] 10.244.0.216:42170 - 27874 "AAAA IN influxdb.monitoring.svc.cluster.local. udp 55 false 512" NXDOMAIN qr,aa,rd 148 0.000141714s
[INFO] 10.244.0.216:51203 - 31196 "A IN influxdb.monitoring.cluster.local. udp 51 false 512" NXDOMAIN qr,aa,rd 144 0.000056868s
[INFO] 10.244.0.216:51203 - 64478 "AAAA IN influxdb.monitoring.cluster.local. udp 51 false 512" NXDOMAIN qr,aa,rd 144 0.000128173s
[INFO] 10.244.0.216:51687 - 28937 "A IN influxdb.monitoring.v.cablecom.net. udp 52 false 512" NXDOMAIN qr,rd,ra 139 0.020606519s
[INFO] 10.244.0.216:51687 - 65290 "AAAA IN influxdb.monitoring.v.cablecom.net. udp 52 false 512" NXDOMAIN qr,rd,ra 139 0.021962102s
NXDOMAIN means that the domain is non-existent, providing a DNS error message that is received by the client (Recursive DNS server). This happens when the domain has been requested but cannot be resolved to a valid IP address. All in all, NXDOMAIN error messages simply mean that the domain does not exist.
This only happens with the pod's hostname, but we can
resolve its Service:
$ kubectl -n monitoring exec -i -t telegraf-zzs52 -- ping -c 1 grafana-svc.monitoring
PING grafana-svc.monitoring.svc.cluster.local (10.109.127.41) 56(84) bytes of data.
All this is because
there's no A record for a Pod born of a Deployment
so the DNS service will not resolve pods, but instead
services, and the port exposed is that of the service (port) instead of that of the pod
(targetPort), so the correct URL to reach InfluxDB
is http://influxdb-svc:18086
The change of service name and port happened while
creating the Ingress for HTTP access:
$ kubectl -n monitoring get all
NAME READY STATUS RESTARTS AGE
pod/grafana-7647f97d64-k8h5m 1/1 Running 0 19h
pod/influxdb-84dd8bc664-5m9fx 1/1 Running 0 17h
pod/telegraf-c59gg 1/1 Running 0 17h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/grafana-svc NodePort 10.109.127.41 <none> 13000:30300/TCP 19h
service/influxdb-svc NodePort 10.109.191.140 <none> 18086:30086/TCP 19h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/telegraf 1 1 1 1 1 <none> 17h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/grafana 1/1 1 1 19h
deployment.apps/influxdb 1/1 1 1 19h
NAME DESIRED CURRENT READY AGE
replicaset.apps/grafana-7647f97d64 1 1 1 19h
replicaset.apps/influxdb-6c86444bb7 0 0 0 19h
replicaset.apps/influxdb-84dd8bc664 1 1 1 17h
replicaset.apps/influxdb-87c66ff6 0 0 0 17h
Further reading: Connecting the Dots: Understanding How Pods Talk in Kubernetes Networks.
Conmon Migration
Continuous Monitoring can now be migrated to report metrics to a (new) database in the new InfluxDB and serve dashboards securely over HTTPS from the new Grafana.
Create a monitoring database (separate from the
telegraf database) and set its retencion period to
30 days:
$ influx -host localhost -port 30086
Connected to http://localhost:30086 version 1.8.10
InfluxDB shell version: 1.6.7~rc0
> auth
username: admin
password:
> CREATE DATABASE monitoring
> USE monitoring
Using database monitoring
> CREATE RETENTION POLICY "30_days" ON "monitoring" DURATION 30d REPLICATION 1
> ALTER RETENTION POLICY "30_days" on "monitoring" DURATION 30d REPLICATION 1 DEFAULT
In Grafana, create a new InfluxDB connection under Data sources pointing to this database.
In the conmon scripts, update the curl domain
to use HTTP Basic authentication, and update
the TARGET to point at the new InfluxDB service;
the value can be any of the following:
- http://localhost:30086 only on the server itself
- http://lexicon:30086 when running in the same LAN
- https://inf.ssl.uu.am when running out of LAN
To accommodate for a transition period, scripts can report to both InfluxDB instances until the migration is over.
First, store the InfluxDB credentials in
/etc/conmon/influxdb-auth and make the
file readable only to the root user:
The conmon scripts will need to be updated later to
optionally send authentication (user and password)
only when necessary, and store the password somewhere
safe (definitely outside of the running script).
For each host reporting metrics, migrating the dashboard/s from the old Grafana to the old new should be as easy as:
- In (each one of) the old dashboards, go to Dashboard settings > JSON Model and copy the JSON model into a local file.
- In (any of) the new old dashboards, go to
Dashboard settings > JSON Model
and copy the
uidof (any)datasourceobject. - In the local file, replace the
uidof theinfluxdbdatasourceobjects (one per panel) with the value copied from the old dashboard. - In the new Grafana, go to Home > Dashboards then New > Import and upload the local file.
Finally, once all the dashboards are working, one can go to Administration > General > Default preferences and set a specific dashboard as Home Dashboard.
Clean Up
After some time dual-reporting with no regressions observed, reporting to the old InfluxDB was removed and a few days later the service could be disabled:
# systemctl stop grafana-server.service
# systemctl stop influxdb.service
# systemctl disable grafana-server.service
Synchronizing state of grafana-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable grafana-server
Removed /etc/systemd/system/multi-user.target.wants/grafana-server.service.
# systemctl disable influxdb.service
Synchronizing state of influxdb.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable influxdb
Removed /etc/systemd/system/multi-user.target.wants/influxdb.service.
Removed /etc/systemd/system/influxd.service.
This was rather necessary to keep the server cooler and quieter, if not completely cool and quiet:
