Detailed system and process monitoring
Never got the hang of telegraf, it was all too easy to cook my own monitoring...
Humble Beginnings
In fact, when I started building detailed process monitoring I knew nothing about telegraf, influxdb, grafana or even Raspberry Pi computers.
It was back in 2017, when pondering whether to build my next PC around an Intel Core i7-6950X or an AMD Ryzen 5 1600X, that I started looking into measuring CPU usage of a specific process. I wanted to better see and understand whether more (but slower) CPU cores would be a better investment than faster (but fewer) CPU cores.
At the time my PC had a
AMD Phenom II X4 965 BE C3
with 4 cores at 3.4GHz, and I had no idea how often those CPU
cores were all used to their full extent. To learn more about
the possibilities (and limitations) of fully
multi-threading CPU-bound applications, I started running
top commands in a loop and dumping lines in .csv files to
then plot charts in Google Sheets. This was very crude, but
it did show the difference between rendering a video in
Blender (not multi-threaded) compared to using the
pulverize tool to
fully multi-thread the same task:
This early ad-hoc effort resulted in a few scripts to measure per-proccess CPU usage, overall CPU with thermals, and even GPU usage.
mt-top
This script measures only CPU usage for a single process:
mt-top-temp
This measures the overall CPU usage along with thermals:
mt-top-gpu
This script measures (overall) GPU usage:
| mt-top-gpu | |
|---|---|
Enter InfluxDB & Grafana
A few days ago someone shared a screenshot of their good-looking weather monitoring and mentioned two things I had never seen before: influxdb and grafana. They also mentioned they were running their monitoring on a Raspberry Pi computer.
Now I know what those are, I have a Raspberry Pi 3 model B with CUPS to share the printer over WiFi; much cheaper than even the cheapest WiFi-enabled printers.
InfluxDB
Installing InfluxDB couldn't be easier, if you don't mind running a fairly old version:
# apt install influxdb influxdb-client -y
# dpkg -l influxdb | grep influxdb
ii influxdb 1.6.4-1+deb10u1 armhf Scalable datastore for metrics, events, and real-time analytics
For a more recent version, one can install InfluxDB OSS 1.7 or InfluxDB 2.7.
Once installed, one or more databases need to be crated to
start collecting data.
Get started with InfluxDB OSS
to create a database (e.g. monitoring) and set a
retention policy:
# influx
Connected to http://localhost:8086 version 1.6.7~rc0
InfluxDB shell version: 1.6.7~rc0
> CREATE DATABASE monitoring
> CREATE RETENTION POLICY "30_days" ON "monitoring" DURATION 30d REPLICATION 1
> ALTER RETENTION POLICY "30_days" on "monitoring" DURATION 30d REPLICATION 1 DEFAULT
As soon as the database is created, data can be inserted. There is no need to define columns, instead just Write data with the InfluxDB API to feed simple data such as CPU load and temperature:
curl -i -XPOST \
--data-binary "cpu,host=$host value=$cpu" \
'http://localhost:8086/write?db=monitoring'
curl -i -XPOST \
--data-binary "temperature,host=$host value=$temp" \
'http://localhost:8086/write?db=monitoring'
The body of the POST or InfluxDB line protocol contains the time series data that you want to store. Data includes:
- Measurement (required): the thing to measure, e.g.
cpuin this case to measure global CPU load. - Tags: Strictly speaking, tags are optional but most series include tags to differentiate data sources and to make querying both easy and efficient. Both tag keys and tag values are strings.
- Fields (required): Field keys are required and are always strings, and, by default, field values are floats.
- Timestamp: Supplied at the end of the line in Unix time in nanoseconds since January 1, 1970 UTC - is optional. If you do not specify a timestamp, InfluxDB uses the server’s local nanosecond timestamp in Unix epoch. Time in InfluxDB is in UTC format by default.
Minimal update to post to InfluxDB
The scripts above can now feed data to it in addition to
producing TSV files, e.g. mt-top-temp
can be updated as follows:
Grafana
The next step is visualizing these time series in fancy charts, and that's where Grafana comes in.
Install Grafana OSS,
start the server with systemd
and reset the Admin password:
# echo "deb https://packages.grafana.com/oss/deb stable main" \
| tee /etc/apt/sources.list.d/grafana.list
# curl https://packages.grafana.com/gpg.key | sudo apt-key add -
# apt update
# apt install grafana
# systemctl daemon-reload
# systemctl start grafana-server
# grafana-cli admin reset-admin-password \
PLEASE_CHOOSE_A_SENSIBLE_PASSWORD
INFO[03-20|15:02:11] Connecting to DB logger=sqlstore dbtype=sqlite3
INFO[03-20|15:02:11] Starting DB migrations logger=migrator
Admin password changed successfully ✔
At this point Grafana is available on http://localhost:3000/ for the Admin user.
Add your InfluxDB data source to Grafana,
create a new Dashboard and Add > Visualization for each
measurement (cpu, temp, gpu, etc.).
Tweak /etc/grafana/grafana.ini as follows to enable
anonymous authentication
and allow anonymous users to view dashboards in the default org:
Continuous Monitoring
With InfluxDB and Grafana in place, collection and reporting of metrics can be done continuously, rather than having to run the scripts each time.
For a minimal start, create a script that reports total CPU
usage every second, e.g as /usr/local/bin/conmon
Create a new service to run this upon reboot, e.g. as
/etc/systemd/system/conmon.service
| /etc/systemd/system/conmon.service | |
|---|---|
Then load and start the new service
# systemctl daemon-reload
# systemctl enable conmon.service
Created symlink /etc/systemd/system/multi-user.target.wants/conmon.service → /etc/systemd/system/conmon.service.
# systemctl start conmon.service
From here on, new functions can be added to the conmon
script to collect additional metrics, which will be then
posted by the script periodically. Many other metrics can
be added later, here are some ideas:
- Total RAM usage
- CPU usage per process
- RAM usage per process
- Network I/O per interface
- Disk I/O per disk
- Disk I/O per process
- Disk used/free (per partition)
- GPU load, VRAM, temperature, fan speed, power draw
These and more may be added later, keep an eye on the full scripts in the Continuous Monitoring project page.

