1. Introduction

I’ve set up a monitoring system for my home lab. This has been a fun project, but as I worked with it, I realized that there wasn’t really a single resource available that would take someone through how to actually do it. So here I present to you: How to self-host a monitoring system based around the Grafana stack. If you follow the steps I describe here, you should end up with a working solution that may or may not fit your needs.

So what’s the point of monitoring - other than the obvious, that we all love Blinkenlichts?

Reliability and predictability. During my first years as a professional system administrator - that is: one who earns money doing it; not necessarily, as I soon realized, one who is great at it - the most common cause for systems being unavailable at inconvenient times was that some important part of the system had gone down at a more convenient time but nobody knew. Back in those times, the direct cause was mostly that the software we ran wasn’t very well written. Most services I encountered needed to be restarted now and again to avoid crashing at some point. But the second most common cause for downtime was that we failed to see that some disk somewhere was filling up, or that a program leaked memory and slowly outgrew the available RAM in the server hosting it. That kind of downtime is entirely preventable, provided you make it easy enough to see that it’s about to happen, and especially if you have a useful system to notify you not only when things have gone wrong - the users of your service are great for that - but already when there’s a heightened risk of things going wrong.

Of course once you have the basics in place, you can use the data you’re collecting to draw other conclusions about your environment: How many hits does this web server see, and from where? Are my hypervisors working too hard? Would that specific server benefit from being moved from spinning disks to solid state storage?

As Adam Savage said: The (…) difference between screwing around and science is writing it down. A good monitoring system writes things down for you, so you can make science-based choices.

So much for the “why”. The next question is how to do it. I want a single system that can do two main things: I obviously want to pick up metrics from my servers, but I also want automated log analysis in the same interface, as that brings an extra dimension to the monitoring. The current go-to tool for that, is the Grafana stack. If you don’t want to spend the time managing it, you can pay them to do it for you. If you have a huge system and need a fully redundant self-managed solution, you can set it up in a Kubernetes cluster backed by clustered databases and cloud object storage. That is absolutely not the scope for this exercise in my home lab. My objective is to see what I can get away with in a single VM, while being good enough for my needs.

I have previous experience consuming a Grafana solution at work, but until now, I’ve never really taken the time to understand how the components fit together: A complete grafana stack consists of so much more than just the web interface, so this will also be part of this exercise.

Please note that I’m heavily relying on IPv6 in this solution as it greatly simplifies anything related to Internet accessible endpoints. If you’re still stuck with IPv4, first of all you should complain to your ISP. As for this guide, it should be relatively simple to convert it to use IPv4 only, but you will most likely have to find an alternative way of managing your certificates; for example via an Internet connected reverse proxy from which you copy the relevant certificates to the Grafana server.

2. Initial experimentation phase

A simple proof-of-concept will consist of the following parts:

  • A Mimir server that can collect time series data.
  • A Loki server to collect logs.
  • A Grafana instance that can be used to view said data.
  • A locally installed Alloy agent that will send time series data and logs to Mimir and Loki, respectively, for further perusal in Grafana.

The first step is to deploy a server - in my case a KVM guest, running Ubuntu 24.04. With that done we need to add the Grafana Labs repository as per their instructions and install the programs:

sudo mkdir -p /etc/apt/keyrings/
sudo wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor > /etc/apt/keyrings/grafana.gpg
sudo echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install mimir loki grafana alloy

Installing the software this way ensures we get updates to our monitoring software the same way we get our other system updates, which I strongly prefer over managing software manually like some sort of animal.

2.1. Mimir PoC installation and configuration

Out of the box, Mimir is not configured at all. Helpfully, there are some configuration examples in /etc/mimir/config.example.yaml. I’ve opened that up for inspiration, and pulled over some config options into /etc/mimir/config.yml:

---
multitenancy_enabled: false

server:
  http_listen_port: 9009

  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  ring:
    kvstore:
      store: memberlist
  pool:
    health_check_ingesters: true

ingester:
  ring:
    # We want to start immediately and flush on shutdown.
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    kvstore:
      store: inmemory
    replication_factor: 1

blocks_storage:
  tsdb:
    dir: /tmp/mimir/tsdb

  bucket_store:
    sync_dir: /tmp/mimir/tsdb-sync

  backend: filesystem

  filesystem:
    dir: ./data/tsdb

compactor:
  data_dir: /tmp/mimir/compactor

ruler_storage:
  backend: local
  local:
    directory: /tmp/mimir/rules

After saving the configuration file, we can enable the service and (re-)start it in the usual fashion:

sudo systemctl enable mimir && sudo systemctl restart mimir

Checking the service status, it should at least theoretically be working:

$ sudo systemctl status mimir
● mimir.service - Horizontally scalable, highly available, multi-tenant, long term Prometheus.
     Loaded: loaded (/usr/lib/systemd/system/mimir.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-12-31 11:05:46 UTC; 1 day 23h ago
       Docs: https://grafana.com/oss/mimir/
   Main PID: 963 (mimir)
      Tasks: 9 (limit: 4579)
     Memory: 197.9M (peak: 249.4M)
        CPU: 22min 17.891s
     CGroup: /system.slice/mimir.service
             └─963 /usr/local/bin/mimir --config.file=/etc/mimir/config.yml --runtime-config.file=/etc/mimir/runtime_config.yml --log.level info

2.2. Grafana PoC configuration

On Ubuntu, Grafana pretty much works out of the box, at least for a proof of concept. Let’s just ensure the service is enabled and started:

sudo systemctl enable grafana-server.service && sudo systemctl start grafana-server.service

If we open up a browser tab over plain http to the grafana server’s port 3000, we should see the Grafana login screen at this point.

After logging in using the default admin username and password combination, the first use wizard demands that we set a new secret. I’m a strong proponent of password managers, and so I recommend to generate and store any secrets that way.

Next we’ll want to make Grafana see Mimir’s data. That’s done through Connections -> Data sources in Grafana’s left-hand menu, and then selecting Add data source. A thing that’s not immediately obvious is that there is no choice of Mimir as a data source. But Mimir is storing Prometheus data, so by choosing Prometheus as the source we can proceed through the rest of the flow as expected. For this proof of concept let’s just change the data source name, to Mimir, and the update the connection to say http://::1:9009/prometheus, where ::1 is the IPv6 equivalent to the localhost name, or the IPv4 address 127.0.0.1.

(If you wonder how I arrived at this conclusion, the endpoint path is documented on the Mimir documentation’s Get Started page, and the address/port is of course extrapolated from the Mimir configuration file we created earlier.

Pressing the big blue Save & test button at the bottom of the page tells us whether Grafana was able to connect to Mimir. If it worked as expected, we just need a way to get metrics data into our Mimir instance. For that, we’ll use Alloy.

2.3. Alloy PoC configuration

The default Alloy configuration in /etc/alloy/config.alloy is more or less fine to start with, but I recommend making a couple of changes already in this early PoC stage: On Linux servers, I like to configure the unix exporter to enable the systemd data collector, which will provide us with information about services running on our server. We also need to tell the agent to pass the gathered metrics to Mimir. This is done by creating a prometheus.remote_write configuration block, and telling the existing prometheus.scrape configuration to forward its data there. Finally, metrics are useless if we can’t attribute them properly, so I recommend adding some labels to all metrics, so that once we begin sending data from additional servers, we’ll be able to easily differentiate their data points in Grafana. After some research in the official documentation I concluded this is done by adding an external_labels block inside the remote_write section. It also took me longer than I want to confess to find out how to dynamically use the local system variables to populate the node_name label’s value.

Here’s my initial config file:

logging {
  level = "warn"
}

prometheus.exporter.unix "default" {
  include_exporter_metrics = true
  disable_collectors       = ["mdadm"]
  enable_collectors        = ["systemd"]

}

prometheus.scrape "default" {
  targets = array.concat(
    prometheus.exporter.unix.default.targets,
    [{
      // Self-collect metrics
      job         = "alloy",
      __address__ = "127.0.0.1:12345",
    }],
  )

  forward_to = [
    prometheus.remote_write.default.receiver,
  ]
}

prometheus.remote_write "default" {
  external_labels = {
    node_name   = sys.env("HOSTNAME"),
    env         = "prod",
  }
  endpoint {
    url = "http://grafanasrv1:9009/api/v1/push"
  }
}

You may be wondering why I connect directly to localhost from Grafana to Mimir, but use the hostname when creating the corresponding connection from the Alloy client. There are two reasons: First of all, I’m lazy. My finished skeleton config will be re-used across all my servers, so I want to have to change as little as possible between machines. Second, this forces traffic from the Alloy agent to hit much of the same parts of the network stack that external requests to the server would, so it’s a slightly better test of my server configuration before we’ve reached the point where other servers start sending data to our Grafana server.

With the config file in place, let’s enable and start Alloy:

sudo systemctl enable alloy && sudo systemctl restart alloy

After a while, we should see data begin trickling into the Grafana Drilldown/Metrics view. We can also try to use the labels we added to search for metrics in Grafana’s Explore view.

At this stage, I have to confess that there’s already a pretty glaring problem with our configuration. I realized this because I’m jaded and decided to reboot my Grafana server to see what broke. With the current configuration, data apparently isn’t being stored permanently: Rebooting the Grafana server didn’t just leave a gap in the data, but left no data at all from before the reboot. This of course is unacceptable for a monitoring system.

My initial suspicion was that this was related to what can be seen in the original Mimir configuration; that several of the files were being stored in the /tmp directory structure. We will fix this by updating our configuration to instead use persistent storage for that data.

This is a good time to apply some experience to the server setup: I recommend adding a second disk to the server and formatting it with a file system of your choice. There are two good reasons to do this. First, I have no idea how much data will ultimately be ingested into Mimir. We don’t want this data to fill the system drive of our monitoring server, risking disruption to the system that’s supposed to help us avoid disruption. Second, related to the first, separating I/O operations for our monitoring from that of the system helps us mitigate a potential bottleneck by spreading our disk traffic across additional queues.

After adding the disk, I moved all contents from /var/lib/mimir/data into a temporary directory, edited my /etc/fstab to ensure the new disk would get mounted to Mimir’s data directory, repopulated it with its original contents, and finally created the directories I could see from the configuration file that I would be needing, this time under a tmp folder structure inside the Mimir data directory:

sudo mkdir -p /var/lib/mimir/data/tmp/{tsdb,tsdb-sync,compactor,rules}
sudo chown -R mimir:mimir /var/lib/mimir

Then I updated /etc/mimir/config.yml accordingly:

---
multitenancy_enabled: false

server:
  http_listen_port: 9009

  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  ring:
    kvstore:
      store: memberlist
  pool:
    health_check_ingesters: true

ingester:
  ring:
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    kvstore:
      store: memberlist
    replication_factor: 1

blocks_storage:
  tsdb:
    dir: ./data/tmp/tsdb

  bucket_store:
    sync_dir: ./data/tmp/tsdb-sync

  backend: filesystem

  filesystem:
    dir: ./data/tsdb

compactor:
  data_dir: ./data/tmp/compactor

ruler_storage:
  backend: local
  local:
    directory: ./data/tmp/rules

After changing the directories, and restarting Mimir, wait a bit for some more data to get ingested, then confirm that Grafana really is able to visualize older metrics after a system reboot.

To sum it up: once we’re here, our Grafana server is able to reliably receive metrics over the network, store them persistently in a metrics database, and visualize them in Grafana. Next up: Logs.

2.4. Loki PoC configuration

The default Loki configuration in /etc/loki/config.yml looks more or less fine for a proof of concept, but seeing the /tmp directory references in the configuration should raise our suspicions after the previous chapter. Spoiler: We’ll need to switch to more permanent storage here too, before taking the server into production. But let’s confirm the most basic functionality first.

---
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: debug
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  enable_multi_variant_queries: true

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

Enable and restart the Loki service as usual:

sudo systemctl enable loki && sudo systemctl restart loki

Loki needs to be added to Grafana, in a similar way to what I already did for Mimir:

Under Connections -> Data sources, we’ll add a new data source of the type Loki. Again we can extrapolate from the config file and simply point Grafana at http://::1:3100. Here too, the Save & test button will give us a good indication of whether Grafana is able to speak to Loki.

And just like with Mimir earlier, we need to push logs to Loki. In our case it means adding some log scraping to our Alloy client configuration. The following config blocks can be added right below the existing configuration in /etc/alloy/config.alloy:

local.file_match "node_logs" {
  path_targets = [{
      __path__  = "/var/log/syslog",
      job       = "node/syslog",
      node_name = sys.env("HOSTNAME"),
      env       = "prod"
  }]
}

loki.source.file "node_logs" {
  targets    = local.file_match.node_logs.targets
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://grafanasrv1:3100/loki/api/v1/push"
  }
}

What happens here is that we’re telling the Alloy agent to look at local files - specifically my system’s default syslog file - and then pass those logs on to the Loki instance running on our server. A difference from the Mimir configuration is that we’re adding labels already when scraping the file, instead of as part of sending data to the Loki endpoint. Both methods are valid, but as we’re already adding some labels in the scraping stage, I see no immediate reason to add an additional step.

After restarting the Alloy agent, we should be able to explore the Loki data in Grafana and see that it starts filling up with logs - mostly from Loki itself at this point. Changing the log_level parameter from debug to info in the Loki configuration file and then restarting the service will help lower the traffic to a more manageable level.

As expected, Loki like Mimir suffers from reboot amnesia with the default configuration. Again I recommend adding a third data disk for this system to use. The disk should be mounted to /var/lib/loki. Then stop the Loki service, move the contents of /tmp/loki into /var/lib/loki/ - double-check the file permissions! - and finally change the Loki configuration similar to what we did for Mimir:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  enable_multi_variant_queries: true

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

With this change, older log data should remain visible in Grafana even after reboots.

So now our Grafana server is able to receive both metrics and logs, and can visualize these ad hoc. We have two more steps before I would consider this a successful proof of concept: Let’s create a somewhat useful dashboard, and and let’s test our ability to trigger and send alerts.

2.5. Dashboards PoC

2.5.1. Selecting data to visualize

To create a dashboard we first need to think about some data that would be useful in a visualization. Grafana’s Explore view is nice for this purpose. Browsing the metrics available, I saw a couple of good candidates: node_filesystem_size_bytes and node_filesystem_avail_bytes. Switching from the builder to the code view, I entered the query (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 to see how many percent of free space I had on each mountpoint. At least on a Ubuntu server, this results in a slightly messy view. Looking at the data sources, we’ll notice that a number of mountpoints are using fstype tmpfs. As tmpfs basically is a RAM disk, we can ignore those mountpoints for our purpose. We can add a filter to our query like this: (node_filesystem_avail_bytes {fstype !="tmpfs"} / node_filesystem_size_bytes {fstype !="tmpfs"}) * 100. Now we see our root file system, we see our two mounts for Mimir and Loki, respectively, and we see our /boot/efi mount. This is useful data.

We can start a dashboard directly from the Explore view by clicking the Add button, selecting Add to dashboard, ensuring New dashboard is selected, and then clicking Open dashboard.

2.5.2. Visualization basics

We now have a nearly empty dashboard with just a single panel, somewhat underwhelmingly called “New panel”, showing the graphs we generated earlier. Let’s prettify it a bit:

Hovering over the panel shows three dots in the upper right corner. Press them and select edit.

On the right-hand-side toolbar, change the Title to Free space.

Under the Legend heading, change Mode to Table, and in Values select Min and Last. This lets us clearly see the minimum value recorded during the currently selected timespan, and the last value during the same span.

A bit further down, in the Standard options section, change Unit to Misc -> Percent (0-100). Also update the Min and Max fields to 0 and 100 respectively, to give a better sense of actual free space.

Under Thresholds, let’s make some changes to make the graph more scannable: Change the Base color to red. Change the value that by default is 80 to 10, and select a yellow color for it. Finally press the + Add threshold button, which will generate a threshold of 20, which is fine, but change that color to green.

Under Show thresholds select As filled regions. That gives us some visual cues as to how good or bad our storage situation is.

Before we return to our dashboard view, let’s do something to clarify the legend under the graphs:

Under our query, there’s an Options menu. Open it, then change legend from Auto to Custom. It will show label_name in double curly braces. Change that query into mountpoint within those same curly braces, and then change Type from Both to Range.

Now click the Back to dashboard button near the top of the window.

2.5.3. Using variables and repetition in dashboards

What if we want to automate visualization of free disk space for all mount points on all servers? Variables to the rescue!

In the main Dashboard view, select Settings near the top right corner of the window, then click Variables. Press the big Add variable button.

Under General, change Name to Host. Then under Query options, change Query type to Label values. For Label select node_name. Change Sort to Alphabetical (case-insensitive, asc). Then in Selection options, check the boxes for Multi-value and Include All option, and uncheck Allow custom values. In the Preview of values we should see the only node_name currently reporting in to our Grafana instance, namely the Grafana server itself. Click the Back to list button.

Click the + New variable button to create a second variable for our dashboard. This one should be called Mountpoint. Again the Query type should be Label values. The Label is mountpoint, but this time we’re adding a Label filter for node_name = $Host. To avoid seeing data for disks we’re not interested in, we’ll add a second filter: Press the + button next to the Label filter we just created and add a second filter for fstype != tmpfs.

Again, set the Sort to Alphabetical (case-insensitive, asc) and under Selection options check Multi-value, uncheck Allow custom values and check Include All option.

The Preview of values shows us the mount points we saw in our graphs earlier.

Press Back to dashboard near the upper right corner.

We now see our previous panel, but we now have dropdown boxes for Host and Mountpoint.

To make our Graph actually utilize them, we need to edit the panel again, by pressing the three dots in the upper right corner.

Under Panel options -> Repeat options, look at Repeat by variable and select Mountpoint from the dropdown. Then edit the graph query by replacing our label filter. To make it more readable, now that we’re starting to stack filters, we can break it apart into multiple lines:

(
	node_filesystem_avail_bytes{
      node_name="$Host", 
      mountpoint="$Mountpoint", 
      fstype!="tmpfs"
 	} / 
 	node_filesystem_size_bytes{
      node_name="$Host", 
      mountpoint="$Mountpoint", 
      fstype!="tmpfs"
 	}
) * 100

Running the query from within the editor will switch to showing the graph for only one of the mount points. But if we return to the Dashboard view, we can select All for the Mountpoint value, and suddenly we’re greeted with a number of graphs, each showing the curent free space for each individual mount point.

For this to look good once we add more host, we’ll need to make one more change to the dashbard: Find the Add button to the left of the Settings button, and click Row. If our existing panel set doesn’t immediately become part of this row, grab the first panel and pull it in under the row.

Then hover the mouse over the row name until a cogwheel icon appears, and click it. For the Title, enter $Host, and under Repeat for, select Host; then click Update. We’ll now be able to easily see the status of all mounted disks belonging to any server connected to our Grafana server.

This was a quick introduction to some of the possibilities we have when creating dashboards. Let’s continue to the final important part of our proof of concept.

2.6 Alerting PoC

To paraphrase the philosophical question: If the amount of free disk space becomes critically low and there’s no way to inform anybody, is it still low? Of course it is. So we’ll start out by making our Grafana instance able to send us alerts via email.

We’ll edit /etc/grafana/grafana.ini, and locate the [smtp] section. Here we can provide values for things like our mail server name, what originating email address we should use for alerts, whether and how to authenticate, and so on. When done, it should look similar to this:

[smtp]
enabled = true
host = mail.example.net:587
user = alerts@example.net
password = <REDACTED>
from_address = alerts@example.net
from_name = example.net Notifications
ehlo_identity = monitoring.example.net

After we’re happy with the configuration, restart Grafana to make the new settings take effect.

With the Grafana server understanding how to send email, we can go into the web UI, locate Alerting/Contact points and edit grafana-default-email to send email to an email account of our choosing. There’s a test button in there, just don’t forget to also save your changes once you’re happy with the result.

Now let’s create an alert trigger. We already know the query required to check the percentage of free disk space on a volume. In my case I can see that the root file system / on my Grafana server has about 47% of free disk space, so creating an alert for anything below 50% free disk space should give me a good test of the alert system.

Under Alerting / Alert rules, let’s create a new rule. We’ll call the alert “Low disk space”. For the query, I’m pasting the query I tested earlier: (node_filesystem_avail_bytes {fstype !="tmpfs"} / node_filesystem_size_bytes {fstype !="tmpfs"}) * 100. For the evaluation expression, let’s say that the alert condition happens when Input A is below 50.

The alert must be stored in a folder. This is a general rule we’ll want to apply to all of our servers, down the line, so let’s create the folder Homelab. We’ll also create an evaluation group Common, for which all evaluations will be synchronized.

For the notification contact point, we’ll choose the grafana-default-email connector that we modified with our email address earlier.

In the notification message, we can add some relevant information to make it more readable, by using Grafana’s built-in variable substitution, like this:

Mount {{ $labels.mountpoint }} on {{ $labels.node_name }} is below threshold 50% of free space.

Current value: {{ $values.A.Value }}

All of this information is available in table form in the email too, but by adding a custom message this way, we can immediately see the most relevant information at a glance.

After saving the alert, we should see that our newly generated alert is firing, and we should receive an email with the relevant information within a few minutes. Once we’ve seen that alerting works, we can update the threshold and the message to a more sensible value, like 20%, after which we should see the alert backing down to “normal” state in Grafana’s alerts interface, and a few minutes later we should receive a “RESOLVED” email, telling us that we have exited the alert state.

With this, I consider the proof of concept done: We now have a useful skeleton for a monitoring service. We can receive data, we can evaluate it, and we can react to it reaching trigger thresholds. But with the service configuration in its current state, I wouldn’t allow this system anywhere near a production network: Our solution is not hardened in any way, save for the protection provided by our network’s perimeter firewalls. Metrics, logs and even credentials are being sent across our network in plaintext for anybody to read. In the next chapter we’re doing something about that.

3. Turning experiment into practice

With our proof of concept considered successfully completed, what changes do we need to make to take such a solution into active duty?

First and foremost, we can’t go around spewing unencrypted metrics - and worse: credentials - over our network. We need to configure our solution for transport layer security, or TLS. This may turn out to be the longest chapter yet: even though the principles involved aren’t very hard to grasp, there’s a multi-step process to get it right. The good thing is that it will all be automated, so once we’ve set it up, we shouldn’t have to think about it again in the foreseeable future. Once TLS is available to us, we’ll need to update the configuration of our Alloy agent and of Grafana to make use of it. It will be a minimal set of changes - I promise - and we’ll make them one by one so we don’t mess anything up.

Second, we don’t want unauthorized and unwanted guests to be able to connect to our Grafana, Mimir or Loki instances, so we’ll add a layer of security to make it harder to abuse our server.

3.1. Reverse proxy

Grafana, Mimir and Loki are all able to present themselves over TLS (actually HTTPS). The only thing required is an appropriate set of TLS certificates. But as we’re presenting them all from the same machine, we can save some work by installing a reverse proxy service in front of them, and let that take care of TLS for us. As the traffic from the reverse proxy to the actual services only ever touches the insides of our single server, there’s not really any meaningful risk to this strategy. Doing it this way significantly simplifies the process of automatic renewal of certificates, and it provides us with a single ingress point to secure. A second benefit is that all of these HTTPS services can be made accessible over the default HTTPS port, since the reverse proxy “knows” where to send the traffic for each service.

TLS certificate used to carry a very real cost, but since a few years, bona fide CA signed certificates can be had for free, courtesy of Let’s Encrypt. We’ll set our solution up to use their service.

My favorite reverse proxy (and load balancer) is HAProxy. We’ll attach the latest long term support (LTS) repo as per this handy wizard and install it:

sudo apt install --no-install-recommends software-properties-common
sudo add-apt-repository ppa:vbernat/haproxy-3.2
sudo apt install haproxy=3.2.\*

3.1.1. TLS certificate automation

For encryption in transit, as mentioned we’ll use TLS in the form of HTTPS. That requires us to have certificates in place. There’s a good writeup about that on the official HAProxy site, so I’ll basically follow that one, for the subdomains metrics and logs, for data ingestion, and monitoring, for the Grafana instance. It’s fully possible to create these subdomains only internally, if you have working public key infrastructure and private certificate authority in place already, but as mentioned, we’re outsourcing our certificate management to a public CA. This will be combined with firewall and reverse proxy rules that will improve security compared to just letting any traffic in.

With the DNS configuration in place, and the necessary firewall pinholes opened toward port 80 in my Grafana server, I can basically follow the instructions from the mentioned post to ensure I get the proper certificates.

3.1.1.1. Acme setup

The first steps are to create a user in whose context acme.sh can run, and add it to the haproxy group so it can write certificates to a place where the reverse proxy can read them:

sudo adduser \
   --system \
   --disabled-password \
   --disabled-login \
   --home /var/lib/acme \
   --quiet \
   --force-badname \
   --group \
   acme
sudo adduser acme haproxy

Next, we need to install acme.sh. This shell script recommends having socat installed - a tool we will use later - so we’ll do that too:

sudo apt install socat
sudo mkdir /usr/local/share/acme.sh/
git clone https://github.com/acmesh-official/acme.sh.git
cd acme.sh/
sudo ./acme.sh \
   --install \
   --no-cron \
   --no-profile \
   --home /usr/local/share/acme.sh
sudo ln -s /usr/local/share/acme.sh/acme.sh /usr/local/bin/
sudo chmod 755 /usr/local/share/acme.sh/

The script is installed. Next, we need to create a Let’s Encrypt account, on the LE test environment, to start with:

sudo -u acme -s
acme.sh --register-account \
   --server letsencrypt_test \
   -m youremail@example.com

At this point it’s important to take note of the output: the value of ACCOUNT_THUMBPRINT needs to be retained for the HAProxy configuration.

3.1.1.2. HAProxy Acme configuration

Our HAProxy instance needs somewhere from which to read TLS certificates. The acme user is in the haproxy group, so we can create a directory both accounts are able to utilize:

sudo mkdir /etc/haproxy/certs
sudo chown haproxy:haproxy /etc/haproxy/certs
sudo chmod 770 /etc/haproxy/certs

After this, /etc/haproxy/haproxy.cfg needs to be updated so our reverse proxy can respond sensibly to Acme authentication requests. We’ll update the default Ubuntu configuration with the necessary configuration lines:

global
	log /dev/log	local0
	log /dev/log	local1 notice
	chroot /var/lib/haproxy
	stats socket /run/haproxy/admin.sock mode 660 level admin
	stats timeout 30s
	user haproxy
	group haproxy
	daemon

    # *** This is where we add the Acme ACCOUNT_THUMBPRINT value: ***
	setenv ACCOUNT_THUMBPRINT '<REDACTED>'

	# Default SSL material locations
	ca-base /etc/ssl/certs
	crt-base /etc/ssl/private

	# See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
    ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
    ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
    ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets

defaults
	log	global
	mode	http
	option	httplog
	option	dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000
	errorfile 400 /etc/haproxy/errors/400.http
	errorfile 403 /etc/haproxy/errors/403.http
	errorfile 408 /etc/haproxy/errors/408.http
	errorfile 500 /etc/haproxy/errors/500.http
	errorfile 502 /etc/haproxy/errors/502.http
	errorfile 503 /etc/haproxy/errors/503.http
	errorfile 504 /etc/haproxy/errors/504.http

frontend http
	bind :::80 # The triple colons indicate that we're listening on all IP addresses, on this port

    # We want to allow Acme challenges but no other external traffic for now.
    # We also want plain HTTP traffic from internal clients to be redirected to our HTTPS listener.
	acl acme_challenge 	path_beg	'/.well-known/acme-challenge/'
	acl internal_origin	src		10.0.0.0/8
	acl internal_origin	src		2001:db8::/32

	http-request deny unless internal_origin or acme_challenge
    # The following line is what actually responds to the ACME challenge
    http-request return status 200 content-type text/plain lf-string "%[path,field(-1,/)].${ACCOUNT_THUMBPRINT}\n" if acme_challenge
	redirect scheme https code 301 if internal_origin !{ ssl_fc }

frontend https
	bind :::443 ssl crt /etc/haproxy/certs/ strict-sni

    # Defence in depth: Only respond to internal HTTPS request, even though they should already be blocked in the firewall.
	acl internal_origin	src		10.0.0.0/8
	acl internal_origin	src		2001:db8::/32

	http-request deny if !internal_origin

With this configuration, let’s verify our configuration is sound, restart HAProxy, and validate its status:

$ sudo haproxy -c -f /etc/haproxy/haproxy.cfg
[NOTICE]   (70280) : haproxy version is 3.2.9-1ppa1~noble
[NOTICE]   (70280) : path to executable is /usr/sbin/haproxy
[WARNING]  (70280) : Proxy 'https': no SSL certificate specified for bind ':443' at [/etc/haproxy/haproxy.cfg:47], ssl connections will fail (use 'crt').
$ sudo systemctl restart haproxy.service
$ sudo systemctl status haproxy.service
● haproxy.service - HAProxy Load Balancer
     Loaded: loaded (/usr/lib/systemd/system/haproxy.service; enabled; preset: enabled)
     Active: active (running) since Sun 2025-12-14 11:58:05 UTC; 4s ago
       Docs: man:haproxy(1)
             file:/usr/share/doc/haproxy/configuration.txt.gz
   Main PID: 70319 (haproxy)
     Status: "Ready."
      Tasks: 3 (limit: 4579)
     Memory: 47.6M (peak: 47.8M)
        CPU: 208ms
     CGroup: /system.slice/haproxy.service
             ├─70319 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -S /run/haproxy-master.sock
             └─70321 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -S /run/haproxy-master.sock

Dec 14 11:58:05 grafanasrv1 systemd[1]: Starting haproxy.service - HAProxy Load Balancer...
Dec 14 11:58:05 grafanasrv1 haproxy[70319]: [NOTICE]   (70319) : Initializing new worker (70321)
Dec 14 11:58:05 grafanasrv1 haproxy[70321]: [NOTICE]   (70321) : haproxy version is 3.2.9-1ppa1~noble
Dec 14 11:58:05 grafanasrv1 haproxy[70321]: [NOTICE]   (70321) : path to executable is /usr/sbin/haproxy
Dec 14 11:58:05 grafanasrv1 haproxy[70321]: [WARNING]  (70321) : Proxy 'https': no SSL certificate specified for bind ':443' at [/etc/haproxy/haproxy.cfg:47], ssl connections will fail (use 'crt').
Dec 14 11:58:05 grafanasrv1 haproxy[70319]: [NOTICE]   (70319) : Loading success.
Dec 14 11:58:05 grafanasrv1 systemd[1]: Started haproxy.service - HAProxy Load Balancer.

There’s a warning about missing certificates. We can ignore that for now, as we haven’t generated any certificates yet. The important part here, is that HAProxy has a syntactically valid configuration and has started, so we have reason to believe it will be able to respond to the Acme authentication challenge.

3.1.1.3. Requesting the certificate

With the previous steps sorted, we can now request the TLS certificate that our server will be using: A single certificate that’s valid for the subdomains metrics.example.net (for Prometheus data), logs.example.net (for Loki ingestion), and monitoring.example.net (for the Grafana web interface):

sudo -su acme
acme.sh --issue \
-d monitoring.example.net \
-d metrics.example.net \
-d logs.example.net \
--stateless \
--server letsencrypt_test

The output indicates all went well:

[Thu Dec 25 07:05:55 PM UTC 2025] Using CA: https://acme-staging-v02.api.letsencrypt.org/directory
[Thu Dec 25 07:05:55 PM UTC 2025] Multi domain='DNS:monitoring.example.net,DNS:metrics.example.net,DNS:logs.example.net'
[Thu Dec 25 07:06:00 PM UTC 2025] Getting webroot for domain='monitoring.example.net'
[Thu Dec 25 07:06:00 PM UTC 2025] Getting webroot for domain='metrics.example.net'
[Thu Dec 25 07:06:00 PM UTC 2025] Getting webroot for domain='logs.example.net'
[Thu Dec 25 07:06:00 PM UTC 2025] Verifying: monitoring.example.net
[Thu Dec 25 07:06:00 PM UTC 2025] Stateless mode for domain: monitoring.example.net
[Thu Dec 25 07:06:03 PM UTC 2025] Pending. The CA is processing your order, please wait. (1/30)
[Thu Dec 25 07:06:07 PM UTC 2025] Success
[Thu Dec 25 07:06:07 PM UTC 2025] Verifying: metrics.example.net
[Thu Dec 25 07:06:07 PM UTC 2025] Stateless mode for domain: metrics.example.net
[Thu Dec 25 07:06:10 PM UTC 2025] Pending. The CA is processing your order, please wait. (1/30)
[Thu Dec 25 07:06:14 PM UTC 2025] Success
[Thu Dec 25 07:06:14 PM UTC 2025] Verifying: logs.example.net
[Thu Dec 25 07:06:14 PM UTC 2025] Stateless mode for domain: logs.example.net
[Thu Dec 25 07:06:17 PM UTC 2025] Pending. The CA is processing your order, please wait. (1/30)
[Thu Dec 25 07:06:21 PM UTC 2025] Success
[Thu Dec 25 07:06:21 PM UTC 2025] Verification finished, beginning signing.
[Thu Dec 25 07:06:21 PM UTC 2025] Let's finalize the order.
[Thu Dec 25 07:06:21 PM UTC 2025] Le_OrderFinalize='https://acme-staging-v02.api.letsencrypt.org/acme/finalize/250417403/29886035973'
[Thu Dec 25 07:06:22 PM UTC 2025] Order status is 'processing', let's sleep and retry.
[Thu Dec 25 07:06:22 PM UTC 2025] Sleeping for 3 seconds then retrying
[Thu Dec 25 07:06:26 PM UTC 2025] Polling order status: https://acme-staging-v02.api.letsencrypt.org/acme/order/250417403/29886035973
[Thu Dec 25 07:06:26 PM UTC 2025] Downloading cert.
[Thu Dec 25 07:06:26 PM UTC 2025] Le_LinkCert='https://acme-staging-v02.api.letsencrypt.org/acme/cert/2c10e72a5c13da18803090a3ae62b64b7ee8'
[Thu Dec 25 07:06:27 PM UTC 2025] Cert success.
-----BEGIN CERTIFICATE-----
<REDACTED>
-----END CERTIFICATE-----
[Thu Dec 25 07:06:27 PM UTC 2025] Your cert is in: /var/lib/acme/.acme.sh/monitoring.example.net_ecc/monitoring.example.net.cer
[Thu Dec 25 07:06:27 PM UTC 2025] Your cert key is in: /var/lib/acme/.acme.sh/monitoring.example.net_ecc/monitoring.example.net.key
[Thu Dec 25 07:06:27 PM UTC 2025] The intermediate CA cert is in: /var/lib/acme/.acme.sh/monitoring.example.net_ecc/ca.cer
[Thu Dec 25 07:06:27 PM UTC 2025] And the full-chain cert is in: /var/lib/acme/.acme.sh/monitoring.example.net_ecc/fullchain.cer

A certificate valid for all three subdomain names has been generated for me. Now we need to make sure it gets deployed to where HAProxy can pick it up and use it.

3.1.1.4. Deploying the certificate to HAProxy

The acme.sh script contains what we need to make HAProxy use the certificate without needing to reload its configuration. We’ll need to trigger the command with a few environment variables set, to make sure we speak to the proxy server’s administrative socket in the correct way, and that we give it the information it needs. There’s no reason to specify more than the first of the domains for which the certificate is valid.

sudo -u acme -s
DEPLOY_HAPROXY_HOT_UPDATE=yes \
DEPLOY_HAPROXY_STATS_SOCKET=/run/haproxy/admin.sock \
DEPLOY_HAPROXY_PEM_PATH=/etc/haproxy/certs \
acme.sh --deploy -d monitoring.example.net --deploy-hook haproxy

Again, the output looks nice:

[Thu Dec 25 07:25:52 PM UTC 2025] The domain 'monitoring.example.net' seems to already have an ECC cert, let's use it.
[Thu Dec 25 07:25:53 PM UTC 2025] Deploying PEM file
[Thu Dec 25 07:25:53 PM UTC 2025] Moving new certificate into place
[Thu Dec 25 07:25:53 PM UTC 2025] Creating new certificate '/etc/haproxy/certs/monitoring.example.net.pem' over HAProxy stats socket.
[Thu Dec 25 07:25:53 PM UTC 2025] Success

Let’s verify that we can see the correct certificate when polling the listener. Still as the acme user, we’ll use socat to send a management command to HAProxy:

echo "show ssl cert /etc/haproxy/certs/monitoring.example.net.pem" | socat /var/run/haproxy/admin.sock -

Again, the output looks good:

Filename: /etc/haproxy/certs/monitoring.example.net.pem
Status: Used
Serial: 2C10E72A5C13DA18803090A3AE62B64B7EE8
notBefore: Dec 25 18:07:51 2025 GMT
notAfter: Mar 25 18:07:50 2026 GMT
Subject Alternative Name: DNS:logs.example.net, DNS:metrics.example.net, DNS:monitoring.example.net
Algorithm: EC256
SHA1 FingerPrint: 55051A4B4E6E2099C6E5457C5AF7C43F56D3A8B6
Subject: /CN=monitoring.example.net
Issuer: /C=US/O=(STAGING) Let's Encrypt/CN=(STAGING) Mysterious Mulberry E8
Chain Subject: /C=US/O=(STAGING) Let's Encrypt/CN=(STAGING) Mysterious Mulberry E8
Chain Issuer: /C=US/O=(STAGING) Internet Security Research Group/CN=(STAGING) Pretend Pear X1
OCSP Response Key:

This was basically the proof of concept step for our certificate automation. Let’s do it for real now.

3.1.1.5. Switching to the production Let’s Encrypt servers

Now that everything seems to work, we need to switch to the production Let’s Encrypt servers. This requires us to repeat a couple of the steps above.

First we need a new account thumbprint, which of course will replace the one in our /etc/haproxy/haproxy.cfg:

sudo -u acme -s
acme.sh --register-account \
   --server letsencrypt \
   -m youremail@example.com

After updating the HAProxy configuration to use the new thumbprint, and restarting the service, We’ll request a new certificate from the Let’s Encrypt production servers. Note the added --force flag that’s required since there’s already a certificate in place that doesn’t expire in a good while:

acme.sh --issue \
-d monitoring.example.net \
-d metrics.example.net \
-d logs.example.net \
--stateless \
--server letsencrypt
--force

The new certificate needs to be deployed, of course:

sudo -u acme -s
DEPLOY_HAPROXY_HOT_UPDATE=yes \
DEPLOY_HAPROXY_STATS_SOCKET=/run/haproxy/admin.sock \
DEPLOY_HAPROXY_PEM_PATH=/etc/haproxy/certs \
acme.sh --deploy -d monitoring.example.net --deploy-hook haproxy

The certificate and its validity can be checked using socat as above, and of course by using curl or a regular web browser and pointing it at one of the domain names.

The final step for encrypting our network traffic is ensuring that it stays secure.

3.1.1.6 Making the certificate autorenew

Let’s encrypt certificates expire quickly, and it’s not feasible to keep updating them manually. The necessary automation can be achieved by using a systemd timer, but since the acme.sh script contains functionality to use cron, let’s just run that, again, as the acme user:

# Initialize a crontab by saving the file generated by the following command:
crontab -e
# Then set up the cron job:
acme.sh --install-cronjob
# Finally confirm that the cron job is listed:
crontab -l

3.1.2. Encrypting and forwarding Prometheus traffic

3.1.2.1. HAproxy side Prometheus TLS config

HAProxy works with the concepts of “frontends” which define how to listen to incoming traffic, with rules that direct traffic to “backends”, where we define how to reach the actual server for the traffic in question. Since our Mimir service listens on port 9009, what’s required here is to make sure that traffic to https://metrics.example.net gets sent to this service.

In the frontend https section of the configuration, we’ll add the following logic:

    use_backend bk_mimir if { hdr(Host) -i metrics.example.net }

Right below the frontend section, we’ll create our first backend section:

backend bk_mimir
        server grafanasrv1 ::1:9009 check

Together, these lines tell HAProxy to send traffic bound for metrics.example.net to port 9009 on localhost. The check argument at the end of the server line asks the reverse proxy to perform a very rudimentary health check on the endpoint by attempting to create a TCP connection to it. If this test would stop succeeding for any reason, HAProxy will respond to the client with a HTTP 503, or Server error message rather than let traffic simply disappear or queue up indefinitely.

Confirm that HAProxy accepts our configuration, and if so restart the service by running the following two commands:

sudo haproxy -c -f /etc/haproxy/haproxy.cfg
sudo systemctl reload haproxy
3.1.2.2. Alloy side Prometheus TLS config

With HAProxy knowing how to manage traffic to our Mimir backend, we can test it out by pointing the Metrics sender of our local Alloy agent at our new listener. Locate the prometheus.remote_write "default" block in /etc/alloy/config.alloy and change the url line:

    url = "https://metrics.example.net/api/v1/push"

If, after restarting the Alloy agent, we still receive metrics to our Grafana instance, everything worked as expected.

3.1.3. Encrypting and forwarding Loki traffic

3.1.3.1. HAProxy side Loki TLS config

Now that we’ve done it once, the next step is simple: We’re basically doing the configuration changes as for Prometheus/Mimir above:

In the frontend https section, we’ll add a selection line like this:

    use_backend bk_loki if { hdr(Host) -i logs.example.net }

Below the Mimir backend configuration, we’re adding a second backend block:

backend bk_loki
        server grafanasrv1 ::1:3100 check

Again, HAProxy needs to be reloaded. If it doesn’t complain, we’ll move on to updating the Agent config.

3.1.3.2. Alloy side Loki TLS config

Again, just like for our Prometheus/Mimir configuration, we’ll update the url configuration in the loki.write "default" block, in /etc/alloy/config.alloy:

    url = "https://logs.example.net/loki/api/v1/push"

Restart the Alloy agent and confirm that we keep seeing new log entries, as confirmation that our change succeeded.

3.1.4. HTTPS access to Grafana

This is the final step in terms of putting network traffic to our Grafana server under the protection of TLS.

3.1.4.1. HAProxy Grafana TLS configuration

Just like earlier, we’ll tell HAProxy to forward traffic to https://monitoring.example.net, to the appropriate web service.

    use_backend bk_grafana if {hdr(Host) -i monitoring.example.net }

The configuration lines for the backend:

backend bk_grafana
        server grafanasrv1 ::1:3000 check

Reload the HAProxy instance, and already a lot of the Grafana services should seem to work. We do want to make some changes to Grafana’s configuration, though:

3.1.4.2. Grafana TLS configuration

Grafana doesn’t yet know about its new URL, and we need to fix that to avoid some issues with the web service.

In /etc/grafana/grafana.ini the following changes need to be made: In the server block:

protocol = http
domain = monitoring.example.net
enforce_domain = true
root_url = https://monitoring.example.net

(Note: Since our reverse proxy handles HTTPS decryption and speaks plain HTTP with Grafana, the protocol from Grafana’s point of view actually is HTTP even though agents and clients speak HTTPS with the server. The incongruency is managed by explicitly setting the value of the root_url.)

In the security block of the same file:

cookie_secure = true
cookie_samesite = strict

There are some additional security options available, but I prefer to handle them in HAProxy, as that configuration will apply to any https service presented through the reverse proxy. But first, restart the grafana-server service and verify the web service can be reached via a regular browser.

If this works: Concratulations! We now have a centralized service for monitoring metrics and logs from our internal servers, and we’ve made it hard for someone on our local network to eavesdrop on the status of our servers.

I’m not completely happy about our security yet, though. In the final chapter, I’ll go through the minimum hardening configuration I would recommend before publishing this service on any production network.

4. Hardening our Grafana stack

There are a couple of main reasons for wanting to harden our environment: The first, of course, is that everything connected to a network can be expected to be attacked at some point. A second reason, is that we might want to be able to access our dashboards even when not connected to our own network, or we may have services hosted elsewhere, for which we want to collect metrics and logs in the same way we do with our on-premises servers. We may have a VPN, which mitigates a lot of the risks, but if we do elect to present the services to the Internet, we’d better be pretty sure we’re not being naïve about it. So let’s start making it a bit harder for the bad guys.

4.1. Hardening Grafana

4.1.1. Replacing the administrative account

This time we’ll start with the Grafana frontend.

The first thing we’ll want to do, is to log in to Grafana, create a new administrative user, and assign it Grafana Admin permissions, plus the Admin role for your main org. As usual it’s prudent to let this account be otherwise unused, and set up a separate account with appropriately lower permissions for your day-to-day tasks.

After creating the new admin account, verify you have the permissions you need, and then disable or outright delete the original admin account. This is the one that will be hammered by bots, and if it doesn’t exist it can’t be abused.

4.1.2. Adding HAProxy security headers

Second, we’ll add some security headers to our HAProxy instance, which will make it harder to trick users into divulging their credentials.

In the https frontend section I like to add the following lines. The Strict-Transport-Security, or HSTS, header tells any compliant client to not even try to connect unless it can safely be done over HTTPS. This is a cheap way of making it harder to accidentally spread authentication credentials in unencrypted form over a public network. The “max-age” is set to half a year, counted in seconds, and is the minimum valid value for this security header. We’re also making it harder to open this site inside a frame, which could be used for clickjacking, and we’re telling compliant clients not to try to load data from third-party domains when visiting sites behind this reverse proxy. The final line is just good form for any service on the Internet.

    http-response set-header Strict-Transport-Security "max-age=15768000; includeSubDomains; preload"
    http-response set-header X-Frame-Options "SAMEORIGIN"
    http-response set-header X-Content-Type-Options "nosniff"
    http-response set-header X-Xss-Protection "1; mode=block"
    http-response set-header Referrer-Policy "strict-origin-when-cross-origin"
    http-response set-header X-Clacks-Overhead "GNU Terry Pratchett"

After storing the configuration file, reload the HAProxy service, and confirm that the headers appear in your web browser’s network monitoring tools (or with curl -v) when visiting the site.

4.2. Hardening Mimir

Mimir apparently doesn’t contain functionality for authentication and authorization. The Alloy client is able to utilize standards like OAuth2, but setting up the necessary infrastructure is out of scope for this document. Instead we’ll configure HAProxy to authenticate calls to Mimir and drop them if invalid credentials are used. For this, we’ll need to generate a secure password and tell HAProxy how to use it.

4.2.1. Generating a password

For this, we’ll need the mkpasswd tool, which for some reason is packaged with whois in Ubuntu:

sudo apt install whois

Then we’ll generate a secure password. If we have a password manager, it’s easiest to generate a password inside of it, but we can also generate a good password by reading from the operating system’s pseudo-random numbers generator and generating something from that data:

head -c 20 /dev/urandom | base64 | tr -dc 'a-zA-Z0-9'

This will generate 20 random alphanumeric characters, which will be our secret.

Next, let’s feed this secret into mkpasswd to hash it with the Blowfish algorithm:

mkpasswd -m bcrypt oursecret

The resulting string is hashed, not encrypted. This means that we shouldn’t need to worry about someone being able to descramble the string and end up with our original secret. Store it somewhere safe: We’ll use it soon.

4.2.2. Preparing HAProxy for Mimir authentication

HAProxy supports what’s called “Basic Auth” - username and password based authentication transferred in HTTP(S) headers. It’s not really the state of the art, but as we expect only a very limited number of user accounts with randomized and long passwords, it should suffice for our needs.

Before the frontend sections in our config file, we’ll add a userlist section, like this:

userlist alloy
    user alloyclient password <REDACTED>

Reminder: For the password value, enter the bcrypt hash of your generated password rather than the password itself!

Then in the frontend https section, we’ll make some changes. Under the line starting with acl internal_ipv6, add a couple of ACLs:

    acl authenticated       http_auth(alloy)
    acl to_mimir            hdr(Host) -i metrics.example.net

These two access control lists are triggered by the successful authentication as a user in the alloy user list, and by trying to access our Mimir listener.

Under the existing http_request deny rule, add the following rule:

    http-request deny if to_mimir !authenticated

If anybody tries to reach metrics.example.net and does not provide the correct username/password, HAProxy will block the request. Note that the userlist here really is a list: If we wanted to, we could create individual user/password combinations to separate non-production from production accounts, or let separate teams authenticate against our solution with individual secrets, or similar. If this solution grows, I strongly recommend setting up proper identity and access management (IAM) infrastructure, but I would not hesitate to use this solution for a smaller-scale deployment.

Finally we’ll update the backend selector for Mimir. We used to check for the hostname inline here, but now we have ACLs that can be used for the same purpose:

    use_backend bk_mimir    if to_mimir authenticated

There is an implicit logical and between the two ACLs here, so both need to be fulfilled for traffic to be allowed to pass through to the Mimir service, again providing certain defence in depth.

If we save the config file and restart HAProxy, we should find ourselves missing new metrics in Grafana. It’s time to update the Prometheus remote_write section in our Alloy config.

4.2.3. Switching Alloy to Mimir authentication

Instead of storing the secret in plaintext in our Alloy configuration, we’ll refer to it via the concept of a secrets file, which we can create like this:

echo -n "oursecret" | sudo tee /etc/alloy/secretsfile
sudo chmod 600 /etc/alloy/secretsfile
sudo chown alloy:alloy /etc/alloy/secretsfile

This sequence of commands generates the file we need in the correct location, and then makes it available only to its owner; the alloy account. The -n parameter for the echo command makes the file contain only the literal string that we want. The default behavior for echo is otherwise to add a line break character to the string, which will mess with the authentication in this case.

Next we’ll edit /etc/alloy/config.alloy. To avoid having the agent read the secrets file every time it wants to send anything to Mimir, we’ll define it as a local file. Below the logging section near the top of the file, create a new section like this:

local.file "authsecret" {
  filename      = "/etc/alloy/secretsfile"
  is_secret     = true
}

Next we’re telling the prometheus.remote_write process how to authenticate against our service. Inside the endpoint block under the url definition, add the following block:

    basic_auth {
      username          = "alloyclient"
      password          = local.file.authsecret.content
    }

After restarting the Alloy service, we should start seeing new metrics again, in Grafana.

4.2.4. Troubleshooting

We can confirm that HAProxy behaves the way we expect, by using curl from a client on a valid network:

curl -u alloyclient:oursecret https://metrics.example.net

This should result in a 404 (“Not found”) error if we’ve provided the correct username/password, but will instead give a 403 (“Forbidden”) error if we’ve provided the wrong credentials.

4.3. Hardening Loki

Unlike Mimir, Loki does provide its own facilities for authentication, but as we already have a working framework in HAProxy I see little reason to change a winning concept. We’ll apply the same principle to this traffic as for our Prometheus/Mimir connection.

4.3.1. Preparing HAProxy for Loki authentication

Under the to_mimir ACL, we’ll add a to_loki ACL on the same form:

    acl to_loki             hdr(Host) -i logs.example.net

The authenticated ACL is already in place, so under the denial rule for Mimir, we’ll update the line for Loki to look almost identical:

    http-request deny if to_loki !authenticated

And just like we did for the backend selector for Mimir, we’ll do a very similar thing for Loki:

    use_backend bk_loki     if to_loki authenticated

Store the configuration file and reload the HAProxy service, and we should temporarily see a gap in our logs. Let’s update Alloy.

4.3.2. Switching Alloy to Loki authentication

Just like in HAProxy, the bulk of the work is already done, so we’ll just copy the basic_auth section from the prometheus.remote_write block into the endpoint section of the loki.write block:

    basic_auth {
      username          = "alloyclient"
      password          = local.file.authsecret.content
    }

After restarting the Alloy service, we should start seeing logs again.

4.4. Chapter summary

When we’ve reached this point, we have done a lot to make it harder for third parties to abuse our service:

  • All traffic is protected by transport layer security (TLS), meaning it’s hard to eavesdrop on our traffic.
  • The default administrative account in Grafana is removed and therefore unfeasible as a target for inevitable password spraying attacks. By now we should have replaced it with individual accounts which adhere to the principle of least necessary privilege, and they should have unguessable and unique passwords.
  • We demand that anything that tries to access our Mimir and Loki endpoints is authenticated using a strong secret.
  • We’ve added some basic security headers that make it harder to trick a user of our Grafana instance into accidentally revealing their credentials to a bad guy.

I would say this solution is ready to be deployed into active use, where we of course should complement it with additional data points, additional dashboards, and additional alert rules. A good exercise would be to enable a Prometheus endpoint in HAProxy, have the Alloy agent scrape it, and then create a dashboard that shows traffic statistics and, say, counts of HTTP 4xx errors for our http and https frontends respectively.

4.5. (OPTIONAL!) Presenting our server to the Internet

If this Grafana server needs to be Internet accessible, the following two steps need to be taken:

  1. Open up for HTTPS traffic to the Grafana server through our firewall.
  2. Comment out the following line in our HAProxy configuration:
    http-request deny unless internal_ipv6

It is, of course, possible to use this rule in combination with our existing ACLs, to allow traffic to specific services and deny it to others; for example opening Grafana for Internet access while denying access to Mimir and Loki, or any combination of the three.

Appendices: The finished configuration files

Appendix i. HAProxy configuration

File location: /etc/haproxy/haproxy.cfg

global
	log /dev/log	local0
	log /dev/log	local1 notice
	chroot /var/lib/haproxy
	stats socket /run/haproxy/admin.sock mode 660 level admin
	stats timeout 30s
	user haproxy
	group haproxy
	daemon
	setenv ACCOUNT_THUMBPRINT '<REDACTED>'
	# Default SSL material locations
	ca-base /etc/ssl/certs
	crt-base /etc/ssl/private

	# See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
        ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
        ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets

defaults
	log	global
	mode	http
	option	httplog
	option	dontlognull
        timeout connect 5000
        timeout client  50000
        timeout server  50000
	errorfile 400 /etc/haproxy/errors/400.http
	errorfile 403 /etc/haproxy/errors/403.http
	errorfile 408 /etc/haproxy/errors/408.http
	errorfile 500 /etc/haproxy/errors/500.http
	errorfile 502 /etc/haproxy/errors/502.http
	errorfile 503 /etc/haproxy/errors/503.http
	errorfile 504 /etc/haproxy/errors/504.http

userlist alloy
	user alloyclient password <REDACTED>

frontend acme
	bind :::80

	acl acme_challenge 	path_beg	'/.well-known/acme-challenge/'
	acl internal_ipv6	src		2001:db8::/32

	http-request deny unless acme_challenge or internal_ipv6

	http-request return status 200 content-type text/plain lf-string "%[path,field(-1,/)].${ACCOUNT_THUMBPRINT}\n" if acme_challenge

	redirect scheme https code 301 if internal_ipv6 !{ ssl_fc }

frontend https
	bind :::443 ssl crt /etc/haproxy/certs/ strict-sni
	http-response set-header Strict-Transport-Security "max-age=15768000; includeSubDomains; preload"
	http-response set-header X-Frame-Options "SAMEORIGIN"
    	http-response set-header X-Content-Type-Options "nosniff"
    	http-response set-header X-Xss-Protection "1; mode=block"
    	http-response set-header Referrer-Policy "strict-origin-when-cross-origin"
    	http-response set-header X-Clacks-Overhead "GNU Terry Pratchett"
	
	acl internal_ipv6	src		2001:db8::/32
	acl authenticated	http_auth(alloy)
	acl to_mimir		hdr(Host) -i metrics.example.net
	acl to_loki			hdr(Host) -i logs.example.net
	
	http-request deny unless internal_ipv6
	http-request deny if to_mimir !authenticated
	http-request deny if to_loki !authenticated

	use_backend bk_mimir    if to_mimir authenticated
	use_backend bk_loki     if to_loki authenticated
	use_backend bk_grafana  if { hdr(Host) -i monitoring.example.net }

backend bk_mimir
	server grafanasrv1 ::1:9009 check

backend bk_loki
        server grafanasrv1 ::1:3100 check

backend bk_grafana
	server grafanasrv1 ::1:3000 check

Appendix ii. Grafana configuration

File location: /etc/grafana/grafana.ini Note: Only including changed sections of the configuration file.

[server]
protocol = http
domain = monitoring.example.net
enforce_domain = true
root_url = https://monitoring.example.net

[security]
cookie_secure = true
cookie_samesite = strict

[smtp]
enabled = true
host = mail.example.net:587
user = alerts@example.net
password = <REDACTED>
from_address = alerts@example.net
from_name = example.net Notifications
ehlo_identity = monitoring.example.net

Appendix iii. Mimir configuration

File location: /etc/mimir/config.yml

---
multitenancy_enabled: false

server:
  http_listen_port: 9009

  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  ring:
    kvstore:
      store: memberlist
  pool:
    health_check_ingesters: true

ingester:
  ring:
    # We want to start immediately and flush on shutdown.
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    kvstore:
      store: memberlist
        #      store: inmemory
    replication_factor: 1

blocks_storage:
  tsdb:
    dir: ./data/tmp/tsdb

  bucket_store:
    sync_dir: ./data/tmp/tsdb-sync

  backend: filesystem 

  filesystem:
    dir: ./data/tsdb

compactor:
  data_dir: ./data/tmp/compactor

ruler_storage:
  backend: local
  local:
    directory: ./data/rules

Appendix iv. Loki configuration

File location: /etc/loki/config.yml

---
multitenancy_enabled: false

server:
  http_listen_port: 9009

  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  ring:
    kvstore:
      store: memberlist
  pool:
    health_check_ingesters: true

ingester:
  ring:
    # We want to start immediately and flush on shutdown.
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    kvstore:
      store: memberlist
        #      store: inmemory
    replication_factor: 1

blocks_storage:
  tsdb:
    dir: ./data/tmp/tsdb

  bucket_store:
    sync_dir: ./data/tmp/tsdb-sync

  backend: filesystem 

  filesystem:
    dir: ./data/tsdb

compactor:
  data_dir: ./data/tmp/compactor

ruler_storage:
  backend: local
  local:
    directory: ./data/rules
cat: cat: No such file or directory
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 0.0.0.0
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  enable_multi_variant_queries: true

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf


# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

Appendix v. Alloy configuration

File location: /etc/alloy/config.alloy Note: Dependent on /etc/alloy/secretsfile containing the secret string that translates into the hash stored for the alloyuser user account in the HAProxy configuration.

logging {
  level = "warn"
}

local.file "authsecret" {
  filename	= "/etc/alloy/secretsfile"
  is_secret	= true
}

prometheus.exporter.unix "default" {
  include_exporter_metrics = true
  disable_collectors       = ["mdadm"]
  enable_collectors        = ["systemd"]
}

prometheus.scrape "default" {
  targets = array.concat(
    prometheus.exporter.unix.default.targets,
    [{
      // Self-collect metrics
      job         = "alloy",
      __address__ = "127.0.0.1:12345",
    }],
  )

  forward_to = [
    prometheus.remote_write.default.receiver,
  ]
}

prometheus.remote_write "default" {
  external_labels = {
    node_name		= sys.env("HOSTNAME"),
    env			= "prod",
  }
  endpoint {
    url = "https://metrics.example.net/api/v1/push"
    basic_auth {
      username          = "alloyclient"
      password		= local.file.authsecret.content
    }
  }
}

local.file_match "node_logs" {
  path_targets = [{
      __path__  = "/var/log/syslog",
      job       = "node/syslog",
      node_name = sys.env("HOSTNAME"),
	  env		= "prod",
  }]
}

loki.source.file "node_logs" {
  targets    = local.file_match.node_logs.targets
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "https://logs.example.net/loki/api/v1/push"
    basic_auth {
      username          = "alloyclient"
      password		= local.file.authsecret.content
    }
  }
}