Diary entry 02/01
Diving into observability
Context
Following a discussion with another tech, where I mentioned connection errors to a production DB even though there weren’t many concurrent users, he asked me what instrumentation tool I was using.
The answer was: nothing really specific. Apart from Sentry, which I use to track application error logs for the apps I develop (in .NET).
Operations
0. Analyzing existing tools
Datadog, Sentry, Grafana, etc… I decided to go with the cheapest option in terms of money, but the most expensive in terms of time: Grafana.
1. What do I actually want to “observe”?
- my .NET apps
- my Nginx proxy
- my databases
So I start by installing libraries to configure observability for my .NET web app and my console app.
For the web part, it’s easy: I just need to expose an HTTP endpoint that Prometheus can scrape to retrieve the exposed metrics (with a strong warning about security). For now, I’ve simply generated a random URL.
For the console app, I don’t want to turn it into a web app just to expose an endpoint. So I need to push the data to a collector.
There is OpenTelemetry Collector, which I would need to install on a node of my PaaS.
I give it a try, but it fails: the OS version of the Docker image is not supported by my PaaS. I’d probably need to switch to an older version. At this point, it doesn’t feel like a good move.
Maybe turning the console app into a web app would make more sense. I’ll see later, depending on RAM usage and image size. For now, I stay focused on the web app part, which would already be a big step forward.
2. Trying to install the “Grafana + Prometheus” extension from my PaaS marketplace
This fails for the Prometheus instance, which returns a 500 error. I don’t spend too much time on it — better to start from scratch and follow Grafana’s documentation.
3. Using the free version of Grafana Cloud
During setup, I choose to monitor a PostgreSQL installation. Grafana directly proposes code to install Alloy, which is a kind of global “monitoring station” that allows collecting more than just metrics from the PostgreSQL exporter (which is embedded in Alloy).
I run the automatically generated install script on my PostgreSQL instance, but it fails because I need sudo rights. And that’s not possible on my PaaS when installing apps via the marketplace.
So I need to use a VPS and install Alloy there, then configure it so it can connect to PostgreSQL.
4. Installing a Debian-based VPS
Nothing special here.
5. Configuring Alloy on the new VPS
I install Alloy using the provided script. I try a connection test via the Grafana Cloud interface… but it doesn’t work.
Looking into Alloy logs:
1
journalctl -u alloy.service -n 100 --no-pager
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Jan 02 08:46:15 myhostname systemd[1]: Started alloy.service - Vendor-agnostic OpenTelemetry Collector distribution with programmable pipelines.
Jan 02 08:46:15 myhostname alloy[5850]: Error: /etc/alloy/config.alloy:75:1: expected identifier, got .
Jan 02 08:46:15 myhostname alloy[5850]: 74 | prometheus.relabel "integrations_postgres_exporter" {
Jan 02 08:46:15 myhostname alloy[5850]: 75 | .forward_to = [prometheus.remote_write.metrics_service.receiver]
Jan 02 08:46:15 myhostname alloy[5850]: | ^
Jan 02 08:46:15 myhostname alloy[5850]: 76 |
Jan 02 08:46:15 myhostname alloy[5850]: Error: /etc/alloy/config.alloy:82:1: expected identifier, got .
Jan 02 08:46:15 myhostname alloy[5850]: 81 |
Jan 02 08:46:15 myhostname alloy[5850]: 82 | .rule {
Jan 02 08:46:15 myhostname alloy[5850]: | ^
Jan 02 08:46:15 myhostname alloy[5850]: 83 | source_labels = ["__name__"]
Jan 02 08:46:15 myhostname alloy[5850]: interrupt received
Jan 02 08:46:15 myhostname alloy[5850]: Error: could not perform the initial load successfully
Jan 02 08:46:15 myhostname systemd[1]: alloy.service: Main process exited, code=exited, status=1/FAILURE
Jan 02 08:46:15 myhostname systemd[1]: alloy.service: Failed with result 'exit-code'.
Jan 02 08:46:15 myhostname systemd[1]: alloy.service: Scheduled restart job, restart counter is at 4.
Jan 02 08:46:15 myhostname systemd[1]: Stopped alloy.service - Vendor-agnostic OpenTelemetry Collector distribution with programmable pipelines.
The errors indicate invalid identifiers and later an invalid DSN format.
With the help of an AI tool, I understand that the issue comes from dots in the config — and that they shouldn’t be there. I remove them, and things get a bit better.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Jan 02 10:36:49 myhostname alloy[6154]: Error: /etc/alloy/config.alloy:62:1: Failed to build component: building component: cannot parse DSN: invalid connection protocol: observability
Jan 02 10:36:49 myhostname alloy[6154]: 61 |
Jan 02 10:36:49 myhostname alloy[6154]: 62 | prometheus.exporter.postgres "integrations_postgres_exporter" {
Jan 02 10:36:49 myhostname alloy[6154]: | _^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jan 02 10:36:49 myhostname alloy[6154]: 63 | | data_source_names = ["username:password@tcp(url:portNumber)/databaseName"]
Jan 02 10:36:49 myhostname alloy[6154]: 64 | | }
Jan 02 10:36:49 myhostname alloy[6154]: | |_^
Jan 02 10:36:49 myhostname alloy[6154]: 65 | discovery.relabel "integrations_postgres_exporter" {
Jan 02 10:36:49 myhostname alloy[6154]: interrupt received
Jan 02 10:36:49 myhostname alloy[6154]: Error: could not perform the initial load successfully
Jan 02 10:36:49 myhostname systemd[1]: alloy.service: Main process exited, code=exited, status=1/FAILURE
Jan 02 10:36:49 myhostname systemd[1]: alloy.service: Failed with result 'exit-code'.
Jan 02 10:36:50 myhostname systemd[1]: alloy.service: Scheduled restart job, restart counter is at 5.
Jan 02 10:36:50 myhostname systemd[1]: Stopped alloy.service - Vendor-agnostic OpenTelemetry Collector distribution with programmable pipelines.
Jan 02 10:36:50 myhostname systemd[1]: alloy.service: Start request repeated too quickly.
Jan 02 10:36:50 myhostname systemd[1]: alloy.service: Failed with result 'exit-code'.
Jan 02 10:36:50 myhostname systemd[1]: Failed to start alloy.service - Vendor-agnostic OpenTelemetry Collector distribution with programmable pipelines.
Now there’s an error with the DSN format used for the connection. What’s funny is that this DSN was provided directly by Grafana Cloud when I said I wanted to monitor a PostgreSQL instance.
This is the moment when I realize it would be really useful to track the changes I make on this Debian server.
So I ask my favorite AI what would be best to track my commands and file modifications (like config files in this case).
It suggests:
- GIT
- auditd (Linux Audit Framework)
- Tlog / Sudo
- termtosvg
- asciinema2md
- etckeeper
I take a look at auditd, land on this article (https://goteleport.com/blog/linux-audit/), and quickly think: this might be too much.
Through the blog of Stéphane Robert, a well-known SRE, I discover that he talks about etckeeper in his list of essential tools. That convinces me enough to go with this solution.
6. Detour with etckeeper
I followed everything step by step from Stéphane Robert’s etckeeper tutorial.
I didn’t find the PRESERVE_METADATA config option in the config file, so I added it manually — just in case.
7. Back to Alloy configuration
Based on the error logs, I need to modify the PostgreSQL connection string.
I had to specify that SSL should not be used, remove the TCP notion, and avoid using the DNS alias provided by the PaaS.
Indeed, my node is in the same environment, and alias resolution only works between environments or via the internet.
Metrics integration works, but logs integration doesn’t. This is probably related to the fact that I’m using v18 — there must be a parameter to tweak. Not a big deal for now; I’ll focus on my .NET WebApp next.
8. Integrating metrics from a .NET 8 Web App
In this case, Grafana Cloud offers an integration via a dedicated NuGet library.
I’d rather not have my .NET app depend directly on Grafana, but only on the standard provided by OpenTelemetry.
That means I need to integrate a Collector into my environment. It will be able to scrape data from the custom metrics endpoint of my web app — and also from my console app (if I eventually wire it in).
I install OpenTelemetry Collector Contrib instead of OpenTelemetry Collector Core. The Contrib version includes custom exporters and log collection, which the Core version does not support.
I can’t install it directly on a PaaS node because the OS isn’t supported.
So I install Docker Engine CE on a node and deploy a new container based on OpenTelemetry Collector Contrib.
At that point… my Grafana Cloud instance decides to crash. I can no longer configure new data sources.
To be continued, then? 😅