Monitor Prometheus-compliant app in GKE with Stackdriver

Written by Rémi Poux | 27-Nov-2019 23:00:00

Overview

Note that for this to work we will use the following components:

A GCP-hosted Stackdriver, with a workspace that receives metrics for project <GCP_PROJECT_ID>
A Prometheus pod that will collect metrics and then send them to Stackdriver
A RabbitMQ exporter compliant with Prometheus format (that could in fact be any other Prometheus exporter)

Note: for this method to work, the exporter needs to be accessible through a Kubernetes Service

A RabbitMQ cluster

These components will be connected as described in the diagram below.

Deploy the solution in a Kubernetes cluster

Deploy RabbitMQ exporter

We will deploy the RabbitMQ Prometheus exporter using Helm with the following command:

This Helm Chart deploy:

A Pod that will connect to RabbitMQ in order to get various metrics
A Service that will expose the metrics on port 9419 at URI /metrics

You can check that they are available:

And then locally expose the service in order to check that the metrics are available:

This will also give you an idea of which metrics are available (http://localhost:9419/metrics).

Note: The latest versions of RabbitMQ also offer a plugin that automatically exposes these metrics. If you use this plugin then you shouldn’t need the exporter.

These metrics have the proper format that will allow Prometheus to use them. Let’s deploy it then.

Deploy Prometheus and patch RabbitMQ exporter service

Prometheus can be deployed in many ways (e.g. with Prometheus exporter), but in this case, we will use a simple deployment.

First, before actually deploying Prometheus, we need to deploy its configuration.

And since we will use the ability of Prometheus to perform Kubernetes Service discovery, we will need two things:

A ClusterRole and a ClusterRoleBinding that allow the Prometheus Pod to list and get Kubernetes Services description on the Kubernetes REST API
A ConfigMap that provide Prometheus configuration
And finally the Prometheus deployment configuration

Now we can deploy all the above-mentioned resources and check their availability:

Now make your Prometheus deployment locally available and check the metrics with your browser:

Now if you go there: http://localhost:9090/service-discovery

You will find out that Prometheus did discover a lot of Kubernetes Services (that’s a good start!) but none of them are active. Why?

Because if you have a look at Prometheus ConfigMap above you will see that a Kubernetes Service discovered by Prometheus need a few annotations in order for Prometheus to scrape it (« scraping » meaning getting metrics from a service):

This means that we need to patch RabbitMQ exporter Service before you can see its metrics in Prometheus, for example with the following method:

If you restart Prometheus Pod and reload the port-forward and then the URL (http://localhost:9090/service-discovery) it should be better now:

Deploy Stackdriver Prometheus sidecar

Now that we have a Prometheus collecting RabbitMQ metrics we can use the integration between Prometheus and Stackdriver in order to use the metrics in Stackdriver. This integration is performed using a sidecar container called stackdriver-sidecar-container.

Note: You need to be really careful in this section because the following points are harder to debug

So we will use the Prometheus deployment and add the sidecar definition to it:

There are two things to notice here:

The Prometheus data volume is mounted in both the Prometheus and the sidecar containers because the sidecar will actually access the collected metrics using files written by Prometheus: Write-ahead-files (WAL). The sidecar can’t work without it.
The options we gave to the sidecar container

--stackdriver.project-id: GCP project ID
--prometheus.wal-directory: will content wal files written by Prometheus
--stackdriver.kubernetes.location: GCP region
--stackdriver.kubernetes.cluster-name: GKE cluster name
--stackdriver.generic.location:
--stackdriver.generic.namespace: Here we put the name of the cluster because unfortunately in this case this will be the only field that will allow us to distinguish metrics from different clusters

You can now update the Prometheus deployment with:

kubectl apply -f deployment.yaml

And then go Stackdriver to see if some new metrics have arrived:

https://app.google.stackdriver.com/
Go to your workspace
Then go to “Resources” -> “Metrics Explorer”

Note: using Prometheus v2.11.2 + stackdriver-prometheus-sidecar v0.6.3 with this configuration, RabbitMQ metrics will be recognised as « Generic_Task »

You can see your metrics with some filters as in the example below

Adding another Kubernetes Service

Now let’s imagine that along with RabbitMQ you have another service that exposes Prometheus-compliant metrics through a Kubernetes Service. Maybe it’s your own application.

Then if you want those metrics to be available in stackdriver, you just need to patch your service by using the same kind of patch script we used before.

Use metrics in Stackdriver

At this stage, exploiting the metrics is now easy. You just need to go to the Stackdriver metric explorer:

https://app.google.stackdriver.com/
Go to your workspace
Then go to “Resources” -> “Metrics Explorer”

Then configure a chart with the following:

Resource type: Generic_Task
Metric name: Just begin to type “rabbit” and stackdriver will try to autocomplete

And when your chart is ready, save it to a new/existing Dashboard.

Pricing

Be careful when sending external metrics in Stackdriver, even if it’s for a test because this is an expensive feature.

Stackdriver monitoring is not that expensive. Even native Kubernetes monitoring with Stackdriver is not expensive. But using external metrics is.

So my advice here would be to add services monitoring one by one and maybe not to send to Stackdriver metrics that are not essential.

As of today, I wouldn't say that getting external metrics associated with Kubernetes Services in Stackdriver is smooth. It requires some work, it doesn’t exactly fit the data model and if some debug is required then it may prove to be difficult debug.

Yet it exists, it works and will hopefully get easier to use.

And it prevents you from dealing with storage (amount of storage, resiliency, …) and cluster aggregation which are a lot trickier problems than what we faced here.

If however you still want to try out other tools, you can read those posts about Kubernetes monitoring. or Kubernetes productivity tips.

View full post