How to configure prometheus/alertmanager on OCP 3
With Openshift 3.11 is the Prometheus Cluster Monitoring fully supported.
I explain below how to customize the alertmanager.yaml
with your own receivers.
Please keep in mind that this solution is primary for the Plattform and not for the Applications them self.
Prerequisites
Before you start let me explain what suggestions and setups are expected for the solution below.
- [bastion] host for your OCP environment where you can run the playbook.
- three [infrastructure] nodes where the prometheus/altermanager pods are running
- a prepared alertmanager.yaml
- a user/SA which have the permissions to modify the openshift-monitor Project
- the
oc
tool is installed and a user with the right permissions is logged in.
I personally prefer to use the oc
tool as it is out of the box compatible with the setuped OCP version.
You should read and understand how the setup in Openshift works as described in the doc from Prometheus Cluster Monitoring, how the default configuration of the OCP Alertmanager looks like and how the Alertmanager can be configured.
Ansible solution
Run on bastion host
The following line executes the playbook which creates the new alertmanager configuration. The new config will be automatically be deployed after the replacement was done.
ANSIBLE_LOG_PATH=ansible_log_$(date +%Y_%m_%d-%H_%M) ansible-playbook \
-e ocp_env=dev \
-e webhook_endpoint=http://127.0.0.1:5001/ \
-e email_receivers=operators@MYDomain.com \
alertman-conf.yaml
Playbook
The playbook which handles the modification of the alermanager.yaml
.
---
- name: Add webhook receiver to altermanager
hosts: bastion
tasks:
- name: Check that webhook receiver is reachable
uri:
body: '{"NodeAlias":"nodetest","Identifier":"MYID"}'
body_format: json
method: POST
url: "{{ webhook_endpoint }}"
with_items: "{{ groups['infranodes'] }}"
delegate_to: "{{ item }}"
changed_when: False
- name: ALERTS | Create alertman backup tmpfile
tempfile:
prefix: "ocp{{ ocp_env }}_alertman_backup"
suffix: ".tmp"
register: alertman_back_tmp
- name: ALERTS | Create alertman all backup tmpfile
tempfile:
prefix: "ocp{{ ocp_env }}_alertman_all_backup"
suffix: ".tmp"
register: alertman_all_back_tmp
# Create backup from current setup
- name: ALERTS | Get alertman secret
shell: |
{%raw%}
oc get secrets -n openshift-monitoring \
-o go-template='{{ index .data "alertmanager.yaml"}}' alertmanager-main \
| base64 -d > {% endraw %} {{ alertman_back_tmp.path }}
- name: ALERTS | Create receiver snipplet
template:
dest: /tmp/alert-man-snipplet
src: templates/alert-receiver.j2
- name: ALERTS | Get alertman secret
shell: |
oc get secrets -n openshift-monitoring -o yaml alertmanager-main > {{ alertman_all_back_tmp.path }}
- name: ALERTS | Replace alertman config with new value
replace:
path: "{{ alertman_all_back_tmp.path }}"
regexp: "^ alertmanager.yaml:.*$"
replace: " alertmanager.yaml: {{lookup('file', '/tmp/alert-man-snipplet') | b64encode }}"
- name: ALERTS | replace alertman secret
shell: |
oc replace -n openshift-monitoring -f {{ alertman_all_back_tmp.path }}
- name: ALERTS | Remove receiver snipplet tmpfile
file:
path: /tmp/alert-man-snipplet
state: absent
- name: ALERTS | Remove alertman backup tmpfile
file:
path: "{{ alertman_all_back_tmp.path }}"
state: absent
Template
The altermanasger Template alert-receiver.j2
.
global:
resolve_timeout: 5m
route:
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: default
routes:
- receiver: myconf
- match:
alertname: DeadMansSwitch
repeat_interval: 5m
receiver: deadmansswitch
receivers:
- name: default
- name: deadmansswitch
- name: myconf
email_configs:
- to: '{{ email_receivers }}'
from: 'admin@ocp{{ ocp_env }}.cloud.internal'
smarthost: 'SMTPRelay.MyDomain:25'
send_resolved: true
#require_tls: false
webhook_configs:
- url: "{{ webhook_endpoint }}"
send_resolved: true
Update
- 17.05.2019 - add catch all receiver and require_tls