Cấu hình cảnh báo trên Prometheus – AlertManager – Telegram
Cấu hình cảnh báo trên Prometheus - AlertManager - Telegram
1./ Cài đặt AlertManager trên server monitor
Tham khảo:
Cài đặt Docker trên Ubuntu 18
https://fixloinhanh.com/script-cai-dat-docker-container-tu-dong-tren-ubuntu-18/
Tạo 1 Docker compose cho container alertmanager
cat docker-compose.yml
version: '2'
services:
alertmanager:
image: prom/alertmanager
privileged: true
volumes:
- /alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml
command:
- '--config.file=/alertmanager/alertmanager.yml'
ports:
- '9093:9093'
- '9094:9094'
Sau đó khởi tạo container
cd /opt/docker
docker-compose up
Nếu không dùng docker để triển khai alertmanager có thể download file chạy trực tiếp để giải nén và chạy
https://prometheus.io/download/
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
Tạo ra file alertmanager.yml
global:
resolve_timeout: 1m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
#- name: 'alertmanager-bot'
- name: 'web.hook'
webhook_configs:
- send_resolved: true
url: 'http://127.0.0.1:8080'
Lệnh chạy
cd alertmanager-0.21.0.linux-amd64
./alertmanager --config.file /etc/alertmanager/alertmanager.yml
Hoặc tạo service
cat /etc/systemd/system/alertmanager.service
[Unit]
Description=Alert Manager Service
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--web.listen-address=":9093" \
--storage.path=/etc/alertmanager/data
SyslogIdentifier=alertmanager
Restart=always
[Install]
WantedBy=multi-user.target
khởi tạo service.
systemctl daemon-reload
systemctl restart alertmanager.service
systemctl enable alertmanager.service
Sau khi chạy xong server sẽ listen thêm port 9093 và 9094
2./ Cài đặt alertmanager-bot để gửi notification đến Telegram
Download alertmanager-bot ở link sau:
https://github.com/metalmatze/alertmanager-bot
cd /opt/setup
git clone https://github.com/metalmatze/alertmanager-bot.git
cd alertmanager-bot
docker pull metalmatze/alertmanager-bot:0.4.2
Tạo 1 file docker-compose.yml
nano docker-compose.yml
#Paste
networks:
alertmanager-bot: {}
services:
alertmanager-bot:
command:
- --alertmanager.url=http://localhost:9093
- --log.level=info
- --store=bolt
- --bolt.path=/data/bot.db
environment:
TELEGRAM_ADMIN: "1234"
TELEGRAM_TOKEN: XXXXXXX
image: metalmatze/alertmanager-bot:0.4.2
networks:
- alertmanager-bot
ports:
- 8080:8080
restart: always
volumes:
- ./data:/data
version: "3"
Thay thế các thông tin bên dưới
TELEGRAM_ADMIN: "1234"
TELEGRAM_TOKEN: XXXXXXX
ENV Variable | Description |
ALERTMANAGER_URL | Address of the alertmanager, default: http://localhost:9093 |
BOLT_PATH | Path on disk to the file where the boltdb is stored, default: /tmp/bot.db |
CONSUL_URL | The URL to use to connect with Consul, default: localhost:8500 |
LISTEN_ADDR | Address that the bot listens for webhooks, default: 0.0.0.0:8080 |
STORE | The type of the store to use, choose from bolt (local) or consul (distributed) |
TELEGRAM_ADMIN | The Telegram user id for the admin. The bot will only reply to messages sent from an admin. All other messages are dropped and logged on the bot's console. |
TELEGRAM_TOKEN | Token you get from @botfather |
TEMPLATE_PATHS | Path to custom message templates, default template is ./default.tmpl, in docker - /templates/default.tmpl |
docker-compose up
có thể check logs của container
docker logs container-id
nếu bị lỗi trên container metalmatze/alertmanager-bot:
component=telegram msg="failed to create bot" err="http.Post failed: Post dial tcp: lookup api.telegram.org on 127.0.0.11:53: read udp 127.0.0.1:41034->127.0.0.11:53: i/o timeout"
bạn có thể fix lỗi bằng cách điều chỉnh lại DNS tại /etc/resolv.conf về DNS của cloudflare và google, hoặc DNS nào đang phân giải được. Sau đó khởi động lại server.
3./ Tạo 1 rule để test gửi cảnh báo trên Vcenter về telegram
Cần tạo và thay đổi nội dung default ban đầu của file default.tmpl. Khi đó docker-compose sẽ có dạng (prometheus sử dụng ngôn ngữ lập trình GOLANG). Sử dụng câu lệnh PromQL để truy vấn ra dữ liệu cần lấy.
Tạo 1 file docker-compose.yml
nano docker-compose.yml
#Paste
networks:
alertmanager-bot: {}
services:
alertmanager-bot:
command:
- --alertmanager.url=http://localhost:9093
- --log.level=info
- --store=bolt
- --bolt.path=/data/bot.db
- --template.paths=/data/default.tmpl
environment:
TELEGRAM_ADMIN: "1234"
TELEGRAM_TOKEN: XXXXXXX
image: metalmatze/alertmanager-bot:0.4.2
networks:
- alertmanager-bot
ports:
- 8080:8080
restart: always
volumes:
- ./data:/data
version: "3"
Thay thế các thông tin bên dưới
TELEGRAM_ADMIN: "1234"
TELEGRAM_TOKEN: XXXXXXX
chỉnh sửa lại file default.tmpl
{{ define "telegram.default" }}
{{ range .Alerts }}
{{ if eq .Status "firing"}}🔥 <b>{{ .Status | toUpper }}</b> 🔥{{ else }}<b>{{ .Status | toUpper }}</b>{{ end }}
<b>===== {{ .Labels.severity }} =====</b> <b>{{ .Labels.alertname }} {{ if .Labels.severity }}</b>
{{ end }}
{{ if .Annotations.Summary }} {{ .Annotations.Summary }}
{{ end }}
{{ if .Annotations.description }} {{ .Annotations.description }}
{{ end }}
<b>Duration:</b> {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}
<b>Ended:</b> {{ .EndsAt | since }}{{ end }}
{{ end }}
{{ end }}
File mặc định ban đầu có dạng như bên dưới.
# https://github.com/metalmatze/alertmanager-bot/blob/master/default.tmpl
{{ define "telegram.default" }}
{{ range .Alerts }}
{{ if eq .Status "firing"}}🔥 <b>{{ .Status | toUpper }}</b> 🔥{{ else }}<b>{{ .Status | toUpper }}</b>{{ end }}
<b>{{ .Labels.alertname }}</b>
{{ if .Annotations.message }}
{{ .Annotations.message }}
{{ end }}
{{ if .Annotations.summary }}
{{ .Annotations.summary }}
{{ end }}
{{ if .Annotations.description }}
{{ .Annotations.description }}
{{ end }}
<b>Duration:</b> {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}
<b>Ended:</b> {{ .EndsAt | since }}{{ end }}
{{ end }}
{{ end }}
Tạo 1 file rule có dạng vcenter.yml sau đó cấu hình trong prometheus.yml để trỏ vào file đó.
groups:
- name: Vcenter ESXi Host
rules:
- alert: Connect Vcenter Failed
expr: up{job="dc_vcenter"} == 0
for: 10s
labels:
severity: "Critical"
annotations:
Summary: 'Connect "{{ $labels.instance }}" failed.'
- alert: Connect ESXi Host Failed
expr: vmware_host_connection_state == 0
for: 10s
labels:
severity: "Critical"
annotations:
Summary: 'Connect "{{ $labels.host_name }}" failed.'
- alert: ESXi Host Maintenance
expr: vmware_host_maintenance_mode == 1
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" maintenance.'
- alert: Esxi Host Reboot
expr: vmware_host_boot_timestamp_seconds < 300
for: 10s
labels:
severity: "Critical"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" reboot 5 minutes ago.'
- alert: Esxi Host CPU Used
expr: (vmware_host_cpu_usage{host_name=~".*"} / vmware_host_cpu_max) * 100 > 15
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" use over 15% CPU.'
- alert: Esxi Host RAM Used
expr: (vmware_host_memory_usage{host_name=~".*"} / vmware_host_memory_max) * 100 > 80
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" use over 80% RAM.'
- alert: Esxi Host Disk Used
expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100 > 80
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Storage "{{ $labels.ds_name }}" use over 80%.'
#https://toivietblog.com/linux/monitoring/monitor-vmware-voi-prometheus-vmware_exporter/
Sau đó khởi động lại.
sau khi tạo rule sẽ có bắn cảnh báo về
4./ Nâng cao tạo ra record để lấy kết quả và tái sử dụng. thay cho variables
Tuỳ chỉnh ra 1 rule cho vcenter như sau
nano vcenter.yml
groups:
- name: Vcenter ESXi Host
rules:
# - record: dungluong_hientai_esxi
# expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100
- alert: Connect Vcenter Failed
expr: up{job="dc_vcenter"} == 0
for: 10s
labels:
severity: "Critical"
annotations:
Summary: 'Connect "{{ $labels.instance }}" failed.'
- alert: Connect ESXi Host Failed
expr: vmware_host_connection_state == 0
for: 10s
labels:
severity: "Critical"
annotations:
Summary: 'Connect "{{ $labels.host_name }}" failed.'
- alert: ESXi Host Maintenance
expr: vmware_host_maintenance_mode == 1
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" maintenance.'
- alert: Esxi Host Reboot
expr: vmware_host_boot_timestamp_seconds < 300
for: 10s
labels:
severity: "Critical"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" reboot 5 minutes ago.'
- alert: Esxi Host CPU Used
expr: (vmware_host_cpu_usage{host_name=~".*"} / vmware_host_cpu_max) * 100 > 85
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" use over 85% CPU.'
description: '(current value: {{ $value | printf "%.2f" }}.% of CPU)'
- alert: Esxi Host RAM Used
expr: (vmware_host_memory_usage{host_name=~".*"} / vmware_host_memory_max) * 100 > 85
for: 10s
labels:
severity: "Warning"
annotations:
Summary: 'Esxi "{{ $labels.host_name }}" use over 85% RAM.'
description: '(current value: {{ $value | printf "%.2f" }}.% of RAM)'
- alert: Esxi Host Disk Used
expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100 > 90
for: 60s
labels:
severity: "Warning"
annotations:
Summary: 'Storage "{{ $labels.ds_name }}" use over 90%.'
description: '(current value: {{ $value | printf "%.2f" }}.% of Disk)'
điều chỉnh file default.tmpl của container metalmatze/alertmanager-bot như sau:
{{ define "telegram.default" }}
{{ range .Alerts }}
{{ if eq .Status "firing"}}🔥 <b>{{ .Status | toUpper }}</b> 🔥{{ else }}<b>{{ .Status | toUpper }}</b>{{ end }}
<b>===== {{ .Labels.severity }} =====</b> <b>{{ .Labels.alertname }} {{ if .Labels.severity }}</b>
{{ end }}
{{ if .Annotations.Summary }} {{ .Annotations.Summary }}
{{ end }}
{{ if .Annotations.description }}
{{ .Annotations.description }}
{{ end }}
<b>Duration:</b> {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}
<b>Ended:</b> {{ .EndsAt | since }}{{ end }}
{{ end }}
{{ end }}
Sau đó khởi động lại các container prometheus và alertmanager-bot
Kết quả:
5./ Tham khảo
https://github.com/metalmatze/alertmanager-bot
https://toivietblog.com/linux/monitoring/monitor-vmware-voi-prometheus-vmware_exporter
https://softwareadept.xyz/2018/01/how-to-write-rules-for-prometheus/
=> Done