Cấu hình cảnh báo trên Prometheus – AlertManager – Telegram

Cấu hình cảnh báo trên Prometheus - AlertManager - Telegram

1./ Cài đặt AlertManager trên server monitor

Tham khảo:

Cài đặt Docker trên Ubuntu 18

https://fixloinhanh.com/script-cai-dat-docker-container-tu-dong-tren-ubuntu-18/

Tạo 1 Docker compose cho container alertmanager​​ 

cat docker-compose.yml

version: '2'

services:

 ​​ ​​ ​​​​ alertmanager:

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ image: prom/alertmanager

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ privileged: true

 ​​ ​​ ​​ ​​​​  ​​ ​​​​ volumes:

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ - /alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ command:

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ - '--config.file=/alertmanager/alertmanager.yml'

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ ports:

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ - '9093:9093'

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ - '9094:9094'

Sau đó khởi tạo container

cd /opt/docker

docker-compose up

Nếu không dùng docker để​​ triển khai alertmanager có thể​​ download file chạy trực tiếp để​​ giải nén và chạy

https://prometheus.io/download/

wget​​ https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

Tạo ra file alertmanager.yml

global:

 ​​​​ resolve_timeout: 1m

route:

 ​​​​ group_by: ['alertname']

 ​​​​ group_wait: 10s

 ​​​​ group_interval: 10s

 ​​​​ repeat_interval: 1h

 ​​​​ receiver: 'web.hook'

receivers:

#- name: 'alertmanager-bot'

- name: 'web.hook'

 ​​​​ webhook_configs:

 ​​​​ - send_resolved: true

 ​​ ​​ ​​​​ url: 'http://127.0.0.1:8080'

Lệnh chạy

 

cd alertmanager-0.21.0.linux-amd64

./alertmanager --config.file​​ /etc/alertmanager/alertmanager.yml

Hoặc tạo service

cat /etc/systemd/system/alertmanager.service​​ 

[Unit]

Description=Alert Manager Service

Wants=network-online.target

After=network-online.target

[Service]

Type=simple

User=alertmanager

Group=alertmanager

ExecReload=/bin/kill -HUP $MAINPID

ExecStart=/usr/local/bin/alertmanager \

--config.file=/etc/alertmanager/alertmanager.yml \

--web.listen-address=":9093" \

--storage.path=/etc/alertmanager/data

SyslogIdentifier=alertmanager

Restart=always

[Install]

WantedBy=multi-user.target

khởi tạo service.

systemctl daemon-reload

systemctl restart alertmanager.service

systemctl​​ enable​​ alertmanager.service

Sau khi chạy xong server sẽ​​ listen thêm port 9093 và 9094

2./ Cài đặt alertmanager-bot để​​ gửi notification đến Telegram

Download alertmanager-bot​​ ​​ link sau:

https://github.com/metalmatze/alertmanager-bot

cd /opt/setup

git clone​​ https://github.com/metalmatze/alertmanager-bot.git

cd alertmanager-bot

docker pull metalmatze/alertmanager-bot:0.4.2

Tạo 1 file docker-compose.yml

nano docker-compose.yml

#Paste

networks:

 ​​​​ alertmanager-bot:​​ {}

services:

 ​​​​ alertmanager-bot:

 ​​ ​​ ​​​​ command:

 ​​ ​​ ​​​​ -​​ --alertmanager.url=http://localhost:9093

 ​​ ​​ ​​​​ -​​ --log.level=info

 ​​ ​​ ​​​​ -​​ --store=bolt

 ​​ ​​ ​​​​ -​​ --bolt.path=/data/bot.db

 ​​ ​​ ​​​​ environment:

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_ADMIN:​​ "1234"

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_TOKEN:​​ XXXXXXX

 ​​ ​​ ​​​​ image:​​ metalmatze/alertmanager-bot:0.4.2

 ​​ ​​ ​​​​ networks:

 ​​ ​​ ​​​​ -​​ alertmanager-bot

 ​​ ​​ ​​​​ ports:

 ​​ ​​ ​​​​ -​​ 8080:8080

 ​​ ​​ ​​​​ restart:​​ always

 ​​ ​​ ​​​​ volumes:

 ​​ ​​ ​​​​ -​​ ./data:/data

version:​​ "3"

Thay thế​​ các thông tin bên dưới​​ 

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_ADMIN:​​ "1234"

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_TOKEN:​​ XXXXXXX

 

ENV Variable

Description

ALERTMANAGER_URL

Address of the alertmanager, default: http://localhost:9093

BOLT_PATH

Path on disk to the file where the boltdb is stored, default: /tmp/bot.db

CONSUL_URL

The URL to use to connect with Consul, default: localhost:8500

LISTEN_ADDR

Address that the bot listens for webhooks, default: 0.0.0.0:8080

STORE

The type of the store to use, choose from bolt (local) or consul (distributed)

TELEGRAM_ADMIN

The Telegram user id for the admin. The bot will only reply to messages sent from an admin. All other messages are dropped and logged on the​​ bot's console.
Your user id you can get from @userinfobot.

TELEGRAM_TOKEN

Token you get from @botfather

TEMPLATE_PATHS

Path to custom message templates, default template is ./default.tmpl, in docker - /templates/default.tmpl

docker-compose up

có thể​​ check logs của container

docker logs container-id

nếu bị​​ lỗi trên container metalmatze/alertmanager-bot:

​​ component=telegram msg="failed to create bot" err="http.Post​​ failed: Post ​​ dial tcp: lookup api.telegram.org on 127.0.0.11:53: read udp 127.0.0.1:41034->127.0.0.11:53: i/o timeout"

bạn có thể​​ fix lỗi bằng cách điều chỉnh lại DNS tại /etc/resolv.conf về​​ DNS của cloudflare và google, hoặc DNS nào đang phân giải được.​​ Sau đó khởi động lại server.

3./ Tạo 1 rule để​​ test gửi cảnh báo trên Vcenter về​​ telegram

Cần tạo và thay đổi nội dung default ban đầu của file default.tmpl. Khi đó docker-compose sẽ​​ có dạng (prometheus sử​​ dụng ngôn ngữ​​ lập trình GOLANG). Sử​​ dụng câu lệnh​​ PromQL để​​ truy vấn ra dữ​​ liệu cần lấy.

Tạo 1 file docker-compose.yml

nano docker-compose.yml

#Paste

networks:

 ​​​​ alertmanager-bot:​​ {}

services:

 ​​​​ alertmanager-bot:

 ​​ ​​ ​​​​ command:

 ​​ ​​ ​​​​ - --alertmanager.url=http://localhost:9093

 ​​ ​​ ​​​​ - --log.level=info

 ​​ ​​ ​​​​ -​​ --store=bolt

 ​​ ​​ ​​​​ - --bolt.path=/data/bot.db

 ​​ ​​ ​​​​ - --template.paths=/data/default.tmpl

 ​​ ​​ ​​​​ environment:

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_ADMIN: "1234"

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_TOKEN: XXXXXXX

 ​​ ​​ ​​​​ image: metalmatze/alertmanager-bot:0.4.2

 ​​ ​​ ​​​​ networks:

 ​​ ​​ ​​​​ - alertmanager-bot

 ​​ ​​ ​​​​ ports:

​​  ​​ ​​​​ - 8080:8080

 ​​ ​​ ​​​​ restart:​​ always

 ​​ ​​ ​​​​ volumes:

 ​​ ​​ ​​​​ -​​ ./data:/data

version:​​ "3"

Thay thế​​ các thông tin bên dưới​​ 

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_ADMIN:​​ "1234"

 ​​ ​​ ​​ ​​ ​​​​ TELEGRAM_TOKEN:​​ XXXXXXX

chỉnh sửa lại file default.tmpl

{{ define "telegram.default" }}

{{ range .Alerts }}

{{ if eq .Status "firing"}}🔥​​ <b>{{ .Status | toUpper }}</b>​​ 🔥{{ else }}<b>{{ .Status | toUpper }}</b>{{ end }}

<b>===== {{ .Labels.severity }} =====</b> <b>{{ .Labels.alertname }} {{ if .Labels.severity }}</b>

{{ end }}

{{ if .Annotations.Summary }} {{ .Annotations.Summary }}

{{ end }}

{{ if .Annotations.description }} {{ .Annotations.description }}

{{ end }}

<b>Duration:</b> {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}

<b>Ended:</b> {{ .EndsAt | since }}{{ end }}

{{ end }}

{{ end }}

File mặc định ban đầu có dạng như bên dưới.​​ 

#​​ https://github.com/metalmatze/alertmanager-bot/blob/master/default.tmpl

{{ define "telegram.default" }}

{{ range .Alerts }}

{{​​ if eq .Status "firing"}}🔥​​ <b>{{ .Status | toUpper }}</b>​​ 🔥{{ else }}<b>{{ .Status | toUpper }}</b>{{ end }}

<b>{{ .Labels.alertname }}</b>

{{ if .Annotations.message }}

{{ .Annotations.message }}

{{ end }}

{{ if .Annotations.summary }}

{{​​ .Annotations.summary }}

{{ end }}

{{ if .Annotations.description }}

{{ .Annotations.description }}

{{ end }}

<b>Duration:</b> {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}

<b>Ended:</b> {{ .EndsAt | since }}{{ end }}

{{ end }}

{{ end }}

Tạo 1​​ file rule có dạng vcenter.yml sau đó cấu hình trong prometheus.yml để​​ trỏ​​ vào file đó.

groups:

- name: Vcenter ESXi Host

 ​​​​ rules:

 ​​​​ - alert: Connect Vcenter Failed

 ​​ ​​ ​​​​ expr: up{job="dc_vcenter"} == 0

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Critical"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Connect "{{ $labels.instance }}" failed.'

 

 ​​​​ - alert: Connect ESXi Host Failed

 ​​ ​​ ​​​​ expr: vmware_host_connection_state == 0

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Critical"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Connect "{{ $labels.host_name }}" failed.'

 

 ​​​​ - alert: ESXi Host Maintenance

 ​​ ​​ ​​​​ expr: vmware_host_maintenance_mode == 1

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" maintenance.'

 

 ​​​​ - alert: Esxi Host​​ Reboot

 ​​ ​​ ​​​​ expr: vmware_host_boot_timestamp_seconds < 300

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Critical"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" reboot 5 minutes ago.'

 

 ​​​​ - alert: Esxi Host CPU Used

 ​​ ​​ ​​​​ expr: (vmware_host_cpu_usage{host_name=~".*"} / vmware_host_cpu_max) * 100 > 15

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" use over 15% CPU.'

 

 ​​​​ - alert: Esxi Host RAM Used

 ​​ ​​ ​​​​ expr: (vmware_host_memory_usage{host_name=~".*"} / vmware_host_memory_max) * 100 > 80

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" use over 80% RAM.'

 

 ​​​​ - alert: Esxi Host Disk Used

 ​​ ​​ ​​​​ expr:​​ ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100 > 80

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Storage "{{ $labels.ds_name }}" use over 80%.'

#https://toivietblog.com/linux/monitoring/monitor-vmware-voi-prometheus-vmware_exporter/

Sau đó khởi động lại.

sau khi tạo rule sẽ​​ có bắn cảnh báo về

4./ Nâng cao tạo ra record để​​ lấy kết quả​​ và tái sử​​ dụng. thay cho variables

Tuỳ​​ chỉnh ra 1 rule cho​​ vcenter như sau

nano vcenter.yml

groups:

- name: Vcenter ESXi Host

 ​​​​ rules:

 ​​​​ # - record: dungluong_hientai_esxi

 ​​ ​​ ​​​​ # expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100

 ​​​​ - alert: Connect Vcenter Failed

 ​​ ​​ ​​​​ expr: up{job="dc_vcenter"} == 0

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Critical"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Connect "{{ $labels.instance }}" failed.'

 

 ​​​​ - alert: Connect ESXi Host Failed

 ​​ ​​ ​​​​ expr:​​ vmware_host_connection_state == 0

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Critical"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Connect "{{ $labels.host_name }}" failed.'

 

 ​​​​ - alert: ESXi Host Maintenance

 ​​ ​​ ​​​​ expr: vmware_host_maintenance_mode == 1

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​​​ ​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" maintenance.'

 

 ​​​​ - alert: Esxi Host Reboot

 ​​ ​​ ​​​​ expr: vmware_host_boot_timestamp_seconds < 300

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Critical"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" reboot 5 minutes ago.'

 

 ​​​​ - alert: Esxi Host CPU Used

 ​​ ​​ ​​​​ expr: (vmware_host_cpu_usage{host_name=~".*"} / vmware_host_cpu_max) * 100 > 85

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" use over 85% CPU.'

 ​​ ​​ ​​ ​​ ​​​​ description: '(current value: {{ $value | printf "%.2f" }}.% of CPU)'

 ​​​​ - alert: Esxi Host RAM Used

 ​​ ​​ ​​​​ expr: (vmware_host_memory_usage{host_name=~".*"} / vmware_host_memory_max) * 100 > 85

 ​​ ​​ ​​​​ for: 10s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Esxi "{{ $labels.host_name }}" use over 85% RAM.'

 ​​ ​​ ​​ ​​ ​​​​ description: '(current value: {{ $value | printf "%.2f" }}.% of RAM)'

 

 ​​​​ - alert: Esxi Host Disk​​ Used

 ​​ ​​ ​​​​ expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100 > 90

 ​​ ​​ ​​​​ for: 60s

 ​​ ​​ ​​​​ labels:

 ​​ ​​ ​​ ​​ ​​​​ severity: "Warning"

 ​​ ​​ ​​​​ annotations:

 ​​ ​​ ​​ ​​ ​​​​ Summary: 'Storage "{{ $labels.ds_name }}" use over 90%.'

 ​​ ​​ ​​ ​​ ​​​​ description: '(current value: {{ $value | printf "%.2f" }}.% of Disk)'

điều chỉnh file default.tmpl của container metalmatze/alertmanager-bot như sau:

{{ define "telegram.default" }}

{{ range .Alerts }}

{{ if eq .Status "firing"}}🔥​​ <b>{{ .Status | toUpper }}</b>​​ 🔥{{ else }}<b>{{ .Status | toUpper }}</b>{{ end }}

<b>===== {{ .Labels.severity }} =====</b> <b>{{ .Labels.alertname }} {{ if .Labels.severity }}</b>

{{ end }}

{{ if .Annotations.Summary }} {{ .Annotations.Summary }}

{{ end​​ }}

{{ if .Annotations.description }}

{{ .Annotations.description }}

{{ end }}

<b>Duration:</b>​​ {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}

<b>Ended:</b> {{ .EndsAt | since }}{{ end }}

{{ end }}

{{ end }}

Sau đó khởi động lại các container prometheus và alertmanager-bot

Kết quả:

 

5./ Tham khảo

https://fixloinhanh.com

https://github.com/metalmatze/alertmanager-bot

https://toivietblog.com/linux/monitoring/monitor-vmware-voi-prometheus-vmware_exporter

https://softwareadept.xyz/2018/01/how-to-write-rules-for-prometheus/

 

=>​​ Done

SaKuRai

Xin chào, Mình là Sakurai. Blog này là nơi để note lại và chia sẻ những kiến thức, kinh nghiệm mà mình và anh em trong Team. Cảm ơn các bạn đã quan tâm theo dõi!

You may also like...

Leave a Reply