Cấu hình cảnh báo trên Prometheus – AlertManager – Telegram

by SaKuRai · 19/11/2020

Cấu hình cảnh báo trên Prometheus - AlertManager - Telegram

1./ Cài đặt AlertManager trên server monitor

Tham khảo:

Cài đặt Docker trên Ubuntu 18

https://fixloinhanh.com/script-cai-dat-docker-container-tu-dong-tren-ubuntu-18/

Tạo 1 Docker compose cho container alertmanager

cat docker-compose.yml

version: '2'

services:

alertmanager:

image: prom/alertmanager

privileged: true

volumes:

- /alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml

command:

- '--config.file=/alertmanager/alertmanager.yml'

ports:

- '9093:9093'

- '9094:9094'

Sau đó khởi tạo container

cd /opt/docker

docker-compose up

Nếu không dùng docker để triển khai alertmanager có thể download file chạy trực tiếp để giải nén và chạy

https://prometheus.io/download/

wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

Tạo ra file alertmanager.yml

global:

resolve_timeout: 1m

route:

group_by: ['alertname']

group_wait: 10s

group_interval: 10s

repeat_interval: 1h

receiver: 'web.hook'

receivers:

#- name: 'alertmanager-bot'

- name: 'web.hook'

webhook_configs:

- send_resolved: true

url: 'http://127.0.0.1:8080'

Lệnh chạy

cd alertmanager-0.21.0.linux-amd64

./alertmanager --config.file /etc/alertmanager/alertmanager.yml

Hoặc tạo service

cat /etc/systemd/system/alertmanager.service

[Unit]

Description=Alert Manager Service

Wants=network-online.target

After=network-online.target

[Service]

Type=simple

User=alertmanager

Group=alertmanager

ExecReload=/bin/kill -HUP $MAINPID

ExecStart=/usr/local/bin/alertmanager \

--config.file=/etc/alertmanager/alertmanager.yml \

--web.listen-address=":9093" \

--storage.path=/etc/alertmanager/data

SyslogIdentifier=alertmanager

Restart=always

[Install]

WantedBy=multi-user.target

khởi tạo service.

systemctl daemon-reload

systemctl restart alertmanager.service

systemctl enable alertmanager.service

Sau khi chạy xong server sẽ listen thêm port 9093 và 9094

2./ Cài đặt alertmanager-bot để gửi notification đến Telegram

Download alertmanager-bot ở link sau:

https://github.com/metalmatze/alertmanager-bot

cd /opt/setup

git clone https://github.com/metalmatze/alertmanager-bot.git

cd alertmanager-bot

docker pull metalmatze/alertmanager-bot:0.4.2

Tạo 1 file docker-compose.yml

nano docker-compose.yml

#Paste

networks:

alertmanager-bot: {}

services:

alertmanager-bot:

command:

- --alertmanager.url=http://localhost:9093

- --log.level=info

- --store=bolt

- --bolt.path=/data/bot.db

environment:

TELEGRAM_ADMIN: "1234"

TELEGRAM_TOKEN: XXXXXXX

image: metalmatze/alertmanager-bot:0.4.2

networks:

- alertmanager-bot

ports:

- 8080:8080

restart: always

volumes:

- ./data:/data

version: "3"

Thay thế các thông tin bên dưới

TELEGRAM_ADMIN: "1234"

TELEGRAM_TOKEN: XXXXXXX

ENV Variable	Description
ALERTMANAGER_URL	Address of the alertmanager, default: http://localhost:9093
BOLT_PATH	Path on disk to the file where the boltdb is stored, default: /tmp/bot.db
CONSUL_URL	The URL to use to connect with Consul, default: localhost:8500
LISTEN_ADDR	Address that the bot listens for webhooks, default: 0.0.0.0:8080
STORE	The type of the store to use, choose from bolt (local) or consul (distributed)
TELEGRAM_ADMIN	The Telegram user id for the admin. The bot will only reply to messages sent from an admin. All other messages are dropped and logged on the bot's console. Your user id you can get from @userinfobot.
TELEGRAM_TOKEN	Token you get from @botfather
TEMPLATE_PATHS	Path to custom message templates, default template is ./default.tmpl, in docker - /templates/default.tmpl

docker-compose up

có thể check logs của container

docker logs container-id

nếu bị lỗi trên container metalmatze/alertmanager-bot:

component=telegram msg="failed to create bot" err="http.Post failed: Post dial tcp: lookup api.telegram.org on 127.0.0.11:53: read udp 127.0.0.1:41034->127.0.0.11:53: i/o timeout"

bạn có thể fix lỗi bằng cách điều chỉnh lại DNS tại /etc/resolv.conf về DNS của cloudflare và google, hoặc DNS nào đang phân giải được. Sau đó khởi động lại server.

3./ Tạo 1 rule để test gửi cảnh báo trên Vcenter về telegram

Cần tạo và thay đổi nội dung default ban đầu của file default.tmpl. Khi đó docker-compose sẽ có dạng (prometheus sử dụng ngôn ngữ lập trình GOLANG). Sử dụng câu lệnh PromQL để truy vấn ra dữ liệu cần lấy.

Tạo 1 file docker-compose.yml

nano docker-compose.yml

#Paste

networks:

alertmanager-bot: {}

services:

alertmanager-bot:

command:

- --alertmanager.url=http://localhost:9093

- --log.level=info

- --store=bolt

- --bolt.path=/data/bot.db

- --template.paths=/data/default.tmpl

environment:

TELEGRAM_ADMIN: "1234"

TELEGRAM_TOKEN: XXXXXXX

image: metalmatze/alertmanager-bot:0.4.2

networks:

- alertmanager-bot

ports:

- 8080:8080

restart: always

volumes:

- ./data:/data

version: "3"

Thay thế các thông tin bên dưới

TELEGRAM_ADMIN: "1234"

TELEGRAM_TOKEN: XXXXXXX

chỉnh sửa lại file default.tmpl

{{ if eq .Status "firing"}}🔥 {{ .Status | toUpper }} 🔥{{ else }}{{ .Status | toUpper }}{{ end }}

===== {{ .Labels.severity }} ===== {{ .Labels.alertname }} {{ if .Labels.severity }}

Duration: {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}

Ended: {{ .EndsAt | since }}{{ end }}

File mặc định ban đầu có dạng như bên dưới.

# https://github.com/metalmatze/alertmanager-bot/blob/master/default.tmpl

{{ if eq .Status "firing"}}🔥 {{ .Status | toUpper }} 🔥{{ else }}{{ .Status | toUpper }}{{ end }}

{{ .Labels.alertname }}

Duration: {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}

Ended: {{ .EndsAt | since }}{{ end }}

Tạo 1 file rule có dạng vcenter.yml sau đó cấu hình trong prometheus.yml để trỏ vào file đó.

groups:

- name: Vcenter ESXi Host

rules:

- alert: Connect Vcenter Failed

expr: up{job="dc_vcenter"} == 0

for: 10s

labels:

severity: "Critical"

annotations:

Summary: 'Connect "{{ $labels.instance }}" failed.'

- alert: Connect ESXi Host Failed

expr: vmware_host_connection_state == 0

for: 10s

labels:

severity: "Critical"

annotations:

Summary: 'Connect "{{ $labels.host_name }}" failed.'

- alert: ESXi Host Maintenance

expr: vmware_host_maintenance_mode == 1

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" maintenance.'

- alert: Esxi Host Reboot

expr: vmware_host_boot_timestamp_seconds < 300

for: 10s

labels:

severity: "Critical"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" reboot 5 minutes ago.'

- alert: Esxi Host CPU Used

expr: (vmware_host_cpu_usage{host_name=~".*"} / vmware_host_cpu_max) * 100 > 15

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" use over 15% CPU.'

- alert: Esxi Host RAM Used

expr: (vmware_host_memory_usage{host_name=~".*"} / vmware_host_memory_max) * 100 > 80

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" use over 80% RAM.'

- alert: Esxi Host Disk Used

expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100 > 80

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Storage "{{ $labels.ds_name }}" use over 80%.'

#https://toivietblog.com/linux/monitoring/monitor-vmware-voi-prometheus-vmware_exporter/

Sau đó khởi động lại.

sau khi tạo rule sẽ có bắn cảnh báo về

4./ Nâng cao tạo ra record để lấy kết quả và tái sử dụng. thay cho variables

Tuỳ chỉnh ra 1 rule cho vcenter như sau

nano vcenter.yml

groups:

- name: Vcenter ESXi Host

rules:

# - record: dungluong_hientai_esxi

# expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100

- alert: Connect Vcenter Failed

expr: up{job="dc_vcenter"} == 0

for: 10s

labels:

severity: "Critical"

annotations:

Summary: 'Connect "{{ $labels.instance }}" failed.'

- alert: Connect ESXi Host Failed

expr: vmware_host_connection_state == 0

for: 10s

labels:

severity: "Critical"

annotations:

Summary: 'Connect "{{ $labels.host_name }}" failed.'

- alert: ESXi Host Maintenance

expr: vmware_host_maintenance_mode == 1

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" maintenance.'

- alert: Esxi Host Reboot

expr: vmware_host_boot_timestamp_seconds < 300

for: 10s

labels:

severity: "Critical"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" reboot 5 minutes ago.'

- alert: Esxi Host CPU Used

expr: (vmware_host_cpu_usage{host_name=~".*"} / vmware_host_cpu_max) * 100 > 85

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" use over 85% CPU.'

description: '(current value: {{ $value | printf "%.2f" }}.% of CPU)'

- alert: Esxi Host RAM Used

expr: (vmware_host_memory_usage{host_name=~".*"} / vmware_host_memory_max) * 100 > 85

for: 10s

labels:

severity: "Warning"

annotations:

Summary: 'Esxi "{{ $labels.host_name }}" use over 85% RAM.'

description: '(current value: {{ $value | printf "%.2f" }}.% of RAM)'

- alert: Esxi Host Disk Used

expr: ((vmware_datastore_capacity_size{ds_name=~".*"} - vmware_datastore_freespace_size) / vmware_datastore_capacity_size) * 100 > 90

for: 60s

labels:

severity: "Warning"

annotations:

Summary: 'Storage "{{ $labels.ds_name }}" use over 90%.'

description: '(current value: {{ $value | printf "%.2f" }}.% of Disk)'

điều chỉnh file default.tmpl của container metalmatze/alertmanager-bot như sau:

{{ if eq .Status "firing"}}🔥 {{ .Status | toUpper }} 🔥{{ else }}{{ .Status | toUpper }}{{ end }}

===== {{ .Labels.severity }} ===== {{ .Labels.alertname }} {{ if .Labels.severity }}

Duration: {{ duration .StartsAt .EndsAt }}{{ if ne .Status "firing"}}

Ended: {{ .EndsAt | since }}{{ end }}

Sau đó khởi động lại các container prometheus và alertmanager-bot

Kết quả:

5./ Tham khảo

https://fixloinhanh.com

https://github.com/metalmatze/alertmanager-bot

https://toivietblog.com/linux/monitoring/monitor-vmware-voi-prometheus-vmware_exporter

https://softwareadept.xyz/2018/01/how-to-write-rules-for-prometheus/

=> Done

Cấu hình cảnh báo trên Prometheus – AlertManager – Telegram

You may also like...

Leave a Reply Cancel reply

Ủng hộ Blog bằng cách sử dụng hệ sinh thái của CloudX – Cloud VPS, Email tên miền doanh nghiệp, Dịch vụ quản trị hệ thống, Canvas LMS…

Categories

Quản trị Server Uy tín số 1 Việt nam tại CloudX

Archives

Cấu hình cảnh báo trên Prometheus – AlertManager – Telegram

You may also like...

Cấu hình Iptables với Docker Container

Xử lý lỗi mất kết nối sau khi restore VM, hoặc thay đổi card mạng trên VMware ESXI

Các loại Storage trong CloudStack

Leave a Reply Cancel reply

Ủng hộ Blog bằng cách sử dụng hệ sinh thái của CloudX – Cloud VPS, Email tên miền doanh nghiệp, Dịch vụ quản trị hệ thống, Canvas LMS…

Categories

Quản trị Server Uy tín số 1 Việt nam tại CloudX

Archives