监控与维护是确保系统稳定运行的重要环节,本文档将详细介绍 Tagtag Starter 项目的监控系统、日志管理、性能优化和系统维护方案。
| 技术 | 用途 |
|---|---|
| Prometheus | 时间序列数据采集和存储 |
| Grafana | 监控数据可视化 |
| Node Exporter | 服务器指标监控 |
| JMX Exporter | Java 应用性能监控 |
| MySQL Exporter | MySQL 数据库监控 |
| Redis Exporter | Redis 缓存监控 |
| Loki | 日志聚合和查询 |
| Promtail | 日志采集 |
| Alertmanager | 告警管理 |
| ELK Stack | 日志分析和可视化(可选) |
| Jaeger | 分布式链路追踪 |
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 应用服务 │ │ 数据库服务 │ │ 缓存服务 │
│ (Spring Boot) │ │ (MySQL) │ │ (Redis) │
└──────┬──────────┘ └──────┬──────────┘ └──────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ JMX Exporter │ │ MySQL Exporter │ │ Redis Exporter │
└──────┬──────────┘ └──────┬──────────┘ └──────┬──────────┘
│ │ │
└───────────┬───────────┴───────────┬───────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Prometheus │
└───────────────────┬─────────────────────────────────────┘
│
├─────────────────────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ Alertmanager │ │ Grafana │
└─────────────────────────┘ └─────────────────────────┘
采用 JSON 格式记录应用日志,便于日志收集和分析:
{
"timestamp": "2023-01-01T12:00:00.123Z",
"level": "INFO",
"thread": "http-nio-8080-exec-1",
"logger": "com.tagtag.controller.UserController",
"message": "用户登录成功",
"traceId": "1234567890abcdef",
"spanId": "abcdef1234567890",
"userId": 1,
"ip": "192.168.1.100",
"method": "POST",
"path": "/api/auth/login",
"status": 200,
"duration": 123
}
Nginx 访问日志格式:
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'$request_time $upstream_response_time';
Promtail 配置:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*log
- job_name: nginx
static_configs:
- targets:
- localhost
labels:
job: nginx
__path__: /var/log/nginx/*.log
- job_name: application
static_configs:
- targets:
- localhost
labels:
job: application
__path__: /opt/tagtag/logs/*.log
Filebeat 配置:
filebeat.inputs:
- type: log
enabled: true
paths:
- /opt/tagtag/logs/*.log
fields:
application: tagtag
fields_under_root: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "tagtag-%{+yyyy.MM.dd}"
setup.ilm.enabled: true
setup.ilm.rollover_alias: "tagtag"
setup.ilm.pattern: "000001"
Grafana 支持通过 Loki 数据源查询日志,支持以下查询语法:
# 查询所有日志
{job="application"}
# 按日志级别查询
{job="application"} |= "ERROR"
# 按时间范围查询
{job="application"} |= "ERROR" | __time > 1609459200000ms and __time < 1609545600000ms
# 按字段查询
{job="application"} | json | level="ERROR" and status=500
Kibana 提供了强大的日志查询和可视化功能,支持以下查询语法:
# 查询所有日志
application:tagtag
# 按日志级别查询
application:tagtag AND level:ERROR
# 按时间范围查询
application:tagtag AND @timestamp:[2023-01-01T00:00:00.000Z TO 2023-01-02T00:00:00.000Z]
# 按字段查询
application:tagtag AND level:ERROR AND status:500
Logback 配置:
<configuration>
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/opt/tagtag/logs/tagtag.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<!-- 每天生成一个日志文件,保留30天 -->
<fileNamePattern>/opt/tagtag/logs/tagtag.%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
<!-- 每个日志文件最大100MB -->
<maxFileSize>100MB</maxFileSize>
<!-- 保留30天 -->
<maxHistory>30</maxHistory>
<!-- 总大小限制10GB -->
<totalSizeCap>10GB</totalSizeCap>
</rollingPolicy>
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="net.logstash.logback.layout.LogstashLayout">
<jsonGeneratorDecorator class="net.logstash.logback.decorate.ContextJsonGeneratorDecorator" />
</layout>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="FILE" />
</root>
</configuration>
Node Exporter 安装:
# 下载 Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
# 解压安装
tar xvfz node_exporter-1.6.0.linux-amd64.tar.gz
mv node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
# 创建系统服务
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=root
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
Prometheus 配置:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
JMX Exporter 配置:
jmx_exporter.yaml配置文件示例:
startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
- pattern: "java.lang<type=Memory><HeapMemoryUsage>(.+)":
name: jvm_memory_heap_usage_$1
type: GAUGE
- pattern: "java.lang<type=Memory><NonHeapMemoryUsage>(.+)":
name: jvm_memory_nonheap_usage_$1
type: GAUGE
- pattern: "java.lang<type=GarbageCollector,name=(.+)><(.+)>(.+)":
name: jvm_gc_$2_$3
labels:
gc: $1
type: GAUGE
- pattern: "java.lang<type=Threading><(.+)>":
name: jvm_threading_$1
type: GAUGE
启动脚本修改:
java -javaagent:jmx_prometheus_javaagent-0.19.0.jar=9404:jmx_exporter.yaml -jar tagtag-backend.jar
MySQL Exporter 安装:
# 下载 MySQL Exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz
# 解压安装
tar xvfz mysqld_exporter-0.15.0.linux-amd64.tar.gz
mv mysqld_exporter-0.15.0.linux-amd64/mysqld_exporter /usr/local/bin/
# 创建 MySQL 用户
mysql -u root -p -e "CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'password' WITH MAX_USER_CONNECTIONS 3;"
mysql -u root -p -e "GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';"
# 创建配置文件
cat > /etc/.mysqld_exporter.cnf << EOF
[client]
user=exporter
password=password
EOF
# 创建系统服务
cat > /etc/systemd/system/mysqld_exporter.service << EOF
[Unit]
Description=MySQL Exporter
After=network.target
[Service]
User=root
Environment="DATA_SOURCE_NAME=exporter:password@(localhost:3306)/"
ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/.mysqld_exporter.cnf
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
systemctl daemon-reload
systemctl start mysqld_exporter
systemctl enable mysqld_exporter
Redis Exporter 安装:
# 下载 Redis Exporter
wget https://github.com/oliver006/redis_exporter/releases/download/v1.53.0/redis_exporter-v1.53.0.linux-amd64.tar.gz
# 解压安装
tar xvfz redis_exporter-v1.53.0.linux-amd64.tar.gz
mv redis_exporter-v1.53.0.linux-amd64/redis_exporter /usr/local/bin/
# 创建系统服务
cat > /etc/systemd/system/redis_exporter.service << EOF
[Unit]
Description=Redis Exporter
After=network.target
[Service]
User=root
ExecStart=/usr/local/bin/redis_exporter --redis.addr redis://localhost:6379 --redis.password password
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
systemctl daemon-reload
systemctl start redis_exporter
systemctl enable redis_exporter
groups:
- name: node-alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for 5 minutes"
- alert: HighMemoryUsage
expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% for 5 minutes"
- alert: HighDiskUsage
expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}% for 5 minutes"
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) / sum(rate(http_server_requests_seconds_count[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value }}% for 5 minutes"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, endpoint)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response time on {{ $labels.endpoint }}"
description: "95th percentile response time is {{ $value }}s for 5 minutes"
- alert: HighJvmMemoryUsage
expr: sum(jvm_memory_used_bytes{area="heap"}) / sum(jvm_memory_max_bytes{area="heap"}) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High JVM memory usage on {{ $labels.instance }}"
description: "JVM heap memory usage is {{ $value }}% for 5 minutes"
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
smtp_require_tls: true
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email'
routes:
- match:
severity: critical
receiver: 'sms'
receivers:
- name: 'email'
email_configs:
- to: 'admin@example.com'
send_resolved: true
- name: 'sms'
webhook_configs:
- url: 'https://sms-gateway.example.com/send'
send_resolved: true
# 启动 Jaeger 容器
docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 14250:14250 \
-p 9411:9411 \
jaegertracing/all-in-one:1.48
添加依赖:
<dependency>
<groupId>io.opentracing.contrib</groupId>
<artifactId>opentracing-spring-jaeger-cloud-starter</artifactId>
<version>3.3.1</version>
</dependency>
配置文件:
opentracing:
jaeger:
enabled: true
udp-sender:
host: jaeger
port: 6831
log-spans: true
service-name: tagtag-backend
访问 Jaeger UI:http://localhost:16686,可以查看:
| 任务 | 频率 | 负责人 | 描述 |
|---|---|---|---|
| 系统更新 | 每月 | 运维工程师 | 更新系统和软件包 |
| 数据库备份 | 每日 | 数据库管理员 | 备份数据库,保留 30 天 |
| 日志清理 | 每周 | 运维工程师 | 清理过期日志 |
| 性能优化 | 每月 | 运维工程师 | 分析系统性能,进行优化 |
| 安全审计 | 每月 | 安全工程师 | 进行安全扫描和审计 |
| 备份验证 | 每月 | 运维工程师 | 验证备份的完整性和可恢复性 |
| 容量规划 | 每季度 | 架构师 | 评估系统容量,进行扩容规划 |
全量备份:
# 使用 mysqldump 备份
mysqldump -u root -p --all-databases --single-transaction --routines --triggers > /backup/mysql/full_backup_$(date +%Y%m%d_%H%M%S).sql
# 使用 xtrabackup 备份
xtrabackup --backup --target-dir=/backup/mysql/full_backup_$(date +%Y%m%d_%H%M%S)
增量备份:
xtrabackup --backup --target-dir=/backup/mysql/incremental_$(date +%Y%m%d_%H%M%S) --incremental-basedir=/backup/mysql/full_backup_20230101_000000
# 备份应用配置和数据
BACKUP_DIR="/backup/app"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR/$DATE
# 备份配置文件
cp -r /opt/tagtag/backend/config $BACKUP_DIR/$DATE/
# 备份日志(可选)
cp -r /opt/tagtag/backend/logs $BACKUP_DIR/$DATE/
# 备份静态资源
cp -r /usr/share/nginx/html/tagtag $BACKUP_DIR/$DATE/
# 压缩备份文件
tar -czf $BACKUP_DIR/app_backup_$DATE.tar.gz -C $BACKUP_DIR $DATE
# 删除临时目录
rm -rf $BACKUP_DIR/$DATE
# 删除 7 天前的备份
find $BACKUP_DIR -name "app_backup_*.tar.gz" -mtime +7 -delete
应用无法启动:
tail -f /opt/tagtag/logs/tagtag.lognetstat -tlnp | grep 8080mysql -u root -p -h localhostredis-cli -h localhost -p 6379 -a password ping接口响应缓慢:
数据库连接失败:
systemctl status mysqldshow processlist;tail -f /var/log/mysql/error.log数据库恢复:
# 使用 mysqldump 恢复
mysql -u root -p < /backup/mysql/full_backup_20230101_000000.sql
# 使用 xtrabackup 恢复
xtrabackup --prepare --target-dir=/backup/mysql/full_backup_20230101_000000
xtrabackup --copy-back --target-dir=/backup/mysql/full_backup_20230101_000000
chown -R mysql:mysql /var/lib/mysql
应用恢复:
# 停止当前应用
systemctl stop tagtag-backend
# 恢复备份
cp -r /backup/app/app_backup_20230101_000000/config /opt/tagtag/backend/
# 启动应用
systemctl start tagtag-backend
监控与维护是确保系统稳定运行的重要环节,通过建立完善的监控体系、日志管理机制、性能优化策略和系统维护流程,可以有效提高系统的可用性、可靠性和安全性。
本文档详细介绍了 Tagtag Starter 项目的监控系统架构、日志管理、性能监控、告警配置、分布式链路追踪和系统维护方案,希望能帮助您建立和完善系统的监控与维护体系。
在实际运营过程中,建议根据业务需求和系统特点,灵活调整监控策略和维护流程,持续优化系统性能和可靠性。