最新要闻

广告

手机

iphone11大小尺寸是多少?苹果iPhone11和iPhone13的区别是什么?

iphone11大小尺寸是多少?苹果iPhone11和iPhone13的区别是什么?

警方通报辅警执法直播中被撞飞:犯罪嫌疑人已投案

警方通报辅警执法直播中被撞飞:犯罪嫌疑人已投案

家电

当前热点-夜莺监控V6初探

来源:博客园

目标


(资料图)

客户用产品可能是功能设计好,也可能是因为响应快稳定可靠,例如滴滴用不了用高德,券商app故障受罚,微信凌晨服务崩溃,所以稳定性建设工作价值是保障客户体验,避免资损,社会负面舆论。

故障生命周期处理

围绕故障生命周期,在整个故障定位体系,可分为故障开始前,预案阶段,做量化分析找到潜在隐患;故障开始后,尽快发现定位故障直接原因,直接原因定位是为了止损,根因可以后续排查;故障恢复后就是复盘,行程TODO list,针对性改进。

预案阶段1.可观测性体系基础设施和软件架构都比较完善情况下并不能万事大吉,线上问题防不胜防,建设可观测体系是必需的。预防阶段两件事,埋点数据采集,数据组织,便于后续排障。可观测性数据通常分四类,指标,链路,日志,事件。指标,存储成本小应用最广泛,其中很多熟知的产品包括,成为监控系统业界标准的Prometheus,时间序列数据库VictoriaMetrics,OpenTSDB,采集器Telegraf、Categraf、Grafana-agent、Datadog-agent。链路,服务数量众多,关系复杂,导致服务故障很难排查的情况下需要引入链路追踪系统,业界推出了观测度量框架OpenTelemetry,可以基本解决链路监控需求,推动落地需要所有模块都接入才有价值。日志,是最重要的问题排查手段,存储成本高,所以管理日志需要精细化,较久远数据几乎没有查询需求,近期数据存ES用于排查问题,选取日志中数值类指标存时序库,保存更久一点。排查问题,通常是指标先提示异常,然后查看相关时间段日志,日志里可能有traceid,再去查询链路数据,从而更快找到故障直接原因。事件,通常包含告警事件,变更事件,故障定位是事件也需要统一收集,从时间维度做关联分析2.风险量化体系主要是分析评价可观测性体系成果,是否完备,还能量化变更体系,确认各团队变更操作是否值得依赖,例如不做灰度直接全量,经常高峰期上线,经常回滚,量化团队健康分,督促差的业务线去改进。

以上内容摘选于大佬(秦晓辉@快猫星云)的文章<稳定性体系建设白皮书>

夜莺方案

夜莺(Nightingale)是一款可视化监控工具产品,夜莺开源版源于滴滴运维团队,是国内最活跃的企业级云原生监控方案,被很多团队选用部署落地,经过生产实践。通过Categraf、VictoriaMetrics、Nightingale可以方便我们快速搭建可观测性体系。夜莺是一个服务端组件,类似Grafana,可以接入不同的数据源,夜莺就可以对数据源的数据进行分析、告警、可视化,以及后续的事件处理、告警自愈(和Grafana一样提供可视化,对告警规则管理不同于Prometheus通过配置文件来实现,夜莺通过WebUI来统一协同) 夜莺商业版产品提供了更多企业级功能,用于统一监控和故障定位场景。

夜莺V6入门

操作系统 云耀云服务器(Hyper Elastic Cloud Server)cat /etc/centos-releaseAlmaLinux release 8.4 (Electric Cheetah)

lsb_release -aLSB Version: :core-4.1-amd64:core-4.1-noarchDistributor ID: AlmaLinuxDescription: AlmaLinux release 8.4 (Electric Cheetah)Release: 8.4Codename: ElectricCheetah

uname -a Linux hecs-34116 4.18.0-372.26.1.el8_6.x86_64 #1 SMP Tue Sep 13 06:07:14 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

cpu信息lscpu|grep CPUCPU op-mode(s): 32-bit, 64-bitCPU(s): 4On-line CPU(s) list: 0-3CPU family: 6Model name: Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHzCPU MHz: 2600.000NUMA node0 CPU(s): 0-3

x86_64,x64,AMD64基本上是同一个东西,我们现在用的intel/amd的桌面级CPU基本上都是x86_64

mariadb安装 默认源 yum info mariadbLast metadata expiration check: 3:17:56 ago on Thu 13 Apr 2023 07:04:57 AM CST.Available PackagesName : mariadbEpoch : 3Version : 10.3.35Release : 1.module_el8.6.0+3265+230ed96bArchitecture : x86_64Size : 6.0 MSource : mariadb-10.3.35-1.module_el8.6.0+3265+230ed96b.src.rpmRepository : appstreamSummary : A very fast and robust SQL database serverURL : http://mariadb.orgLicense : GPLv2 with exceptions and LGPLv2 and BSDDescription : MariaDB is a community developed branch of MySQL - a multi-user, multi-threaded : SQL database server. It is a client/server implementation consisting of : a server daemon (mysqld) and many different client programs and libraries. : The base package contains the standard MariaDB/MySQL client programs and : generic MySQL files.

添加MariaDB yum 源, 官网按需要选择源

vim /etc/yum.repos.d/MariaDB.repo

# MariaDB 11.0 [RC] CentOS repository list - created 2023-04-13 02:37 UTC# https://mariadb.org/download/[mariadb]name = MariaDB# rpm.mariadb.org is a dynamic mirror if your preferred mirror goes offline. See https://mariadb.org/mirrorbits/ for details.# baseurl = https://rpm.mariadb.org/11.0/centos/$releasever/$basearchbaseurl = https://mirrors.neusoft.edu.cn/mariadb/yum/11.0/centos/$releasever/$basearchmodule_hotfixes = 1# gpgkey = https://rpm.mariadb.org/RPM-GPG-KEY-MariaDBgpgkey = https://mirrors.neusoft.edu.cn/mariadb/yum/RPM-GPG-KEY-MariaDBgpgcheck = 1

重新构建缓存。

yum clean allyum makecache

卸载旧版

yum remove mariadb-server mariadb mariadb-libsyum clean all

找出并删除残留目录

find / -name mariadb find / -name mysql

安装新版及启动数据库

yum install MariaDB-server

一路y下去查看状态:systemctl status mariadb

● mariadb.service - MariaDB 11.0.1 database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: inactive (dead) Docs: man:mariadbd(8) https://mariadb.com/kb/en/library/systemd/启动systemctl start mariadb

此时意外发现报错了Job for mariadb.service failed because the control process exited with error code.See "systemctl status mariadb.service" and "journalctl -xe" for details.

按照提示去查看错误,端口被占用...Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [Note] Server socket created on IP: "0.0.0.0".Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [ERROR] Can"t start server: Bind on TCP/IP port. Got error: 98: Address already in useApr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [ERROR] Do you already have another server running on port: 3306 ?Apr 13 10:59:59 hecs-34116 mariadbd[2579389]: 2023-04-13 10:59:59 0 [ERROR] Aborting...

netstat -ntuaptcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 232910/nginx: maste tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 232910/nginx: maste

换端口

vi /etc/my.cnf.d/server.cnf

搜索行统计以[mysqld]开始,并在[mysqld]语句下放置以下端口指令,如以下文件摘录所示。 相应地更换端口变量。

[mysqld] port = 12345

再次启动,并查看状态systemctl start mariadbsystemctl status mariadb

● mariadb.service - MariaDB 11.0.1 database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: active (running) since Thu 2023-04-13 11:21:07 CST; 2min 44s ago Docs: man:mariadbd(8) https://mariadb.com/kb/en/library/systemd/ Process: 2579593 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS) Process: 2579562 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment> Process: 2579560 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS) Main PID: 2579576 (mariadbd) Status: "Taking your SQL requests now..." Tasks: 9 (limit: 49448) Memory: 169.3M CGroup: /system.slice/mariadb.service └─2579576 /usr/sbin/mariadbd

Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] InnoDB: log sequence number 47295; transaction id 14Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_poolApr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Plugin "FEEDBACK" is disabled.Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Plugin "wsrep-provider" is disabled.Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Server socket created on IP: "0.0.0.0".Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] Server socket created on IP: "::".Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] InnoDB: Buffer pool(s) load completed at 230413 11:21:07Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: 2023-04-13 11:21:07 0 [Note] /usr/sbin/mariadbd: ready for connections.Apr 13 11:21:07 hecs-34116 mariadbd[2579576]: Version: "11.0.1-MariaDB" socket: "/var/lib/mysql/mysql.sock" port: 12345 MariaDB ServerApr 13 11:21:07 hecs-34116 systemd[1]: Started MariaDB 11.0.1 database server.

成功了接下来,给MariaDB设置用户及密码

连接数据库

mysql

select user, host, plugin from mysql.user;+-------------+------------+-----------------------+| User | Host | plugin |+-------------+------------+-----------------------+| mariadb.sys | localhost | mysql_native_password || root | localhost | mysql_native_password || mysql | localhost | mysql_native_password || PUBLIC | | || | localhost | || | hecs-34116 | |+-------------+------------+-----------------------+

设置权限和密码

GRANT ALL PRIVILEGES ON *.* TO "root"@"localhost" IDENTIFIED BY "123456";

退出后再登录mysql

mysql -uroot -p123456 -e "show databases"

mysql: Deprecated program name. It will be removed in a future release, use "/usr/bin/mariadb" instead+--------------------+| Database |+--------------------+| information_schema || mysql || performance_schema || sys || test |+--------------------+

连接test库

mysql -uroot -p -D test

安装redis 查看可用源

dnf list|grep redishiredis.x86_64 0.13.3-13.el8 epel hiredis-devel.x86_64 0.13.3-13.el8 epel pcp-pmda-redis.x86_64 5.3.7-7.el8 appstream perl-RDF-Trine-redis.noarch 1.019-8.el8 epel python3-redis.noarch 3.5.3-1.el8 epel redis.x86_64 5.0.3-5.module_el8.4.0+2583+b9845322 appstream redis-devel.x86_64 5.0.3-5.module_el8.4.0+2583+b9845322 appstream redis-doc.noarch 5.0.3-5.module_el8.4.0+2583+b9845322 appstream syslog-ng-redis.x86_64 3.23.1-3.el8 epel uwsgi-logger-redis.x86_64 2.0.21-2.el8 epel uwsgi-router-redis.x86_64 2.0.21-2.el8 epel

官网下载新版

cd /home/tarball/wget https://codeload.github.com/redis/redis/tar.gz/refs/tags/7.0.10 -O redis-7.0.10.tar.gztar zxvf redis-7.0.10.tar.gz cd redis-7.0.10

make

成功后输出Hint: It"s a good idea to run "make test" ;)

make[1]: Leaving directory "/home/tarball/redis-7.0.10/src"

make PREFIX=/home/tarball/redis-7.0.10 install

成功后输出cd src && make installmake[1]: Entering directory "/home/tarball/redis-7.0.10/src" CC Makefile.dep

Hint: It"s a good idea to run "make test" ;)

INSTALL redis-server INSTALL redis-benchmark INSTALL redis-climake[1]: Leaving directory "/home/tarball/redis-7.0.10/src"

启动redis

./bin/redis-server& ./redis.conf

查看redis状态

ps -aux | grep redisroot 2584263 0.0 0.1 62644 10196 pts/0 Sl 12:51 0:00 ./bin/redis-server *:6379root 2584270 0.0 0.0 12140 1116 pts/1 S+ 12:51 0:00 grep --color=auto redis

netstat -ntuapActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 2487/sshd tcp 0 0 0.0.0.0:12345 0.0.0.0:* LISTEN 2579576/mariadbd tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 2584263/./bin/redis

安装 TSDB

按官方文档描述‘小规模使用,比如 1000 台机器以下,用 Prometheus 做存储即可,超过 1000 台机器,选择 VictoriaMetrics 可能更合适VictoriaMetrics。 提供单机版和集群版。如果您的每秒写入数据点数小于100万(这个数量是个什么概念呢,如果只是做机器设备的监控,每个机器差不多采集200个指标,采集频率是10秒的话每台机器每秒采集20个指标左右,100万/20=5万台机器),VictoriaMetrics 官方默认推荐您使用单机版,单机版可以通过增加服务器的CPU核心数,增加内存,增加IOPS来获得线性的性能提升。且单机版易于配置和运维。’另外听大佬说,大规模使用时候夜莺的主要瓶颈在TSDB上,所以这次选用单机版VictoriaMetrics

下载VictoriaMetrics

wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.90.0/victoria-metrics-linux-amd64-v1.90.0.tar.gz

下载还是有点慢的

mkdir victoria-metricstar xf victoria-metrics-linux-amd64-v1.90.0.tar.gz -C victoria-metricscd victoria-metrics

启动

nohup ./victoria-metrics-prod &>victoria.log &

查看默认端口8428

ss -ntplState Recv-Q Send-Q Local Address:Port Peer Address:Port Process LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=2487,fd=5)) LISTEN 0 80 0.0.0.0:12345 0.0.0.0:* users:(("mariadbd",pid=2579576,fd=21)) LISTEN 0 511 0.0.0.0:6379 0.0.0.0:* users:(("redis-server",pid=2584263,fd=6)) LISTEN 0 1024 0.0.0.0:8428 0.0.0.0:* users:(("victoria-metric",pid=2584341,fd=10))

安装夜莺

官网下载wget https://download.flashcat.cloud/n9e-v6.0.0-ga.3-linux-amd64.tar.gz

mkdir n9etar zxvf n9e-v6.0.0-ga.3-linux-amd64.tar.gz -C n9e

导入数据库mysql -uroot -p

修改 N9e 的配置文件 (需要注意上线前修改密钥Auth相关字段)vim etc/config.toml

[DB]# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%sDSN="root:123456@tcp(127.0.0.1:12345)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true" #更换db密码,端口

[[Pushgw.Writers]]# Url = "http://127.0.0.1:8480/insert/0/prometheus/api/v1/write"Url = "http://127.0.0.1:8428/api/v1/write" #更换为vm的端口

启动n9e服务

$ nohup ./n9e &>n9e.log &

ss -ntlp | grep 17000LISTEN 0 1024 *:17000 *:* users:(("n9e",pid=2584479,fd=9))

配置nginx

server { listen 80; server_name xxxx.xxxx.com; location / { proxy_pass http://localhost:17000; proxy_set_header Host $http_host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; }}

浏览器访问,然后输入用户名root,密码root.2020即可登录系统。

下载catgraf

wget https://download.flashcat.cloud/categraf-v0.2.38-linux-amd64.tar.gztar xf categraf-v0.2.38-linux-amd64.tar.gzcd categraf-v0.2.38-linux-amd64/vim conf/config.toml

修改这两项

[[writers]]url = "http://127.0.0.1:17000/prometheus/v1/write"

[heartbeat]enable = true

启动catgraf

nohup ./categraf &>categraf.log &

这样完成了夜莺精简的中心汇聚式部署方案搭建,再通过WebUI操作系统配置-数据源来添加数据源,告警规则来告警判断等等操作

夜莺的V6版本架构图和部署方式可以通过官方博客了解到,简单来说就是n9e利用mysql存储数据 告警信息,配置信息, redis存储验证信息,元数据,心跳信息,TSDB时序数据库存储告警指标,categraf进行数据采集,n9e可以做集群,多个n9e分担告警规则的处理和压力(此时n9e有状态服务)。

最后感谢看完,由于作者水平有限,使用很多工具并不熟悉,如有错误和遗漏欢迎指出,感谢谅解。

参考资料:https://flashcat.cloud/blog/sre-practice-white-paper/https://mp.weixin.qq.com/s/5Ik-Kk1_B7jjgXLxHH1Oughttps://blog.csdn.net/m0_61323675/article/details/130114281https://flashcat.cloud/blog/nightingale-v6-arch/https://blog.csdn.net/wf19930209/article/details/79536506https://www.cnblogs.com/xunzhiyou/p/16365158.htmlhttps://blog.csdn.net/sqlquan/article/details/122093702https://blog.csdn.net/w892824196/article/details/107062729https://www.cnblogs.com/pxyblog/p/mysql.htmlhttps://www.cnblogs.com/hunanzp/p/12304622.htmlhttps://developer.aliyun.com/article/789869https://flashcat.cloud/docs/content/flashcat-monitor/nightingale/install/victoriametrics/https://blog.csdn.net/qihoo_tech/article/details/120558834

关键词: