Prometheus概述-个人在线分享

Prometheus概述

Prometheus是什么

Prometheus（普罗米修斯）是一个最初在SoundCloud上构建的监控系统。自2012年成为社区开源项目，拥有非常活跃的开发人员和用户社区。为强调开源及独立维护，Prometheus于2016年加入云原生云计算基金会（CNCF），成为继kubernetes之后的第二个托管项目。
Prometheus存储的是时序数据，即按相同时序(相同名称和标签)，以时间维度存储连续的数据的集合。
时序(time series)是由名字(Metric)以及一组key/value标签定义的，具有相同的名字以及标签属于相同时序。

Prometheus特点

作为新一代的监控系统，Prometheus有哪些特点呢？主要有以下几个方面：

多维数据模型：由度量名称和键值对表示的时间序列数据；
PromSQL：一种灵活的查询语言，可以利用多维数据完成复杂的查询；
不依赖分布式存储，单个服务器节点可直接工作；
基于HTTP的pull方式采集时间序列数据；
推送时间序列数据通过PushGateway组件支持；
通过服务发现或静态配置发现目标，拥有多种发现手段；
多种图形模型及仪表盘支持（grafana）。

Prometheus框架

Prometheus主要的技术框架如下图所示：
Prometheus框架图
框架中的主要内容如下：
Prometheus Server主要作用为收集指标和存储时间序列数据，并提供查询接口。

Prometheus Server中的Retrieval，主要作用为获取被监控对象中的监控数据。
Prometheus Server中的TSDB，它是一个时间序列数据库，主要存储监控数据，prometheus自带的一个数据，主要适用于小型的监控方案中。如果采集大量的服务器监控，则需要将数据库换成influxDB数据库。
Prometheus Server中的HTTP Server主要是提供相应的API接口。供WEB或者其他服务调用。
PushGateway：它是一个独立的服务，它位于应用程序发送指标和Prometheus服务器之间。Pushgateway接收指标，然后将其作为目标被Prometheus服务器拉取。可以将其看作代理服务，或者与blackbox exporter的行为相反，它接收度量，而不是探测它们，主要应用于短期存储指标数据，主要用于临时性的任务。
Exporters：主要是采集已有的第三方服务监控指标并暴露metrics（数据接口）
AlertManager：警报一直是整个监控系统中的重要组成部分，Prometheus监控系统中，采集与警报是分离的。警报规则在 Prometheus 定义，警报规则触发以后，才会将信息转发到给独立的组件 Alertmanager ，经过 Alertmanager r对警报的信息处理后，最终通过接收器发送给指定用户
WEB UI：提供简单的WEB控制台，一般情况下完整的数据展示通过引入Grafana来进行数据展示

数据模型

Prometheus采用多维数据模型，底层存储为时间序列（time series）。时间序列由metric名称、一组key/value标签组成，同一组时间序具有相同的metric名称和标签组合。时间序的样本数据包含一个float64的值，以及毫秒级别的unix时间戳。
时间序的格式为：

<metric_name>{<label name>=<label value>, …}

例如，metric名称为http_requests_total和标签为method=“POST” 和 handler=“/messages” 的时间序可以这样写：
http_requests_total{method=“POST”, handler=“/messages”,instance=“webserver”,job=“web”
} 100
如用JSON表示一个时序数据库中的原始时序数据：

##用JSON表示一个时序数据
{
  "timestamp": 1346846400,            // 时间戳
  "metric": "total_website_visits",        // 指标名
  "tags":{                             // 标签组
    "instance": "aaa",
    "job": "job001"
  },
  "value": 18			               // 指标值
}

指标类型

指标类型主要有这么几种方式：1.Counter 2.Gauge 3.Histogram 4.Summary
下面针对指标类型进行具体介绍：

Counter是一个累积指标，用于表示单调增加的统计数据，其值在重启时被重置为零或只能递增。可以用Counter表示服务的请求数量，完成的任务或错误的数量，不能使用Counter来表示可以减少的值。

# HELP http_requests_total Counter of HTTP requests.
# TYPE http_requests_total counter 
http_requests_total{code=“200”,handler="/api/v1/query"} 7
http_requests_total{code=“400”,handler="/api/v1/query"} 2
http_requests_total{code=“200”,handler="/metrics"} 32

统计请求总数
可以对Counter进行聚合，例如统计 http_requests_total{handler=“/api/v1/query”} 最近5分钟的请求总数，并且按handler聚合。

sum(increase(http_requests_total{handler="/api/v1/query"}[5m])) by (handler)

其中在表达式中的[]为范围向量选择器，而[5m]则表示对应的数据池中选择5分钟内的数据，http_requests_total{handler=“/api/v1/query”}[5m] 指具体的意思是返回http_requests_total{handler=“/api/v1/query”} 5分钟内所收集到的数据：
http_requests_total{handler=“/api/v1/query”}[5m]
返回的结果如下：

http_requests_total{code=“200”,handler="/api/v1/query"}
51 @1598945717.441
51 @1598945777.374
51 @1598945837.307
51 @1598945897.239
52 @1598945957.171
http_requests_total{code=“400”,handler="/api/v1/query"}
4 @1598945717.441
4 @1598945777.374
4 @1598945837.307
4 @1598945897.239
4 @1598945957.171

increase聚合范围向量
increase() 计算范围向量中时间序列的增加, increase(http_requests_total{ handler=“/api/v1/query”}[5m])表示统计5分钟内的请求增量。
increase(http_requests_total{handler=“/api/v1/query”}[5m])
返回的结果：

{code=“200”,handler="/api/v1/query"} 6.2570652695335145
{code=“400”,handler="/api/v1/query"} 0

sum按维度聚合
sum()聚合函数，sum(…) by (handler)保留hander维度，code维度的数据被聚合成一个值。
sum(increase(http_requests_total{handler=“/api/v1/query”}[5m])) by (handler)
返回的结果：

{handler="/api/v1/query"} 7.508415682577556

Gauge表示可以上下浮动的数据，通常用于测量温度或当前内存使用情况等测量值。

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 61

表示当前go线程的数量为61。
3. Histogram生成直方图数据，需要预先设定统计桶，通常包含以下几个时间序
观察桶的累积计数器，_bucket{le=“”}
所有观察值的总和，_sum
已观察到的事件数量，_count
例子：

# HELP prometheus_http_response_size_bytes Histogram of response size for HTTP requests.
# TYPE prometheus_http_response_size_bytes histogram
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“100”} 0
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“1000”} 0
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“10000”} 5
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“100000”} 5
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“1e+06”} 5
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“1e+07”} 5
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“1e+08”} 5
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le=“1e+09”} 5
prometheus_http_response_size_bytes_bucket{handler="/api/v1/label/:name/values",le="+Inf"} 5
prometheus_http_response_size_bytes_sum{handler="/api/v1/label/:name/values"} 8440
prometheus_http_response_size_bytes_count{handler="/api/v1/label/:name/values"} 5

prometheus_http_response_size_bytes 配置了以下几个统计桶：100，1000，10000，100000，1e+06，1e+07，1e+08，1e+09，+Inf，统计总数为5，总和为8440，数据具体分布情况在prometheus_http_response_size_bytes_bucket中。Histogram可以在服务端进行聚合，统计出分布百分位数，如统计出P90,P99等数据，但精度受配置的桶影响。
统计最近一小时内prometheus_http_response_size_bytes的P95:

histogram_quantile(0.95, rate(prometheus_http_response_size_bytes[1h]))

Summary在客户端统计出精确的百分位数据，Summary需要配置统计的百分位。和Histogram类似，通常包含了多个时间序列：
① φ-quantiles(0 ≤ φ ≤ 1)，{quantile=“”}
② 所有观察值的总和，_sum
③ 已观察到的事件数量，_count
例子：

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile=“0”} 2.43e-05
go_gc_duration_seconds{quantile=“0.25”} 3.97e-05
go_gc_duration_seconds{quantile=“0.5”} 5.91e-05
go_gc_duration_seconds{quantile=“0.75”} 0.0001494
go_gc_duration_seconds{quantile=“1”} 0.0016843
go_gc_duration_seconds_sum 0.1152793
go_gc_duration_seconds_count 1189

作业与实例

实例：可以抓取的目标成为实例（Instances）
作业：具有相同目标的实例集合称为作业（Job）,例如
scrape_configs:
-job_name:‘prometheus’
static_configs:
-targets:[‘localhost:9090’]
-job_name:‘node’
static_configs:
-targets:[‘192.168.1.10:9090’]

2024年七月
一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30