Prometheus Metric Description

Service Metrics

Service metrics are generated from the server-side events, which are used to show the quality of service.

Metrics List

Metric NameTypeDescription
kindling_entity_request_totalCounterTotal number of requests
kindling_entity_request_duration_nanoseconds_totalCounterTotal duration of requests
kindling_entity_request_send_bytes_totalCounterTotal size of payload sent
kindling_entity_request_receive_bytes_totalCounterTotal size of payload received
kindling_entity_request_average_duration_nanoseconds_countHistogramCount of average duration of requests
Disabled by default. See Note 3 for how to enable it.
kindling_entity_request_average_duration_nanoseconds_sumHistogramSum of average duration of requests
Disabled by default. See Note 3 for how to enable it.
kindling_entity_request_average_duration_nanoseconds_bucketHistogramHistogram buckets of average duration of requests
Disabled by default. See Note 3 for how to enable it.

Labels List

Label NameExampleNotes
nodeworker-1Node name represented in Kubernetes cluster
namespacedefaultNamespace of the pod
workload_kinddaemonsetK8sResourceType
workload_nameapi-dsK8sResourceName
serviceapiOne of the services that target this pod
podapi-ds-xxxxThe name of the pod
containerapi-containerThe name of the container
container_id1a2b3c4d5e6fThe shorten container id which contains 12 characters
ip10.1.11.23The IP address of the entity
port80The listening port of the entity
protocolhttpThe application layer protocol the requests use
request_content/test/apiThe request content of the requests
response_content200The response content of the requests
is_slowfalse(Only applicable to kindling_entity_request_total)
Whether the requests are considered as slow

Notes

Note 1: The label namespace holds a value NOT_FOUND_INTERNAL when the container_id and the IP can’t be found in the current Kubernetes cluster, in which case the entity isn’t maintained by the current Kubernetes.

Note 2: The labels request_content and response_content hold different values when protocol is different.

  • When protocol is http:
LabelExampleNotes
request_content/test/apiEndpoint of HTTP request. URL has been truncated to avoid high-cardinality.
response_content200‘Status Code’ of HTTP response.
  • When protocol is dns:
LabelExampleNotes
request_contentwww.google.comDomain to be queried
response_content0“rcode” of DNS response. Including 0, 1, 2, 3, 4
  • When protocol is mysql:
LabelExampleNotes
request_contentselect employeeSQL of MySQL. SQL has been truncated to avoid high-cardinality. The format is [‘operation’ ‘space’ ’table’ ‘*’].
response_content1064Error code of MySQL. Only applicable when the response is in error type. See codes introduction.
  • When protocol is kafka:
LabelExampleNotes
request_contentuser-msg-topicTopic of Kafka request.
response_contentEmpty temporarily.
  • When protocol is dubbo:
LabelExampleNotes
request_contentio.kindling.dubbo.api.service.OrderService#orderService Info. The format of service is package.class#method
response_content20“error_code” of Dubbo. 20 means OK, more details at the docs.
  • For other cases, the request_content and response_content are both empty.

Note 3: The histogram metric kindling_entity_request_average_duration_nanoseconds_* is disabled by default as it could be high-cardinality. If this metric is needed, please add a new line to the exporters.otelexporter.metric_aggregation_map section of the configuration file.

exporters:
  otelexporter:
    metric_aggregation_map:
      # add the following line
      kindling_entity_request_average_duration_nanoseconds: histogram 

Topology Metrics

Topology metrics are typically generated from the client-side events, which are used to show the service dependencies map, so the metrics are called “topology”. Some timeseries may be generated from the server-side events, which contain a non-empty label dst_container_id. These timeseries are generated only when the source IP is not the pod’s IP inside the Kubernetes cluster, which are useful when there is no agent installed on the client-side.

Metrics List

Metric NameTypeDescription
kindling_topology_request_totalCounterTotal number of requests
kindling_topology_request_duration_nanoseconds_totalCounterTotal duration of requests
kindling_topology_request_request_bytes_totalCounterTotal size of payload sent
kindling_topology_request_response_bytes_totalCounterTotal size of payload received
kindling_topology_request_average_duration_nanoseconds_countHistogramCount of average duration of requests
Disabled by default. See Note 3 for how to enable it.
kindling_topology_request_average_duration_nanoseconds_sumHistogramSum of average duration of requests
Disabled by default. See Note 3 for how to enable it.
kindling_topology_request_average_duration_nanoseconds_bucketHistogramHistogram buckets of average duration of requests
Disabled by default. See Note 3 for how to enable it.

Labels List

Label NameExampleNotes
src_nodeslave-node1Which node the source pod is on
src_namespacedefaultNamespace of the source pod
src_workload_kinddeploymentWorkload kind of the source pod
src_workload_namebusiness1Workload name of the source pod
src_servicebusiness1-svcOne of the services that target the source pod
src_podbusiness1-0The name of the source pod
src_containerbusiness-containerThe name of the source container
src_container_id1a2b3c4d5e6fThe shorten container id which contains 12 characters
src_ip10.1.11.23The IP address of the source
dst_nodeslave-node2Which node the destination pod is on
dst_namespacedefaultNamespace of the destination pod
dst_workload_kinddeploymentWorkload kind of the destination pod
dst_workload_namebusiness2Workload name of the destination pod
dst_servicebusiness2-svcOne of the services that target the destination pod
dst_podbusiness2-0The name of the destination pod
dst_containerbusiness-containerThe name of the source container
dst_container_id2b3c4d5e6f7e(Only applicable to the timeseries generated from the server-side)
The shorten container id which contains 12 characters
dst_ip10.1.11.24The IP address of the destination
dst_port80The listening port of the destination container
protocolhttpThe application layer protocol the requests use
status_code200Different values for different protocols

Notes

Note 1: We define two custom terms for the label src_namespace and dst_namespace, which are NOT_FOUND_INTERNAL and NOT_FOUND_EXTERNAL. The meanings are described as follows. These terms also apply to other metrics in this doc.

These two terms are composed of two parts.

  1. NOT_FOUND: NOT_FOUND means the IP is neither a pod’s one nor a service’s one in the current Kubernetes cluster. The IP could belong to a host or an external service.
  2. INTERNAL or EXTERNAL: There are two cases in which INTERNAL will be set. The first case is when the IP belongs to a node that resides in the current Kubernetes cluster. The second case is when the source or destination is running on the same host with the kindling agent, which is generally applicable for non-Kubernetes clusters. EXTERNAL is set for other cases if the IP is NOT_FOUND. Note another Kubernetes cluster is also considered “external”.

Note 2: The field “status_code” holds different values when “protocol” is different.

  • HTTP: ‘Status Code’ of HTTP response.
  • DNS: rcode of DNS response.
  • MySQL: Error code of the error response.
  • DUBBO: ‘Error Code’ of Dubbo request.
  • others: empty temporarily

Note 3: The histogram metric kindling_topology_request_average_duration_nanoseconds_* is disabled by default as it could be high-cardinality. If this metric is needed, please add a new line to the exporters.otelexporter.metric_aggregation_map section of the configuration file.

exporters:
  otelexporter:
    metric_aggregation_map:
      # add the following line
      kindling_topology_request_average_duration_nanoseconds: histogram 

Trace As Metric

We made some rules for considering whether a request is abnormal. For the abnormal request, the detail request information is considered as useful for debugging or profiling. We name this kind of data “trace”. It is not a good practice to store such data in Prometheus as some labels are high-cardinality, so we picked up some labels from the original ones to generate a new kind of metric, which is called “Trace As Metric”. The following table shows what labels this metric contains.

Metrics List

Metric NameTypeDescription
kindling_trace_request_duration_nanosecondsGaugeThe specific request duration

Labels List

Label NameExampleNotes
src_nodeslave-node1Which node the source pod is on
src_namespacedefaultNamespace of the source pod
src_workload_kinddeploymentWorkload kind of the source pod
src_workload_namebusiness1Workload name of the source pod
src_servicebusiness1-svcOne of the services that target the source pod
src_podbusiness1-0The name of the source pod
src_containerbusiness-containerThe name of the source container
src_container_id1a2b3c4d5e6f(Only applicable when is_server is false)
The shorten container id which contains 12 characters
src_ip10.1.11.23The IP address of the source
dst_nodeslave-node2Which node the destination pod is on
dst_namespacedefaultNamespace of the destination pod
dst_workload_kinddeploymentWorkload kind of the destination pod
dst_workload_namebusiness2Workload name of the destination pod
dst_servicebusiness2-svcOne of the services that target the destination pod
dst_podbusiness2-0The name of the destination pod
dst_containerbusiness-containerThe name of the destination container
dst_container_id2b3c4d5e6f7e(Only applicable when is_server is true)
The shorten container id which contains 12 characters
dst_ip10.1.11.24The IP address of the destination. This is the original IP before DNAT
dst_port80The listening port of the destination container
dnat_ip192.168.12.3The IP address of the destination after DNAT if applicable
dnat_port80The listening port of the destination container after DNAT if applicable
protocolhttpThe application layer protocol the requests use
is_servertrueTrue if the data is from the server-side, false otherwise
request_content/test/apiDifferent values when protocol is different. Refer to service metric
response_content200Different values when protocol is different. Refer to service metric
request_duration_status1The total duration spent for sending request and receiving response.
1(green): latency <= 800ms
2(yellow): 800<latency<1500
3(red): latency >= 1500
request_reqxfer_status2ReqXfe indicates the duration for transferring request payload.
1(green): latency <= 200ms
2(yellow): 200<latency<1000
3(red): latency >= 1000
request_processing_status3Processing indicates the duration until receiving the first byte.
1(green): latency <= 200ms
2(yellow): 200<latency<1000
3(red): latency >= 1000
response_rspxfer_status1RspXfer indicates the duration for transferring response bopayloaddy.
1(green): latency <= 200ms
2(yellow): 200<latency<1000
3(red): latency >= 1000

TCP Status Metrics

Metrics List

Metric NameTypeDescription
kindling_tcp_srtt_microsecondsGaugeSmoothed round trip time of the TCP socket
kindling_tcp_packet_loss_totalCounterTotal number of dropped packets
kindling_tcp_retransmit_totalCounterTotal times of retransmitting happens (not packets count)

Labels List

Label NameExampleNotes
src_nodeslave-node1Which node the source pod is on
src_namespacedefaultNamespace of the source pod
src_workload_kinddeploymentWorkload kind of the source pod
src_workload_namebusiness1Workload name of the source pod
src_servicebusiness1-svcOne of the services that target the source pod
src_podbusiness1-0The name of the source pod
src_containerbusiness-containerThe name of the source container
src_ip10.1.11.23Pod’s IP by default. If the source is not a pod in Kubernetes, this is the IP address of an external entity
src_port80The listening port of the source container, if applicable
dst_nodeslave-node2Which node the destination pod is on
dst_namespacedefaultNamespace of the destination pod
dst_workload_kinddeploymentWorkload kind of the destination pod
dst_workload_namebusiness2Workload name of the destination pod
dst_servicebusiness2-svcOne of the services that target the destination pod
dst_podbusiness2-0The name of the destination pod
dst_containerbusiness-containerThe name of the destination container
dst_ip10.1.11.24Pod’s IP by default. If the destination is not a pod in Kubernetes, this is the IP address of an external entity
dst_port80The listening port of the destination container, if applicable

TCP Socket Connects Metrics

Metrics List

Metric NameTypeDescription
kindling_tcp_connect_totalCounterTotal number of successfully and unsuccessfully established TCP connections
kindling_tcp_connect_duration_nanoseconds_totalCounterTotal duration of the successfully established TCP connections

Labels List

Label NameExampleNotes
pid1024The client’s process ID
commjavaThe client’s process command
src_nodeslave-node1Which node the source pod is on
src_namespacedefaultNamespace of the source pod
src_workload_kinddeploymentWorkload kind of the source pod
src_workload_namebusiness1Workload name of the source pod
src_servicebusiness1-svcOne of the services that target the source pod
src_podbusiness1-0The name of the source pod
src_containerbusiness-containerThe name of the source container
src_container_id1a2b3c4d5e6fThe shorten container id which contains 12 characters
src_ip10.1.11.23Pod’s IP by default. If the source is not a pod in Kubernetes, this is the IP address of an external entity
dst_nodeslave-node2Which node the destination pod is on
dst_namespacedefaultNamespace of the destination pod
dst_workload_kinddeploymentWorkload kind of the destination pod
dst_workload_namebusiness2Workload name of the destination pod
dst_servicebusiness2-svcOne of the services that target the destination pod
dst_podbusiness2-0The name of the destination pod
dst_containerbusiness-containerThe name of the destination container
dst_ip10.1.11.24Pod’s IP by default. If the destination is not a pod in Kubernetes, this is the IP address of an external entity
dst_port80The listening port of the destination container, if applicable
dnat_ip192.168.12.3The IP address of the destination after DNAT if applicable
dnat_port80The listening port of the destination container after DNAT if applicable
successtrueWhether the TCP connection is successfully established
errno0The error number of the TCP connection. 0 if no error. Note it could also be 0 even if there is an error.

Notes

Note 1: The field success for kindling_tcp_connect_duration_nanoseconds_total is always true.

Note 2: The field errno is not 0 only if the TCP socket is blocking and there is an error happened. There are multiple possible values it could contain. See the ERRORS section of the connect(2) manual for more details.

Note 3: The field pid and comm will not exist if you set need_process_info to false (default is false), that will reduce the pressure of Prometheus.

PromQL Example

Here are some examples of how to use these metrics in Prometheus, which can help you understand them faster.

DescribePromQL
Request countssum(increase(kindling_entity_request_total{namespace="$namespace",workload_name="$workload"}[5m])) by(namespace, workload_name)
DNS request countssum(increase(kindling_topology_request_total{src_namespace="$namespace",src_workload_name="$workload", protocol="dns"}[5m])) by (src_workload_name)
Latencysum by(namespace, workload_name) (increase(kindling_entity_request_duration_nanoseconds_total{namespace="$namespace", workload_name="$workload"}[5m])) / sum by(namespace, workload_name) (increase(kindling_entity_request_total{namespace="$namespace", workload_name="$workload"}[5m]))
Error ratio of HTTP requestssum (increase(kindling_entity_request_total{namespace="$namespace",workload_name="$workload",protocol="http",response_content=~"4..|5.."}[5m])) / sum (increase(kindling_entity_request_total{namespace="$namespace",workload_name="$workload",protocol="http"}[5m])) * 100
Request latency quantilehistogram_quantile(0.99, rate(kindling_topology_request_average_duration_nanoseconds_bucket{dst_namespace="$namespace", dst_workload_name="$workload",protocol="http"}[5m]))
Retransmit timessum(increase(kindling_tcp_retransmit_total{src_workload_name=~"$source", dst_workload_name=~"$destination"}[5m]))
Packets lost countsum(increase(kindling_tcp_packet_loss_total{src_workload_name=~"$source", dst_workload_name=~"$destination"}[5m]))
Network sent bytessum(increase(kindling_topology_request_request_bytes_total{src_workload_name=~"$source", dst_workload_name=~"$destination"}[5m]))
Network received bytessum(increase(kindling_topology_request_response_bytes_total{src_workload_name=~"$source", dst_workload_name=~"$destination"}[5m]))