Skip to main content

Description of monitoring alarm items

Overview

The Rainbond monitoring service is completed by the component rbd-monitor In the monitor component, the Sidecar design pattern is used to integrate the Prometheus services, and dynamically discovers the targets that need to be monitored based on etcd, and automatically configures and manages the Prometheus service.Monitor will scrape indicator data from each target regularly, and persist the data locally, providing flexible PromQL query and RESTful API query.

Architecture Diagram:

interview method

The default listening port is 9999. The default installation has added a Service object. After the cluster obtains ServiceIP , add a third-party service on the platform to open the external port to access.

Get ServiceIP way

$ kubectl get service rbd-monitor -n rbd-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
rbd-monitor ClusterIP 10.68.140.5 <none> 9999/TCP 7h11m

For specific monitoring alarm items, please visit rbd-monitor to view, the following is for reference only.

Monitoring item

Node resource monitoring items

Monitoring itemowning componentillustrate
cadvisor_version_infocadvisorNode system information
machine_memory_bytescadvisorCurrent host memory size
machine_cpu_corescadvisorCurrent node CPU number
node_filesystem_sizenodestorage
node_load1nodeLoad 1m
node_load5nodeLoad 5m
node_load5nodeLoad 15m
node_memory_MemTotalnodeNode memory total
node_memory_MemFreenodeNode memory free
node_uname_infonodeNode information

Rainbond Service Component Monitoring Items

Monitoring itemowning componentillustrate
acp_mq_dequeue_numberrbd-mq
acp_mq_enqueue_numberrbd-mq
acp_mq_exporter_health_statusrbd-mq
acp_mq_exporter_last_scrape_errorrbd-mq
acp_mq_exporter_scrapes_totalrbd-mq
builder_exporter_builder_task_errorrbd-chaosNumber of source code build task failures
builder_exporter_builder_task_numberrbd-chaosNumber of source code build tasks
builder_exporter_health_statusrbd-chaosComponent status 1 is healthy
event_log_exporter_chan_cache_sizerbd-eventlog
event_log_exporter_collector_duration_secondsrbd-eventlog
event_log_exporter_container_log_store_cache_barrel_countrbd-eventlog
event_log_exporter_container_log_store_log_countrbd-eventlog
event_log_exporter_event_store_barrel_countrbd-eventlog
event_log_exporter_event_store_cache_barrel_countrbd-eventlog
event_log_exporter_event_store_log_countrbd-eventlog
event_log_exporter_health_statusrbd-eventlog
event_log_exporter_last_scrape_errorrbd-eventlog
event_log_exporter_monitor_store_barrel_countrbd-eventlog
event_log_exporter_monitor_store_log_countrbd-eventlog
event_log_exporter_scrapes_totalrbd-eventlog
gateway_request_duration_seconds_bucketrbd-gatewayThe number of client requests within the specified request time (bucket)
gateway_request_duration_seconds_countrbd-gatewayTotal number of client requests
gateway_request_duration_seconds_sumrbd-gatewayTotal client request time
gateway_request_size_bucketrbd-gatewayThe number of requests that satisfy the condition within the specified request size (bucket)
gateway_request_size_countrbd-gatewayTotal number of client requests
gateway_request_size_sumrbd-gatewayThe total number of client request sizes
gateway_requestsrbd-gatewayThe number of client visits
gateway_response_duration_seconds_bucketrbd-gatewayWithin the specified response time (bucket), the number of responses
gateway_response_duration_seconds_countrbd-gatewaytotal number of responses
gateway_response_duration_seconds_sumrbd-gatewaytotal time to respond
gateway_response_size_bucketrbd-gatewayThe number of responses that satisfy the condition within the specified response size (bucket)
gateway_response_size_countrbd-gatewaytotal number of responses
gateway_response_size_sumrbd-gatewaytotal size of the response
gateway_upstream_latency_secondsrbd-gatewayThe number of delays that satisfy the condition within the specified delay time (bucket)
gateway_upstream_latency_seconds_countrbd-gatewaytotal number of delays
gateway_upstream_latency_seconds_sumrbd-gatewaysum of delay times
worker_exporter_health_statusrbd-worker
worker_exporter_worker_task_numberrbd-worker
worker_exporter_collector_duration_secondsrbd-worker
worker_exporter_last_scrape_errorrbd-worker
worker_exporter_scrapes_totalrbd-worker
worker_exporter_worker_task_errorrbd-worker
worker_exporter_worker_task_numberrbd-worker
worker_uprbd-worker
scrape_samples_scraped
scrape_samples_post_metric_relabeling
scrape_duration_seconds
statsd_exporter_build_info
statsd_exporter_events_total
statsd_exporter_lines_total
statsd_exporter_loaded_mappings
statsd_exporter_samples_total
statsd_exporter_tag_errors_total
statsd_exporter_tags_total
statsd_exporter_tcp_connection_errors_total
statsd_exporter_tcp_connections_total
statsd_exporter_tcp_too_long_lines_total
statsd_exporter_udp_packets_total
upcomponent status

Application-level monitoring items

Monitoring itemillustrate
app_resource_appmemoryApplication memory, filter according to service_id, tenant_id
app_resource_appfsapplication
app_resource_appmemoryapplication
app_client_requestapplication
app_client_requesttimeapplication
app_requestapplication
app_request_unusualapplication
app_requestclientapplication
app_requesttimeapplication

Application-level acquisition of typical monitoring indicators based on CAvisor

Monitoring itemtypeillustrate
container_cpu_load_average_10sgaugeAverage load of container CPU over the past 10 seconds
container_cpu_usage_seconds_totalcounterCumulative occupancy time of the container on each CPU core (unit:seconds)
container_cpu_system_seconds_totalcounterSystem CPU cumulative occupancy time (unit:seconds)
container_cpu_user_seconds_totalcounterUser CPU cumulative occupancy time (unit:seconds)
container_fs_usage_bytesgaugeThe usage of the file system in the container (unit:bytes)
container_fs_limit_bytesgaugeThe total amount of file system that the container can use (unit:bytes)
container_fs_reads_bytes_totalcounterThe total amount of accumulated data read by the container (unit:bytes)
container_fs_writes_bytes_totalcounterThe total amount of accumulated data written by the container (unit:bytes)
container_memory_max_usage_bytesgaugeThe maximum memory usage of the container (in:bytes)
container_memory_usage_bytesgaugeThe current memory usage of the container (unit:bytes)
container_spec_memory_limit_bytesgaugeContainer memory usage limit
container_network_receive_bytes_totalcounterThe total amount of accumulated data received by the container network (unit:bytes)
container_network_transmit_bytes_totalcounterThe total amount of accumulated data transmitted by the container network (unit:bytes)

Other monitoring items

Monitoring itemillustrate
process_cpu_seconds_total
process_max_fds
process_open_fds
process_virtual_memory_bytes
process_start_time_seconds
process_resident_memory_bytes
process_open_fds
process_max_fds
process_cpu_seconds_total

Description of alarm rules

Component monitoring alarm

Alarm itemAlarm information
api service offlineAPIDown
chaos service offlineBuilderDown
The state of the chaos component is abnormalBuilderUnhealthy
The number of abnormal tasks in source code construction is greater than 30BuilderTaskError
ETCD service offlineEtcdDown
ETCD Leader node goes offlineEtcdLoseLeader
ETCD cluster member exceptionInsufficientMembers
ETCD Cluster Leader ChangeHighNumberOfLeaderChanges
ETCD GPRC failed requests greater than 0.05HighNumberOfFailedGRPCRequests
ETCD The number of HTTP request failures in 1 minute is greater than 0.05HighNumberOfFailedHTTPRequests
The number of GPRC slow queries in ETCD within 1 minute is greater than 0.15GRPCRequestsSlow
ETCD disk space occupies more than 80%DatabaseSpaceExceeded
The eventlog component status is abnormalEventLogUnhealthy
eventlog service offlineEventLogDown
gateway service offlineGatewayDown
The gateway request size exceeds 10MRequestSizeTooMuch
The number of gateway requests per second exceeds 200RequestMany
The number of bad requests in gateway 10s is greater than 5FailureRequestMany
mq service offlineMqDown
The status of mq component is abnormalMqUnhealthy
There are tasks longer than 1 minute in the mq message queueMqMessageQueueBlock
webcli service goes offlineWebcliDown
The status of the webcli component is abnormalWebcliUnhealthy
The number of errors that occurred while executing the command from webcli was greater than 5 per secondWebcliUnhealthy
worker service goes offlineWorkerDown
The status of the worker component is abnormalWorkerUnhealthy
The number of worker task execution errors is greater than 50WorkerTaskError

Cluster monitoring alarm

Alarm itemAlarm information
Rainbond cluster node node is unhealthyRbdNodeUnhealth
K8s cluster node node is unhealthyKubeNodeUnhealth
It takes more than 10s to collect cluster informationClusterCollectorTimeout
The tenant's resource usage exceeds the resource limitInsufficientTenantResources
Node node goes offlineNodeDown
The CPU usage of the node is greater than 70% within 5 minutesHighCpuUsageOnNode
The available memory resources of the cluster are less than 2GBInsufficientClusterMemoryResources
Cluster CPU availability is less than 500mInsufficientClusterCPUResources
The node load is greater than 5 within 5 minutesHighLoadOnNode
The remaining available amount of node Inode is less than 0.3InodeFreerateLow
The disk usage of the root partition of the node is greater than 85%HighRootdiskUsageOnNode
Node Docker disk partition usage is greater than 85%HighDockerdiskUsageOnNode
Node memory usage is greater than 80%HighMemoryUsageOnNode

For cluster monitoring and alarm configuration, see Monitoring and Alarm Deployment