Xiaomi Galaxy Talos Book

Talos monitoring and alarm


In order to fully understand the system status and user use, Talos provides a relatively complete Counter system. The original intention of the Counter system design is for the understanding of the following information:

  • System performance, load conditions; for example, latency and qps

  • Data storage; for example, Topic's data amount, Partition's current offset range: [start, end]

  • Data consumption; for example, consumer group's consumption record checkpoint and consumption data accumulation

For this purpose, the Metrics provided by the Talos system are roughly divided into:these five metrics, this section mainly describes the Topic level and Consumer level information the user is concerned about; the user can configure the relevant alarms to monitor their Topic status and data consumption;

Metrics (Blue font for MetricName)

1) Topic/Partition:
  • StartOffset: topic/partition's start offset value will increase due to data retention expiry
  • EndOffset: Same as above, it is the end offset
  • MessageBytes: topic/partition's current total number of Message bytes
  • MessageNumber: topic/partition's current number of Messages

  • In addition, for all Operation (API) , they all have both Latency and QPS Metrics. The MetricName corresponding to these Metrics are as follows:<0><1>

Latency:
API MetricName Set
putMessage putMessage.Time.75thPercentile
-- putMessage.Time.95thPercentile
-- putMessage.Time.98thPercentile
-- putMessage.Time.999thPercentile
getMessage getMessage.Time.75thPercentile
-- getMessage.Time.95thPercentile
-- getMessage.Time.98thPercentile
-- getMessage.Time.999thPercentile

Note: all Percentile computed samples are data in the last 5 minutes

QPS:
API MetricName Set
putMessage putMessage.60sRate
-- putMessage.300sRate
-- putMessage.900sRate
getMessage getMessage.60sRate
-- getMessage.300sRate
-- getMessage.900Rate

Among them, 60s/300s/900s in QPS MetricName refers to the time window for calculating QPS: QPS for the last minute, QPS for the last 5 minutes, and data for the last 15 minutes.

2) Consumer
  • ConsumerCommitOffset: An offset record that was recently consumed and successfully committed to by a consumerGroup
  • ConsumerOffsetLag: The difference between the user's commitOffset and the Topic/Partition's latest message offset is used to detect message accumulation during consumption

Alert

We are web-servicing the current alarm system. Now it supports background configuration. If the user wants to set the monitoring alarm (email/SMS) of the above Metrics, please send an email to talos-help@xiaomi.com

Please fill out the application form. The following is an example (due to layout issues, split the form into 2):

Cluster TopicName MetricName Alert-Value
azbjsrv-talos testTopic putMessage.Time.95thPercentile 100ms
azbjsrv-talos testTopic ConsumerOffsetLag 10000
Email Phone ConsumerGroup
alert@xiaomi.com 1877777777 None
alert@xiaomi.com 1877777777 myConsumerGroupName