设为首页 - 加入收藏 巴彦淖尔网 (http://www.0478zz.com)- 国内知名站长资讯网站,提供最新最全的站长资讯,创业经验,网站建设等!
热搜: 解析 4800 2018 成功
当前位置: 首页 > 运营中心 > 建站资源 > 优化 > 正文

Kubernetes节点之间的ping监控

发布时间:2019-10-20 07:32 所属栏目:[优化] 来源:dummy
导读:在诊断Kubernetes集群问题的时候,我们经常注意到集群中某一节点在闪烁*,而这通常是随机的且以奇怪的方式发生。这就是为什么我们一直需要一种工具,它可以测试一个节点与另一个节点之间的可达性,并以Prometheus度量形式呈现结果。有了这个工具,我们还希

在诊断Kubernetes集群问题的时候,我们经常注意到集群中某一节点在闪烁*,而这通常是随机的且以奇怪的方式发生。这就是为什么我们一直需要一种工具,它可以测试一个节点与另一个节点之间的可达性,并以Prometheus度量形式呈现结果。有了这个工具,我们还希望在Grafana中创建图表并快速定位发生故障的节点(并在必要时将该节点上所有Pod进行重新调度并进行必要的维护)。

“闪烁”这里我是指某个节点随机变为“NotReady”但之后又恢复正常的某种行为。例如部分流量可能无法到达相邻节点上的Pod。

为什么会发生这种情况?常见原因之一是数据中心交换机中的连接问题。例如,我们曾经在Hetzner中设置一个vswitch,其中一个节点已无法通过该vswitch端口使用,并且恰好在本地网络上完全不可访问。

我们的最后一个要求是可直接在Kubernetes中运行此服务,因此我们将能够通过Helm图表部署所有内容。(例如在使用Ansible的情况下,我们必须为各种环境中的每个角色定义角色:AWS、GCE、裸机等)。由于我们尚未找到针对此环境的现成解决方案,因此我们决定自己来实现。

脚本和配置

我们解决方案的主要组件是一个脚本,该脚本监视每个节点的.status.addresses值。如果某个节点的该值已更改(例如添加了新节点),则我们的脚本使用Helm value方式将节点列表以ConfigMap的形式传递给Helm图表:

  1. apiVersion:?v1?
  2. kind:?ConfigMap?
  3. metadata:?
  4. name:?ping-exporter-config?
  5. namespace:?d8-system?
  6. data:?
  7. nodes.json:?>?
  8. {{?.Values.pingExporter.targets?|?toJson?}}??
  9. ?
  10. ?
  11. .Values.pingExporter.targets类似以下:?
  12. ?
  13. "cluster_targets":[{"ipAddress":"192.168.191.11","name":"kube-a-3"},{"ipAddress":"192.168.191.12","name":"kube-a-2"},{"ipAddress":"192.168.191.22","name":"kube-a-1"},{"ipAddress":"192.168.191.23","name":"kube-db-1"},{"ipAddress":"192.168.191.9","name":"kube-db-2"},{"ipAddress":"51.75.130.47","name":"kube-a-4"}],"external_targets":[{"host":"8.8.8.8","name":"google-dns"},{"host":"youtube.com"}]}??

下面是Python脚本:

  1. #!/usr/bin/env?python3?
  2. ?
  3. import?subprocess?
  4. import?prometheus_client?
  5. import?re?
  6. import?statistics?
  7. import?os?
  8. import?json?
  9. import?glob?
  10. import?better_exchook?
  11. import?datetime?
  12. ?
  13. better_exchook.install()?
  14. ?
  15. FPING_CMDLINE?=?"/usr/sbin/fping?-p?1000?-C?30?-B?1?-q?-r?1".split("?")?
  16. FPING_REGEX?=?re.compile(r"^(\S*)\s*:?(.*)$",?re.MULTILINE)?
  17. CONFIG_PATH?=?"/config/targets.json"?
  18. ?
  19. registry?=?prometheus_client.CollectorRegistry()?
  20. ?
  21. prometheus_exceptions_counter?=?\?
  22. prometheus_client.Counter('kube_node_ping_exceptions',?'Total?number?of?exceptions',?[],?registry=registry)?
  23. ?
  24. prom_metrics_cluster?=?{"sent":?prometheus_client.Counter('kube_node_ping_packets_sent_total',?
  25. ??????????????????????????????????????????????'ICMP?packets?sent',?
  26. ??????????????????????????????????????????????['destination_node',?'destination_node_ip_address'],?
  27. ??????????????????????????????????????????????registry=registry),?
  28. ????????????"received":?prometheus_client.Counter('kube_node_ping_packets_received_total',?
  29. ??????????????????????????????????????????????????'ICMP?packets?received',?
  30. ?????????????????????????????????????????????????['destination_node',?'destination_node_ip_address'],?
  31. ?????????????????????????????????????????????????registry=registry),?
  32. ????????????"rtt":?prometheus_client.Counter('kube_node_ping_rtt_milliseconds_total',?
  33. ?????????????????????????????????????????????'round-trip?time',?
  34. ????????????????????????????????????????????['destination_node',?'destination_node_ip_address'],?
  35. ????????????????????????????????????????????registry=registry),?
  36. ????????????"min":?prometheus_client.Gauge('kube_node_ping_rtt_min',?'minimum?round-trip?time',?
  37. ???????????????????????????????????????????['destination_node',?'destination_node_ip_address'],?
  38. ???????????????????????????????????????????registry=registry),?
  39. ????????????"max":?prometheus_client.Gauge('kube_node_ping_rtt_max',?'maximum?round-trip?time',?
  40. ???????????????????????????????????????????['destination_node',?'destination_node_ip_address'],?
  41. ???????????????????????????????????????????registry=registry),?
  42. ????????????"mdev":?prometheus_client.Gauge('kube_node_ping_rtt_mdev',?
  43. ????????????????????????????????????????????'mean?deviation?of?round-trip?times',?
  44. ????????????????????????????????????????????['destination_node',?'destination_node_ip_address'],?
  45. ????????????????????????????????????????????registry=registry)}?
  46. ?
  47. ?
  48. prom_metrics_external?=?{"sent":?prometheus_client.Counter('external_ping_packets_sent_total',?
  49. ??????????????????????????????????????????????'ICMP?packets?sent',?
  50. ??????????????????????????????????????????????['destination_name',?'destination_host'],?
  51. ??????????????????????????????????????????????registry=registry),?
  52. ????????????"received":?prometheus_client.Counter('external_ping_packets_received_total',?
  53. ??????????????????????????????????????????????????'ICMP?packets?received',?
  54. ?????????????????????????????????????????????????['destination_name',?'destination_host'],?
  55. ?????????????????????????????????????????????????registry=registry),?
  56. ????????????"rtt":?prometheus_client.Counter('external_ping_rtt_milliseconds_total',?
  57. ?????????????????????????????????????????????'round-trip?time',?
  58. ????????????????????????????????????????????['destination_name',?'destination_host'],?
  59. ????????????????????????????????????????????registry=registry),?
  60. ????????????"min":?prometheus_client.Gauge('external_ping_rtt_min',?'minimum?round-trip?time',?
  61. ???????????????????????????????????????????['destination_name',?'destination_host'],?
  62. ???????????????????????????????????????????registry=registry),?
  63. ????????????"max":?prometheus_client.Gauge('external_ping_rtt_max',?'maximum?round-trip?time',?
  64. ???????????????????????????????????????????['destination_name',?'destination_host'],?
  65. ???????????????????????????????????????????registry=registry),?
  66. ????????????"mdev":?prometheus_client.Gauge('external_ping_rtt_mdev',?
  67. ????????????????????????????????????????????'mean?deviation?of?round-trip?times',?
  68. ????????????????????????????????????????????['destination_name',?'destination_host'],?
  69. ????????????????????????????????????????????registry=registry)}?
  70. ?
  71. def?validate_envs():?
  72. envs?=?{"MY_NODE_NAME":?os.getenv("MY_NODE_NAME"),?"PROMETHEUS_TEXTFILE_DIR":?os.getenv("PROMETHEUS_TEXTFILE_DIR"),?
  73. ????????"PROMETHEUS_TEXTFILE_PREFIX":?os.getenv("PROMETHEUS_TEXTFILE_PREFIX")}?
  74. ?
  75. for?k,?v?in?envs.items():?
  76. ????if?not?v:?
  77. ????????raise?ValueError("{}?environment?variable?is?empty".format(k))?
  78. ?
  79. return?envs?
  80. ?
  81. ?
  82. @prometheus_exceptions_counter.count_exceptions()?
  83. def?compute_results(results):?
  84. computed?=?{}?
  85. ?
  86. matches?=?FPING_REGEX.finditer(results)?
  87. for?match?in?matches:?
  88. ????host?=?match.group(1)?
  89. ????ping_results?=?match.group(2)?
  90. ????if?"duplicate"?in?ping_results:?
  91. ????????continue?
  92. ????splitted?=?ping_results.split("?")?
  93. ????if?len(splitted)?!=?30:?
  94. ????????raise?ValueError("ping?returned?wrong?number?of?results:?\"{}\"".format(splitted))?
  95. ?
  96. ????positive_results?=?[float(x)?for?x?in?splitted?if?x?!=?"-"]?
  97. ????if?len(positive_results)?>?0:?
  98. ????????computed[host]?=?{"sent":?30,?"received":?len(positive_results),?
  99. ????????????????????????"rtt":?sum(positive_results),?
  100. ????????????????????????"max":?max(positive_results),?"min":?min(positive_results),?
  101. ????????????????????????"mdev":?statistics.pstdev(positive_results)}?
  102. ????else:?
  103. ????????computed[host]?=?{"sent":?30,?"received":?len(positive_results),?"rtt":?0,?
  104. ????????????????????????"max":?0,?"min":?0,?"mdev":?0}?
  105. if?not?len(computed):?
  106. ????raise?ValueError("regex?match\"{}\"?found?nothing?in?fping?output?\"{}\"".format(FPING_REGEX,?results))?
  107. return?computed?
  108. ?
  109. ?
  110. @prometheus_exceptions_counter.count_exceptions()?
  111. def?call_fping(ips):?
  112. cmdline?=?FPING_CMDLINE?+?ips?
  113. process?=?subprocess.run(cmdline,?stdout=subprocess.PIPE,?
  114. ?????????????????????????stderr=subprocess.STDOUT,?universal_newlines=True)?
  115. if?process.returncode?==?3:?
  116. ????raise?ValueError("invalid?arguments:?{}".format(cmdline))?
  117. if?process.returncode?==?4:?
  118. ????raise?OSError("fping?reported?syscall?error:?{}".format(process.stderr))?
  119. ?
  120. return?process.stdout?
  121. ?
  122. ?
  123. envs?=?validate_envs()?
  124. ?
  125. files?=?glob.glob(envs["PROMETHEUS_TEXTFILE_DIR"]?+?"*")?
  126. for?f?in?files:?
  127. os.remove(f)?
  128. ?
  129. labeled_prom_metrics?=?{"cluster_targets":?[],?"external_targets":?[]}?
  130. ?
  131. while?True:?
  132. with?open(CONFIG_PATH,?"r")?as?f:?
  133. ????config?=?json.loads(f.read())?
  134. ????config["external_targets"]?=?[]?if?config["external_targets"]?is?None?else?config["external_targets"]?
  135. ????for?target?in?config["external_targets"]:?
  136. ????????target["name"]?=?target["host"]?if?"name"?not?in?target.keys()?else?target["name"]?
  137. ?
  138. if?labeled_prom_metrics["cluster_targets"]:?
  139. ????for?metric?in?labeled_prom_metrics["cluster_targets"]:?
  140. ????????if?(metric["node_name"],?metric["ip"])?not?in?[(node["name"],?node["ipAddress"])?for?node?in?config['cluster_targets']]:?
  141. ????????????for?k,?v?in?prom_metrics_cluster.items():?
  142. ????????????????v.remove(metric["node_name"],?metric["ip"])?
  143. ?
  144. if?labeled_prom_metrics["external_targets"]:?
  145. ????for?metric?in?labeled_prom_metrics["external_targets"]:?
  146. ????????if?(metric["target_name"],?metric["host"])?not?in?[(target["name"],?target["host"])?for?target?in?config['external_targets']]:?
  147. ????????????for?k,?v?in?prom_metrics_external.items():?
  148. ????????????????v.remove(metric["target_name"],?metric["host"])?
  149. ?
  150. ?
  151. labeled_prom_metrics?=?{"cluster_targets":?[],?"external_targets":?[]}?
  152. ?
  153. for?node?in?config["cluster_targets"]:?
  154. ????metrics?=?{"node_name":?node["name"],?"ip":?node["ipAddress"],?"prom_metrics":?{}}?
  155. ?
  156. ????for?k,?v?in?prom_metrics_cluster.items():?
  157. ????????metrics["prom_metrics"][k]?=?v.labels(node["name"],?node["ipAddress"])?
  158. ?
  159. ????labeled_prom_metrics["cluster_targets"].append(metrics)?
  160. ?
  161. for?target?in?config["external_targets"]:?
  162. ????metrics?=?{"target_name":?target["name"],?"host":?target["host"],?"prom_metrics":?{}}?
  163. ?
  164. ????for?k,?v?in?prom_metrics_external.items():?
  165. ????????metrics["prom_metrics"][k]?=?v.labels(target["name"],?target["host"])?
  166. ?
  167. ????labeled_prom_metrics["external_targets"].append(metrics)?
  168. ?
  169. out?=?call_fping([prom_metric["ip"]???for?prom_metric?in?labeled_prom_metrics["cluster_targets"]]?+?\?
  170. ?????????????????[prom_metric["host"]?for?prom_metric?in?labeled_prom_metrics["external_targets"]])?
  171. computed?=?compute_results(out)?
  172. ?
  173. for?dimension?in?labeled_prom_metrics["cluster_targets"]:?
  174. ????result?=?computed[dimension["ip"]]?
  175. ????dimension["prom_metrics"]["sent"].inc(computed[dimension["ip"]]["sent"])?
  176. ????dimension["prom_metrics"]["received"].inc(computed[dimension["ip"]]["received"])?
  177. ????dimension["prom_metrics"]["rtt"].inc(computed[dimension["ip"]]["rtt"])?
  178. ????dimension["prom_metrics"]["min"].set(computed[dimension["ip"]]["min"])?
  179. ????dimension["prom_metrics"]["max"].set(computed[dimension["ip"]]["max"])?
  180. ????dimension["prom_metrics"]["mdev"].set(computed[dimension["ip"]]["mdev"])?
  181. ?
  182. for?dimension?in?labeled_prom_metrics["external_targets"]:?
  183. ????result?=?computed[dimension["host"]]?
  184. ????dimension["prom_metrics"]["sent"].inc(computed[dimension["host"]]["sent"])?
  185. ????dimension["prom_metrics"]["received"].inc(computed[dimension["host"]]["received"])?
  186. ????dimension["prom_metrics"]["rtt"].inc(computed[dimension["host"]]["rtt"])?
  187. ????dimension["prom_metrics"]["min"].set(computed[dimension["host"]]["min"])?
  188. ????dimension["prom_metrics"]["max"].set(computed[dimension["host"]]["max"])?
  189. ????dimension["prom_metrics"]["mdev"].set(computed[dimension["host"]]["mdev"])?
  190. ?
  191. prometheus_client.write_to_textfile(?
  192. ???
    envs["PROMETHEUS_TEXTFILE_DIR"]?+?envs["PROMETHEUS_TEXTFILE_PREFIX"]?+?envs["MY_NODE_NAME"]?+?".prom",?registry)  
    

该脚本在每个Kubernetes节点上运行,并且每秒两次发送ICMP数据包到Kubernetes集群的所有实例。收集的结果会存储在文本文件中。

【免责声明】本站内容转载自互联网,其相关言论仅代表作者个人观点绝非权威,不代表本站立场。如您发现内容存在版权问题,请提交相关链接至邮箱:bqsm@foxmail.com,我们将及时予以处理。

网友评论
推荐文章