博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Understanding, Operating and Monitoring Apache Kafka
阅读量:5174 次
发布时间:2019-06-13

本文共 8564 字,大约阅读时间需要 28 分钟。

Apache Kafka is an attractive service because it’s conceptually simple and powerful. It’s easy to understand writing messages to a log in one place, then reading messages from that log in another place. This simplicity not only allows for a nice separation of concerns, but also is relatively easier to understand than more complex alternatives. Plus, Kafka is powerful enough to pass messages at wire speed, going as fast as your network allows.

Understanding Apache Kafka

At the same time, Apache Kafka has a surprising number of moving parts that can be monitored, and it can be a little hard to know even where to start. In broad strokes, there are two categories to monitor: infrastructure and application performance. Most of the metric space for Kafka centers on measuring the movement of messages over a network, and the underlying mechanisms that support correct passage/delivery of those messages. Kafka labels a subset of the metrics involved as “topics” and “consumer groups,” which typically address application performance, providing answers to questions such as “how many messages have been sent to topic X,” or, “how far behind is consumer group Y?”

The Big Picture

To make things a bit more complicated, Kafka provides node-level metrics, and doesn’t provide a coordinated rollup across the logical service. So, you may want to monitor “active.controller.count” but each node tells you “zero” and only the one coordinator node tells you “one,” and that might move around from node to node as instances are cycled. Or, you may want to monitor “messages.in” but each node only tells you the number of messages it’s seen, rather than the sum total across the brokers. In these cases, it is necessary to create an aggregation across the nodes, to get the sum total of messages or partitions or other kinds of counts.

Now that we have the big picture, let’s get specific. There are lots of good sources for comprehensive monitoring and laundry lists of metrics, so instead I’ll give a couple of embarrassing stories from our own experience when we were first adopting this wonderful technology.

Embarrassing story #1:

It doesn’t work when the disk is full. I think this one must happen to everyone at some point, because you sort of hope that Kafka will manage disk space holistically. Instead, Kafka has a fine-grained policy for the limits of a partition. It doesn’t have an overall limit and it cannot work backward  to keep the disk from becoming full by throwing away older data. Instead, it just fills the disk and tips over, as if that’s what you asked it to do, even though you didn’t mean to really. So, we had just rolled out a small staging cluster with mostly default settings, and within a few days the disks were full and everything failed. Luckily, it failed fast.

If you tell this embarrassing story to an experienced Kafka dev, they might say “oh, you didn’t configure things properly.” And, yes, it is possible to take a guess at how many topics you need to support at most, and then work backward from a number of partitions and configure settings that might keep the disk from filling so long as you stay under that number of partitions. However,  the number of partitions is a moving target; topics can be made on demand, topics with more partitions than the default can be made, etc. You can make an imaginary partition budget, but there’s no way to enforce that budget. What you really want, in the general case, is to use the disk up to a point, then hope Kafka acts like a FIFO and throws away old things to make room for new things. Until that’s available, I don’t think there’s a robust way to defend against filling the disk with Kafka configuration alone, so monitoring disk utilization is essential.

We use a nice machine learning model in our product that learns disk utilization trends and warns us if the trend predicts a full disk in the near future. We get a warning if the full disk is three days out (the long weekend), and a critical alert if eight hours out. What’s great about this approach is that it adapts to real utilization automatically, pays attention to the velocity of data rather than some arbitrary threshold, and is independent of the number of active topics, and independent of the activity level in different topics. See how I turned our embarrassment into a positive? Pretty slick, right?

Embarrassing story #2:

Broker down is bad. We deployed three Kafka brokers in staging, and the story we told ourselves was that one broker could go down and the other two would pick up the slack. Like any quorum-based service, we hoped, two out of three would mean a majority would always be up and running. So, sure enough a broker went down, and we were like “cool, we’ll fix that later, but leave it for now because it might offer good clues we’ll want to investigate.” Then, much later, one of the producers wedged. Then another. Then it was an outage. It turns out we had configured those Kafka Producers to require acks from each of the replicas, and there were three replicas. So, when one of the brokers went down, there was no fourth broker to take up the missing replica and the Producers were only able to get two acks. Everything had seemed to keep working for a long while. But, while the broker was down, the Producers kept resources for each message hoping for the last ack, then eventually crashed the jvm when it was out of resources. This one was a little tricky to figure out because there were no logged failures on the Producers, and the Consumers were happy to read from the replicas that were working, so it wasn’t obvious that there was a problem with Kafka.

The fix was to configure the Producers to only require one acknowledgement before considering the message delivered. Also, we had to stop thinking about Kafka as if it were another quorum-type service. A better mental image is to think of Kafka brokers as partition servers, and the number of brokers should be the number of partitions you need to serve in parallel at full speed. It won’t quite work out that way in practice, but that mental model is better than thinking about quorum. Then, once you have a number of brokers established, think about replication as providing failure tolerance, but work backward. If one broker goes down, the replication factor should be equal or less than the number of remaining brokers, because otherwise there’s nowhere to move the replicas. Three replicas on three brokers means when a broker goes down, there’s nothing to take on the third replica and you have a policy violation / outage.

Also, note that when the broker is down, it naturally doesn’t have any metrics to report. You can’t tell from inspecting the down node that something is wrong with the cluster. You have to go add up all the metrics from the other brokers. So, it is important to have service-wide aggregations for key performance indicators, such as  under.replicated.partitions, offline.partition.count, bytes.in, bytes.out, messages.in, etc. Adding up all the active.controller.counts will tell you that there is no leader if the sum is zero, and some split-brain catastrophe has happened if the sum is more than one. With service aggregations, you can assess the impact of a broker going down: either the cluster recovers and is fine, or it gets into trouble pretty quickly.

 

Lastly, I’d like to point out that it may be possible in some environments to have a strategy for dealing with the large list of metrics in Kafka, but still collect everything that’s interesting. Whenever a topic or partition is created, data about those are reported essentially forever. A topic may be abandoned and all the message data expired, but all the metrics around that topic will continue to report values. In staging and testing-type environments, especially,  we’ve seen large collections of abandoned topics. But even in production, given enough time, the collection of inactive topics can grow substantially. Kafka doesn’t particularly encourage cleaning up old topics, so eventually you have a long list of all the topics ever made. As an adaptive filter, one can track message counts and ignore all the metrics related to topics that had no increase in the number of messages. That way, you can reasonably “watch everything live” and avoid having to maintain a list of topics to be  monitored, or otherwise manually tune your data collection.

In the next part of Monitoring Kafka series we will discuss Kafka consumer lag and some other concerns in a little more detail.

转载于:https://www.cnblogs.com/felixzh/p/6035658.html

你可能感兴趣的文章
注意java的对象引用
查看>>
C++ 面向对象 类成员函数this指针
查看>>
NSPredicate的使用,超级强大
查看>>
自动分割mp3等音频视频文件的脚本
查看>>
判断字符串是否为空的注意事项
查看>>
布兰诗歌
查看>>
(转)Tomcat 8 安装和配置、优化
查看>>
(转)Linxu磁盘体系知识介绍及磁盘介绍
查看>>
跨域问题整理
查看>>
[Linux]文件浏览
查看>>
获取国内随机IP的函数
查看>>
Spring Mvc模式下Jquery Ajax 与后台交互操作
查看>>
(转)matlab练习程序(HOG方向梯度直方图)
查看>>
tableView
查看>>
Happy Great BG-卡精度
查看>>
TCP/IP 邮件的原理
查看>>
原型设计工具
查看>>
windows下的C++ socket服务器(4)
查看>>
css3 2d转换3d转换以及动画的知识点汇总
查看>>
计算机改名导致数据库链接的诡异问题
查看>>