Kafka 官网文档解读、最佳实战

2021/05/01

Kafka基于数据流事件来反应,比喻为人的中枢神经系统;高吞吐量、数据持久化、端到端低延迟、容错性、高可用(副本,产品级建议设置为3个副本)、不挑硬件,普通地硬件都能发挥良好地性能;性能相对于数据大小是稳定的,所以长时间存储数据是完全没有问题地,性能恒定O(1);可以重复消费数据,取决于Events过期时间的设置(通过配置topic实现),过期的Events则被丢弃;

Kafka能够收集数据库数据、物联网数据、移动设备、云服务、传感器、汽车、工业数据、应用系统数据等场景下使用。

kafka官方文档 快速开始 财富100强60%的公司使用Kafka,他们的使用场景

5个核心API对java和scala

  • The Admin API to manage and inspect topics, brokers, and other Kafka objects.
  • The Producer API to publish (write) a stream of events to one or more Kafka topics.
  • The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
  • The Kafka Streams API to implement stream processing applications and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event-time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
  • The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka. For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you typically don’t need to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.

#个人理解划的草图:

                 Broker
  Topic          Topic           ...              Topic看做文件系统的文件夹,事件看出一个个文件
|----------|   |----------|    |----------|
| P1|P2|Pn |   | P1|P2|Pn |    | P1|P2|Pn |       P 一个个顺序写的数据块,可以跨Broker存储
|----------|   |----------|    |----------|
                    
                   
    消费者组            消费者组                 ...
|----------------|  |----------------|  |---------------| 
| 消费者|消费者|...|  | 消费者|消费者|...|  |消费者|消费者|...|
|----------------|  |----------------|  |---------------|

可以有多个Brokers、多个topics、多个消费组;一个消费者组就是队列来使用了。

案例

1、流处理

Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For example, a processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an “articles” topic; further processing might normalize or deduplicate this content and publish the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users. Such processing pipelines create graphs of real-time data flows based on the individual topics. Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such data processing as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza 常使用在爬虫爬取多页面数据,通过多个不同的topic来做ETL处理,最终得到用户需要的数据。在进行落得ES进行API的提供

2.消息队列

Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications. In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides. In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

3、用户网页行为追踪

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting. Activity tracking is often very high volume as many activity messages are generated for each user page view.

4、统计分析

Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

5、日志聚合

Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency. 通过分析用户的行为,在电商领域可以做定向的推送、数据预热等;管理系统领域,可以用于分布式微服务日志收集,便于错误排查,日志归档等

APIS

使用生产者、消费者、Admin API需要引入如下依赖

    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.8.0</version>
	</dependency>

使用Stream API需要引入如下依赖

    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>2.8.0</version>
    </dependency>

连接器使用

Running Kafka Connect