Apache Kafka Interview Questions and Answers

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log,” making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.

Apache Kafka is a fast, scalable, fault-tolerant publish-subscribe messaging system which enables communication between producers and consumers using message-based topics. It designs a platform for high-end new generation distributed applications. Kafka permits a large number of permanent or ad-hoc consumers. Kafka is highly available and resilient to node failures and supports automatic recovery. These characteristics make Kafka ideal for communication and integration between components of large scale data systems in real world data systems

Apache Kafka was originally developed by LinkedIn and was subsequently open sourced in early 2011. Graduation from the Apache Incubator occurred on 23 October 2012. In November 2014, several engineers who worked on Kafka at LinkedIn created a new company named Confluent with a focus on Kafka. According to a Quora post from 2014

What is Apache Kafka?

Apache Kafka is a distributed streaming system that can publish and subscribe a stream of records. Kafka can be used for a number of purposes: Messaging, real time website activity tracking, monitoring operational metrics of distributed applications, log aggregation from numerous servers, event sourcing where state changes in a database are logged and ordered, commit logs where distributed systems sync data and restoring data from failed systems.

Why Apache Kafka?

What are the different components that are available in Kafka?

Topics: Kafka maintains feeds of messages in categories called topics. Each topic has a user-defined category (or feed name), to which messages are published

Producers: Producers are processes that publish messages to one or more Kafka topics. The producer is responsible for choosing which message to assign to which partition within a topic.

Consumers: Consumers are processes that subscribe to one or more topics and process the feeds of published messages from those topics.

Brokers: A Kafka cluster consists of one or more servers, each of which is called a broker. Producers send messages to the Kafka cluster, which in turn serves them to consumers

What are the core APIs in Kafka?

Kafka has four core APIs:

Producer API: The Producer API allows an application to publish a stream of records to one or more Kafka topics.

Consumer API: The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.

Streams API: The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

Connector API: The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Explain the role of the Kafka Producer API?

What is Kafka Topic?

Kafka topics consists of several partitions. A topic can be parallelized through these partitions by splitting the data in a particular topic across multiple brokers. For multiple consumers to read a topic in parallel each partition should be placed on a separate machine. Multiple consumers can read from multiple partitions in a topic which allows for a phenomenally high message processing throughput. Offset is an identifier within a partition of every message. The offset is simply an immutable sequence where the messages are ordered. This kind of message ordering is maintained by Kafka. Starting from a specific offset, consumers can read messages. They can choose and read from any offset which allows to consumers at any time to join the cluster. Through the message’s partition, topic and offset within the partition each specific message in a Kafka cluster can be uniquely identified by a tuple.

What is a broker?

An instance in a Kafka cluster is called a broker. In a Kafka cluster, if you connect to any one broker, you will be able to access the entire cluster. The broker instance that we connect to in order to access the cluster is known as a bootstrap server. Each broker is identified by a numeric ID in the cluster. To start a Kafka cluster, three brokers is a good number, but there are clusters with hundreds of brokers.

What are consumers or users?

How message is consumed by consumer in Kafka?

How to start a Kafka server?

Elaborate Kafka architecture?

What happens if the preferred replica is not in the ISR?

What is a partitioning key?

What is an Offset?

What is Replication in Kafka?

What is Kafka Logs?

What is ISR?

What does it indicate if replica stays out of ISR for a long time?

Why are Replications critical in Kafka?

How you can reduce churn in ISR? When does broker leave the ISR?

What is the role ZooKeeper plays in a cluster of Kafka?

Kafka is an open source system and also a distributed system is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Since Zookeeper works as periodically commit offset so that if any node fails, it will be used to recover from previously committed to offset. The ZooKeeper is also responsible for configuration management, leader detection, detecting if any node leaves or joins the cluster, synchronization, etc.

Can Kafka be utilized without zookeeper?

Why is Kafka technology significant to use?

Kafka being distributed publish-subscribe system has the advantages as below.

Fast: Kafka comprises of a broker and a single broker can serve thousands of clients by handling megabytes of reads and writes per second.

Scalable: facts are partitioned and streamlined over a cluster of machines to enable large information Durable: Messages are persistent and is replicated in the cluster to prevent record loss Distributed by Design: It provides fault tolerance guarantees and robust.

What is the maximum size of the message does Kafka server can receive?

When does the queue full exception emerge inside the manufacturer?

How you can get exactly once messaging from Kafka during data production?

During data, production to get exactly once messaging from Kafka you have to follow two things avoiding duplicates during data consumption and avoiding duplication during data production.

Here are the two ways to get exactly one semantics while data production:

Avail a single writer per partition, every time you get a network error checks the last message in that partition to see if your last write succeeded

In the message include a primary key (UUID or something) and de-duplicate on the consumer

What is the main difference between Kafka and Flume?

What is the traditional method of message transfer?

When not to use Apache Kafka?

Kafka doesn’t number the messages. It has a notion of “offset” inside the log which identifies the messages.

Consumers consume the data from topics but Kafka does not keep track of the message consumption. Kafka does not know which consumer consumed which message from the topic. The consumer or consumer group has to keep a track of the consumption.

There are no random reads from Kafka. Consumer has to mention the offset for the topic and Kafka starts serving the messages in order from the given offset.

Kafka does not offer the ability to delete. The message stays via logs in Kafka till it expires (until the retention time defined).