What is Apache Kafka?
Apache Kafka is a distributed streaming system that can publish and subscribe a stream of records. Kafka can be used for a number of purposes: Messaging, real time website activity tracking, monitoring operational metrics of distributed applications, log aggregation from numerous servers, event sourcing where state changes in a database are logged and ordered, commit logs where distributed systems sync data and restoring data from failed systems.
Why Apache Kafka?
Kafka is messaging system. Kafka is highly scalable and highly durable. building real-time streaming data pipelines that reliably get data between systems or applications. Building real-time streaming applications that transform or react to the streams of data. Core reasons for choosing Kafka is its high reliability and ability to deal with data with a very low latency.
What are the different components that are available in Kafka?
Topics: Kafka maintains feeds of messages in categories called topics. Each topic has a user-defined category (or feed name), to which messages are published
Producers: Producers are processes that publish messages to one or more Kafka topics. The producer is responsible for choosing which message to assign to which partition within a topic.
Consumers: Consumers are processes that subscribe to one or more topics and process the feeds of published messages from those topics.
Brokers: A Kafka cluster consists of one or more servers, each of which is called a broker. Producers send messages to the Kafka cluster, which in turn serves them to consumers
What are the core APIs in Kafka?
Kafka has four core APIs:
Producer API: The Producer API allows an application to publish a stream of records to one or more Kafka topics.
Consumer API: The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
Streams API: The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
Connector API: The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
Explain the role of the Kafka Producer API?
The role of Kafka’s Producer API is to wrap the two producer’s kafka.producer.SyncProducer and the kafka.producer.async.AsyncProducer. The aim is to expose all the producer functionality through a single API to the client.
What is Kafka Topic?
Kafka topics consists of several partitions. A topic can be parallelized through these partitions by splitting the data in a particular topic across multiple brokers. For multiple consumers to read a topic in parallel each partition should be placed on a separate machine. Multiple consumers can read from multiple partitions in a topic which allows for a phenomenally high message processing throughput. Offset is an identifier within a partition of every message. The offset is simply an immutable sequence where the messages are ordered. This kind of message ordering is maintained by Kafka. Starting from a specific offset, consumers can read messages. They can choose and read from any offset which allows to consumers at any time to join the cluster. Through the message’s partition, topic and offset within the partition each specific message in a Kafka cluster can be uniquely identified by a tuple.
What is a broker?
An instance in a Kafka cluster is called a broker. In a Kafka cluster, if you connect to any one broker, you will be able to access the entire cluster. The broker instance that we connect to in order to access the cluster is known as a bootstrap server. Each broker is identified by a numeric ID in the cluster. To start a Kafka cluster, three brokers is a good number, but there are clusters with hundreds of brokers.
What are consumers or users?
Kafka provides single consumer abstractions that discover both queuing and publish-subscribe Consumer Group. They tag themselves with a user group and every communication available on a topic is distributed to one user case within every promising user group. User instances are in disconnected process. We can determine the messaging model of the consumer based on the consumer groups.
How message is consumed by consumer in Kafka?
Transfer of messages in Kafka is done by using sendfile API. It enables the transfer of bytes from the socket to disk via kernel space saving copies and call between kernel user back to the kernel.
How to start a Kafka server?
Given that Kafka exercises Zookeeper, we have to start the Zookeeper’s server.
One can use the convince script packaged with Kafka to get a crude but effective single node Zookeeper instance> bin/zookeeper-server-start.shconfig/zookeeper.propertiesNow the Kafka server can start> bin/Kafka-server-start.shconfig/server.properties.
Elaborate Kafka architecture?
A cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into various partitions and each broker store one or more of those partitions so that multiple producers and consumers can publish and retrieve messages at the same time.
What happens if the preferred replica is not in the ISR?
If the preferred replica is not in the ISR, the controller will fail to move leadership to the preferred replica.
What is a partitioning key?
The partitioning key consists of one or more columns that determine the partition where each row is stored. Within the available producer, the main function of partitioning key is to validate and direct the destination partition of the message. Normally, a hashing based partitioner is used to assess the partition Id if the key is provided.
What is an Offset?
The messages in the partitions will be given a sequential ID number known as an offset, the offset will be used to identify each message in the partition uniquely. With the aid of Zookeeper Kafka stores the offsets of messages consumed for a specific topic and partition by this consumer group.
What is Replication in Kafka?
Kafka replicates the log for each topic’s partitions across a configurable number of servers. This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures. The purpose of adding replication in Kafka is for stronger durability and higher availability.
What is Kafka Logs?
An important concept for Apache Kafka is “log”. This is not related to application log or system log. This is a log of the data. It creates a loose structure of the data which is consumed by Kafka. The notion of “log” is an ordered, append-only sequence of data. The data can be anything because for Kafka it will be just an array of bytes.
What is ISR?
ISR refers to in-sync replica, a follower considered in-sync must satisfy following two conditions:
It must send the fetch request in certain time, configurable via replica.lag.time.max.ms.
It must not lag behind the leader too far away, configurable via replica.lag.max.messages.
What does it indicate if replica stays out of ISR for a long time?
If a replica remains out of ISR for an extended time, it indicates that the follower is unable to fetch data as fast as data accumulated at the leader.
Why are Replications critical in Kafka?
Replication confirms that published messages are not vanished and can be consumed in the event of any machine error, program error or frequent software upgrades.
How you can reduce churn in ISR? When does broker leave the ISR?
ISR is a set of message replicas that are completely synced up with the leaders, in other word ISR has all messages that are committed. ISR should always include all replicas until there is a real failure. A replica will be dropped out of ISR if it deviates from the leader.
What is the role ZooKeeper plays in a cluster of Kafka?
Kafka is an open source system and also a distributed system is built to use Zookeeper. The basic responsibility of Zookeeper is to build coordination between different nodes in a cluster. Since Zookeeper works as periodically commit offset so that if any node fails, it will be used to recover from previously committed to offset. The ZooKeeper is also responsible for configuration management, leader detection, detecting if any node leaves or joins the cluster, synchronization, etc.
Can Kafka be utilized without zookeeper?
It is impossible to use Kafka without Zookeeper because it is not feasible to go around Zookeeper and attach in a straight line to the server. If the Zookeeper is down for a number of causes, then we will not be able to serve any customer demand
Why is Kafka technology significant to use?
Kafka being distributed publish-subscribe system has the advantages as below.
Fast: Kafka comprises of a broker and a single broker can serve thousands of clients by handling megabytes of reads and writes per second.
Scalable: facts are partitioned and streamlined over a cluster of machines to enable large information Durable: Messages are persistent and is replicated in the cluster to prevent record loss Distributed by Design: It provides fault tolerance guarantees and robust.
What is the maximum size of the message does Kafka server can receive?
The maximum size of the message that Kafka server can receive is 1000000 bytes =1MB
When does the queue full exception emerge inside the manufacturer?
Queue Full Exception naturally happens when the manufacturer tries to propel
communications at a speed which Broker can’t grip. Consumers need to insert sufficient brokers to collectively grip the amplified load since the Producer doesn’t block.
How you can get exactly once messaging from Kafka during data production?
During data, production to get exactly once messaging from Kafka you have to follow two things avoiding duplicates during data consumption and avoiding duplication during data production.
Here are the two ways to get exactly one semantics while data production:
Avail a single writer per partition, every time you get a network error checks the last message in that partition to see if your last write succeeded
In the message include a primary key (UUID or something) and de-duplicate on the consumer
What is the main difference between Kafka and Flume?
Both Kafka and Flume are used for real-time processing where Kafka seems to be more scalable and you can trust on the message durability.
What is the traditional method of message transfer?
The traditional method of message transfer includes two methods
Queuing: In a queuing, a pool of consumers may read message from the server and each message goes to one of them
Publish-Subscribe: In this model, messages are broadcasted to all consumers
Kafka caters single consumer abstraction that generalized both of the above- the consumer group
When not to use Apache Kafka?
Kafka doesn’t number the messages. It has a notion of “offset” inside the log which identifies the messages.
Consumers consume the data from topics but Kafka does not keep track of the message consumption. Kafka does not know which consumer consumed which message from the topic. The consumer or consumer group has to keep a track of the consumption.
There are no random reads from Kafka. Consumer has to mention the offset for the topic and Kafka starts serving the messages in order from the given offset.
Kafka does not offer the ability to delete. The message stays via logs in Kafka till it expires (until the retention time defined).