Apache Storm Interview Questions and Answers

Apache Storm, in easy terms, is a distributed framework for real time process of massive data like Apache Hadoop is a distributed framework for batch processing. Apache Storm works on task similarity principle wherever within the same code is executed on multiple nodes with totally different input data.

Apache Storm adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations. It is originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.

A Storm application is meant as a “topology” within the form of a directed acyclic graph (DAG) with spouts and bolts acting because the graph vertices. Edges on the graph are named streams and direct data from one node to a different. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.

What is Apache storm?

Apache Storm is a free and open supply or source distributed fault-tolerant, real-time computation system. it’s a distributed computation framework written preponderantly within the Clojure programming language. Storm is meant to method large quantity of data in a fault-tolerant and horizontal scalable technique. It is a streaming data framework that has the capability of highest ingestion rates. Though Storm is stateless, it manages distributed environment and cluster state via Apache ZooKeeper. It is simple and you can execute all kinds of manipulations on real-time data in parallel.

Why use Apache Storm?

Easy to operate: Operating storm is quite easy

Storm has many use cases: real-time analytics, online machine learning, distributed RPC, continuous computation ETL, and more.

Storm is very fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Storm integrates with the queueing and database technologies you already use. A Storm

topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

Reliable: It guarantees that each unit of data will be executed at least once or exactly once

Scalable: It runs across a cluster of machine

Apache Storm Use Cases: Twitter, Infochimps, yahoo, Filpboaard, ooyala, Rocket Fuel, wego, Navsite, klout, Taobao

Explain the major components of Apache Storm system?

What are the components of Apache Storm?

Tuple: is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster.

Stream: It is an unordered sequence of tuples. One of the basic abstractions of the storm architecture is stream which is an unbounded pipeline of tuples. A tuple can be defined as the fundamental component in the Storm cluster containing a named list of the values or elements.

Spouts : Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.

Spouts can broadly be classified into following:

Reliable These spouts have the aptitude to replay the tuples (a unit of data in data stream). This helps applications accomplish ‘at least once message processing’ linguistics as just in case of failures, tuples are often replayed and processed once more. Spouts for fetching the data from electronic messaging frameworks are usually reliable as these frameworks offer the mechanism to replay the messages.

Unreliable: These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follows ‘at most once message processing’ semantic.

Bolts: Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc.

Can you explain the Spout Creation?

Spout is a part that is employed for data generation. Basically, a spout can implement an IRichSpout interface. “IRichSpout” interface has the subsequent important methods:

Open: It provides the spout with an environment to execute. The executors will run this method to initialize the spout.

nextTuple: Emits the generated data through the collector.

close: when a spout is going to shut down.

declareOutputFields: Declares the output schema of the tuple.

ack: Acknowledges that a specific tuple is processed

fail: Specifies that a specific tuple is not processed and not to be reprocessed.

Can you define stream and stream grouping in Apache Storm?

Can you explain the common configurations in Apache Storm?

There is a spread of configurations you’ll be able to set per topology. a list of all the configurations you’ll be able to set can be found here. those prefixed with “TOPOLOGY” is overridden on a topology-specific basis (the alternative ones are cluster configurations and can’t be overridden). Here are some common ones that are set for a topology:

Config.TOPOLOGY_WORKERS : This sets the number of worker processes to use to execute the topology. For example, if you set this to 25, there will be 25 Java processes across the cluster executing all the tasks. If you had a combined 150 parallelism across all components in the topology, each worker process will have 6 tasks running within it as threads.

Config.TOPOLOGY_ACKER_EXECUTORS : This sets the number of executors that will track tuple trees and detect when a spout tuple has been fully processed By not setting this variable or setting it as null, Storm will set the number of acker executors to be equal to the number of workers configured for this topology. If this variable is set to 0, then Storm will immediately ack tuples as soon as they come off the spout, effectively disabling reliability.

Config.TOPOLOGY_MAX_SPOUT_PENDING : This sets the maximum number of spout tuples that can be pending on a single spout task at once (pending means the tuple has not been acked or failed yet). It is highly recommended you set this config to prevent queue explosion.

Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS : This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed. This value defaults to 30 seconds, which is sufficient for most topologies.

Config.TOPOLOGY_SERIALIZATIONS : You can register more serializers to Storm using this config so that you can use custom types within tuples

Explain how Storm UI can be used in topology?

Define combiner aggregator in Apache Storm?

Can we run Apache as a root? If yes, what are the security risks?

Explain how you can streamline log files using Apache storm?

When should you call the clean-up method?

Can you define Toplogy_Message_Timeout_secs in Apache storm?

In which folder are Java Applications stored in Apache?

Can you define Multiviews?

MultiViews search is enabled by the MultiViews Options. it’s the overall name given to the Apache server’s ability to supply language-specific document variants in response to a request. this is often documented quite totally within the content negotiation description page. additionally, Apache Week carried an article on this subject entitled It then chooses the most effective match to the client’s requirements and returns that document.

Can you define ZeroMQ?

What are the distinct layers of Storm’s Codebase?

There are three distinct layers to Storm’s codebase.

First: Storm was designed from the very beginning to be compatible with multiple languages. Nimbus is a Thrift service and topologies are defined as Thrift structures. The usage of Thrift allows Storm to be used from any language.

Second: all of Storm’s interfaces are specified as Java interfaces. So even though there’s a lot of Clojure in Storm’s implementation, all usage must go through the Java API. This means that every feature of Storm is always available via Java.

Third: Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality the great majority of the implementation logic is in Clojure.

Can you define mod_vhost_alias?

Can you define struct?

Can you explain the difference between raw data and processed data?

What is data integrity and what are the methods available to reduce threats to it?

Explain the difference between Apache Kafka and Apache Storm?

Apache Kafka: This is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.

Apache Storm: This is a real time message processing system, and you can edit or manipulate data in real-time. Storm pulls the data from Kafka and applies some required manipulation. It makes it easy to reliably process unbounded streams of data, doing real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use.

Does Apache act as a Proxy server?

How to check for the httpd.conf consistency and any errors in it?

How to stop Apache?

Can you explain Distributed Messaging System?

This system is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging systems. A distributed messaging system provides the benefits of reliability, scalability, and persistence.

Most of the messaging patterns follow the publish-subscribe model (simply Pub-Sub) where the senders of the messages are called publishers and those who want to receive the messages are called subscribers.

Once the message has been published by the sender, the subscribers can receive the selected message with the help of a filtering option. Usually we have two types of filtering, one is topic-based filtering and another one is content-based filtering

Can you explain combinerAggregator?

A CombinerAggregator is used to combine a set of tuples into a single field. It has the following signature:

public interface CombinerAggregator {

T init (TridentTuple tuple);

T combine(T val1, T val2);

T zero();

}

Storm calls the init() method with each tuple, and then repeatedly calls the combine()method until the partition is processed. The values passed into the combine() method are partial aggregations, the result of combining the values returned by calls to init().

Can you define Trident?

What are the benefits of Apache Storm?

Apache Storm comes with the following benefits to provide your application with a robust real time processing engine –

Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10-node cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.

Guarantees no data loss: A real-time system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.

Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.

Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).

Programming language agnostic: It is Robust and scalable real-time processing shouldn’t be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.