What is Apache storm?
Apache Storm is a free and open supply or source distributed fault-tolerant, real-time computation system. it’s a distributed computation framework written preponderantly within the Clojure programming language. Storm is meant to method large quantity of data in a fault-tolerant and horizontal scalable technique. It is a streaming data framework that has the capability of highest ingestion rates. Though Storm is stateless, it manages distributed environment and cluster state via Apache ZooKeeper. It is simple and you can execute all kinds of manipulations on real-time data in parallel.
Why use Apache Storm?
Easy to operate: Operating storm is quite easy
Storm has many use cases: real-time analytics, online machine learning, distributed RPC, continuous computation ETL, and more.
Storm is very fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queueing and database technologies you already use. A Storm
topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.
Reliable: It guarantees that each unit of data will be executed at least once or exactly once
Scalable: It runs across a cluster of machine
Apache Storm Use Cases: Twitter, Infochimps, yahoo, Filpboaard, ooyala, Rocket Fuel, wego, Navsite, klout, Taobao
Explain the major components of Apache Storm system?
Nimbus: It is a job tracker that distributes the code across clusters, computes the code and execute it.
Zookeeper: It is a mediator to communicate with the storm cluster.
Supervisor: It interacts with Nimbus with the help of zookeeper and executes the process as per the Nimbus’s instructions.
What are the components of Apache Storm?
Tuple: is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster.
Stream: It is an unordered sequence of tuples. One of the basic abstractions of the storm architecture is stream which is an unbounded pipeline of tuples. A tuple can be defined as the fundamental component in the Storm cluster containing a named list of the values or elements.
Spouts : Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.
Spouts can broadly be classified into following:
Reliable These spouts have the aptitude to replay the tuples (a unit of data in data stream). This helps applications accomplish ‘at least once message processing’ linguistics as just in case of failures, tuples are often replayed and processed once more. Spouts for fetching the data from electronic messaging frameworks are usually reliable as these frameworks offer the mechanism to replay the messages.
Unreliable: These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follows ‘at most once message processing’ semantic.
Bolts: Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc.
Can you explain the Spout Creation?
Spout is a part that is employed for data generation. Basically, a spout can implement an IRichSpout interface. “IRichSpout” interface has the subsequent important methods:
Open: It provides the spout with an environment to execute. The executors will run this method to initialize the spout.
nextTuple: Emits the generated data through the collector.
close: when a spout is going to shut down.
declareOutputFields: Declares the output schema of the tuple.
ack: Acknowledges that a specific tuple is processed
fail: Specifies that a specific tuple is not processed and not to be reprocessed.
Can you define stream and stream grouping in Apache Storm?
In Apache Storm, a Stream is a sequence of bounded or unbounded tuples while stream grouping process streams among the bolt tasks.
Can you explain the common configurations in Apache Storm?
There is a spread of configurations you’ll be able to set per topology. a list of all the configurations you’ll be able to set can be found here. those prefixed with “TOPOLOGY” is overridden on a topology-specific basis (the alternative ones are cluster configurations and can’t be overridden). Here are some common ones that are set for a topology:
Config.TOPOLOGY_WORKERS : This sets the number of worker processes to use to execute the topology. For example, if you set this to 25, there will be 25 Java processes across the cluster executing all the tasks. If you had a combined 150 parallelism across all components in the topology, each worker process will have 6 tasks running within it as threads.
Config.TOPOLOGY_ACKER_EXECUTORS : This sets the number of executors that will track tuple trees and detect when a spout tuple has been fully processed By not setting this variable or setting it as null, Storm will set the number of acker executors to be equal to the number of workers configured for this topology. If this variable is set to 0, then Storm will immediately ack tuples as soon as they come off the spout, effectively disabling reliability.
Config.TOPOLOGY_MAX_SPOUT_PENDING : This sets the maximum number of spout tuples that can be pending on a single spout task at once (pending means the tuple has not been acked or failed yet). It is highly recommended you set this config to prevent queue explosion.
Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS : This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed. This value defaults to 30 seconds, which is sufficient for most topologies.
Config.TOPOLOGY_SERIALIZATIONS : You can register more serializers to Storm using this config so that you can use custom types within tuples
Explain how Storm UI can be used in topology?
It is used in monitoring the topology. The Storm UI provides information about errors happening in tasks and fine-grained stats on the throughput and latency performance of each component of each running topology.
Define combiner aggregator in Apache Storm?
In Apache Storm, a combiner aggregator is used to group tuples into a unified field.
Can we run Apache as a root? If yes, what are the security risks?
Well, this is often fully potential running Apache as a root. For root method, it calls port eighty however ne’er listens thereto. the simplest half is that no user can get the root rights while not admin permissions.
Explain how you can streamline log files using Apache storm?
To read from the log files, you can configure your spout and emit per line as it read the log. The output then can be assign to a bolt for analyzing.
When should you call the clean-up method?
The clean-up method comes in picture when Bolt shuts down and there is an immediate need to clean the running resources in the system.
Can you define Toplogy_Message_Timeout_secs in Apache storm?
It is the maximum amount of time allotted to the topology to fully process a message released by a spout. If the message in not acknowledged in given time frame, Apache Storm will fail the message on the spout.
In which folder are Java Applications stored in Apache?
Java applications are not stored in Apache, it can be only connected to a other Java webapp hosting webserver using the mod_jk connector.
Can you define Multiviews?
MultiViews search is enabled by the MultiViews Options. it’s the overall name given to the Apache server’s ability to supply language-specific document variants in response to a request. this is often documented quite totally within the content negotiation description page. additionally, Apache Week carried an article on this subject entitled It then chooses the most effective match to the client’s requirements and returns that document.
Can you define ZeroMQ?
It is a library which extends the standard socket interfaces with features traditionally provided by specialized messaging middleware products”. Storm relies on ZeroMQ primarily for task-to-task communication in running Storm topologies.
What are the distinct layers of Storm’s Codebase?
There are three distinct layers to Storm’s codebase.
First: Storm was designed from the very beginning to be compatible with multiple languages. Nimbus is a Thrift service and topologies are defined as Thrift structures. The usage of Thrift allows Storm to be used from any language.
Second: all of Storm’s interfaces are specified as Java interfaces. So even though there’s a lot of Clojure in Storm’s implementation, all usage must go through the Java API. This means that every feature of Storm is always available via Java.
Third: Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality the great majority of the implementation logic is in Clojure.
Can you define mod_vhost_alias?
This module creates dynamically configured virtual hosts, by allowing the IP address and/or the Host: header of the HTTP request to be used as part of the pathname to determine what files to serve. This allows for easy use of a huge number of virtual hosts with similar configurations.
Can you define struct?
A strut is an open source framework for creating a Java web application.
Can you explain the difference between raw data and processed data?
Raw data is the unstructured data which is very large in size. It doesn’t provide a goal and it is not possible to make predictions about the same.
Processed data contain useful information towards a specific goal. It is limited in size than the raw data. Also, it has a specific format only.
What is data integrity and what are the methods available to reduce threats to it?
Data integrity merely means that availability timeliness, dependableness, and accuracy of the data and information. one thing that may minimize the chance is taking correct backup of data on the storage. additionally, to the current, victimization error detection computer code can even facilitate up to an honest extent during this manner.
Explain the difference between Apache Kafka and Apache Storm?
Apache Kafka: This is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers.
Apache Storm: This is a real time message processing system, and you can edit or manipulate data in real-time. Storm pulls the data from Kafka and applies some required manipulation. It makes it easy to reliably process unbounded streams of data, doing real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use.
Does Apache act as a Proxy server?
Yes, it acts as proxy also by using the mod_proxy module. This module implements a proxy, gateway or cache for Apache. It implements proxying capability for AJP13 (Apache JServ Protocol version 1.3), FTP, CONNECT (for SSL),HTTP/0.9, HTTP/1.0, and (since Apache 1.3.23) HTTP/1.1. The module can be configured to connect to other proxy modules for these and other protocols.
How to check for the httpd.conf consistency and any errors in it?
httpd configuration file by using
following command. httpd –S
This command will dump out a description of how Apache parsed the configuration file. Careful examination of the IP addresses and server names may help uncover configuration mistakes.
How to stop Apache?
To stop apache you can use below command.
/etc/init.d/httpd stop
Can you explain Distributed Messaging System?
This system is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging systems. A distributed messaging system provides the benefits of reliability, scalability, and persistence.
Most of the messaging patterns follow the publish-subscribe model (simply Pub-Sub) where the senders of the messages are called publishers and those who want to receive the messages are called subscribers.
Once the message has been published by the sender, the subscribers can receive the selected message with the help of a filtering option. Usually we have two types of filtering, one is topic-based filtering and another one is content-based filtering
Can you explain combinerAggregator?
A CombinerAggregator is used to combine a set of tuples into a single field. It has the following signature:
public interface CombinerAggregator {
T init (TridentTuple tuple);
T combine(T val1, T val2);
T zero();
}
Storm calls the init() method with each tuple, and then repeatedly calls the combine()method until the partition is processed. The values passed into the combine() method are partial aggregations, the result of combining the values returned by calls to init().
Can you define Trident?
It is an extension of Storm. Like Storm, Trident was also developed by Twitter. The main reason behind developing Trident is to provide a high-level abstraction on top of Storm along with stateful stream processing and low latency distributed querying.
What are the benefits of Apache Storm?
Apache Storm comes with the following benefits to provide your application with a robust real time processing engine –
Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10-node cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
Guarantees no data loss: A real-time system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
Programming language agnostic: It is Robust and scalable real-time processing shouldn’t be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.