Apache Hadoop Interview Questions and Answers

Apache Hadoop is an open source software framework for storage and large data scale processing of data sets on clusters of commodity hardware. It is designed to scale up from single servers to thousands of machine, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.-(Apache Hadoop official site)

What is Apache Hadoop?

Hadoop is an open source software framework for distributed storage and distributed processing of large data sets. Open source means it is freely available and even we can change its source code as per our requirements. Apache Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes. It’s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure.

What are the Main Components of Hadoop?

Storage layer – HDFS

Batch processing engine – MapReduce

Resource Management Layer – YARN

HDFS – HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.

Components of HDFS are NameNode and DataNode

MapReduce – For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduces process.

YARN – YARN (Yet another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.

Main Components of YARN are Node Manager and Resource Manager

Why do we need Hadoop?

Storage: Since data is very large, so storing such huge amount of data is very difficult.

Security: Since the data is huge in size, keeping it secure is another challenge.

Analytics: In Big Data, most of the time we are unaware of the kind of data we are dealing with. So analyzing that data is even more difficult.

Data Quality: In the case of Big Data, data is very messy, inconsistent and incomplete.

Discovery: Using a powerful algorithm to find patterns and insights are very difficult.

What are the four characteristics of Big Data?

Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes.

Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.

Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.

Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless.

What are the modes in which Hadoop run?

Local (Standalone) Mode: Hadoop by default run in a single-node, non-distributed mode, as a single Java process.

Pseudo-Distributed Mode: Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode.

Fully-Distributed Mode: In this mode, all daemons execute on separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.

Can you explain about the indexing process in HDFS?

How many daemon processes run on a Hadoop cluster?

Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.

Following 3 Daemons run on Master nodes

NameNode: This daemon stores and maintains the metadata for HDFS. The NameNode is the master server in Hadoop and manages the file system namespace and access to the files stored in the cluster.

Secondary Name Node: Secondary Name Node, isn’t a redundant daemon for the NameNode but instead provides period check pointing and housekeeping tasks.

JobTracker : Each cluster will have a single JobTracker that manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

Data Node: Stores actual HDFS data blocks. The datanode manages the storage attached to a node, of which there can be multiple nodes in a cluster. Each node storing data will have a datanode daemon running.

Task Tracker: It is Responsible for instantiating and monitoring individual Map and Reduce tasks i.e. TaskTracker per datanode performs the actual work

What happens to a NameNode that has no data?

Can you explain Hadoop streaming?

Can you define a block and block scanner in HDFS?

Can you define a checkpoint?

Can you explain commodity hardware?

Can you explain heartbeat in HDFS?

What happens when a data node fails?

Can you explain textinformat?

Can you define Sqoop in Hadoop?

What are the data components used by Hadoop?

Can you explain rack awareness?

How to do ‘map’ and ‘reduce’ works?

Can you explain Combiner?

How many input splits will be made by Hadoop framework?

Suppose Hadoop spawned 100 tasks for a job and one of the tasks failed. What will Hadoop do?

What are problems with small files and HDFS?

What does ‘jps’ command do?

How to restart Namenode?

What does /etc /init.d do?

What is the purpose of Context Object?

What is the number of default partitioner in Hadoop?

What is the use of RecordReader in Hadoop?

What is the best way to copy files between HDFS clusters?

Can you explain Speculative Execution?

In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.

What is the difference between an Input Split and HDFS Block?

How can native libraries be included in YARN jobs?

Can you explain Apache HBase?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop. It is designed to provide a fault-tolerant way of storing a large collection of sparse data sets. HBase achieves high throughput and low latency by providing faster Read/Write Access to huge datasets.

Can you define SerDe in Hive?

Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook. Hive abstracts the complexity of Hadoop MapReduce.

The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.

Can you explaine WAL in HBase?

Can you explain Apache Spark?

Can you define UDF?

Can you explain SMB Join in Hive?

In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in the hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.

How can you connect an application, if you run Hive as a server?

Is YARN a replacement of Hadoop MapReduce?

Can you explain Record Reader?

Can you explain sequence file in Hadoop?

Sequence file is used to store binary key/value pairs. Sequence files support splitting even when the data inside the file is compressed which is not possible with a regular compressed file. You can either choose to perform a record level compression in which the value of the key/value pair will be compressed. Or you can also choose to choose at the block level where multiple records will be compressed together.

What does the Conf.setMapper Class do?

How do you overwrite replication factor?

How do you do a file system check in HDFS?

Is Namenode also a commodity?

Can you define InputSplit in Hadoop?

How many InputSplits is made by a Hadoop Framework?

What is the difference between an InputSplit and a Block?

What is the difference between SORT BY and ORDER BY in Hive?

ORDER BY performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger datasets.

SORT BY orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. You will not achieve a total ordering on the dataset. Better performance is traded for total ordering.

In which directory Hadoop is installed?

What are the port numbers of Namenode, jobtracker, and task tracker?

What are the Hadoop configuration files at present?

What is Cloudera and why it is used?

How can we check whether Namenode is working or not?

Which files are used by the startup and shutdown commands?

Can we create a Hadoop cluster from scratch?

How can you transfer data from Hive to HDFS?

What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task lifecycle management (tracking the tasks progress and fault tolerance).

It is a process that runs on a separate node, not on a DataNode often.
Job Tracker communicates with the NameNode to identify data location.
Finds the best Task Tracker Nodes to execute tasks on given nodes.
Monitors individual Task Trackers and submits the overall job back to the client.
It tracks the execution of MapReduce workloads local to the slave node.

What are the core methods of a Reducer?

The three core methods of a Reducer are:

setup(): this method is used for configuring various parameters like input data size, distributed cache.

public void setup (context)

reduce(): the heart of the reducer always called once per key with the associated reduced task

public void reduce(Key, Value, context)

cleanup(): this method is called to clean temporary files, only once at the end of the task

public void cleanup (context)

Can I access Hive Without Hadoop?

How Spark uses Hadoop?

What is Spark SQL?

SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining the data type of each column in the row. It is similar to a table in relational database.

What are the additional benefits YARN brings in to Hadoop?

Can you explain Sqoop metastore?

Which are the elements of Kafka?

Can you explain Apache Kafka?

What is the role of the ZooKeeper?

What are the key benefits of using Storm for Real-Time Processing?

List out different stream grouping in Apache storm?

Which operating system(s) are supported for production Hadoop deployment?

What is the best practice to deploy the secondary name node?

What are the side effects of not running a secondary name node?

What daemons run on Master nodes?

Can you explain BloomMapFile.

What is the usage of foreach operation in Pig scripts?

Explain about the different complex data types in Pig.

What is the difference between PigLatin and HiveQL?

Whether pig latin language is case-sensitive or not?

What are the use cases of Apache Pig?

Apache Pig is used for analyzing and performing tasks involving ad-hoc processing.Apache Pig is used for:

Research on large raw data sets like data processing for search platforms. For example, Yahoo uses Apache Pig to analyze data gathered from Yahoo search engines and Yahoo News Feeds.

Processing huge datasets like Weblogs, streaming online data, etc.

In customer behavior prediction models like e-commerce websites.

What does Apache Mahout do?

Mahout supports four main data science use cases:

Collaborative filtering – mines user behavior and makes product recommendations (e.g. Amazon recommendations)

Clustering – takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other

Classification – learns from existing categorizations and then assigns unclassified items to the best category

Frequent item-set mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together

Mention some machine learning algorithms exposed by Mahout?

Below is a current list of machine learning algorithms exposed by Mahout.

Collaborative Filtering

Item-based Collaborative Filtering
Matrix Factorization with Alternating Least Squares
Matrix Factorization with Alternating Least Squares on Implicit Feedback

Classification

Naive Bayes
Complementary Naive Bayes
Random Forest

Clustering

Canopy Clustering
k-Means Clustering
Fuzzy k-Means
Streaming k-Means
Spectral Clustering

What is Apache Flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data source. Review this Flume use case to learn how Mozilla collects and Analyse the Logs using Flume and Hive.

Flume is a framework for populating Hadoop with data. Agents are populated throughout one’s IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

Explain about the different channel types in Flume. Which channel type is faster?

The 3 different built-in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.

JDBC Channel – JDBC Channel stores the events in an embedded Derby database.

FILE Channel – File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel is the fastest channel among the three, however, has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.