What is Hadoop?
Hadoop is a distributed computing platform. It is written in Java. It consists of the features like Google File System and Map Reduce.
Why use Hadoop?
- It is a Flexible, Reliable and Scalable
- To process multi petabyte datasets
- Nodes fails
- Data may not have strict schema
What is InputSplit in Hadoop?
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit.
How many InputSplits is made by a Hadoop Framework?
Hadoop will make 5 splits as following:
- One split for 64K files
- Two splits for 65MB files, and
- Two splits for 127MB files
What is Hadoop MapReduce?
For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.
What is Hadoop Streaming?
Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.
What kind of Hardware is best for Hadoop?
Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.
How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
What is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centerpiece of an HDFS file system. It keeps the record of all the files in the file system and tracks the file data across the cluster or multiple machines.
What is JobTracker in Hadoop?
In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process.
How many daemon processes run on a Hadoop cluster?
Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.
Following 3 Daemons run on Master nodes
NameNode – This daemon stores and maintains the metadata for HDFS. The NameNode is the master server in Hadoop and manages the file system namespace and access to the files stored in the cluster.
Secondary NameNode – Secondary NameNode, isn’t a redundant daemon for the NameNode but instead provides period check pointing and housekeeping tasks.
JobTracker – Each cluster will have a single JobTracker that manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.
Following 2 Daemons run on each Slave nodes
DataNode – Stores actual HDFS data blocks. The datanode manages the storage attached to a node, of which there can be multiple nodes in a cluster. Each node storing data will have a datanode daemon running.
TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks i.e. TaskTracker per datanode performs the actual work
What are the functionalities of JobTracer?
These are the main tasks of JobTracer:
- To accept jobs from client.
- To communicate with the NameNode to determine the location of the data.
- To locate TaskTracker Nodes with available slots.
- To submit the work to the chosen TaskTracker node and monitors progress of each tasks.
What is a NameNode?
NameNode exists at the center of the Hadoop distributed file system cluster. It manages metadata for the file system, and DataNode, but does not store data itself.
What is a datanode?
Unlike NameNode, a datanode actually stores data within the Hadoop distributed file system. DataNode run on their own Java virtual machine process.
What is TaskTracker?
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a JobTracker.
How JobTracker assign tasks to the TaskTracker?
The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.
What happens when a datanode fails?
When a datanode fails:
- JobTracker and NameNode detect the failure
- On the failed node all tasks are re-scheduled
- NameNode replicates the user’s data to another node
What are the most common input formats defined in Hadoop?
These are the most common input formats defined in Hadoop:
- TextInputFormat
- KeyValueInputFormat
- SequenceFileInputFormat
What are the actions followed by Hadoop?
Hadoop performs following actions in Hadoop
- Client application submit jobs to the job tracker
- JobTracker communicates to the Namemode to determine data location
- Near the data or with available slots JobTracker locates TaskTracker nodes
- On chosen TaskTracker Nodes, it submits the work
- When a task fails, Job tracker notify and decides what to do then.
- The TaskTracker nodes are monitored by JobTracker
What is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.
What is a Combiner?
The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
What is Speculative Execution?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.
What are the basic parameters of a Mapper?
The basic parameters of a Mapper are
- LongWritable and Text
- Text and IntWritable
What is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers.
What is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
What is WebDAV in Hadoop?
To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems, so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
What is sqoop in Hadoop?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
What is Sequencefileinputformat?
Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
What does the Conf.setMapper Class do?
Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper.
What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.
What is Distributed Cache in Hadoop?
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
How did you debug your Hadoop code?
There can be several ways of doing this but most common ways are: –
- By using counters
- The web interface provided by Hadoop framework
Which directory does Hadoop install to?
Hadoop is installed in cd/usr/lib/hadoop-0.20/
Where are Hadoop’s configuration files located and list Them?
Hadoop’s configuration files can be found inside the conf sub-directory.
- hdfs-site.xml
- core-site.xml
- mapred-site.xml
Â