What is Hadoop Mapreduce?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using distributed programming.
What are ‘maps’ and ‘reduces’?
‘Maps’ and ‘Reduces’ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine. ‘Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.
How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
What is shuffling in MapReduce?
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
Explain about the partitioning, shuffle and sort phase
Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it.
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.
Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.
What is distributed Cache in MapReduce Framework?
Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache is used. The files could be an executable jar files or simple properties file.
What are the four basic parameters of a mapper?
The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters.
What are the four basic parameters of a reducer?
The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate output parameters and the second two represent final output parameters.
How to write a custom partitioner for a Hadoop MapReduce job?
Steps to write a Custom Partitioner for a Hadoop MapReduce Job-
- A new class must be created that extends the pre-defined Partitioner Class.
- getPartition method of the Partitioner class must be overridden.
- The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any programming language which can accept standard input and can produce standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any other programming language.
What is a Combiner?
A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.
Can we rename the output file?
Yes, we can rename the output file by implementing multiple format output class.
What does a MapReduce partitioner do?
A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.
What does a split do?
Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split Method‘. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything but reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.
What does Conf.setMapper Class do?
Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.
What do sorting and shuffling do?
Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.
What is the input type/format in MapReduce by default?
By default, the type input type in MapReduce is ‘text’.
Is it mandatory to set input and output type/format in MapReduce?
No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.