What is Apache Spark?
Apache Spark is an open source data processing framework for performing big data analytics on distributed computing cluster. Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. It provides a provision of reusability, Fault Tolerance, real-time stream processing and many more.
What are the main Components of Spark?’
- Apache Spark Core
- Spark SQL
- Spark Streaming
- MLlib (Machine Learning Library)
- GraphX
What are features of Apache Spark?
- It supports wide variety of operations, compared to Map and Reduce functions.
- Swift Processing.
- It provides concise and consistent APIs in Scala, Java and Python.
- It is written in Scala Programming Language and runs in JVM.
- Supported programming languages: Scala, Java, Python, R
- It leverages the distributed cluster memory for doing computations for increased speed and data processing.
- Spark is most suitable for real time decision making with big data.
- Spark runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), it can also process data stored by HBase structure. It can also run without Hadoop with apache Mesos or alone in standalone mode.
- The Spark code can be reused for batch-processing, join stream against historical data or run ad-hoc queries on stream state
- Spark enables applications in Hadoop clusters to run up to as much as 100 times faster in memory and 10 times faster even when running in disk.
- Apache Spark can be integrated with various data sources like SQL, NoSQL, S3, HDFS, local file system etc.
- Good fit for iterative tasks like Machine Learning (ML) algorithms.
- Fault Tolerance in Spark
- Active, Progressive and Expanding Spark Community
- And also, Map and Reduce operations, it supports SQL like queries, streaming data, machine learning and data processing in terms of graph
How Apache Spark works?
Apache Spark will method data from a spread of data repositories, as well as the Hadoop Distributed file system (HDFS), NoSQL databases and relational data stores, like Apache Hive. Spark supports in-memory process to spice up the performance of massive data analytics applications, however it may perform standard disk-based processing when information sets are overlarge to suit into the available system memory.
The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The RDD is designed in such a way so as to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn’t have to define where specific files are sent or what computational resources are used to store or retrieve files.
Can you explain Spark RDD?
Spark RDD: At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel .A programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing.
RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.
Can you explain Spark Core?
Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in external storage systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing RDDs.
The RDD abstraction is exposed through a language-integrated API. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
Can you explain Spark MLlib?
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use.
At a high level, it provides tools such as:
ML Algorithms: It is common learning algorithms such as classification, regression, clustering, and collaborative filtering.
Featurization: It is feature extraction, transformation, dimensionality reduction, and selection
Pipelines: these tools for constructing, evaluating, and tuning ML Pipelines
Persistence: It is saving and load algorithms, models, and Pipelines
Utilities: It is linear algebra, statistics, data handling, etc.
The benefits of MLlib’s design include:
- Simplicity and Scalability
- Streamlined end-to-end
- Compatibility
- Security monitoring/fraud detection, including risk assessment and network monitoring
- Operational optimization such as supply chain optimization and preventative maintenance
Can you explain Spark GraphX?
GraphX is Apache Spark’s API for graphs and graph-parallel computation. It comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the Graph Frames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.
Can you explain Spark Streaming?
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. It is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.
Can you explain Spark SQL?
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.
SQL support, Spark SQL provides a standard interface for reading from and writing to other data stores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem.
What is DataFrames?
A DataFrame can be considered as a distributed set of data which has been organized into many named columns. It can be compared with a relational table, CSV file or a data frame in R or Python. The DataFrame functionality is made available as API in Scala, Java, Python, and R.
What is RDD?
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD is immutable and distributed. There are primarily two types of RDD:
Parallelized Collections: The existing RDD’s running parallel with one another.
Hadoop datasets: perform function on each file record in HDFS or other storage system
How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily in the memory. Different storage level options there such as MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and many more. Both persist() and cache() uses different options depends on the task.
What are the types of Cluster Managers in Spark?
The Spark framework supports three major types of Cluster Managers:
Standalone: a basic manager to set up a cluster.
Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
Yarn: responsible for resource management in Hadoop
What is Transformation in spark?
Spark provides two special operations on RDDs called transformations and Actions. Transformation follow lazy operation and temporary hold the data until unless called the Action. Each transformation generate/return new RDD. Example of transformations: Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, sample are common spark transformations.
Can you define Yarn?
Like Hadoop, Yarn is one of the key components in Spark, giving a central and resource organization stage to pass on versatile operations across the cluster. Running Spark on Yarn requires a matched apportionment of Spar as built on Yarn support.
What are common Spark Ecosystems?
- Spark SQL (Shark) for SQL developers,
- Spark Streaming for streaming data,
- MLLib for machine learning algorithms,
- GraphX for Graph computation,
- SparkR to run R on Spark engine,
- BlinkDB enabling interactive queries over massive data are common Spark ecosystems. GraphX, SparkR and BlinkDB are in incubation stage. (In detail see above questions )
What is Dataset?
A Dataset is a new addition in the list of spark libraries. It is an experimental interface added in Spark 1.6 to v2.2.1 that tries to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine.
What are Actions?
An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions is:
reduce () – executes the function passed again and again until only one value is left. The function should take two argument and return one value.
take () – take all the values back to the local node form RDD.
Explain Spark map() transformation?
map () transformation takes a function as input and apply that function to each element in the RDD.
Output of the function will be a new element (value) for each input element.
Ex.val rdd1 = sc.parallelize(List(10,20,30,40))
val rdd2 = rdd1.map(x=>x*x)
println(rdd2.collect().mkString(“,”))
Explain the action count() in Spark RDD?
count() is an action in Apache Spark RDD operation
count() returns the number of elements in RDD.
Example:
val rdd1 = sc.parallelize(List(11,22,33,44))
println(rdd1.count())
Output: 4
Is the following approach correct? Is the sqrtOfSumOfSq a valid reducer?
numsAsText =sc.textFile(“hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt”);
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
A: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
Explain about mapPartitions() and mapPartitionsWithIndex()
mapPartitions() and mapPartitionsWithIndex() are both transformation.
mapPartitions() : mapPartitions() can be used as an alternative to map() and foreach() .mapPartitions() can be called for each partitions while map() and foreach() is called for each elements in an RDD
Hence one can do the initialization on per-partition basis rather than each element basis
mapPartitions() :mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which keeps the track of partition.
What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default
Explain how can Spark be connected to Apache Mesos?
To connect Spark with Mesos:
- Configure the spark driver program to connect to Mesos.
- Spark binary package should be in a location accessible by Mesos.
- Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
Can you define Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far.
Can you define RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.
Can you define PageRank?
A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.
Can you explain broadcast variables?
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Can you explain accumulators in Apache Spark?
Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.
What do you know about Transformations in Spark?
Transformations are functions implemented on RDD, resulting into another RDD. It does not execute until an action occurs. map () and filer () are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter () creates a new RDD by selecting elements form current RDD that pass function argument.
What is Spark Executor?
When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
What do you know about SchemaRDD?
SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.These are some of the popular questions asked in an Apache Spark interview. Always be prepared to answer all types of questions — technical skills, interpersonal, leadership or methodology. If you are someone who has recently started your career in big data, you can always get certified in Apache Spark to get the techniques and skills required to be an expert in the field.
Can you explain worker node?
Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.
Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.
Explain how can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
Using Broadcast Variable: Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators: Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
Explain how can Apache Spark be used alongside Hadoop?
The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop. Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN.
Why is there a need for broadcast variables when working with Apache Spark?
Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup().
Can you explain benefits of Spark over MapReduce?
Due to the availability of in-memory process, Spark implements the process around 10-100x quicker than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data process tasks.
Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
(Source: wiki and Spark doc)