This tutorial introduce you with the Apache Spark Framework and explains you the architecture of this framework.This tutorial introduce you with the Apache Spark Framework and explains you the architecture of this framework.
Apache Spark is an open source cluster computing framework acclaimed for lightning fast Big Data processing offering speed, ease of use and advanced analytics. Originally conceptualized and developed in 2009 in University of California Berkeley's AMP Lab it became a open source technology in 2010. Apache Spark offers an array of advantages in comparison to other Big Data processing technologies like Storm and Hadoop MapReduce.
As Big Data processing and warehousing technology Hadoop Distributed File System (HDFS) infrastructure has already an established following. To incorporate the speed and performance attributes of Apache Spark it is no longer needed to replace the existing Hadoop Distributed File System (HDFS) infrastructure, because it runs on top of it ensuring enhanced functionalities. Spark applications can be easily deployed and run in the existing Hadoop v1 cluster making Spark function inside the MapReduce or in Hadoop v2 YARN cluster or in Apache Mesos.
In that sense Apache Spark can enhance the speed, power and capability of Hadoop file system. For example, with Hadoop for performing a complicated task you always needed stringing together MapReduce jobs in a series and then take one job at a time for execution. This sequential execution of jobs gave high level of latency to the process. Other jobs must wait until an earlier job in the sequence is finished. Apache Spark came as a great alternative to this high latency processing speed which can be incorporated in the Hadoop infrastructure and you do not require replacing Hadoop. Basically it offers a faster, smoother and performance driven replacement to MapReduce.
Contrary to the common belief, Spark is far from an enhanced version of Hadoop infrastructure. Though it is not dependent on Hadoop at all as it is equipped with its own cluster management, it can very well be deployed, run and implemented in Hadoop.
The high speed in-memory cluster computing technology is what makes Spark so advanced as a data processing technology. Apache Spark can utilize Hadoop in two distinct ways, respectively for storage and processing. But as Spark is endowed with its own cluster computing capability, Hadoop is used by Spark for only storage.
There are three main components in Spark Architecture including, data storage, API and management framework. Let us have a brief and clear understanding of each of these components.
Data storage: For data storage, Spark uses HDFS file system and works perfectly with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc.
API: Spark API is there to allow developers building applications by utilizing an API interface for Spark. Respectively for or Scala, Java, and Python programming languages there are separate Spark APIs.
Resource management framework: Spark can be deployed both as a singular server or can be deployed on distribution based computing frameworks such as Mesos and YARN.
In the field of Big Data analytics and data management Apache Spark rose to quick popularity and widespread adoption thanks to several advantageous attributes like processing speed, flexibility and support for multiple languages and libraries. Let us have a look at the key features of Apache Spark.
Data processing speed: As for running an application in Hadoop cluster, Spark ensures 100 times faster processing speed in memory and 10 times faster processing while the application running on disk. Spark achieves this processing speed mainly by storing intermediate processing data in the memory itself and thereby reducing the number of read and write operations to disk.
Multiple language support: With built-in APIs for Java, Scala and Python, Spark enables you to write applications in different programming languages.
Support for multiple libraries: Spark empowers a wide array of data libraries including SQL, DataFrames, MLlib, GraphX, and Spark Streaming. By combining these libraries one can create powerful applications serving variety of purposes.
Flexibility for multiple platforms: Flexibility to run on multiple platforms including Hadoop, Mesos, standalone or cloud platform, is an important attribute of Spark. Moreover, Spark has enhanced capability to access various data sources including HDFS, Cassandra, HBase, and S3.
Advanced analytics: Besides supporting typical 'Map' and 'reduce' function of Hadoop MapReduce it provides extensive support for SQL queries, streaming data, machine learning and graph algorithms.
Besides the Core API Spark offers additional libraries as part of the Spark framework ecosystem and thereby ensures enhanced capabilities in the areas of Big Data analytics and Machine Learning. Let us have a brief look at some of these important libraries in this ecosystem.
Spark Streaming: Spark Streaming is used for processing real time streaming data.
Spark SQL: Through Spark SQL the datasets can be exposed to JDBC API and can be allowed running queries similar as SQL on Spark data.
Spark MLlib: It is the machine learning library in Spark ecosystem which is scalable and consists of an array of utilities and general learning algorithms like classification, regression, collaborative filtering, dimensionality reduction, clustering, etc.
Spark GraphX: This Spark API is used for graphs and graph like computation.
BlinkDB: This serves as a query engine that can handle large volumes of data for allowing interactive SQL queries.
Tachyon: This is basically a memory oriented distributed file system that can offer reliable file sharing at high memory speed throughout cluster frameworks like MapReduce.