In this tutorial are discussing spark data structures available in Apache Spark Framework.In this tutorial are discussing spark data structures available in Apache Spark Framework.
Understanding the various types of data structures provided in Apache Spark framework. First of all developer must understand the data structures provided by Apache Spark framework so that they can use it in better way to meet application requirements.
Spark Framework is distributed parallel computing engine with following functionality:
In this tutorial we are exploring various data structures supported by Spark framework. These data structures are supported in Spark programming using Java, Scala, Python and R. Programmers can these data structures while writing Spark application using any of the following programming languages:
Apache Spark is distributed, large scale parallel process framework over distributed Spark Cluster. Nodes on the Spark cluster perform computations on the large scale data sets. This fast in-memory parallel processing engine works very well if correct data structures are used in the programming. So, its necessary for developers to understand data structures very well and practice it with many examples.
Apache Spark Framework provides following Data Structures:
Now we will discuss all these data structures one by one and see the features of each of them.
RDD is also known as Resilient Distributed Dataset which was introduced with the first version of Spark Framework. RDD is immutable data structure that distributes the data in partitions across the nodes in the cluster. Computation on the data is done on the node where data is present. This makes architecture flexible and enables parallel processing of data. RDD provides interfaces for performing transformations and actions.
RDD is immutable object and only created by coarse grained operations such as map, filter, group by etc. and once created can't be modified. You can perform certain operation and then new RDD is created.
Common use cases where you can use RDD are:
In the future tutorials we will provide you many examples of working with RDD's in PySpark.
Like RDD, Data Frames are also immutable and once created can't be modified. Data Frames can be recreated with the operation like map, filter, etc.. Data frames are distributed data collection which is organized into named columns just like RDBS table row, columns.
Data Framework runs on the Spark SQL Context and provides SQL like queries for querying data. DataFrames support data from many different sources including Hive tables, Structured Data files, external databases, or existing RDDs.
DataFrames API was designed to meet the requirement of modern Big Data and Data Science applications. It is highly influenced with the designing principals of Data Frames in R Programming and Pandas in Python.
DataSet is also distributed collection of data which organize data into named columns and add type safety to it. So, it comes with the type safety and it is checked at the compile time. This was developed by the aim of adding type safety to dataframes.
Tungsten is new component added to Spark SQL which provides efficient operation on data sets as it works directly on the byte level. The type safety was added to Dataset and now data already knows the format it contains, so with this hint encoders are generated to perform operations on data fast in Tungsten format.
Data Stored in Tungsten takes 4 to 5 times less space and provides better performance including better memory utilization.
The Graphframe data structure is used for graph data storage and processing. The Graphframe stores data into 2 distinct data frames:
In this tutorial we have learned about various data structures supported by Spark Framework.
Check more tutorials at: