Spark 2.2 Features

In this article we are exploring the features of Spark 2.2 and umderstamd Structured Streaming in Spark.

Spark 2.2 Features

Spark 2.2 Features - Exploring new features of Spark 2.2

In this article we are exploring the new features and improvement which comes with the Spark 2.2. In Spark 2.2 Structured Streaming is the most important feature which is now production ready. There are many new improvements and exciting features which is added in this version.

The most important feature of Spark 2.2 is the Structured Streaming which was introduced in 2.0 and after long development its matured. Spark Structured Streaming is now production ready and experimental tag has been removed in Spark 2.2. Spark Structured streaming feature allows developers to write application which makes decisions real-time. This API is used to make real-time end to end applications in fault tolerant way.

Gist of changes and improvements in Spark 2.2:

  • Structured Streaming - Spark Structured Streaming is now production ready and its experimental tag has been removed.
     
  • An improvement in SQL Functionality - A number of new functionality in Spark SQL is added.
     
  • R Support in Spark - Spark is now providing support for running R Jobs on the distributed Spark cluster. So, it brings new machine learning algorithm on the Spark Cluster.
     
  • New functionality in MLlib and GraphX - The MLlib and GraphX library has been updated and now it comes with new functionality.

Now let's features of Spark 2.2 one by one.

Spark 2.2 Structured Streaming

Good news for developers, experimental on Structured Streaming has been removed in the Spark 2.2 and now its production ready. Developers can use Structured Streaming in Spark to make end-to-end streaming application in Spark.

Structured Streaming API is based on the Spark SQL and this is used for stream processing in Spark. This is high level declarative API in Spark which is used for continuous incremental execution of SQL queries in Spark. This API is used for end-to-end streaming application development which reads from source, perform processing and finally sends the data to sink.

Apache Spark Structured Streaming API can read streaming data from Kafka stream with low latency as compared to earlier versions. This gives a performance boost in using the Kafka streaming data in Apache Spark.

Spark 2.2 Core/SQL Improvements

Here are the import changes and improvements in Spark 2.2 SQL:

  • The Support for creating hive table with DataFrameWriter and Catalog is added in Spark 2.2
     
  • Multi-line parsing support for CSV and JSON file is latest addition in Spark 2.2
     
  • In Spark 2.2 support for LATERAL VIEW OUTER explode() has been added.
     
  • The support for the SQL like command: ALTER TABLE table_name ADD COLUMNS added, which is used for adding new column to existing hive table.
     
  • Support for hint function in Dataset/DataFrame added

There are many other improvements in Spark Core/SQL module.

SparkR - R Support in Spark

Finally R support is here on Spark 2.2, now R programmers can use Spark Distributed processing cluster to run R machine learning program. Support for R over Spark Cluster gives a boost to R programming language for machine learning over Spark Cluster.

The SparkR consists of R packages which allow using Spark from R program. It is also used for distributed machine learning using the MLlib module.

New functionality in MLlib and GraphX

Apache Spark MLlib gets major boost with the introduction of new algorithm for performing tasks such as  PageRank on datasets, or running multiclass logistic regression analysis. Spark is also used extensively for machine learning from huge data set.

Here are the list of new features added to MLlib in Spark 2.2:

  • The alternating least squares (ALS) algorithm support is added in Spark 2.2 and this can be used for getting top-k recommendations for all users or items.
     
  • Two new methods are added for using the DataFrames API, these methods are Correlation and ChiSquareTest stats functions.
     
  • The FPGrowth algorithm is added which can be used for frequent pattern mining in large datasets over high performance Spark 2.2 cluster.
     
  • Generalized Linear Models(GLM) of Spark now supports the full Tweedie family.
     
  • The Imputer class of Spark MLlib now allows feature transformer to impute missing values in a dataset and this is accomplished by mean or median of nearest data.
     
  • The LinearSVC class is added which can be used for linear Support Vector Machine classification.
     
  • During Logistic regression training Spark 2.2 supports constraints on the coefficients.

GraphX is API for computation of graph data over distributed cluster. In Spark 2.0 GraphX library comes with the bug fixes and performance improvements.

Apache Spark 2.2 is released with many new features to boost the use of this framework in processing of large datasets very fast. In this article we explored the new features of Spark 2.2. Check Apache Spark tutorials home page for more tutorials on Apache Spark.