In this article we are exploring the features of Spark 2.2 and umderstamd Structured Streaming in Spark.In this article we are exploring the features of Spark 2.2 and umderstamd Structured Streaming in Spark.
In this article we are exploring the new features and improvement which comes with the Spark 2.2. In Spark 2.2 Structured Streaming is the most important feature which is now production ready. There are many new improvements and exciting features which is added in this version.
The most important feature of Spark 2.2 is the Structured Streaming which was introduced in 2.0 and after long development its matured. Spark Structured Streaming is now production ready and experimental tag has been removed in Spark 2.2. Spark Structured streaming feature allows developers to write application which makes decisions real-time. This API is used to make real-time end to end applications in fault tolerant way.
Gist of changes and improvements in Spark 2.2:
Now let's features of Spark 2.2 one by one.
Good news for developers, experimental on Structured Streaming has been removed in the Spark 2.2 and now its production ready. Developers can use Structured Streaming in Spark to make end-to-end streaming application in Spark.
Structured Streaming API is based on the Spark SQL and this is used for stream processing in Spark. This is high level declarative API in Spark which is used for continuous incremental execution of SQL queries in Spark. This API is used for end-to-end streaming application development which reads from source, perform processing and finally sends the data to sink.
Apache Spark Structured Streaming API can read streaming data from Kafka stream with low latency as compared to earlier versions. This gives a performance boost in using the Kafka streaming data in Apache Spark.
Here are the import changes and improvements in Spark 2.2 SQL:
There are many other improvements in Spark Core/SQL module.
Finally R support is here on Spark 2.2, now R programmers can use Spark Distributed processing cluster to run R machine learning program. Support for R over Spark Cluster gives a boost to R programming language for machine learning over Spark Cluster.
The SparkR consists of R packages which allow using Spark from R program. It is also used for distributed machine learning using the MLlib module.
Apache Spark MLlib gets major boost with the introduction of new algorithm for performing tasks such as PageRank on datasets, or running multiclass logistic regression analysis. Spark is also used extensively for machine learning from huge data set.
Here are the list of new features added to MLlib in Spark 2.2:
GraphX is API for computation of graph data over distributed cluster. In Spark 2.0 GraphX library comes with the bug fixes and performance improvements.
Apache Spark 2.2 is released with many new features to boost the use of this framework in processing of large datasets very fast. In this article we explored the new features of Spark 2.2. Check Apache Spark tutorials home page for more tutorials on Apache Spark.