Apache Spark Tutorials - Apache Spark Framework is 100 times faster framework then Map Reduce
Apache Spark Framework is next innovation in Big Data cluster computing and it runs 100 faster in memory as compared to Map Reduce. This framework is under open source license which was originally developed in the AMPLab at University of California, Berkeley. Later on this framework was Apache Software Foundation and Apache Software Foundation working on this framework.
This is designed in a such a way that it allows the application program to load the data in the memory of cluster's computer and then query/process the data very fast. Data can be queried repeatedly and get the results very fast. This framework also well-suited framework for running the machine learning algorithms.
Spark Framework can't run alone it requires the cluster manager software along with distributed storage system.
Following Cluster Manager software are supported by Spark Framework:
- Standalone (native Spark cluster)
- Hadoop YARN
- Apache Mesos
Following distributed storage system are supported by Apache Spark Framework:
- Hadoop Distributed File System (HDFS)
- CassandraOpenStack
- Swift
- Amazon S3
- Kudu
- It also custom storage system
Here is the tutorials of Apache Spark Framework:
Apache Spark 2.4.0
- Installing Apache Spark on Ubuntu 18.04
- Introduction to Apache Spark Scala Shell
- Apache Spark data structures
- Resilient Distributed Dataset (RDD) in Apache Spark
- Starting with Apache Spark in Ubuntu 22.04
Spark Scala
PySpark
- Creating SparkSession - Learn to create SparkSession in your PySpark program
- sc.parallelize pyspark - use of sc.parallelize in pyspark program
Apache Spark Article/Tutorials