Read text file in PySpark

Read text file in PySpark - How to read a text file in PySpark?

The PySpark is very powerful API which provides functionality to read files into RDD and perform various operations. This tutorial is very simple tutorial which will read text file and then collect the data into RDD.

The term RDD stands for Resilient Distributed Dataset in Spark and it is using the RAM on the nodes in spark cluster to store the data. Any computation done on RDD is executed on the workers nodes in the Spark Cluster.

This way RDD can be used to process large amount of data in memory over distributed cluster and then processed data can be fetched on the master node. This architecture of Spark makes it very powerful for distributed processing of data.

Program will collect the data into lines and then print on the console. Spark is very powerful framework that uses the memory over distributed cluster and process in parallel.

Read text file in PySpark

We will create a text file with following text:

one
two
three
four
five
six
seven
eight
nine
ten

create a new file in any of directory of your computer and add above text. In my example I have created file test1.txt. We will write PySpark code to read the data into RDD and print on console.

So, first thing is to import following library in "readfile.py":

from pyspark import SparkContext
from pyspark import SparkConf

This will import required Spark libraries.

Next create SparkContext with following code:

# create Spark context with Spark configuration
conf = SparkConf().setAppName("read text file in pyspark")
sc = SparkContext(conf=conf)

As explained earlier SparkContext (sc) is the entry point in Spark Cluster. We will use sc object to perform file read operation and then collect the data.

Here is complete program code (readfile.py):

from pyspark import SparkContext
from pyspark import SparkConf

# create Spark context with Spark configuration
conf = SparkConf().setAppName("read text file in pyspark")
sc = SparkContext(conf=conf)

# Read file into RDD
lines = sc.textFile("/home/deepak/test1.txt")

# Call collect() to get all data
llist = lines.collect()

# print line one by line
for line in llist:
	print(line)

To run the program use spark-submit tool and command is:

./spark-submit readfile.py

Above command will display following output:

deepak@deepak-VirtualBox:~/spark/spark-2.3.0-bin-hadoop2.7/bin$ ./spark-submit readfile.py 
one
two
three
four
five
six
seven
eight
nine
ten
deepak@deepak-VirtualBox:~/spark/spark-2.3.0-bin-hadoop2.7/bin$

In this tutorial we have learned how to read a text file in RDD and then print data line by line? We have large number of Spark tutorials and you can view all these tutorials at:

Read text file in PySpark

Read text file in PySpark

Read text file in PySpark - How to read a text file in PySpark?

Tutorials