Configure PyCharm CE to work with Apache Spark

This guide should help you to setup PyCharm CE to work with Python3 and Apache Spark (tested with version 2.1)

First, Create a new Pure Python PyCharm project.

Now copy the content of to your project. Your IDE should complain at the following line

from pyspark.sql import SparkSession

because it doesn’t know where is pyspark.sql which is part of the Python Spark library.

In order to tell PyCharm where the Python Spark libraries are, you need to go to Preferences->Project->Project Structure and add the zip files under $SPARK_HOME/python/lib to the content root. $SPARK_HOME is the location of your Apache Spark directory. If you haven’t downloaded Apache Spark, you can download it here

Next, go to Run -> Edit Configurations

and create a new configuration using the default Python configuration profile and add the following environment variables,

SPARK_HOME=<your spark home dir>
PYTHONPATH=<your spark home dir>/python

Then specify the name of your main .py script and the location of your text file where you want the words to be counted.

Finally, run your new configuration and it should do a word count job using Apache Spark.

Have fun!