PySpark Hands-on Tutorial Using a Visual IDE

Anya Bida; Tutorial based on a lightning talk at PyBay 2022.

Learn by example

Let's see how a lead engineer would organize a project and write concise, readable code.
Toggle from code to visual design to see how each function fits into the bigger picture.

Now that I've sparked your interest, let's breakdown the steps so that everyone can build a PySpark pipeline:

  1. Sign into the Spark IDE using Databricks (1m) or directly via Prophecy (1m) 
  2. Create a new project and specify your spark cluster defaults (1m) 
  3. Create pipeline, connect pipeline to a Spark cluster, browse & load data (1.5m) 
  4. Explore transformations - toggle to view code (2m)
  5. Transform, join, aggregate in your own pipeline. (maybe try a join hint!) (5+ min)
  6. Commit PySpark code changes to your git repo (1m)
  7. Add a custom script (10+ min)
  8. Run the pipeline
  9. Repartition your data (to fix a failing job) as described below and re-run (3m)
  10. Congratulate yourself for building an actual pySpark pipeline!

We've seen how to use Prophecy's IDE for Spark. Now let's dig into a pro-tip to consider when moving Python workloads to PySpark.

Distributed computing

Traditional Python workloads run on a single machine (maybe a laptop or virtual machine). But here we're using PySpark, so we have to consider how the data is distributed across many machines. Figure 3 illustrates one PySpark job (left) which is crunching approximately the same amount of data on each machine core. On some occasions*, a PySpark job (right) can time out or fail because some machine cores are crunching much more data than the rest. This problem is called "skewed data" because the data is unevenly distributed across machine cores.

Fig 3. Skewed data can sometimes* cause PySpark jobs to time-out or fail. The job on the right is failing because some machine cores are crunching much more data than others. These heavy-lifting machine cores will take much longer to process tasks, eventually timing-out. *The spark-flowchart details many causes for Spark job slowness or failure, including skewed data. Image source: Micheal Berk

What do we do if our PySpark job is failing due to skewed data?

Skewed data can be resolved by repartitioning the dataframe (or RDD) for a more even distribution. Repartitioning causes a "shuffle." This costs time and resources, which is appropriate to resolve a failing job. In the IDE, we can choose a partition type and expression (or key). Let's run the job and check the Spark UI! Each task is included in a summary metrics table, displaying the amount of data read for that particular shuffle.

Fig 4. Hash repartition configured in the Spark IDE (left); corresponding distribution across partitions shown in the Spark UI (right). Is the data distributed evenly across the tasks?

Fig 5. Let's try a different expression (or key) for repartitioning (left); corresponding distribution across partitions shown in the Spark UI (right). Q: Does this configuration improve the data skew problem? A: No! The data is much more skewed compared to that in Fig 4.

The partition type, key, and number of partitions can be adjusted depending on the dataset and pipeline. I suggest to try a couple options and see what works best for your data. Usually partitioning on a date works if the data is relatively evenly distributed over multiple days. Partitioning on zipcode is usually not the best option because some zipcodes would have many more rows than others. Experiment with repartitioning until the job no longer fails due to skewed data. Also consider adjusting the partition strategy upon first reading the data into Spark. Get familiar with partitioning, as this topic comes up frequently when moving Python workloads to PySpark. Good thing tuning and troubleshooting can be accomplished via the Spark IDE.

This has been the PySpark Hands-On Tutorial! We got familiar with a new IDE, and learned by example how to code a variety of transformations using pySpark. Finally we walked through a pySpark pro-tip on repartitioning.

Try it for yourself at

PS What would you like to see in my next tutorial? Ping me or schedule a session with me or my team.