Lately I’ve been dealing with nested data on a semi regular basis with PySpark. One of the scenarios that tends to come up a lot is to apply transformations to semi/unstructured data to generate a tabular dataset for consumption by data scientist. When processing and transforming data I’ve previously found it beneficial to make use of the RDD data structure so that I have the ability to easily apply custom transformations the same way I would if I was interacting with normal Python data structures, but with the benefit of Spark and the functionality provided by the RDD API.