Dealing with NULL in PySpark transformations

Lately I’ve been dealing with nested data on a semi regular basis with PySpark. One of the scenarios that tends to come up a lot is to apply transformations to semi/unstructured data to generate a tabular dataset for consumption by data scientist. When processing and transforming data I’ve previously found it beneficial to make use of the RDD data structure so that I have the ability to easily apply custom transformations the same way I would if I was interacting with normal Python data structures, but with the benefit of Spark and the functionality provided by the RDD API.

DerbyPy Intro to PySpark

This month at DerbyPy I provided a high level introduction to PySpark. For this talk I went over the Spark execution model at a high level, talked about the difference between the PySpark Dataframe and RDD api, and provided some examples of how to use both. As part of this I put together a jupyter notebook and some scripts that can be used via spark-submit along with instructions on how to run PySpark locally.

DerbyPy Introduction to Python Modules and Packages

Most programming languages offer ways to organize your code into namespaces. These namespaces are logical containers that group different names and behaviors together and isolate them to that namespace. By organizing your code with namespaces it makes it easier to structure your application without naming collisions and it can make it easier for you and others to maintain your code by adding some additional organization to your project.

Recursive Search with Python

Recently I received from JSON like data that I needed to transform into a tabular dataset. As part of that there was a specific key that could occur as a child of different keys at different depths in the structure. Not only could the key I needed appear at different locations and depths, but when it was located it was possible that it would have N sibling occurrences I needed to retrieve at the same location. Finally for all of these there were a set of id and date keys at the top level of the structure that I was asked to include with each search key result.

Publishing with Pelican on Windows

To get things started I thought it might be a good idea to document using Pelican on Windows with Github and Gandi for blog publishing. I’ll start by configuring Pelican and Github. Once that’s working I’ll then talk about configuring Gandi so you can use a custom domain. If you’re using a different domain provider you may need to use different settings, but Github has plenty of documentation around this that I’ll provide links for. Using Pelican on Windows isn’t that much different than macOS or Linux, but you won’t find as many tutorials or be able to use the quickstart makefile.