Mahangu Weerasinghe

Sri Lankan, Automattician, WordPress user since 0.70


Code

Udacity Data Engineering Nanodegree Capstone Project

A part of the Udacity Data Engineering Nanodegree, this ETL project was developed as the DEND’s Capstone Project.

The aim of the project was to combine chess game data from from two popular chess website APIs:

  1. The Lichess.org API
  2. The Chess.com API — via the chessdotcom Python module

— and thereby allow the creation of a custom bank of chess games, from both of the Internet’s two most popular chess sites.

This project first pulls these data via the APIs that both sites offer and then performs ETL operations on these data using Apache Spark (PySpark), moving these 1M+ rows from a collection of raw API responses to finally rendering them as fact/dim tables that display useful information.

During this process, data is saved at each stage in Apache Parquet format, as it has a number of advantages over other file formats such as CSV, including being a columnar storage format. Throughout this process, data can be saved both locally, or to Amazon S3.

This project includes:

  • At least 2 data sources ✅ (Chess.com and Lichess data)
  • More than 1 million lines of data. ✅ (The combined Chess.com and Lichess staging tables have 1,047,618 rows.)
  • At least two data sources/formats (csv, api, json) ✅ (API Responses are in JSON, but the Chess.com response contains a PGN blob that we need to parse separately.)



Udacity Data Engineering Nanodegree: Apache Airflow Project

A part of the Udacity Data Engineering Nanodegree, this ETL project looks to collect and present user activity information for a fictional music streaming service called Sparkify. To do this, data is gathered from song information and application .json log files (which were generated from the Million Song Dataset and from eventsim respectively and given to us).

Part of this ETL process is from an earlier iteration of this project, where we loaded these .json files into an Amazon Redshift cluster, using Redshift’s COPY function to do the data extraction for us. This was particularly useful because it took full advantage of Redshift’s MPP (massively parallel processing) architecture to perform paralell ETL on the JSON files that had to be processed.

The difference in this new project is that we use Apache Airflow for workflow management, allowing us to automate this ETL process, so that it can be run at an interval of our choice (i.e daily/hourly) and also automatically log all tasks and retry them when they fail. These tasks are maanged via the Airflow Scheduler, which makes use of Directed Acyclic Graphs (DAGs) for this purpose. These DAGs in turn uitilise several custom Airflow Operators to perform these tasks. These operators allow for a level of modularity in our project that is very useful.

To run this project, you will need a working installation of Apache Airflow, and a connection to Amazon S3 and Amazon Redshift. The basic two-step ETL process is as follows:

  1. Use the Redshift COPY function to extract .json files from s3 to staging tables in Redshift
  2. Transform the data in Redshift and create fact and dimension tables


Udacity Data Engineering Nanodegree: Apache Spark Project

A part of the Udacity Data Engineering Nanodegree, this ETL project looks to collect and present user activity information for a fictional music streaming service called Sparkify. To do this, data is gathered from song information and application .json log files (which were generated from the Million Song Dataset and from eventsim respectively and given to us).

These log files are stored in two Amazon S3 directories, and are loaded into an Amazon EMR Spark cluster for processing. The etl.py script reads these files from S3, transforms them to create five different tables in Spark and writes them to partitioned parquet files in table directories on S3.

Having these data stored as .parquet files means they can be easily loaded into Hadoop for analysis whenever required, meaning that the .json files will not need to be reprocessed in order to make use of this data.


Other Projects

Upstream was an open source prototype system log transfer system for Debian/Ubuntu. It is now inactive.

OCON-SL is an open source corpus and corpus server written in Python that I used for a while as a graduate student. The project is no longer active, but both the code and the corpus are released under the GPL.

Finally, I’ve also made minor code contributions to WordPress plugins like Jetpack and Sensei.

About Me

I’m Mahangu Weerasinghe, a Data Engineer at Automattic, the company behind WordPress.com, Jetpack, WooCommerce and Tumblr. Our team is responsible for maintaining our primary Hadoop cluster and providing support to datums across the company.

Recent Posts

    Newsletter