A WordPress.org Data Engineering Starter Project

WordCamp Asia 2026 — Mahangu Weerasinghe


SLIDES

Graphic titled 'A WordPress.org Data Engineering Starter Project' featuring the Automattic logo, WordCamp Asia 2026 logo, and QR code. Includes the name Mahangu Weerasinghe and designation as Data Engineer at Automattic Inc.

View slides on Google Slides

Quick Start

Prerequisites: uv, git, make. Windows users: use WSL. See the GitHub repo for more details.

git clone https://github.com/mahangu/meltano-wordpress-org-data-starter-project.git
cd meltano-wordpress-org-data-starter-project
make quickstart

View on GitHub →


The Problem

WordPress.org API theme and plugin data is useful, but using it directly means writing a lot of code.

The SOLUTION: THe Open Source Data Stack

Four open-source tools, all running locally:

  1. Meltano — Extract, Transform, and Load (ETL) data. Uses our custom extractor tap-wordpress-org and the target-duckdb loader.
  2. DuckDB — Store and query large datasets fast, locally.
  3. Jupyter — Run everything in a local notebook.
  4. Python — Powers it all.

Available Commands

make help Show available targets
make quickstart Install, create sample data, extract events, launch notebook
make extract-plugins Extract WordPress plugins data
make extract-events Extract WordPress events data
make extract-themes Extract WordPress themes data
make extract-all Extract all available data streams
make sample-data Create sample data from WordPress.org API
make notebook Start Jupyter notebook
make check-data Check what data is in the database

Extra Credit

LLMs can help with all of these:


Links