mahangu@wordpress:~$ cd ~/ && ls -la

mahangu@wordpress:~/blog$ cat

Building a WordPress.org API Meltano Extractor with Claude Code

Building a WordPress.org API Meltano Extractor with Claude Code


Claude Code image / logo

As a part of an internal AI tooling exercise at Automattic, I recently co-developed a Meltano extractor with Claude Code, Anthropic’s agentic AI development assistant. You can find tap-wordpress-org on GitHub and on the Meltano Hub.

The Challenge

WordPress.org hosts over 60,000 plugins and 10,000 themes, along with valuable statistics about WordPress usage, PHP versions, and MySQL deployments across millions of websites. While this data is publicly available through various API endpoints, there wasn’t a standardised way to extract it for data pipelines and analytics workflows. I wanted to develop a solution that could:

  • Extract data from multiple WordPress.org API endpoints
  • Handle incremental updates for frequently changing plugin data
  • Transform and normalize the data for analytics
  • Integrate seamlessly with modern data stacks

Enter Claude Code and Meltano

Claude Code proved to be a great companion for this project. While I have worked on Meltano Extractors before, with Claude Code this process was not only faster, but more enjoyable. It needed very little input from me to get going.

The Development Process

I essentially just pointed Claude Code at – https://codex.wordpress.org/WordPress.org_API – and said something like:

“let’s make a Meltano Extractor for these APIs using the new Meltano SDK”

and it asked some follow up questions and got going. 🚀 It created, tested, and committed a lot of it on its own, and also helped create/troubleshoot CI/CD setup.

What impressed me most was Claude’s ability to:

  • Generate the complete project structure with proper Meltano SDK patterns
  • Implement all 8 different streams (plugins, themes, events, patterns, and various stats)
  • Handle edge cases like HTML entity decoding and missing fields
  • Add features like configurable request delays and incremental syncing
  • Fix issues in real-time based on actual API responses

Key Features Implemented

The final tap-wordpress-org extractor includes:

  1. Eight Data Streams:
  • Plugins (with incremental sync support)
  • Themes
  • WordPress Events
  • Block Patterns
  • WordPress Version Statistics
  • PHP Version Statistics
  • MySQL Version Statistics
  • Locale Statistics
  1. Smart Data Handling:
  • Automatic HTML entity decoding (e.g., &&)
  • Graceful handling of missing or null fields
  • Configurable request delays to respect API rate limits
  1. Production-Ready Features:
  • Incremental replication for plugins based on last_updated timestamps
  • Full Singer protocol compliance
  • Comprehensive error handling
  • Type-safe schema definitions

Installing and Running the Extractor

Getting started with tap-wordpress-org is straightforward. Here’s how to install Meltano and use the extractor:

Prerequisites

# Install Python 3.8 or higher
python3 --version

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Meltano and the Tap

# Install Meltano
pip install meltano

# Install the tap directly from GitHub
pip install git+https://github.com/Automattic/tap-wordpress-org.git

Configuration

Create a config.json file to configure the extractor (see more about configuration in the Meltano docs):

{
  "stream_selection": ["plugins", "themes", "wordpress_stats"],
  "request_delay": 0.3,
  "start_date": "2025-01-01T00:00:00Z"
}

Running the Extractor

To extract data and save it as JSONL (JSON Lines format):

# Create an output dir
mkdir output

# Run the tap and save output to a file
python -m tap_wordpress_org.tap --config config.json > output/wordpress_data.jsonl

Sample Data Output

The extractor produces clean, structured data ready for analysis. Here are some examples:

Plugin Data

{
  "type": "RECORD",
  "stream": "plugins",
  "record": {
    "name": "Hello Dolly",
    "slug": "hello-dolly",
    "author": "<a href=\"https://profiles.wordpress.org/matt/\">Matt Mullenweg</a>",
    "author_profile": "https://profiles.wordpress.org/matt/",
    "requires": "4.6",
    "tested": "6.8.1",
    "requires_php": false,
    "rating": 60,
    "num_ratings": 297,
    "active_installs": 700000,
    "downloaded": 0,
    "last_updated": "2025-05-07 4:50pm GMT",
    "added": "2008-07-06",
    "homepage": "http://wordpress.org/plugins/hello-dolly/",
    "short_description": "This is not just a plugin, it symbolizes the hope and enthusiasm...",
    "download_link": "https://downloads.wordpress.org/plugin/hello-dolly.1.7.3.zip",
    "tags": {}
  },
  "time_extracted": "2025-07-11T00:00:00.000000+00:00"
}

Theme Data

{
  "type": "RECORD",
  "stream": "themes",
  "record": {
    "name": "Twenty Twenty-Five",
    "slug": "twentytwentyfive",
    "version": "1.2",
    "preview_url": "https://wp-themes.com/twentytwentyfive/",
    "screenshot_url": "//ts.w.org/wp-content/themes/twentytwentyfive/screenshot.png?ver=1.2",
    "rating": 78,
    "num_ratings": 9,
    "homepage": "https://wordpress.org/themes/twentytwentyfive/",
    "requires": "6.7",
    "requires_php": "7.2"
  },
  "time_extracted": "2025-07-11T00:00:00.000000+00:00"
}

WordPress Statistics

{
  "type": "RECORD",
  "stream": "wordpress_stats",
  "record": {
    "version": "6.8",
    "count": 7500000,
    "percent": 45.5
  },
  "time_extracted": "2025-07-10T03:10:28.063321+00:00"
}

Incremental Sync in Action

The extractor also supports incremental syncing for plugins. After an initial full sync, subsequent runs only fetch plugins updated since the last run:

# First run - gets all plugins updated after start_date
python -m tap_wordpress_org.tap --config config.json > run1.jsonl

# Second run - uses state from previous run to get only new updates
python -m tap_wordpress_org.tap --config config.json --state state.json > run2.jsonl

Lessons Learned

Working with Claude Code on this project taught me that:

  1. Agentic AI-Assisted development can be powerful: Claude Code understood the requirements and overall generated quality code that would have taken a few hours to write manually, even with the Meltano SDK.
  2. Iterative development is key: Rather than trying to get everything perfect upfront, Claude Code and I worked iteratively, testing against real API endpoints and refining the implementation. For example, we discovered that the WordPress.org API doesn’t support field filtering for themes, which we only found through actual testing – Claude’s agentic ability to test and make changes mean that it immediately adapted the code accordingly.
  3. Documentation matters: Claude Code helped create comprehensive todos and documentation (Markdown files) for itself as it went along and I believe this helped maintain a clear context which in turn probably reduced hallucinations and errors.

Looking Forward

The tap-wordpress-org extractor is now available on GitHub and is also listed on the Meltano Hub.

Whether you’re analyzing WordPress ecosystem trends, monitoring plugin security updates, or building competitive intelligence tools, tap-wordpress-org provides a foundation for extracting WordPress.org data in a robust, scalable way.

Get Started Today

Ready to analyze WordPress.org data? Get started with:

# Install the tap
pip install git+https://github.com/Automattic/tap-wordpress-org.git

# Create a config file
echo '{"stream_selection": ["plugins", "themes"], "request_delay": 0.3}' > config.json

# Run the extractor
python -m tap_wordpress_org.tap --config config.json > wordpress_data.jsonl


next post →

One response to “Building a WordPress.org API Meltano Extractor with Claude Code”

  1. >

    […] building a WordPress.org API Meltano Extractor with Claude Code, I was pinged with something unexpected: my plugin had a bug that revealed a gap in Meltano’s […]

Leave a Reply

Your email address will not be published. Required fields are marked *