No. 03 — Data Engineering

Eighty thousand listens.

Building the pipeline that should have existed first — a fully automated ETL system that turns raw Spotify data into a queryable warehouse, updated every day.

Data EngineeringMay 2025Solo

When I finished the Spotify listening behavior analysis, the most obvious gap was right in front of me: the data was already stale. A GDPR export is a one-off snapshot — the moment it arrives, it starts aging. Every insight about listening patterns was frozen in time, and there was no way to update it without requesting another export and waiting days.

So I built the thing that should have come first. A pipeline that ingests both the historical archive and the live Spotify API, stores everything in PostgreSQL with deduplication logic, and runs on a daily schedule via Apache Airflow — all on WSL2.

79,889Historical RecordsGDPR backfill
5Postgres Tables
7DAG TasksDaily @ midnight
100%Task Success RateOver 2 days monitored
I.The Architecture

The pipeline has two distinct ingestion paths that converge into a single PostgreSQL database. The first is a one-time backfill from the Spotify GDPR export — 79,889 historical plays spanning years of listening data, bulk-loaded via SQLAlchemy with conflict-safe upserts.

The second path is live and automated: three Spotify API endpoints pulled daily, each feeding its own table. The recently-played endpoint gives the last ~50 plays; top tracks and top artists snapshot long-term preferences.

Spotify API

/recently-played /top-tracks /top-artists

GDPR Export

StreamingHistory *.json files

Extract

Transform

Load

Airflow

@daily

PostgreSQL

listening_history

top_tracks

top_artists

daily_metrics

End-to-end data flow

II.The DAG

Orchestration runs through a single Airflow DAG on a @daily schedule with catchup=False. Seven tasks run in a partially parallel graph: database initialization and historical load happen first, then the Spotify client is initialized, and the three API fetch tasks run in parallel before the export load, metrics computation, and cleanup complete the chain.

XCom passes the Spotify client object between tasks. Each task has one retry with a five-minute delay, and failures send no email — the intent was always to monitor via the Airflow UI directly.

init_db_and_load_historicalinit_spotify_clientfetch_recently_playedfetch_top_tracksfetch_top_artistsload_streaming_exportcompute_daily_metricscleanup_old_data

Airflow DAG — task dependency graph

III.The Schema

The central table is listening_history, keyed on played_at. Every insert uses ON CONFLICT DO NOTHING on that key — so the historical backfill and the daily API pull can both write to the same table without creating duplicates, regardless of overlap.

The hour and weekday columns are denormalized at ingest time — redundant with played_at, but intentional. Temporal bucketing queries run without any function overhead on the timestamp, making heatmap and session analysis fast.

listening_historyPostgreSQL
ColumnTypeNote
played_atPKTIMESTAMPPrimary Key
dateDATE
hourSMALLINT0–23
weekdayVARCHAR(10)Mon–Sun
track_nameTEXT
artist_nameTEXT
album_nameTEXT
ms_playedINTEGERMilliseconds
platformTEXTDevice / OS
conn_countryVARCHAR(5)ISO 3166-1
IV.The Data

The shape of the ingestion is lopsided by design. The GDPR export contributes nearly everything by volume — it’s a complete history, not a sample. The daily API pulls are narrow by Spotify’s own limits: fifty recently-played tracks is roughly the last few hours of listening, not a day.

That gap — between the richness of the historical archive and the shallowness of the live API — is the core constraint of working with Spotify data.

Records ingested by source
GDPR Historical Backfill79,889
Daily API — Recently Played50
Top Tracks (long-term)50
Top Artists (long-term)20

Bar widths are log-scaled for visibility. GDPR backfill dwarfs the daily API pull by design.

Data sources & limits
/me/player/recently-played50 tracksDaily
/me/top/tracks50 tracksDaily
/me/top/artists20 artistsDaily
GDPR Export79,889 rowsOne-off
The GDPR export is a complete history. The API is a keyhole. Building the warehouse meant learning to use both.
V.The Hard Parts

WSL2 networking. PostgreSQL inside WSL2 listens on the Linux loopback by default. Getting it accessible from Windows — and from Airflow running in the same WSL2 environment — required configuring listen_addresses = '*' in postgresql.conf, updating pg_hba.conf, and port-forwarding through the Windows firewall.

OAuth in a headless environment. Spotipy’s default OAuth flow opens a browser for the authorization redirect — which doesn’t work inside a headless Airflow worker. The solution was a small FastAPI server to receive the callback, capture the authorization code, and store the token cache before the DAG ran.

Timezone-aware vs. naïve datetimes. The GDPR export timestamps are UTC strings; the API returns timezone-aware objects. Both needed to be normalized to timezone-naïve UTC before they could be used as primary keys in the same table — otherwise identical plays would appear as duplicates to the conflict handler.

VI.Limitations

The recently-played endpoint is the weakest link. Spotify caps it at fifty tracks — roughly four hours of listening — which means any gap in the daily run longer than that loses data permanently. A production deployment would need a more aggressive schedule, or a webhook-based ingest.

Audio features — valence, tempo, danceability — are no longer available through the API following Spotify’s deprecation in late 2024. All behavioral analysis is therefore temporal and contextual rather than acoustic.

The pipeline ran locally on WSL2. Migrating to a cloud environment would require re-solving the OAuth headless problem with a proper callback server, and replacing the local Postgres connection with a managed database.

Colophon

Built in Python with Spotipy for API access, SQLAlchemy for ORM and conflict-safe upserts, Pandas for transformation, and Apache Airflow for orchestration. Storage in PostgreSQL running on WSL2. OAuth callback handled via a lightweight FastAPI server.

The Index— fin —