Predicting Music Subscription Churn at Scale with PySpark

How can streaming platforms use big data to predict churn?
Using the KKBox dataset (tens of millions of transactions and listening log records), I built a scalable churn prediction pipeline with PySpark to show how distributed computing enables machine learning on massive data.

Problem

Churn is a major threat for subscription services. The business question: can we predict which users are most at risk of leaving, using transactional and behavioral data at scale?

Approach

Data preparation at scale
- Ingested multi-million-row user logs with PySpark DataFrames.
- Engineered user-level features with window functions: active days, listening streaks, play-count momentum.
- Merged with demographics and transaction histories into a unified training dataset.
Exploratory analysis
- Analyzed churn distributions and retention curves at scale with Spark SQL and visualization samples.
Modeling
- Built a baseline logistic regression.
- Trained and evaluated an XGBoost classifier (via PySpark ML wrappers) with metrics: AUROC, AUPRC, Lift@5%.
Evaluation
- Compared models; best achieved AUROC ≈ 0.72, Lift@5% ≈ 2.95 (model identifies 3x more churners than random at top 5%).

Results

Demonstrated scalable churn prediction pipeline handling millions of rows efficiently with Spark.
Showed feature momentum (change in listening activity) as a strong churn signal.
Logistic regression underperformed; tree-based methods captured non-linear churn drivers.

Impact

Illustrates how subscription platforms can operationalize churn prediction at scale, combining demographic, transactional, and behavioral data.
Shows the role of PySpark in bridging data engineering and data science workflows for applied ML.

Skills & tools

PySpark (DataFrames, SQL, Window functions) for large-scale preprocessing
MLlib, XGBoost for modeling
Python (pandas, matplotlib) for sampled EDA/visualizations
Churn modeling techniques: classification metrics (AUROC, AUPRC, Lift)

Check out the full code on GitHub

Problem

Approach

Results

Impact

Skills & tools

Related Projects

Employee Attrition Prediction

Predicting & Explaining Music Subscription Churn

Leave a Reply Cancel reply