How can streaming platforms use big data to predict churn?
Using the KKBox dataset (tens of millions of transactions and listening log records), I built a scalable churn prediction pipeline with PySpark to show how distributed computing enables machine learning on massive data.
Problem
Churn is a major threat for subscription services. The business question: can we predict which users are most at risk of leaving, using transactional and behavioral data at scale?
Approach
- Data preparation at scale
- Ingested multi-million-row user logs with PySpark DataFrames.
- Engineered user-level features with window functions: active days, listening streaks, play-count momentum.
- Merged with demographics and transaction histories into a unified training dataset.
- Exploratory analysis
- Analyzed churn distributions and retention curves at scale with Spark SQL and visualization samples.
- Modeling
- Built a baseline logistic regression.
- Trained and evaluated an XGBoost classifier (via PySpark ML wrappers) with metrics: AUROC, AUPRC, Lift@5%.
- Evaluation
- Compared models; best achieved AUROC ≈ 0.72, Lift@5% ≈ 2.95 (model identifies 3x more churners than random at top 5%).
Results
- Demonstrated scalable churn prediction pipeline handling millions of rows efficiently with Spark.
- Showed feature momentum (change in listening activity) as a strong churn signal.
- Logistic regression underperformed; tree-based methods captured non-linear churn drivers.
Impact
- Illustrates how subscription platforms can operationalize churn prediction at scale, combining demographic, transactional, and behavioral data.
- Shows the role of PySpark in bridging data engineering and data science workflows for applied ML.
Skills & tools
- PySpark (DataFrames, SQL, Window functions) for large-scale preprocessing
- MLlib, XGBoost for modeling
- Python (pandas, matplotlib) for sampled EDA/visualizations
- Churn modeling techniques: classification metrics (AUROC, AUPRC, Lift)
Check out the full code on GitHub
