Using a dataset of 40,000+ Spotify tracks (2000–2020), I explored what drives a song’s popularity.
Beyond simple correlations, this project applied statistics, causal inference, and Bayesian modeling to rigorously test whether audio features — especially danceability — actually cause higher popularity.
Problem
Song popularity is often explained by audio features (danceability, energy, tempo, etc.), but correlation doesn’t imply causation.
The question: Does high danceability truly increase the chance of a song becoming a hit, or is it just associated with hits?
Approach
- EDA – Explored feature distributions and hit prevalence (~1% of tracks).
- Hypothesis Testing – Compared high vs. low danceability tracks.
- Bootstrapping – Quantified uncertainty in mean popularity differences.
- Regression Models – Multivariate linear & logistic regression for prediction.
- Causal Inference – Propensity Score Matching (PSM) to estimate causal effect of danceability.
- Bayesian Modeling – Used PyMC + ArviZ to build posterior estimates and credible intervals.
Results
- Hits are rare (~1% of songs).
- Danceability: High-danceability tracks are 3x more likely to be hits.
- Bootstrapping: Confirmed a ~30-point popularity gap between hits and non-hits.
- Regression: Danceability was the strongest predictor.
- Causal Inference: PSM estimated a +1.28 point causal lift in popularity from high danceability.
- Bayesian Analysis: 95% credible interval confirms strong confidence that danceability increases hit probability.
Impact
- Demonstrated that danceability is not just correlated but causally linked to song success.
- Built a reproducible statistical workflow blending classical inference, causal reasoning, and Bayesian methods.
- Provides a framework music platforms or labels could use to identify potential hits early.
Skills & Tools
- Hypothesis testing & A/B test simulation
- Propensity Score Matching & DAG reasoning
- Bootstrapping & confidence intervals
- Logistic & linear regression
- Bayesian modeling (PyMC, ArviZ)
- Python (pandas, NumPy, matplotlib, seaborn)
- Data storytelling with reproducible visualizations
View the Project on GitHub or read my article about what I learned from this project.
