I started this project with a simple question:
What makes a song a hit on Spotify?
By the end, I had unintentionally walked through key lessons in descriptive analysis, confounding, matching, causal inference, and bootstrapping.
This is what I learned, not just about the data, but about the mindset statistical analysis requires in the real world.
Descriptive Patterns Aren’t Enough
Spotify provides rich audio features such as tempo, energy, loudness, valence, danceability, and a popularity score that loosely reflects how well a song performs.
Naturally, I began with some exploration:
- Plotted distributions
- Compared the averages between hit and non-hit songs
- Checked correlations between features and popularity

Danceability stood out. Popular songs tended to be more danceable. But here’s the first major realization:
Just because a feature is correlated with success does not mean it causes it.

Maybe danceable songs are more likely to be promoted. Or maybe danceability correlates with genre, which correlates with chart success.
That led to a deeper question:
Does high danceability actually cause a song to do well?
Thinking in Terms of Treatment and Control
To explore causality, I reframed the analysis. What if I treat high danceability as a “treatment”? Could I compare the hit probability between treated (high-danceability) and control (lower-danceability) songs while holding everything else constant?
Of course, I could not run a randomized experiment. But I could simulate one using observational data.

Techniques I Used (and What They Taught Me)
1. Propensity Score Matching (PSM)
This was my entry point into causal inference.
Using a logistic regression model, I predicted the probability that a song would have high danceability based on features like tempo, energy, loudness, duration, and valence.
Then I matched each high-danceability song to a similar low-danceability one with a close propensity score.
This created two balanced groups that were similar in all features except for danceability.
Matching does not make the data perfect, but it reduces confounding. It simulates “what would’ve happened if this song had a different danceability score?”
2. Stratified Comparison
To refine the analysis, I divided songs into strata based on their baseline hit probability.
Within each group, I compared hit rates between high- and low-danceability songs. This helped control for variation across the popularity spectrum so I was not comparing a global artist to an obscure lo-fi track.
Stratification helped ensure that I was comparing like with like.
3. Bootstrapping for Confidence
Once I estimated the treatment effect, I needed to understand how much that estimate might vary by chance. I used bootstrap resampling, drawing multiple samples with replacement and recomputing the treatment effect each time.
This gave me a confidence interval around the result.
Final estimate: songs with high danceability were about three times more likely to become hits, even after adjusting for other features.
What I Really Learned (Beyond the Numbers)
- Descriptive statistics are essential
They are not just for show. They help you understand the shape, structure, and oddities in your data before you go deeper. - Causality is about structured comparison
Correlations give hints. Causal thinking asks: Compared to what? - Statistical analysis is iterative
Every answer raises new questions. Every assumption needs checking. Every conclusion needs context. - Real-world data is noisy, biased, and imperfect
Still, with the right tools and framing, it is possible to draw meaningful insights.
Final Thoughts
This project started with curiosity and turned into a practical exercise in statistical reasoning.
It reminded me that statistics is not just about running tests or making graphs.
It is a way of thinking clearly about uncertainty, structure, and impact.
And that mindset is something I want to bring to every project I work on.
