Sparkify — Predict User Churn using Spark

Project Overview

In this article, we’ll focus on Sparkify, a capstone project for Udacity data scientist nanodegree, in which we try to estimate whether users of a fictitious audio streaming platform are likely to unsubscribe, based on their activity logs.

For this article, we used the following dataset and notebook.

Problem Statement


1. Data exploration:

  • load the data
  • deal with incomplete data, identify outliers
  • visualize and understand correlations

2. Feature engineering:

  • transform the data into a dataset of one sample record per user, aggregating features and performing statistics on them
  • decide what features to take or what potential new features to engineer
  • define the target response variable for training

3. Modeling and evaluation:

  • define objective in mathematical terms
  • create proper metrics to track performance
  • come up with suitable models and train them
  • evaluate various models according to the chosen metrics and decide on the most promising one
  • refine the model


Exploratory Data Analysis

  • ts: timestamp (in milliseconds; integer)
  • auth: represents whether user is logged in (string)
  • userId: user’s unique ID (number, stored as string)
  • firstName: user’s first name (string)
  • lastName: user’s last name (string)
  • gender: user’s gender (string)
  • registration: time when user registered (in milliseconds; integer)
  • level: user’s subscription level (string)
  • location: user’s city and state (string)
  • sessionId: unique session ID (integer)
  • itemInSession: unique ID within session (integer sequence)
  • page: page accessed by the user (string)
  • artist: song’s artist (string)
  • song: song’s title (string)
  • length: song’s length (in seconds; float)
  • status: HTTP return code (integer)
  • method: HTTP method (string)
  • userAgent: HTTP user agent (string)

Key insights:

  • the small dataset has 286500 rows and 18 columns.
  • there are no duplicated rows
  • there are 22 possible types of activities user-service
  • ts and registration contain timestamp values and can be converted to easily readable date format
  • beginning of the observation period: 2018–10–01 00:01:57
  • end of the observation period: 2018–12–03 01:11:16
  • there are 255 unique users, out of whom 52 quit. From all logs, 8346 contain empty userId.
  • even though there were somewhat more male than female users (54% to 46%), female users were more actively using the app (45% to 55%)
  • total logs coming from paid subscription users were 228162, from free users 58338
  • almost 80% of total activity types were switching to the next song.
  • the two activity types: “Cancellation” and “Cancellation Confirmation” follow each other in any cancellation case, as two logs with small time difference.
  • total activities per user vary a lot. The highest activity is 9632, the lowest one 6, with mean being 1267 and 75th percentile being 1878,
  • there are 2354 distinctive sessions in total, no log misses sessionId. SessionId is unique to the user, i.e. two users might be involved in two sessions that share the same Id.
  • there are various strengths of correlations between features, but no chosen features are redundant

We can find out whether users churned by checking if they viewed the ‘Cancellation Confirmation’ page, which is displayed after a successful cancellation performed on the ‘Cancel’ page.

In order to predict whether users are likely to churn, we extract some of the seemingly most relevant features from the log:

  • Number of songs per session: nbSongsPerSession
  • Number of adverts per session: nbAdvertsPerSession
  • Percentage of thumbs up / thumbs down given per song: thumbsUpPerSong / thumbsDownPerSong
  • Percentage of songs added to playlist: addToPlaylistPerSong
  • Number of friends added per session: addFriendsPerSession
  • User gender (we chose to encode 0 for male, 1 for female): gender
  • Whether user recently (during the observed timeframe) upgraded or downgraded: upgraded/downgraded
  • Whether user connected with a Windows, Mac, or Linux device: windows/mac / Linux
  • Length of time during which user has been registered: time registered

Note that we chose to divide all values representing a quantity by the number of sessions or songs: this is because, for churned users, we only have data up to the point when they churn (which on average is at the middle of the observed period); this implies that total quantities for the observed period will be lower on average for churned users vs. non-churned ones. While this can be used as a good indicator of whether users have churned, we have to remember that this information is only available for training/testing, and not when we will want to predict whether users actually will churn, with data up to the same point in time for all users.

Data Visualization

Data Preprocessing

  • load JSON dataset
  • exclude entries with an empty user ID
  • using Spark SQL queries, extract the features discussed above for all users: churned, downgraded, upgraded, nbSongsPerSession, nbAdvertsPerSession, thumbsUpPerSong, thumbsDownPerSong, addToPlaylistPerSong, addFriendsPerSession, windows, mac, Linux, gender, level, time registered
  • we save these features to a CSV file to avoid having to repeat the above steps and retrieve them when needed


Though the LinearSVC spent more training time, it can get the highest f1 score 0.702. And the LogisticRegression has a medium training time and f1 score so will try to tune the LogisticRegression model

Hyperparameter Tuning

paramGrid = ParamGridBuilder().\
addGrid(clf_LR.elasticNetParam,[0.1, 0.5, 1]).\
addGrid(clf_LR.regParam,[0.01, 0.05, 0.1]).\



  • We could get an F1-score of 0.7 for churn prediction using a default classifier algorithm, and 0.73 after fine-tuning; these numbers are relatively good, though not great, maybe because we only have 225 distinct users in our subset.
  • This project allowed us to familiarize ourselves with Spark; one of the outstanding features of this framework is SQL support, which (even though some limitations exist) is very powerful and enables us to explore data and extract features in a pretty convenient way.
  • The next step would be to see how our results extend to the whole dataset (12GB); since all dataset manipulation and machine learning steps were written using Spark framework, we can expect to leverage its distributed cluster-computing capabilities to tackle the big data challenge.


  1. spend more time on feature engineering and come up with more features. In particular aspects such as trends could be addressed. If the user behavior changed over the last 2 weeks, deviating from the user’s average, that could be an indicator of a possible quitting. Furthermore, in this way features like song ratio or relative total activity could gain more on the importance and their interpretability would improve.
  2. Fully automatic machine learning pipeline, including hyperparameter tuning, model testing, and refinement.
  3. Optimize further spark code, possibly grasping for low-level API and work directly with RDDs.
  4. Perform more comprehensive grid search, or even add a random parameter search stage before grid search
  5. Identify shifts in data distribution over time, in online scenarios. This would mean that statistics of the newer data (or bigger dataset, like in the example of the full dataset) would deviate from the development set. Compensate for that with sampling methods.