Sparkify — Predict User Churn using Spark
Predicting customers’ behavior is of foremost importance for businesses, but also a very challenging task. Fortunately, if provided with the right input, data science can assist in anticipating users’ needs (ideally even before users are aware of them themselves 😊).
In this article, we’ll focus on Sparkify, a capstone project for Udacity data scientist nanodegree, in which we try to estimate whether users of a fictitious audio streaming platform are likely to unsubscribe, based on their activity logs.
We have access to a JSON log of all actions performed by Sparkify users during a period of two months; our objective is to learn from this dataset what behaviors can allow us to predict whether users will “churn” (i.e. unsubscribe from the service).
In order to achieve this, we will extract the most relevant features from the log, and train a machine learning classifier; in this article, we will work with a small subset, representing 1% of the total size, but will use Spark framework, and keep scalability in mind, to ensure the same code can be reused when using the full dataset, which is 12GB large.
1. Data exploration:
- load the data
- deal with incomplete data, identify outliers
- visualize and understand correlations
2. Feature engineering:
- transform the data into a dataset of one sample record per user, aggregating features and performing statistics on them
- decide what features to take or what potential new features to engineer
- define the target response variable for training
3. Modeling and evaluation:
- define objective in mathematical terms
- create proper metrics to track performance
- come up with suitable models and train them
- evaluate various models according to the chosen metrics and decide on the most promising one
- refine the model
Since we are interested not only in precision (ensuring we identify as many users susceptible to churn as possible) but also in recall (ensuring the users we identify are actually likely to churn, since they’ll for example be proposed special offers), we propose to use F1-score to measure our machine learning classifier performance.
Exploratory Data Analysis
The subset we’ll use for this article is 128MB large (1% of the whole dataset which is 12GB large); it consists of 286,500 records in JSON format, containing the following fields:
- ts: timestamp (in milliseconds; integer)
- auth: represents whether user is logged in (string)
- userId: user’s unique ID (number, stored as string)
- firstName: user’s first name (string)
- lastName: user’s last name (string)
- gender: user’s gender (string)
- registration: time when user registered (in milliseconds; integer)
- level: user’s subscription level (string)
- location: user’s city and state (string)
- sessionId: unique session ID (integer)
- itemInSession: unique ID within session (integer sequence)
- page: page accessed by the user (string)
- artist: song’s artist (string)
- song: song’s title (string)
- length: song’s length (in seconds; float)
- status: HTTP return code (integer)
- method: HTTP method (string)
- userAgent: HTTP user agent (string)
- the small dataset has 286500 rows and 18 columns.
- there are no duplicated rows
- there are 22 possible types of activities user-service
- ts and registration contain timestamp values and can be converted to easily readable date format
- beginning of the observation period: 2018–10–01 00:01:57
- end of the observation period: 2018–12–03 01:11:16
- there are 255 unique users, out of whom 52 quit. From all logs, 8346 contain empty userId.
- even though there were somewhat more male than female users (54% to 46%), female users were more actively using the app (45% to 55%)
- total logs coming from paid subscription users were 228162, from free users 58338
- almost 80% of total activity types were switching to the next song.
- the two activity types: “Cancellation” and “Cancellation Confirmation” follow each other in any cancellation case, as two logs with small time difference.
- total activities per user vary a lot. The highest activity is 9632, the lowest one 6, with mean being 1267 and 75th percentile being 1878,
- there are 2354 distinctive sessions in total, no log misses sessionId. SessionId is unique to the user, i.e. two users might be involved in two sessions that share the same Id.
- there are various strengths of correlations between features, but no chosen features are redundant
We can find out whether users churned by checking if they viewed the ‘Cancellation Confirmation’ page, which is displayed after a successful cancellation performed on the ‘Cancel’ page.
In order to predict whether users are likely to churn, we extract some of the seemingly most relevant features from the log:
- Number of songs per session: nbSongsPerSession
- Number of adverts per session: nbAdvertsPerSession
- Percentage of thumbs up / thumbs down given per song: thumbsUpPerSong / thumbsDownPerSong
- Percentage of songs added to playlist: addToPlaylistPerSong
- Number of friends added per session: addFriendsPerSession
- User gender (we chose to encode 0 for male, 1 for female): gender
- Whether user recently (during the observed timeframe) upgraded or downgraded: upgraded/downgraded
- Whether user connected with a Windows, Mac, or Linux device: windows/mac / Linux
- Length of time during which user has been registered: time registered
Note that we chose to divide all values representing a quantity by the number of sessions or songs: this is because, for churned users, we only have data up to the point when they churn (which on average is at the middle of the observed period); this implies that total quantities for the observed period will be lower on average for churned users vs. non-churned ones. While this can be used as a good indicator of whether users have churned, we have to remember that this information is only available for training/testing, and not when we will want to predict whether users actually will churn, with data up to the same point in time for all users.
This section consists, as its name, most of the data visualizations that were crucial during the exploration of the dataset.
Now that we have explored which features are the most relevant, we can write PySpark code allowing us to:
- load JSON dataset
- exclude entries with an empty user ID
- using Spark SQL queries, extract the features discussed above for all users: churned, downgraded, upgraded, nbSongsPerSession, nbAdvertsPerSession, thumbsUpPerSong, thumbsDownPerSong, addToPlaylistPerSong, addFriendsPerSession, windows, mac, Linux, gender, level, time registered
- we save these features to a CSV file to avoid having to repeat the above steps and retrieve them when needed
We split the user set into training (75%) and test (25%), and create a pipeline that assembles the features selected above, scales them, then runs a machine learning model; our problem is a classification one, so we used the following algorithms implemented in Spark:
Though the LinearSVC spent more training time, it can get the highest f1 score 0.702. And the LogisticRegression has a medium training time and f1 score so will try to tune the LogisticRegression model
Let’s try to optimize our LogisticRegression model by adjusting its parameters: this is done by performing a grid search with different values elasticNetParam and regParam
paramGrid = ParamGridBuilder().\
addGrid(clf_LR.elasticNetParam,[0.1, 0.5, 1]).\
addGrid(clf_LR.regParam,[0.01, 0.05, 0.1]).\
As we can see above, the logistic regression can get a nearly F1-score and save 82% time spending than the linear svc. So, the logistic regression is the best model from now on in this project. allowed to improve the F1-score from 0.692 to 0.728.
- This capstone project is a great exercise allowing to put in practice several data science skills (data analysis, cleansing, feature extraction, machine learning pipeline creation, model evaluation, and fine-tuning…) to solve a problem close to those regularly encountered by customer-facing businesses.
- We could get an F1-score of 0.7 for churn prediction using a default classifier algorithm, and 0.73 after fine-tuning; these numbers are relatively good, though not great, maybe because we only have 225 distinct users in our subset.
- This project allowed us to familiarize ourselves with Spark; one of the outstanding features of this framework is SQL support, which (even though some limitations exist) is very powerful and enables us to explore data and extract features in a pretty convenient way.
- The next step would be to see how our results extend to the whole dataset (12GB); since all dataset manipulation and machine learning steps were written using Spark framework, we can expect to leverage its distributed cluster-computing capabilities to tackle the big data challenge.
As always, there is space for further improvements. To name a few examples:
- spend more time on feature engineering and come up with more features. In particular aspects such as trends could be addressed. If the user behavior changed over the last 2 weeks, deviating from the user’s average, that could be an indicator of a possible quitting. Furthermore, in this way features like song ratio or relative total activity could gain more on the importance and their interpretability would improve.
- Fully automatic machine learning pipeline, including hyperparameter tuning, model testing, and refinement.
- Optimize further spark code, possibly grasping for low-level API and work directly with RDDs.
- Perform more comprehensive grid search, or even add a random parameter search stage before grid search
- Identify shifts in data distribution over time, in online scenarios. This would mean that statistics of the newer data (or bigger dataset, like in the example of the full dataset) would deviate from the development set. Compensate for that with sampling methods.