Introduction

From audio cassettes to vinyl disks, from ipods to itunes, as media products shift from physical goods to bits, the way people explore and perceive new media has also changed through the years.

The amount of data people consume every day makes it a necessity for digital companies to fine tune their selling for each user in order to create a hyper personalised environment for users to be more engaged.

Spotify, an online music streaming platform, puts user personalization at the forefront through recommended playlists. Rather than just listening to a full length album, users are able to control their experience by curating playlists that can reflect their emotions, activities, and memories. Spotify can enhance this personalization experience by assisting users in their music discovery. Discovering new tracks are made easier by offering recommendations relevant to the user.

What sets Spotify apart from other music streaming platforms like Amazon Music, Pandora, Apple Music, and YT Music is precisely the spot on recommendations that it offers users for their playlists. Through this project, we aim to see how any platform can improve its user personalization and its implications for the future of targetted marketing.

Objective

For this project we will be building a recommendation system which recommends 20 songs that a user might add to their playlist based on the first song added by the user. This recommendation system can be leveraged by platforms to further refine the precision of their recommendations to users and ensure that customers stay loyal to their platform.

In order to achieve this, we will try to analyze the data in order to find trends in user listening habits. We will also see who are the top artists, what are the top genres, and what sets them apart from the rest. These analysis will not only help us to optimize our recommendation system but can help all the sides of the platform involved in the following ways:

Dataset

Our analysis was done on an existing dataset obtained from a Spotify Case Challenge - Million Playlist Dataset Challenge, hosted by AIcrowd in JSON format. This dataset consists of 1 million Spotify playlists and is sampled from over 4 billion public playlists on Spotify. This sampled dataset further consists of over 2 million unique tracks by nearly 300,000 artists and represents the largest public dataset of music playlists in the world. The dataset includes public playlists created by US Spotify users between January 2010 and November 2017. Playlists are sampled with simple randomization, are manually filtered for playlist quality and to remove offensive content and have some dithering and fictitious tracks added to them. As such, the dataset is not representative of the true distribution of playlists on the Spotify platform and must not be interpreted as such in any research or analysis performed on the dataset. The data is anonymized to protect user privacy by Spotify.

However, Our analysis was done only on the first 1000 playlists and their respective songs to serve the purpose of this project. The dataset contains the following attributes:

As we wanted to dig deeper in order to find out what separates each track from the rest and to optimize the recommendation system, we decided to pull data for each individual track by making API calls to Spotify Data Catalogue using REST principles. This dataset was acquired in JSON format and contains the following attributes:

Sources : Spotify for Developers

Importing Libraries For Data Processing

Apart from the standard libraries like pandas, numpy, datatime, matplotlib, seaborn, and plotly, we will be using the following additional resources for processing the data:

Data Processing

Before analysing the data, we will prepare our data in order to answer our questions of interest. The data processing will be divided into 4 stages:

We will be explaining each stage in detail as we go about it.

Data Extraction - Playlist Data

The Million Playlist Challenge Dataset consists of playlists and their respective tracks in json format. We will be loading this data into a data frame while setting a unique identifier for each playlist as its index. Since the tracks are included as lists within the playlist, we will be extracting them later in order to display the track information separately before we dig deeper into the track metadata.

Data Wrangling - Playlist Data

In order to get best results, We will be transforming and mapping our data from its original form into our desired format with the intent of making it more appropriate and valuable for our analysis. To achieve this, we will be converting timestamp of playlist modification from UTC format into datetime format, extracting only the year.

Data Cleaning - Playlist Data

After extracting and munging the data, we realize that there is a need to clean the data as it includes datapoints which are out of the scope of our analysis. We will be cleaning the data by dropping redundant datapoints and retaining those which will be usefull for our analysis.

Extracting Songs from Playlists

In order to dig deeper into the tracks within the playlists, we will be iterating through each playlist to get the track data which is stored as a dictionary in the data. We will save this track data into a new dataframe.

Data Extraction - Track Metadata

In order to answer our research questions, we realize that the data from the Million Playlist Dataset Challenge is not sufficient. We need additional song metrics which quantify each individual song. We will be extracting these metrics by making API calls to Spotify following REST principles.

To deal with the rate limit (a strategy implemented by Spotify for restricting the network traffic), we will define a counter variable and load all the playlist data in a new variable. We will then iterate through all the playlists and their respective tracks while calling the API and simultaneously append the metadata into a list, incrementing the counter. Now, even if we exceed the rate limit, we will not have to run the code all over again from scratch as we will be storing the last iteration in the counter. We will also be able to update the access token before it gets expired.

By doing this, we will be saving a lot of time extracting the metadata.

Data Merging

Now that we have the metadata for each track, we will be dumping it into a file so that we don’t have to make API calls every time we execute the code. Firstly, we will check if any file named track_data already exists to avoid overwriting it, and if not, we will create the file, dumping the track metadata. After that, we will load the track metadata into a dataframe and create a link between track data and playlist data by setting pid (playlist id) as the index.

Data Analysis

Now that we have prepared our data, we are all set to start our analysis. In order to make our recommendation system, we will try to find different trends in listening habits of users and factors that could quantify each song. These analysis will be done in order to have a complete knowledge of the music industry which could further help all the sides involved.

Understanding Track Attributes

We will start off our analysis by understanding the attributes of the tracks that can quantify each individual track and separate it from the rest. Our aim is to see if these attributes are dependent on each other or not, and if a particular attribute has any relation with the track being popular or not.

We believe that the best way to relate all the attributes is by making a correlation between them and visualizing it in a Heatmap.

Inference

From the above heatmap, we can observe the following:

What is shocking is that not a single attribute strongly correlates with popularity. May be just one attribute isn't enough to impact the popularity of the song. It could be possible for multiple attributes together to affect the popularity of the song. We will be digging deeper into it next.

Attributes Affecting Artist Popularity

From the above analysis, we found out that no single track attribute majorly affects the popularity of the track. So, now we will try to look for some common attributes that makes an artist popular, if any. We will compare the attributes of 5 most popular and least popular artists in a form of a radar chart.

We will achieve this by following the below mentioned steps:

Inference

We can observe that people tend to love artists who produce songs which have more energy and danceability. This may be a hint towards people loving the party genre. We can also find that energy and danceability of all songs, regardless of artist popularity, are comparable. This could mean that these attributes are a common trait in the music industry. The liveness and Speachiness are not the influencing factors for audience's preferences. One major difference that we can see between most popular artists and least popular artists is the Valence factor. Valence defines the happiness in the song. We can observe that people tend to love songs that makes them feel positive and happy. These observations can act as a formula for new musicians trying to build up the their name in the music industry. They should consider making songs that are positive and energetic at the same time.

Evolution of Sound over the Years

Now that we have a good understanding of the song attributes that people tend to prefer, We would like to see if these attributes result in a specific sound (genre) by analyzing the popularity of different genres over the years. We will determine the popularity of the genre by taking the average of the track popularity for each genre. We will further try to look for the evolution of genre popularity over the years. For this analysis, we will consider 3 most popular and least popular genres.

Inference

From the above line chart, we can observe that the sound hasn't changed much over the years. This also tells us that users remain true to the genres they listen to. Maybe the users are not being able to explore new genres and discover new musics. Streaming platforms should make music discovery easier so that a user can have a diverse taste of music.

Genres like Indie Rock, Stomp and Haller, and Indie Folk are relatively niche and hence are declining rapidly in popularity over the years. However, the Pop genre has seen an incredible amount of growth starting from the year 2013. This could be because of the fact that in the year 2014, Billboard top teen pop artist - Taylor Swift, released her music on Spotify.

Billboard Top 5 Artists

Taking into consideration the song attributes and artist popularity, we decided to look how this fits into the classification and rankings of artists. Using Billboard's Top Artists of the decade (2010 - 2020), we decided to filter the dataframe of our tracks to analyze how artists compare in popularity.

The Top Artists of the Decade were:

  1. Drake
  2. Taylor Swift
  3. Bruno Mars
  4. Rihanna
  5. Ed Sheeran

Using Seaborn's Violinplot, we can visualize how Artist's compare in track popularity.

Source : Billboard : Artist of the decade

Inference

The top artist of decade, according to Billboard, was Drake. At first glance we can observe that Drake has popular tracks that are lower than some of the top 5 artists. We can see the Bruno Mars and Taylor Swift had high track popularity, and few tracks that were unpopular. Rihanna and Drake had more tracks that can be considered unpopular.

Based on these results, we can only assume that institutions such as Billboard use other metrics other than song popularity. It is possible that external metrics, such as a album sales or number of times a track has been on the charts could be considered. In recent years, Drake has become more prominent in the music industry, which would imply that relevance is considered. However, from a purely musical standpoint we say that the top artists were actually Bruno Mars.

Recommendation System

We not only have a clear understanding of how we can differentiate each song from the rest by quantifying its attributes but we also know what sound users are looking for and how that sound is changing every year. We have also analyzed attributes of artists who are on top of their game and are preferred by users based on number of times they have been added by them in their playlists. These analysis gives us enough knowledge to build a recommendation system to personalize user experience.

To achieve this, we will first remove all the duplicate tracks from the dataset and convert the genres from string to arrays in order to analyze genres. We will match the genre of the input track and compare it to the tracks with similar genre and then find a cosine similarity on track attributes which will give us a list of similar sounding tracks of the same genre.

Cosine similarity will be used to find similarity between two non-zero vectors. Our two non-zero vectors will be the songs metadata (input song and target song). We find this similarity by calculating the angle between these two vectors.

Mathematically, Cosθ = A . B / ||A| . ||B||

We will also be using hamilton similarity to make our recommendations even more accurate. We will also be taking care of the edge case testing.

Scope of Improvement

We can further optimize our recommendation system based on the user listening history and not just the playlist data. We can also optimize this recommendation system by using similar playlists of different users.

Conclusion

To wrap it up, We have analyzed how we could quantify the tracks based on different metrics. Through this, we found out that these metrics are not individually related to the popularity of the song. However, from our second analysis we can see that the artists who are popular tend to have specific attributes in their song. These attributes are:

After analyzing the song attributes, we decided to analyze the sound (genre) preferred by users over the years. We can see that the popularity of the sound remains more or less same over the years. We can infer that users prefer only some genres and want to stick to them. The other possible reason could be lack of diversed in-house curated playlists by Spotify. The platform should not only focus on users' listening habits but also promote niche and new genres to diversify the music taste.

We then analyzed the data for top Billboard artists to see if the music industry is relying completely on spotify streams to judge the performance of an artist. From the analysis, we can infer that even though everything is going digital, artist popularity defined by Spotify is not the only factor to determine artist's popularity. Maybe Billboard are considering other factors like tour ticket sale, physical distribution of music, streams from other platforms in order to calculate the popularity of the artist.

After having enough knowledge about the music industry and attributes, we decided to make our own recommendation system to personalize user experience. This recommendation system can be used by any digital platform who wishes to provide a hyper personalized environment to their users.