Playlist for Your Life

4 min readMay 28, 2024

In the era of digital streaming, music recommendation systems have become an essential part of our listening experience. These systems leverage sophisticated algorithms and data analysis techniques to suggest songs that resonate with our tastes and preferences. This article explores the development of a music recommendation system using Python and machine learning, focusing on text preprocessing, tokenization, and similarity measurement.

The foundation of any recommendation system is the dataset. For this project, we used the “spotify_millsongdata.csv” dataset, which contains detailed information about various songs, including their lyrics, titles, and other metadata. The initial steps involve loading this dataset and performing some basic exploratory data analysis (EDA) to understand its structure and content.

import pandas as pd
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

music_data = pd.read_csv("spotify_millsongdata.csv")

display(music_data.head())
display(music_data.tail())
display(music_data.shape)
display(music_data.isna().sum())

This snippet reads the dataset and displays its first and last few records, along with its dimensions and missing values. EDA helps us identify any anomalies or missing data that might need handling before we proceed further.

Given the large size of the dataset, we sample 20,000 records for our analysis to ensure computational efficiency. Additionally, we drop any unnecessary columns, such as links to the songs, which do not contribute to our recommendation logic.

music_data =music_data.sample(20000).drop('link', axis=1).reset_index(drop=True)

Next, we focus on text preprocessing. Lyrics, stored in the ‘text’ column, need to be cleaned and standardized. This involves converting all text to lowercase and removing unwanted characters like newline characters.

music_data['text'] = music_data['text'].str.lower().replace(r'^\w\s', ' ').replace(r'\n', ' ', regex = True)

Tokenization is the process of splitting text into individual words or tokens. Following tokenization, we apply stemming to reduce words to their root form. This reduces the complexity of the text data and enhances the performance of our model.

stemmer = PorterStemmer()

def tokenization(txt):
    tokens = nltk.word_tokenize(txt)
    stemming = [stemmer.stem(w) for w in tokens]
    return " ".join(stemming)

music_data['text'] = music_data['text'].apply(lambda x: tokenization(x))

Here, we define a function tokenization that tokenizes and stems each song’s lyrics, then apply this function to the entire ‘text’ column.

To measure the similarity between songs, we transform the cleaned lyrics into a numerical format using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer. This technique evaluates how important a word is to a document in a collection of documents, providing a more meaningful numerical representation of the lyrics.

tfidvector = TfidfVectorizer(analyzer='word',stop_words='english')
matrix = tfidvector.fit_transform(music_data['text'])
similarity = cosine_similarity(matrix)

The core of our recommendation system is a function that, given a song, finds and returns a list of similar songs based on the cosine similarity scores.

def recommendation(song_df):
    idx = music_data[music_data['song'] == song_df].index[0]
    distances = sorted(list(enumerate(similarity[idx])),reverse=True,key=lambda x:x[1])
    
    songs = []
    for m_id in distances[1:21]:
        songs.append(music_data.iloc[m_id[0]].song)
        
    return songs

This function first identifies the index of the given song in the dataset. It then sorts all other songs based on their similarity to the given song, excluding the song itself, and returns the top 20 most similar songs.

To ensure our recommendation system can be used in the future without the need to recompute the similarity matrix, we save both the similarity matrix and the cleaned dataset using the pickle module.

import pickle
pickle.dump(similarity,open('rec_spotify.pkl','wb'))
pickle.dump(music_data,open('music_data.pkl','wb'))

After we save the model then we are deploying a music recommendation system allows users to receive personalized song suggestions based on their favorite tracks. Here’s how you can deploy such a system using Streamlit and the Spotify API.

First, import necessary libraries and initialize the Spotify client with your credentials.

import pickle
import streamlit as st
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

CLIENT_ID = "your_client_id"
CLIENT_SECRET = "your_client_secret"

client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Define functions to get album cover URLs and recommend songs.

def get_song_album_cover_url(song_name, artist_name):
    search_query = f"track:{song_name} artist:{artist_name}"
    results = sp.search(q=search_query, type="track")
    if results["tracks"]["items"]:
        return results["tracks"]["items"][0]["album"]["images"][0]["url"]
    return "https://i.postimg.cc/0QNxYz4V/social.png"

def recommend(song):
    index = music[music['song'] == song].index[0]
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])
    recommended_music_names = [music.iloc[i[0]].song for i in distances[1:11]]
    recommended_music_posters = [get_song_album_cover_url(music.iloc[i[0]].song, music.iloc[i[0]].artist) for i in distances[1:11]]
    return recommended_music_names, recommended_music_posters

Design the UI using Streamlit.

st.markdown('<div class="header"><img src="https://storage.googleapis.com/pr-newsroom-wp/1/2018/11/Spotify_Logo_CMYK_Green.png" width="200" alt="Spotify Logo"><h1 class="title">🎵 Playlist for Yours 🎵</h1></div>', unsafe_allow_html=True)
st.markdown("Bosen dengan lagu yang itu-itu aja? Coba pilih satu lagu yang kamu suka dan kami akan urus sisanya.")

Load the pre-trained model and dataset, and create the recommendation system.

music = pickle.load(open('music_data.pkl', 'rb'))
similarity = pickle.load(open('rec_spotify.pkl', 'rb'))

music_list = music['song'].values
selected_song = st.selectbox("Type or select a song from the dropdown", music_list)

if st.button('Show Recommendation'):
    try:
        recommended_music_names, recommended_music_posters = recommend(selected_song)
        st.markdown('<div class="recommendations">', unsafe_allow_html=True)
        for name, poster in zip(recommended_music_names, recommended_music_posters):
            st.markdown(f'''
                <div class="recommendation">
                    <img src="{poster}" width="100%">
                    <p style='color:#1DB954;'>{name}</p>
                </div>
            ''', unsafe_allow_html=True)
        st.markdown('</div>', unsafe_allow_html=True)
    except Exception as e:
        st.error(f"An error occurred: {e}")

Playlist for Your Life

Written by Hans Bonnie

Responses (3)