This project aimed at building a research paper recommendation system for researchers of a global food corporation. The goal is to develop a personalized paper recommendation system to help users discover new research papers based on their preferences and past interactions with the platform.
I extensively researched different ways to build recommendation systems before deciding on the methodology.
I will explain the whole process by building a book recommendation system because this project was done for a client.
How do recommendation systems work?
Recommendation systems are algorithms designed to suggest items to users based on various data inputs. These systems are used widely in online platforms like Netflix, Amazon, YouTube, and many more.
There are multiple ways to build recommendation systems based on the available data and your use case.
Collaborative filtering uses data about user ratings and item ratings to make suggestions based on similar users and their behavior. If a user likes Item X and Item Y is similar to Item X, then Item Y will be recommended to the user.
Content-based filtering methods suggest items based on the preferences of similar users. For example, if a user likes sci-fi movies, the system will recommend more sci-fi movies by analyzing the features of the items (like genre, actors, and directors).
Check out my blog post to understand more about the inner workings of recommendation systems.
Data Collection and Preprocessing
Data Source:
Sample research papers and user rating data provided by the client.
Initial research and prototyping were done using the Book-Crossing Dataset which is available on Kaggle. Book-Crossings is a book ratings dataset compiled by Cai-Nicolas Ziegler. It contains 1.1 million ratings of 270,000 books by 90,000 users. The ratings are on a scale from 1 to 10.
The book data set provides book details. It includes 271,360 records and 8 fields: ISBN, book title, book author, publisher and so on.
We have three separate files, one for user details, one for item details and one for ratings.
ratings.head()
ratings.columns
PythonIndex(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')
users.head()
users.columns
PythonIndex(['User-ID', 'Location', 'Age'], dtype='object')
books.head()
books.columns
PythonIndex(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'], dtype='object')
Preprocessing:
Preprocessed the data by handling missing values and removing duplicates.
#To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded
counts1 = ratings['userID'].value_counts()
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
counts = ratings['bookRating'].value_counts()
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]
PythonCombining ratings data and book details.
combine_book_rating = pd.merge(ratings, books, on='ISBN')
columns = ['imageUrlS', 'imageUrlM', 'imageUrlL']
combine_book_rating = combine_book_rating.drop(columns, axis=1)
combine_book_rating.head()
Python#We then group by book titles and create a new column for total rating count.
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle'])
book_ratingCount = (combine_book_rating.
groupby(by = ['bookTitle'])['bookRating'].
count().
reset_index().
rename(columns = {'bookRating': 'totalRatingCount'})
[['bookTitle', 'totalRatingCount']]
)
book_ratingCount.head()
Pythonrating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')
rating_with_totalRatingCount.head()
PythonFiltering popular books by setting a threshold.
popularity_threshold = 50
rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')
rating_popular_book.head()
PythonAdding the user data to the popular book rating data.
combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')
PythonExploratory Data Analysis
Conducted exploratory data analysis to gain insights into user behavior patterns and distribution of book ratings.
#Ratings Distribution plot
plt.rc("font", size=15)
ratings.bookRating.value_counts(sort=False).plot(kind='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('system1.png', bbox_inches='tight')
plt.show()
#The plot shows that ratings are very unevenly distributed, and the vast majority of ratings are 0.
PythonRating trend through out the years.
plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)
plt.scatter(combined["yearOfPublication"],combined["totalRatingCount"], label="Total Rating")
plt.title('Rating Trend through Year')
plt.xlabel('Year Of Publication')
plt.ylabel('Total Rating')
plt.legend(loc='best')
Python#Removing users with age less than 10 and greater than 100
Ageindex_10 = combined[ combined['Age']<10 ].index
Ageindex_100 = combined[ combined['Age']>100 ].index
combined.drop(Ageindex_10, inplace=True)
combined.drop(Ageindex_100, inplace=True)
# Age Vs Rating
plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)
plt.scatter(combined["Age"],combined["totalRatingCount"], label="bookRating")
plt.title('Age of User Vs Rating')
plt.xlabel('Age of user')
plt.ylabel('Rating')
plt.legend(loc='best')
PythonMethodology and Modeling
Implemented multiple collaborative filtering and content-based filtering models, leveraging user interaction data to generate personalized recommendations.
For collaborative filtering, algorithms like Matrix Factorization, k-nearest Neighbors, or Singular Value Decomposition (SVD) are used. Content-based systems might use classifiers or other algorithms to learn item features.
- Matrix Factorization: Decomposes the user-item interaction matrix into lower-dimensional matrices representing users and items, capturing latent factors.
- k-Nearest Neighbors (k-NN): Finds the nearest neighbors of users or items based on their ratings or features and recommends items that similar users have liked.
- Deep Learning Models: Utilize neural networks to learn complex patterns in user-item interactions. Techniques like Autoencoders, Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs) can be employed.
After trying out many libraries and methods we decided to move forward with the LightFM implementation. Converted the data into LightFMs format.
# dummify categorical features
books_metadata_selected_transformed = pd.get_dummies(books_metadata_selected, columns = ['bookAuthor',
'yearOfPublication', 'publisher'])
user_metadata_selected_transformed = pd.get_dummies(user_metadata_selected, columns=['Age','Location'])
books_metadata_selected_transformed = books_metadata_selected_transformed.sort_values('ISBN').reset_index().drop('index', axis=1)
user_metadata_selected_transformed = user_metadata_selected_transformed.sort_values('userID').reset_index().drop('index', axis=1)
# convert to csr matrix
books_metadata_csr = csr_matrix(books_metadata_selected_transformed.drop('ISBN', axis=1).values)
user_metadata_csr = csr_matrix(user_metadata_selected_transformed.drop('userID', axis=1).values)
user_book_interaction = pd.pivot_table(interactions_selected, index='userID', columns='ISBN', values='bookRating')
# fill missing values with 0
user_book_interaction = user_book_interaction.fillna(0)
user_id = list(user_book_interaction.index)
user_dict = {}
counter = 0
for i in user_id:
user_dict[i] = counter
counter += 1
# convert to csr matrix
user_book_interaction_csr = csr_matrix(user_book_interaction.values)
PythonTrained the model.
model = LightFM(loss='warp',
random_state=2016,
learning_rate=0.90,
no_components=150,
user_alpha=0.000005)
model = model.fit(user_book_interaction_csr,
user_features=user_metadata_csr,
item_features=books_metadata_csr,
epochs=100,
num_threads=16, verbose=False)
PythonResults and Evaluation
Evaluated the performance of the recommendation system using metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). After testing out different models as a team, we decided to choose a hybrid model that uses a combination of collaborative filtering and content-based filtering. We used the LightFM library to implement the model.
Final Thoughts
By combining expertise in data preprocessing, algorithm implementation, and model evaluation, we delivered a robust recommendation system that enhances user engagement and promotes the discovery of new research papers.