Research Paper Recommendation System

This project aimed at building a research paper recommendation system for researchers of a global food corporation. The goal is to develop a personalized paper recommendation system to help users discover new research papers based on their preferences and past interactions with the platform.

I extensively researched different ways to build recommendation systems before deciding on the methodology.

I will explain the whole process by building a book recommendation system because this project was done for a client.

How do recommendation systems work?

Recommendation systems are algorithms designed to suggest items to users based on various data inputs. These systems are used widely in online platforms like Netflix, Amazon, YouTube, and many more.
There are multiple ways to build recommendation systems based on the available data and your use case.

Collaborative filtering uses data about user ratings and item ratings to make suggestions based on similar users and their behavior. If a user likes Item X and Item Y is similar to Item X, then Item Y will be recommended to the user.

Content-based filtering methods suggest items based on the preferences of similar users. For example, if a user likes sci-fi movies, the system will recommend more sci-fi movies by analyzing the features of the items (like genre, actors, and directors).

Check out my blog post to understand more about the inner workings of recommendation systems.

Data Collection and Preprocessing

Data Source:

Sample research papers and user rating data provided by the client.

Initial research and prototyping were done using the Book-Crossing Dataset which is available on Kaggle. Book-Crossings is a book ratings dataset compiled by Cai-Nicolas Ziegler. It contains 1.1 million ratings of 270,000 books by 90,000 users. The ratings are on a scale from 1 to 10.

The book data set provides book details. It includes 271,360 records and 8 fields: ISBN, book title, book author, publisher and so on.

We have three separate files, one for user details, one for item details and one for ratings.

ratings.head()
ratings.columns

Python

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

users.head()
users.columns

Python

Index(['User-ID', 'Location', 'Age'], dtype='object')

books.head()
books.columns

Python

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

Preprocessing:

Preprocessed the data by handling missing values and removing duplicates.

#To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded
counts1 = ratings['userID'].value_counts()
ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]
counts = ratings['bookRating'].value_counts()
ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]

Python

Combining ratings data and book details.

combine_book_rating = pd.merge(ratings, books, on='ISBN')
columns = ['imageUrlS', 'imageUrlM', 'imageUrlL']
combine_book_rating = combine_book_rating.drop(columns, axis=1)
combine_book_rating.head()

Python

#We then group by book titles and create a new column for total rating count.
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle'])
 
book_ratingCount = (combine_book_rating.
     groupby(by = ['bookTitle'])['bookRating'].
     count().
     reset_index().
     rename(columns = {'bookRating': 'totalRatingCount'})
     [['bookTitle', 'totalRatingCount']]
    )
book_ratingCount.head()

Python

rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')
rating_with_totalRatingCount.head()

Python

Filtering popular books by setting a threshold.

popularity_threshold = 50
rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')
rating_popular_book.head()

Python

Adding the user data to the popular book rating data.

combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')

Python

Exploratory Data Analysis

Conducted exploratory data analysis to gain insights into user behavior patterns and distribution of book ratings.

#Ratings Distribution plot
plt.rc("font", size=15)
ratings.bookRating.value_counts(sort=False).plot(kind='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('system1.png', bbox_inches='tight')
plt.show()
#The plot shows that ratings are very unevenly distributed, and the vast majority of ratings are 0.

Python

Rating trend through out the years.

plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)

plt.scatter(combined["yearOfPublication"],combined["totalRatingCount"], label="Total Rating")

plt.title('Rating Trend through Year')
plt.xlabel('Year Of Publication')
plt.ylabel('Total Rating')
plt.legend(loc='best')

Python

#Removing users with age less than 10 and greater than 100
Ageindex_10 = combined[ combined['Age']<10 ].index
Ageindex_100 = combined[ combined['Age']>100 ].index
combined.drop(Ageindex_10, inplace=True)
combined.drop(Ageindex_100, inplace=True)

# Age Vs Rating

plt.figure(figsize=(40, 7))
plt.subplot(1,2,1)

plt.scatter(combined["Age"],combined["totalRatingCount"], label="bookRating")

plt.title('Age of User Vs Rating')
plt.xlabel('Age of user')
plt.ylabel('Rating')
plt.legend(loc='best')

Python

Methodology and Modeling

Implemented multiple collaborative filtering and content-based filtering models, leveraging user interaction data to generate personalized recommendations.

For collaborative filtering, algorithms like Matrix Factorization, k-nearest Neighbors, or Singular Value Decomposition (SVD) are used. Content-based systems might use classifiers or other algorithms to learn item features.

Matrix Factorization: Decomposes the user-item interaction matrix into lower-dimensional matrices representing users and items, capturing latent factors.
k-Nearest Neighbors (k-NN): Finds the nearest neighbors of users or items based on their ratings or features and recommends items that similar users have liked.
Deep Learning Models: Utilize neural networks to learn complex patterns in user-item interactions. Techniques like Autoencoders, Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs) can be employed.

After trying out many libraries and methods we decided to move forward with the LightFM implementation. Converted the data into LightFMs format.

# dummify categorical features
books_metadata_selected_transformed = pd.get_dummies(books_metadata_selected, columns = ['bookAuthor',
                                          'yearOfPublication', 'publisher'])
user_metadata_selected_transformed = pd.get_dummies(user_metadata_selected, columns=['Age','Location'])

books_metadata_selected_transformed = books_metadata_selected_transformed.sort_values('ISBN').reset_index().drop('index', axis=1)
user_metadata_selected_transformed = user_metadata_selected_transformed.sort_values('userID').reset_index().drop('index', axis=1)

# convert to csr matrix
books_metadata_csr = csr_matrix(books_metadata_selected_transformed.drop('ISBN', axis=1).values)
user_metadata_csr = csr_matrix(user_metadata_selected_transformed.drop('userID', axis=1).values)

user_book_interaction = pd.pivot_table(interactions_selected, index='userID', columns='ISBN', values='bookRating')
# fill missing values with 0
user_book_interaction = user_book_interaction.fillna(0)
user_id = list(user_book_interaction.index)
user_dict = {}
counter = 0
for i in user_id:
    user_dict[i] = counter
    counter += 1
# convert to csr matrix
user_book_interaction_csr = csr_matrix(user_book_interaction.values)

Python

Trained the model.

model = LightFM(loss='warp',
                random_state=2016,
                learning_rate=0.90,
                no_components=150,
                user_alpha=0.000005)

model = model.fit(user_book_interaction_csr,
                  user_features=user_metadata_csr,
                  item_features=books_metadata_csr,
                  epochs=100,
                  num_threads=16, verbose=False)

Python

Results and Evaluation

Evaluated the performance of the recommendation system using metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). After testing out different models as a team, we decided to choose a hybrid model that uses a combination of collaborative filtering and content-based filtering. We used the LightFM library to implement the model.

Final Thoughts

By combining expertise in data preprocessing, algorithm implementation, and model evaluation, we delivered a robust recommendation system that enhances user engagement and promotes the discovery of new research papers.

Notebook

How do recommendation systems work?

Data Collection and Preprocessing

Exploratory Data Analysis

Methodology and Modeling

Results and Evaluation

Final Thoughts

Related Projects

Wine Classification using Neural Networks

BP Prediction from ECG data

Leave a Reply Cancel reply