Article Recommendation System

Objective: -

The goal of this challenge is to build recommendation system to recommend articles to their readers.

Dataset: -

Many websites today use a recommendation system to recommend articles to their readers. For example, Most websites like Quora, LinkedIn, Medium are also using a recommendation system to recommend articles to its readers.

Step 1: Import all the required libraries

  • Pandas : In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis and storing in a proper way. In particular, it offers data structures and operations for manipulating numerical tables and time series

  • Sklearn : Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn.

  • Pickle : Python pickle module is used for serializing and de-serializing a Python object structure. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

  • Seaborn : Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

  • Matplotlib : Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.

#Loading libraries   
import pandas as pd  
from sklearn import preprocessing  
import pickle  
import numpy as np  
from sklearn.decomposition import PCA  
from sklearn.feature_extraction import text  
from sklearn.metrics.pairwise import cosine_similarity  
import warnings  


Step 2 : Read dataset and basic details of dataset

Goal:- In this step we are going to read the dataset, view the dataset and analysis the basic details like total number of rows and columns, what are the column data types and see to need to create new column or not.

In this stage we are going to read our problem dataset and have a look on it.

#loading the dataset  
    df = pd.read_csv("", encoding='latin1') #Path for the file  
    print('Data read done successfully...')  
except (FileNotFoundError, IOError):  
    print("Wrong file or file path")
Data read done successfully...
# To view the content inside the dataset we can use the head() method that returns a specified number of rows, string from the top.   
# The head() method returns the first 5 rows if a number is not specified.  


After we read the data, we can look at the data using:

# count the total number of rows and columns.  
print ('The train data has {0} rows and {1} columns'.format(df.shape[0],df.shape[1]))
The train data has 34 rows and 2 columns
(34, 2)

The df.shape method shows the shape of the dataset.
<class 'pandas.core.frame.DataFrame'>  
RangeIndex: 34 entries, 0 to 33  
Data columns (total 2 columns):  
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   Article  34 non-null     object  
 1   Title    34 non-null     object  
dtypes: object(2)  
memory usage: 672.0+ bytes

The method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

Article    The performance of a machine learning algorith...  
Title             Assumptions of Machine Learning Algorithms  
Name: 1, dtype: object

df.iloc[ ] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. The iloc property gets, or sets, the value(s) of the specified indexes.

Data Type Check for every column

Why data type check is required?

Data type check helps us with understanding what type of variables our dataset contains. It helps us with identifying whether to keep that variable or not. If the dataset contains contiguous data, then only float and integer type variables will be beneficial and if we have to classify any value then categorical variables will be beneficial.

objects_cols = ['object']  
objects_lst = list(df.select_dtypes(include=objects_cols).columns)
print("Total number of categorical columns are ", len(objects_lst))  
print("There names are as follows: ", objects_lst)
Total number of categorical columns are  2  
There names are as follows:  ['Article', 'Title']
int64_cols = ['int64']  
int64_lst = list(df.select_dtypes(include=int64_cols).columns)
print("Total number of numerical columns are ", len(int64_lst))  
print("There names are as follows: ", int64_lst)
Total number of numerical columns are  0  
There names are as follows:  []
float64_cols = ['float64']  
float64_lst = list(df.select_dtypes(include=float64_cols).columns)
print("Total number of float64 columns are ", len(float64_lst))  
print("There name are as follow: ", float64_lst)
Total number of float64 columns are  0  
There name are as follow:  []

Step 2 Insights: -

  1. We have total 2 features where 0 of them are float type, 2 are object type and 0 is int type.

Step3: Data Preprocessing

Why need of Data Preprocessing?

Preprocessing data is an important step for data analysis. The following are some benefits of preprocessing data:

  • It improves accuracy and reliability. Preprocessing data removes missing or inconsistent data values resulting from human or computer error, which can improve the accuracy and quality of a dataset, making it more reliable.

  • It makes data consistent. When collecting data, it’s possible to have data duplicates, and discarding them during preprocessing can ensure the data values for analysis are consistent, which helps produce accurate results.

  • It increases the data’s algorithm readability. Preprocessing enhances the data’s quality and makes it easier for machine learning algorithms to read, use, and interpret it.

Null and Nan values

  1. Null Values


A null value in a relational database is used when the value in a column is unknown or missing. A null is neither an empty string (for character or datetime data types) nor a zero value (for numeric data types).

Article    0  
Title      0  
dtype: int64

As we notice that there are null values in our dataset.

  1. Nan Values


NaN, standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic.

Article    0  
Title      0  
dtype: int64

As we notice that there are nan values in our dataset.

# We have many ways to fill Null/Nan Values as below:
  • mean -> average value (for numerical)

  • mode -> most repeated value (for categorical)

Another way to remove null and nan values is to use the method “df.dropna(inplace=True)”.

Step 4: Imlementing Cosine Similarity and Creating Function to reccommend article to user

To create an articles recommendation system, we need to focus on content rather than user interest. For example, if a user reads an article based on clustering, all recommended articles should also be based on clustering. So to recommend articles based on the content:

  • we need to understand the content of the article

  • match the content with all the other articles

  • and recommend the most suitable articles for the article that the reader is already reading

For this task, we can use the concept of cosine similarity in machine learning. Cosine similarity is a method of building recommendation systems based on the content. It is used to find similarities between two different pieces of text documents. So we can use cosine similarity to build an article recommendation system.

Tasks we are going to in this step:

  1. Impliment cosine similarity algorithm

  2. Make a function for recommending article for a paticular article

  3. Run function of recommending article for i times the articles in dataset

1. Impliment cosine similarity algorithm

articles = df["Article"].tolist()  
uni_tfidf = text.TfidfVectorizer(input=articles, stop_words="english")  
uni_matrix = uni_tfidf.fit_transform(articles)  
uni_sim = cosine_similarity(uni_matrix)

2. Make a function for recommending article for a paticular article

def recommend_articles(x):  
    return ", ".join(df["Title"].loc[x.argsort()[-5:-1]])

3. Run function of recommending article for i times the articles

df["Recommended Articles"] = [recommend_articles(x) for x in uni_sim]  


As you can see from the output above, a new column has been added to the dataset that contains the titles of all the recommended articles. Now let’s see all the recommendations for an article:

# lets check a a recommended article for a paticular article or index  
print(df["Recommended Articles"][22])
BIRCH Clustering in Machine Learning, Clustering Algorithms in Machine Learning, DBSCAN Clustering in Machine Learning, K-Means Clustering in Machine Learning

Index 22 contains an article on “agglomerated clustering”, and all the recommended articles are also based on the concepts of clustering, so we can say that this recommender system can also give great results in real-time.

Step 5: Save Model

Goal:- In this step we are going to save our model in pickel format file.

import pickle  
pickle.dump(df, open('article_recommender_model.pkl', 'wb'))
import pickle   
pickle.dump(uni_sim, open("Cosine_artciles.pkl", 'wb'))


After observing the problem statement we have build an efficient model to overcome it. The above model helps in recommending articles to their readers.

