Objective: -
The goal of this challenge is to build recommendation system to recommend articles to their readers.
Dataset: -
Many websites today use a recommendation system to recommend articles to their readers. For example, Most websites like Quora, LinkedIn, Medium are also using a recommendation system to recommend articles to its readers.
Step 1: Import all the required libraries
Pandas : In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis and storing in a proper way. In particular, it offers data structures and operations for manipulating numerical tables and time series
Sklearn : Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn.
Pickle : Python pickle module is used for serializing and de-serializing a Python object structure. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.
Seaborn : Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Matplotlib : Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
#Loading libraries
import pandas as pd
from sklearn import preprocessing
import pickle
import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')
Step 2 : Read dataset and basic details of dataset
Goal:- In this step we are going to read the dataset, view the dataset and analysis the basic details like total number of rows and columns, what are the column data types and see to need to create new column or not.
In this stage we are going to read our problem dataset and have a look on it.
#loading the dataset
try:
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/articles.csv", encoding='latin1') #Path for the file
print('Data read done successfully...')
except (FileNotFoundError, IOError):
print("Wrong file or file path")
Data read done successfully...
# To view the content inside the dataset we can use the head() method that returns a specified number of rows, string from the top.
# The head() method returns the first 5 rows if a number is not specified.
df.head()
After we read the data, we can look at the data using:
# count the total number of rows and columns.
print ('The train data has {0} rows and {1} columns'.format(df.shape[0],df.shape[1]))
The train data has 34 rows and 2 columns
df.shape
(34, 2)
The df.shape method shows the shape of the dataset.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Article 34 non-null object
1 Title 34 non-null object
dtypes: object(2)
memory usage: 672.0+ bytes
The df.info() method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
df.iloc[1]
Article The performance of a machine learning algorith...
Title Assumptions of Machine Learning Algorithms
Name: 1, dtype: object
df.iloc[ ] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. The iloc property gets, or sets, the value(s) of the specified indexes.
Data Type Check for every column
Why data type check is required?
Data type check helps us with understanding what type of variables our dataset contains. It helps us with identifying whether to keep that variable or not. If the dataset contains contiguous data, then only float and integer type variables will be beneficial and if we have to classify any value then categorical variables will be beneficial.
objects_cols = ['object']
objects_lst = list(df.select_dtypes(include=objects_cols).columns)
print("Total number of categorical columns are ", len(objects_lst))
print("There names are as follows: ", objects_lst)
Total number of categorical columns are 2
There names are as follows: ['Article', 'Title']
int64_cols = ['int64']
int64_lst = list(df.select_dtypes(include=int64_cols).columns)
print("Total number of numerical columns are ", len(int64_lst))
print("There names are as follows: ", int64_lst)
Total number of numerical columns are 0
There names are as follows: []
float64_cols = ['float64']
float64_lst = list(df.select_dtypes(include=float64_cols).columns)
print("Total number of float64 columns are ", len(float64_lst))
print("There name are as follow: ", float64_lst)
Total number of float64 columns are 0
There name are as follow: []
Step 2 Insights: -
- We have total 2 features where 0 of them are float type, 2 are object type and 0 is int type.
Step3: Data Preprocessing
Why need of Data Preprocessing?
Preprocessing data is an important step for data analysis. The following are some benefits of preprocessing data:
It improves accuracy and reliability. Preprocessing data removes missing or inconsistent data values resulting from human or computer error, which can improve the accuracy and quality of a dataset, making it more reliable.
It makes data consistent. When collecting data, it’s possible to have data duplicates, and discarding them during preprocessing can ensure the data values for analysis are consistent, which helps produce accurate results.
It increases the data’s algorithm readability. Preprocessing enhances the data’s quality and makes it easier for machine learning algorithms to read, use, and interpret it.
Null and Nan values
- Null Values
A null value in a relational database is used when the value in a column is unknown or missing. A null is neither an empty string (for character or datetime data types) nor a zero value (for numeric data types).
df.isnull().sum()
Article 0
Title 0
dtype: int64
As we notice that there are null values in our dataset.
- Nan Values
NaN, standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic.
df.isna().sum()
Article 0
Title 0
dtype: int64
As we notice that there are nan values in our dataset.
# We have many ways to fill Null/Nan Values as below:
mean -> average value (for numerical)
mode -> most repeated value (for categorical)
Another way to remove null and nan values is to use the method “df.dropna(inplace=True)”.
Step 4: Imlementing Cosine Similarity and Creating Function to reccommend article to user
To create an articles recommendation system, we need to focus on content rather than user interest. For example, if a user reads an article based on clustering, all recommended articles should also be based on clustering. So to recommend articles based on the content:
we need to understand the content of the article
match the content with all the other articles
and recommend the most suitable articles for the article that the reader is already reading
For this task, we can use the concept of cosine similarity in machine learning. Cosine similarity is a method of building recommendation systems based on the content. It is used to find similarities between two different pieces of text documents. So we can use cosine similarity to build an article recommendation system.
Tasks we are going to in this step:
Impliment cosine similarity algorithm
Make a function for recommending article for a paticular article
Run function of recommending article for i times the articles in dataset
1. Impliment cosine similarity algorithm
articles = df["Article"].tolist()
uni_tfidf = text.TfidfVectorizer(input=articles, stop_words="english")
uni_matrix = uni_tfidf.fit_transform(articles)
uni_sim = cosine_similarity(uni_matrix)
2. Make a function for recommending article for a paticular article
def recommend_articles(x):
return ", ".join(df["Title"].loc[x.argsort()[-5:-1]])
3. Run function of recommending article for i times the articles
df["Recommended Articles"] = [recommend_articles(x) for x in uni_sim]
df.head()
As you can see from the output above, a new column has been added to the dataset that contains the titles of all the recommended articles. Now let’s see all the recommendations for an article:
# lets check a a recommended article for a paticular article or index
print(df["Recommended Articles"][22])
BIRCH Clustering in Machine Learning, Clustering Algorithms in Machine Learning, DBSCAN Clustering in Machine Learning, K-Means Clustering in Machine Learning
Index 22 contains an article on “agglomerated clustering”, and all the recommended articles are also based on the concepts of clustering, so we can say that this recommender system can also give great results in real-time.
Step 5: Save Model
Goal:- In this step we are going to save our model in pickel format file.
import pickle
pickle.dump(df, open('article_recommender_model.pkl', 'wb'))
import pickle
pickle.dump(uni_sim, open("Cosine_artciles.pkl", 'wb'))
Conclusion
After observing the problem statement we have build an efficient model to overcome it. The above model helps in recommending articles to their readers.
Checkout whole project codehere(github repo).
🚀 Unlock Your Dream Job with HiDevs Community!
🔍 Seeking the perfect job?HiDevs Community is your gateway to career success in the tech industry. Explore free expert courses, job-seeking support, and career transformation tips.
💼 We offer an upskill program in Gen AI, Data Science, Machine Learning, and assist startups in adopting Gen AI at minimal development costs.
🆓 Best of all, everything we offer is completely free! We are dedicated to helping society.
Book free of cost 1:1 mentorship on any topic of your choice —topmate
✨ We dedicate over 30 minutes to each applicant’s resume, LinkedIn profile, mock interview, and upskill program. If you’d like our guidance, check out our services here
💡 Join us now, and turbocharge your career!
**Deepak Chawla LinkedIn
**Vijendra Singh LinkedInYajendra Prajapati LinkedIn**YouTube Channel
****Instagram Page
**HiDevs LinkedIn