Tuesday, January 16, 2024

Heart Disease Prediction – Application of Scikit-Learn and Machine Learning

We all know that the cardiovascular diseases (CVDs), heart disease, acknowledged as the most common cause of death during the preceding several decades, is now recognized as the most lethal condition, not just in India but throughout the globe. In humans, the heart is important. A small error can result in tiredness issues or even death. One of the most significant challenges confronting medical sector today is the prediction or forecasting of heart illness. It is associated with numerous and intimidating factors and components. An early diagnosis must be made using a prediction model that is reliable, trustworthy, and reasonable in order to achieve rapid disease treatment. It has been discovered that Machine Learning and Deep Learning may help with decision-making and prediction-making using the vast amounts of data that gathered from several healthcare firms. Machine learning methods and approaches have been used to automate the study of big and complicated medical information. Recently, some of the decorated researchers have employed various machine learning to assist the healthcare sectors and professionals to identify heart- related illness.

Before implementing the machine learning models, we need authenticated medical records such as tabular data to predict the heart related illnesses. Let’s look at the data in hand which is taken from the open-source platform know as Kaggle, and this is the link to Heart Disease Dataset (kaggle.com). This data contains the following attributes.

Dataset Attributes

Exploratory Data Analysis

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is an essential stage in the data analysis process that entails analyzing and graphically representing data sets to comprehend their primary attributes, reveal patterns, spot anomalies, and ascertain connections between variables. The main objective of Exploratory Data Analysis (EDA) is to get a deep understanding of the fundamental organization of the data, formulate hypotheses, and provide guidance for the later stages of analysis or modeling.

Let's open the CSV (Comma Separated Values) file, to get the Heart Disease Detection dataset. For the presented context, I will be using python programming language for code example.

1. Importing the important libraries for exploratory data analysis

import numpy as np 

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings("ignore")

2. Importing the dataset

df = pd.read_csv("/kaggle/input/heart-disease-dataset/heart.csv")
df.head() # viewing the first five rows of the dataset


Let's look at the shape of the dataset.
df.shape
From the shape of the dataset, we can observe that the dataset contains total 1025 data rows and 14 columns or attributes.  Let's look at the data types of the different attributes of the data.
df.info()

Almost every attribute of the dataset has integer values except oldpeak is of float type. Let's look at the different properties of the dataset such as mean, min, standard deviation and more...
df.describe().T

3. Exploratory Data Analysis (EDA)

Let's look at the Gender distribution of the dataset. Where 1 - Male and 0 - Female.
df['sex'].value_counts()

In the given dataset there are total 713 male records and 312 female records.  Let's create the bar chart for the same.
df['sex'].value_counts().plot(kind='bar', color=['orange', 'green'])

Let's look at the target values in the dataset. Where target: 1- Positive and 0 - Negative.
df['target'].value_counts()

In the dataset there are total 526 positive records and 499 negative records. Let's create the bar chart for the same.
df['target'].value_counts().plot(kind='bar', color=['red', 'green'])

Let's look at how many males and females in the records have heart diseases.
pd.crosstab(df.target, df.sex)

fig = sns.countplot(x = 'target', data = df, hue = 'sex')
fig.set_xticklabels(labels=["Doesn't have heart disease", 'Has heart disease'], rotation=0)
plt.legend(['Female', 'Male'])
plt.title("Heart Disease Frequency for Sex");

4. Checking the Correlation Within Dataset Attributes

x = df.corr()
plt.figure(figsize = (15,8))
sns.heatmap(x,annot = True)

From heatmap we can clearly observe that no column is shown significant contribution among all the attributes of the dataset. So, we are going to take all the features for the model evaluation. Let's look at the outliers among the dataset's attributes.
df.describe().T

The dataset columns `trestbps`, `chol`, `thalach` and `thal` shown some outliers among the datset.

5. Removing the Outliers from the Dataset

q1 = df['trestbps'].quantile(q = 0.25)
q3 = df["trestbps"].quantile(q = 0.75)
IQR = q3 - q1
IQR_lower_limit = int(q1 - (1.5*IQR))
IQR_upper_limit = int(q3 + (1.5*IQR))
print("Upper limit of IQR:",IQR_upper_limit)
print("Lower limit of IQR:",IQR_lower_limit)
cleaned_data = df[df["trestbps"]<IQR_upper_limit]

plt.boxplot(cleaned_data["trestbps"])
Now we can clearly see that there is no outlier in the dataset.
cleaned_data.shape

After removing the outliers there are only 980 records left in the dataset.

In this context we will employ the different methods of machine learning provided by the Scikit-Learn, to deploy such models that may help us predict the cardiovascular disease in timely manner to assist the healthcare system to aid the patient more effectively for more prompt recovery of patient. For this scenario we will implement machine learning models such as K-Nearest Neighbour (KNN), Logistic Regression (LR), and Random Forest (RF).

6. Building Model and Model Evaluation

6.1. One Hot-Encoding

cat_values = []
conti_values = []
for col in df.columns:
    if len(df[col].unique()) >= 10:
        conti_values.append(col)
    else:
        cat_values.append(col)
cat_values.remove('target')
cleaned_data = pd.get_dummies(cleaned_data, columns=cat_values)

6.2. Training and Testing Data Split

X = cleaned_data.drop(columns = 'target')
y = cleaned_data['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

6.3. Importing Sklearn Libraries for Building Model and Model Evaluation

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.metrics import accuracy_score
from math import sqrt
from sklearn.metrics import cohen_kappa_score

6.4. Logistic Regression

log_clf = LogisticRegression(max_iter=1000, penalty='l2', solver='liblinear', C=0.05, random_state = 4)
log_clf.fit(X_train, y_train)
log_score = metrics.f1_score(y_test, log_clf.predict(X_test))
log_score

Confusion Matrix for Logistic Regression Model
# Making Prediction on the test set
y_preds = log_clf.predict(X_test)
print(classification_report(y_test, y_preds))
#Mean Square Error
train_preds = log_clf.predict(X_train)
mse = mean_squared_error(y_train, train_preds)
lrmse = sqrt(mse)
print("Error: ", lrmse)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds)
print(fpr)
print(tpr)
print(thresholds)
print("Cohen's Kappa Score: ", cohen_kappa_score(y_test, y_preds))

6.5. K-Nearest Neighbor

knn_clf = KNeighborsClassifier(algorithm='auto', leaf_size=45, n_neighbors=2, p=1)
knn_clf.fit(X_train, y_train)
knn_score = metrics.f1_score(y_test, knn_clf.predict(X_test))
knn_score

Confusion Matrix for K-Nearest Neighbor 
# Making Prediction on the test set
y_preds = knn_clf.predict(X_test)
print(classification_report(y_test, y_preds))
#Mean Square Error
train_preds = knn_clf.predict(X_train)
mse = mean_squared_error(y_train, train_preds)
knnrmse = sqrt(mse)
print("Error: ",knnrmse)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds)
print(fpr)
print(tpr)
print(thresholds)
print("Cohen's Kappa Score: ", cohen_kappa_score(y_test, y_preds))

6.6. Random Forest

rand_clf = RandomForestClassifier(n_estimators=100, random_state = 35)
rand_clf.fit(X_train, y_train)
ranf_score = metrics.f1_score(y_test, rand_clf.predict(X_test))
ranf_score

Confusion Matric for Random Forest
# Making Prediction on the test set
y_preds = rand_clf.predict(X_test)
print(classification_report(y_test, y_preds))
#Mean Square Error
train_preds = rand_clf.predict(X_train)
mse = mean_squared_error(y_train, train_preds)
rfrmse = sqrt(mse)
print("Error: ",rfrmse)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds)
print("False Positive Rate: ",fpr)
print(tpr)
print(thresholds)
print("Cohen's Kappa Score: ", cohen_kappa_score(y_test, y_preds))

6.7. Model Accuracy Scores

# create a dictionary with all scores
score = [{'Model':'Logistic Regression', 'Score': log_score}, 
         {'Model':'KNN', 'Score': knn_score},
         {'Model':'Random Forest', 'Score': ranf_score}]
pd.DataFrame(score, columns=['Model','Score'])

The result of the proposed model demonstrates that the Random Forest Model has the highest accuracy rating of 98.43%.

Please check out the complete python notebook code is available at Heart Disease Prediction KNN, LR and RF | Kaggle

This context is mere glimpse the work done in the Research paper Heart Disease Prediction Using Machine Learning Techniques | IEEE Conference Publication | IEEE Xplore . I hope you will like the content and Best of Luck for future expenditures! 

No comments:

Post a Comment