Diabetes is a chronic condition in which the body develops a resistance to insulin, a hormone which converts food into glucose. Diabetes affect many people worldwide and is normally divided into Type 1 and Type 2 diabetes. Both have different characteristics. This article intends to analyze and create a model on the PIMA Indian Diabetes dataset to predict if a particular observation is at a risk of developing diabetes, given the independent factors. This article contains the methods followed to create a suitable model, including EDA along with the model.
Dataset
The dataset can be found on the Kaggle website. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and can be used to predict whether a patient has diabetes based on certain diagnostic factors. Starting off, I use Python 3.3 to implement the model. It is important to perform some basic analysis to get an overall idea of the dataset.
#Importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
#Importing the Dataset
diabetes = pd.read_csv(“diabetes.csv”)
dataset = diabetes
#EDA using Pandas Profiling
file = ProfileReport(dataset)
file.to_file(output_file=’output.html’)
Pandas profiling is an efficient way to get an overall as well as in-depth information about the dataset and the variables in it. However, caution must be exercised if the dataset is very large as Pandas Profiling is time-consuming. Since the dataset has only 768 observations and 9 columns, we use this function. The output gets saved as an HTML report in the working directory.
Overview of the Dataset
We can see the basic information about the dataset such as the size, missing values, etc. On the top right, we see 8 of numerical columns and 1 Boolean column (which is our dependent variable). In the lower panel, (%) of zeros are given in every column, which will be an useful information for us later. We do not have any categorical variable as an independent variable.
Exploratory Data Analysis (EDA)
Having observed the basic characteristics of the dataset, we now move on to observe characteristics of the variables involved in the study. Again, Pandas Profiling comes to our rescue. The same HTML report gives information on the variables.
#Classifying the Blood Pressure based on class
ax = sns.violinplot(x=”Outcome”, y=”BloodPressure”, data=dataset, palette=”muted”, split=True)
#Replacing the zero-values for Blood Pressure
df1 = dataset.loc[dataset['Outcome'] == 1]
df2 = dataset.loc[dataset['Outcome'] == 0]
df1 = df1.replace({'BloodPressure':0}, np.median(df1['BloodPressure']))
df2 = df2.replace({'BloodPressure':0}, np.median(df2['BloodPressure']))
dataframe = [df1, df2]
dataset = pd.concat(dataframe)
There won’t be any zero-values in BloodPressure column post this. Let’s move to the next variable.
from scipy.stats import pearsonr
corr, _ = pearsonr(dataset[‘Age’], dataset[‘Pregnancies’])
print(‘Pearsons correlation: %.3f’ % corr)
Pearsons correlation: 0.544
The correlation coefficient (r) is 0.544. By a rule of thumb, in case of an r above 0.70, multi-collinearity is expected. Hence, no significant case of multi-collinearity is observed.
Treating Outliers and Non-Normality
Outliers are extreme values existing in the dataset. It is necessary to treat outliers if a distance-based algorithm (logistic regression, SVM, etc) is applied on the dataset. Outliers do not affect a tree-based algorithm. Since we will use both distance and tree-based algorithms, we will scale our data to treat outliers. We use Standard Scaler for the process. Standard Scaler transforms the feature by subtracting the mean and dividing with the standard deviation. This way the feature also gets close to standard normal distribution with mean 0.
#Splitting the data into dependent and independent variables
Y = dataset.Outcome
x = dataset.drop(‘Outcome’, axis = 1)
columns = x.columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(x)
data_x = pd.DataFrame(X, columns = columns)
We have scaled our X values.
Splitting the dataset into Training and Test data
We now split our processed dataset into Training and Test data. The Test data size is take to be 15% of the entire data (which means 115 observations) and the model will be trained on 653 observations.
#Splitting the data into training and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_x, Y, test_size = 0.15, random_state = 45)
Plot showing frequency of 0 and 1 in Y train
from imblearn.over_sampling import SMOTE
smt = SMOTE()
x_train, y_train = smt.fit_sample(x_train, y_train)
np.bincount(y_train)
Out[74]: array([430, 430])
We now have a balanced Training data.
Our data is now prepared to fit a model
Model Fitting: Logistic Regression
The first model we fit on the training data is the Logistic Regression.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))
Output: Accuracy of logistic regression classifier on test set: 0.73
We get a 73% accuracy score on the test data.
print(f1_score(y_test, y_pred, average=”macro”))
print(precision_score(y_test, y_pred, average=”macro”))
print(recall_score(y_test, y_pred, average=”macro”))
0.723703419131771
0.7220530003045994
0.7263975155279503
Model Fitting: Support Vector Machine (Kernel: rbf)
The first model we fit on the training data is the Support Vector Machine (SVM). SVM uses many kernels to classify the data. We use rbf/Gaussian kernel to fit the first model.
from sklearn.svm import SVC
classifier_rbf = SVC(kernel = ‘rbf’)
classifier_rbf.fit(x_train, y_train)
y_pred = classifier_rbf.predict(x_test)
print('Accuracy of SVC (RBF) classifier on test set: {:.2f}'.format(classifier_rbf.score(x_test, y_test)))
Out[76]: Accuracy of SVC (RBF) classifier on test set: 0.75
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
0.7431080565101182
0.7410256410256411
0.7481366459627329
We have an improved accuracy using SVM with rbf kernel. The model accuracy comes to 75%,
Model Fitting: Random Forest
We use Random Forest Classifier, with 300 trees (derived at after tuning the model) to fit a model on the data.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300, bootstrap = True, max_features = ‘sqrt’)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print('Accuracy of Random Forest on test set: {:.2f}'.format(model.score(x_test, y_test)))
Out[95]: Accuracy of Random Forest on test set: 0.88
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
0.8729264475743349
0.8762626262626263
0.8701863354037267
We get the highest accuracy for Random Forest, with the score reaching 88%.
We thus select the Random Forest Classifier as the right model due to high accuracy, precision and recall score. One reason why Random Forest Classifier showed an improved performance was because of the presence of outliers. As mentioned before, since Random Forest is not a a distance based algorithm, it is not much affected by outliers, whereas distance based algorithm such as Logistic Regression and Support Vector showed a lower performance.
Based on the feature importance:
Glucose is the most important factor in determining the onset of diabetes followed by BMI and Age.
Other factors such as Diabetes Pedigree Function, Pregnancies, Blood Pressure, Skin Thickness and Insulin also contributes to the prediction.
As we can see, the results derived from Feature Importance makes sense as one of the first things that actually is monitored in high-risk patients is the Glucose level. An increased BMI might also indicate a risk of developing Type II Diabetes. Normally, especially in case of Type II Diabetes, there is a high risk of developing as the age of a person increases (given other factors).
We now come to the end of the project. I did not go in-depth of the techniques I used. However, there are some really good articles that helped me in doing the same.
Note : Find the best solution for electronics components and technical projects ideas
keep in touch with our social media links as mentioned below
Mifratech websites : https://www.mifratech.com/public/
Mifratech facebook : https://www.facebook.com/mifratech.lab
mifratech instagram : https://www.instagram.com/mifratech/
mifratech twitter account : https://twitter.com/mifratech
latest engineering projects on data science
engineering projects on machine learning
latest engineering projects on data science
engineering projects on machine learning
best engineering projects on machine learning
best engineering projects on machine learning
best projects on machine learning
best projects in deep learning
best machine learning projects for resume
best machine learning projects for final year
best machine learning projects for beginners
best machine learning projects for portfolio
best machine learning projects for jobs
best machine learning projects github
best projects in machine learning
best machine learning projects with source code
best deep learning projects for resume
best deep learning projects github
best deep learning research projects
best machine learning project ideas
best machine learning projects
best ml projects for resume
top 5 machine learning projects for beginners
top 10 machine learning projects for beginners
best ai projects for beginners
best ml projects for final year students
best engineering projects on machine learning
best projects on machine learning
best projects in deep learning
best machine learning projects for resume
best machine learning projects for final year
best machine learning projects for beginners
best machine learning projects for portfolio
best machine learning projects for jobs
best machine learning projects github
best projects in machine learning
best machine learning projects with source code
best deep learning projects for resume
best deep learning projects github
best deep learning research projects
best machine learning project ideas
best machine learning projects
best ml projects for resume
top 5 machine learning projects for beginners
top 10 machine learning projects for beginners
best ai projects for beginners
best ml projects for final year students
best project for machine learning
best ml projects for beginners
best machine learning tutorial for beginners
best machine learning course with projects
best machine learning projects in python
best machine learning projects on github
best machine learning programs online
top 10 machine learning projects for beginners in python
easy machine learning projects for beginners