PIMA Indian Diabetes Prediction using machine learning

× Home Careers Contact

Back

PIMA Indian Diabetes Prediction using machine learning

Diabetes is a chronic condition in which the body develops a resistance to insulin, a hormone which converts food into glucose. Diabetes affect many people worldwide and is normally divided into Type 1 and Type 2 diabetes. Both have different characteristics. This article intends to analyze and create a model on the PIMA Indian Diabetes dataset to predict if a particular observation is at a risk of developing diabetes, given the independent factors. This article contains the methods followed to create a suitable model, including EDA along with the model.

Dataset

The dataset can be found on the Kaggle website. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases and can be used to predict whether a patient has diabetes based on certain diagnostic factors. Starting off, I use Python 3.3 to implement the model. It is important to perform some basic analysis to get an overall idea of the dataset.

Programming part for this project.

#Importing basic packages

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from pandas_profiling import ProfileReport

#Importing the Dataset

diabetes = pd.read_csv(“diabetes.csv”)

dataset = diabetes

#EDA using Pandas Profiling

file = ProfileReport(dataset)

file.to_file(output_file=’output.html’)

Pandas profiling is an efficient way to get an overall as well as in-depth information about the dataset and the variables in it. However, caution must be exercised if the dataset is very large as Pandas Profiling is time-consuming. Since the dataset has only 768 observations and 9 columns, we use this function. The output gets saved as an HTML report in the working directory.

Overview of the Dataset

We can see the basic information about the dataset such as the size, missing values, etc. On the top right, we see 8 of numerical columns and 1 Boolean column (which is our dependent variable). In the lower panel, (%) of zeros are given in every column, which will be an useful information for us later. We do not have any categorical variable as an independent variable.

Exploratory Data Analysis (EDA)

Having observed the basic characteristics of the dataset, we now move on to observe characteristics of the variables involved in the study. Again, Pandas Profiling comes to our rescue. The same HTML report gives information on the variables.

#Classifying the Blood Pressure based on class

ax = sns.violinplot(x=”Outcome”, y=”BloodPressure”, data=dataset, palette=”muted”, split=True)

#Replacing the zero-values for Blood Pressure

df1 = dataset.loc[dataset['Outcome'] == 1]

df2 = dataset.loc[dataset['Outcome'] == 0]

df1 = df1.replace({'BloodPressure':0}, np.median(df1['BloodPressure']))

df2 = df2.replace({'BloodPressure':0}, np.median(df2['BloodPressure']))

dataframe = [df1, df2]

dataset = pd.concat(dataframe)

There won’t be any zero-values in BloodPressure column post this. Let’s move to the next variable.

from scipy.stats import pearsonr

corr, _ = pearsonr(dataset[‘Age’], dataset[‘Pregnancies’])

print(‘Pearsons correlation: %.3f’ % corr)

Pearsons correlation: 0.544

The correlation coefficient (r) is 0.544. By a rule of thumb, in case of an r above 0.70, multi-collinearity is expected. Hence, no significant case of multi-collinearity is observed.

Treating Outliers and Non-Normality

Outliers are extreme values existing in the dataset. It is necessary to treat outliers if a distance-based algorithm (logistic regression, SVM, etc) is applied on the dataset. Outliers do not affect a tree-based algorithm. Since we will use both distance and tree-based algorithms, we will scale our data to treat outliers. We use Standard Scaler for the process. Standard Scaler transforms the feature by subtracting the mean and dividing with the standard deviation. This way the feature also gets close to standard normal distribution with mean 0.

#Splitting the data into dependent and independent variables

Y = dataset.Outcome

x = dataset.drop(‘Outcome’, axis = 1)

columns = x.columns

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(x)

data_x = pd.DataFrame(X, columns = columns)

We have scaled our X values.

Splitting the dataset into Training and Test data

We now split our processed dataset into Training and Test data. The Test data size is take to be 15% of the entire data (which means 115 observations) and the model will be trained on 653 observations.

#Splitting the data into training and test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data_x, Y, test_size = 0.15, random_state = 45)

Plot showing frequency of 0 and 1 in Y train

from imblearn.over_sampling import SMOTE

smt = SMOTE()

x_train, y_train = smt.fit_sample(x_train, y_train)

np.bincount(y_train)

Out[74]: array([430, 430])

We now have a balanced Training data.

Our data is now prepared to fit a model

Model Fitting: Logistic Regression

The first model we fit on the training data is the Logistic Regression.

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(x_train, y_train)

y_pred = logreg.predict(x_test)

print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

Output: Accuracy of logistic regression classifier on test set: 0.73

We get a 73% accuracy score on the test data.

print(f1_score(y_test, y_pred, average=”macro”))

print(precision_score(y_test, y_pred, average=”macro”))

print(recall_score(y_test, y_pred, average=”macro”))

0.723703419131771

0.7220530003045994

0.7263975155279503

Model Fitting: Support Vector Machine (Kernel: rbf)

The first model we fit on the training data is the Support Vector Machine (SVM). SVM uses many kernels to classify the data. We use rbf/Gaussian kernel to fit the first model.

from sklearn.svm import SVC

classifier_rbf = SVC(kernel = ‘rbf’)

classifier_rbf.fit(x_train, y_train)

y_pred = classifier_rbf.predict(x_test)

print('Accuracy of SVC (RBF) classifier on test set: {:.2f}'.format(classifier_rbf.score(x_test, y_test)))

Out[76]: Accuracy of SVC (RBF) classifier on test set: 0.75

print(f1_score(y_test, y_pred, average="macro"))

print(precision_score(y_test, y_pred, average="macro"))

print(recall_score(y_test, y_pred, average="macro"))

0.7431080565101182

0.7410256410256411

0.7481366459627329

We have an improved accuracy using SVM with rbf kernel. The model accuracy comes to 75%,

Model Fitting: Random Forest

We use Random Forest Classifier, with 300 trees (derived at after tuning the model) to fit a model on the data.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=300, bootstrap = True, max_features = ‘sqrt’)

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print('Accuracy of Random Forest on test set: {:.2f}'.format(model.score(x_test, y_test)))

Out[95]: Accuracy of Random Forest on test set: 0.88

print(f1_score(y_test, y_pred, average="macro"))

print(precision_score(y_test, y_pred, average="macro"))

print(recall_score(y_test, y_pred, average="macro"))

0.8729264475743349

0.8762626262626263

0.8701863354037267

We get the highest accuracy for Random Forest, with the score reaching 88%.

Conclusion

We thus select the Random Forest Classifier as the right model due to high accuracy, precision and recall score. One reason why Random Forest Classifier showed an improved performance was because of the presence of outliers. As mentioned before, since Random Forest is not a a distance based algorithm, it is not much affected by outliers, whereas distance based algorithm such as Logistic Regression and Support Vector showed a lower performance.

Based on the feature importance:

Glucose is the most important factor in determining the onset of diabetes followed by BMI and Age.

Other factors such as Diabetes Pedigree Function, Pregnancies, Blood Pressure, Skin Thickness and Insulin also contributes to the prediction.

As we can see, the results derived from Feature Importance makes sense as one of the first things that actually is monitored in high-risk patients is the Glucose level. An increased BMI might also indicate a risk of developing Type II Diabetes. Normally, especially in case of Type II Diabetes, there is a high risk of developing as the age of a person increases (given other factors).

We now come to the end of the project. I did not go in-depth of the techniques I used. However, there are some really good articles that helped me in doing the same.

Note : Find the best solution for electronics components and technical projects ideas
keep in touch with our social media links as mentioned below
Mifratech websites : https://www.mifratech.com/public/
Mifratech facebook : https://www.facebook.com/mifratech.lab
mifratech instagram : https://www.instagram.com/mifratech/
mifratech twitter account : https://twitter.com/mifratech

latest engineering projects on data science

engineering projects on machine learning

latest engineering projects on data science

engineering projects on machine learning

best engineering projects on machine learning

best projects on machine learning

best projects in deep learning

best machine learning projects for resume

best machine learning projects for final year

best machine learning projects for beginners

best machine learning projects for portfolio

best machine learning projects for jobs

best machine learning projects github

best projects in machine learning

best machine learning projects with source code

best deep learning projects for resume

best deep learning projects github

best deep learning research projects

best machine learning project ideas

best machine learning projects

best ml projects for resume

top 5 machine learning projects for beginners

top 10 machine learning projects for beginners

best ai projects for beginners

best ml projects for final year students

best engineering projects on machine learning

best projects on machine learning

best projects in deep learning

best machine learning projects for resume

best machine learning projects for final year

best machine learning projects for beginners

best machine learning projects for portfolio

best machine learning projects for jobs

best machine learning projects github

best projects in machine learning

best machine learning projects with source code

best deep learning projects for resume

best deep learning projects github

best deep learning research projects

best machine learning project ideas

best machine learning projects

best ml projects for resume

top 5 machine learning projects for beginners

top 10 machine learning projects for beginners

best ai projects for beginners

best ml projects for final year students

best project for machine learning

best ml projects for beginners

best machine learning tutorial for beginners

best machine learning course with projects

best machine learning projects in python

best machine learning projects on github

best machine learning programs online

top 10 machine learning projects for beginners in python

easy machine learning projects for beginners

Back

Popular Coures

PIMA Indian Diabetes Prediction using machine learning