whatsapp

whatsApp

Have any Questions? Enquiry here!
☎ +91-9972364704 LOGIN BLOG
× Home Careers Contact
Back
Breast Cancer Detection Using Machine Learning
Breast Cancer Detection Using Machine Learning

INTRODUCTION

In this PROJECT  I will show you how to create your very own machine learning python program to detect breast cancer from data. Breast Cancer (BC) is a common cancer for women around the world, and early detection of BC can greatly improve prognosis and survival chances by promoting clinical treatment to patients early. So it’s amazing to be able to possibly help save lives just by using data, python, and machine learning!

DATASET:https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

PROBLEM STATEMENT & DISCUSSION

Breast Cancer is one of the leading cancer developed in

many countries including India. Though the endurance

rate is high – with early diagnosis 97% women can

survive for more than 5 years. Statistically, the death toll

due to this disease has increased drastically in last few

decades. The main issue pertaining to its cure is early

recognition. Hence, apart from medicinal solutions some

Data Science solution needs to be integrated for resolving

the death causing issue.

PROGRAMMING:

Now import the packages/libraries to make it easier to write the program.

#import libraries 

import numpy as np 

import pandas as pd 

import matplotlib.pyplot as plt 

import seaborn as sns

Next I will load the data, and print the first 7 rows of data.

NOTE: Each row of data represents a patient that may or may not have cancer.

#Load the data 

#from google.colab import files # Use to load data on Google Colab #uploaded = files.upload() # Use to load data on Google Colab

df = pd.read_csv('data.csv') 

df.head(7)

 

A sample of the first 7 rows of data

Explore the data and count the number of rows and columns in the data set. Their are 569 rows of data which means their are 569 patients in this data set, and 33 columns which mean their are 33 features or data points for each patient.

#Count the number of rows and columns in the data set

df.shape

 

Number of Rows: 569, Number of Columns: 33

Continue exploring the data and get a count of all of the columns that contain empty (NaN, NAN, na) values. Notice none of the columns contain any empty values except the column named ‘Unnamed: 32’ , which contains 569 empty values (the same number of rows in the data set, this tells me this column is completely useless)

#Count the empty (NaN, NAN, na) values in each column

df.isna().sum()

 

Count of all the empty values per column/feature

Remove the column ‘Unnamed: 32’ from the original data set since it adds no value.

#Drop the column with all missing values (na, NAN, NaN)

#NOTE: This drops the column Unnamed

df = df.dropna(axis=1)

Get the new count of the number of rows and columns.

#Get the new count of the number of rows and cols

df.shape

 

Number of Rows: 569, Number of Columns: 32

Get a count of the number of patients with Malignant (M) cancerous and Benign (B) non-cancerous cells.

#Get a count of the number of 'M' & 'B' cells

df['diagnosis'].value_counts()

 

# of Cancerous Cells: 212 and # of Non-Cancerous Cells: 357

Visualize the counts, by creating a count plot.

#Visualize this count 

sns.countplot(df['diagnosis'],label="Count")

 

Chart displaying Malignant (cancerous) & Benign(non-cancerous) diagnosis

Look at the data types to see which columns need to be transformed / encoded. I can see from the data types that all of the columns/features are numbers except for the column ‘diagnosis’, which is categorical data represented as an object in python.

#Look at the data types 

df.dtypes

 

A list of the columns & their data types

Encode the categorical data. Change the values in the column ‘diagnosis’ from M and B to 1 and 0 respectively, then print the results.

#Encoding categorical data values (

from sklearn.preprocessing import LabelEncoder

labelencoder_Y = LabelEncoder()

df.iloc[:,1]= labelencoder_Y.fit_transform(df.iloc[:,1].values)

print(labelencoder_Y.fit_transform(df.iloc[:,1].values)

 

Create a pair plot. A “pairs plot” is also known as a scatter plot, in which one variable in the same data row is matched with another variable’s value.

sns.pairplot(df, hue="diagnosis")

 

Pair plot of all of the columns highlighting the diagnosis points in Orange (1) & Blue (0)

Print the new data set which now has only 32 columns. Print only the first 5 rows.

df.head(5)

 

5 rows of the new data set

Get the correlation of the columns.

#Get the correlation of the columns

df.corr()

 

Column correlation sample

Visualize the correlation by creating a heat map.

plt.figure(figsize=(20,20))  

sns.heatmap(df.corr(), annot=True, fmt='.0%')

 

Heat map of correlations

Now I am done exploring and cleaning the data. I will set up my data for the model by first splitting the data set into a feature data set also known as the independent data set (X), and a target data set also known as the dependent data set (Y).

 

X = df.iloc[:, 2:31].values 

Y = df.iloc[:, 1].values 

Split the data again, but this time into 75% training and 25% testing data sets.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

 

 

#Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Create a function to hold many different models (e.g. Logistic Regression, Decision Tree Classifier, Random Forest Classifier) to make the classification. These are the models that will detect if a patient has cancer or not. Within this function I will also print the accuracy of each model on the training data.

def models(X_train,Y_train):

  

  #Using Logistic Regression 

  from sklearn.linear_model import LogisticRegression

  log = LogisticRegression(random_state = 0)

  log.fit(X_train, Y_train)

  

  #Using KNeighborsClassifier 

  from sklearn.neighbors import KNeighborsClassifier

  knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

  knn.fit(X_train, Y_train)

 

  #Using SVC linear

  from sklearn.svm import SVC

  svc_lin = SVC(kernel = 'linear', random_state = 0)

  svc_lin.fit(X_train, Y_train)

 

  #Using SVC rbf

  from sklearn.svm import SVC

  svc_rbf = SVC(kernel = 'rbf', random_state = 0)

  svc_rbf.fit(X_train, Y_train)

 

  #Using GaussianNB 

  from sklearn.naive_bayes import GaussianNB

  gauss = GaussianNB()

  gauss.fit(X_train, Y_train)

 

  #Using DecisionTreeClassifier 

  from sklearn.tree import DecisionTreeClassifier

  tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

  tree.fit(X_train, Y_train)

 

  #Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm

  from sklearn.ensemble import RandomForestClassifier

  forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)

  forest.fit(X_train, Y_train)

  

  #print model accuracy on the training data.

  print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))

  print('[1]K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train))

  print('[2]Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train, Y_train))

  print('[3]Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train, Y_train))

  print('[4]Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train))

  print('[5]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))

  print('[6]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))

  

  return log, knn, svc_lin, svc_rbf, gauss, tree, forest

Create the model that contains all of the models, and look at the accuracy score on the training data for each model to classify if a patient has cancer or not.

model = models(X_train,Y_train)

 

The accuracy of each model on the training data

Show the confusion matrix and the accuracy of the models on the test data. The confusion matrix tells us how many patients each model misdiagnosed (number of patients with cancer that were misdiagnosed as not having cancer a.k.a false negative, and the number of patients who did not have cancer that were misdiagnosed with having cancer a.k.a false positive) and the number of correct diagnosis, the true positives and true negatives.

False Positive (FP) = A test result which incorrectly indicates that a particular condition or attribute is present.

True Positive (TP) = Sensitivity (also called the true positive rate, or probability of detection in some fields) measures the proportion of actual positives that are correctly identified as such.

True Negative (TN) = Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such.

False Negative (FN) = A test result that indicates that a condition does not hold, while in fact it does. For example a test result that indicates a person does not have cancer when the person actually does have it

 

Confusion Matrix

 

from sklearn.metrics import confusion_matrix

for i in range(len(model)):

  cm = confusion_matrix(Y_test, model[i].predict(X_test))

  

  TN = cm[0][0]

  TP = cm[1][1]

  FN = cm[1][0]

  FP = cm[0][1]

  

  print(cm)

  print('Model[{}] Testing Accuracy = "{}!"'.format(i,  (TP + TN) / (TP + TN + FN + FP)))

  print()# Print a new line

 

The models confusion matrix and accuracy on test data

Other ways to get metrics on the model to see how well each one performed.

#Show other ways to get the classification accuracy & other metrics 

 

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

 

for i in range(len(model)):

  print('Model ',i)

  #Check precision, recall, f1-score

  print( classification_report(Y_test, model[i].predict(X_test)) )

  #Another way to get the models accuracy on the test data

  print( accuracy_score(Y_test, model[i].predict(X_test)))

  print()#Print a new line

 

Sample of the models from 1–6 performance metrics on test data

From the accuracy and metrics above, the model that performed the best on the test data was the Random Forest Classifier with an accuracy score of about 96.5%. So I will choose that model to detect cancer cells in patients. Make the prediction/classification on the test data and show both the Random Forest Classifier model classification/prediction and the actual values of the patient that shows rather or not they have cancer.

I notice the model, misdiagnosed a few patients as having cancer when they didn’t and it misdiagnosed patients that did have cancer as not having cancer. Although this model is good, when dealing with the lives of others I want this model to be better and get it’s accuracy as close to 100% as possible or at least as good as if not better than doctors. So a little more tuning of each of the models is necessary.

#Print Prediction of Random Forest Classifier model

pred = model[6].predict(X_test)

print(pred)

 

#Print a space

print()

 

#Print the actual values

print(Y_test)

 

CONCLUSION & FUTURE SCOPE

In this project in python, we learned to build a breast

cancer tumour predictor on the wisconsin dataset

and created graphs and results for the same. It has

been observed that a good dataset provides better

accuracy. Selection of appropriate algorithms with

good home dataset will lead to the development of

prediction systems. These systems can assist in

proper treatment methods for a patient diagnosed

with breast cancer. There are many treatments for a

patient based on breast cancer stage; data mining and

machine learning can be a very good help in

deciding the line of treatment to be followed by

 

extracting knowledge from such suitable databases.

 

Note : Find the best solution for electronics components and technical projects ideas
keep in touch with our social media links as mentioned below
Mifratech websites : https://www.mifratech.com/public/
Mifratech facebook : https://www.facebook.com/mifratech.lab
mifratech instagram : https://www.instagram.com/mifratech/
mifratech twitter account : https://twitter.com/mifratech

 

Popular Coures