Breast Cancer Detection Using Machine Learning

× Home Careers Contact

Back

Breast Cancer Detection Using Machine Learning

INTRODUCTION

In this PROJECT I will show you how to create your very own machine learning python program to detect breast cancer from data. Breast Cancer (BC) is a common cancer for women around the world, and early detection of BC can greatly improve prognosis and survival chances by promoting clinical treatment to patients early. So it’s amazing to be able to possibly help save lives just by using data, python, and machine learning!

DATASET:https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

PROBLEM STATEMENT & DISCUSSION

Breast Cancer is one of the leading cancer developed in

many countries including India. Though the endurance

rate is high – with early diagnosis 97% women can

survive for more than 5 years. Statistically, the death toll

due to this disease has increased drastically in last few

decades. The main issue pertaining to its cure is early

recognition. Hence, apart from medicinal solutions some

Data Science solution needs to be integrated for resolving

the death causing issue.

PROGRAMMING:

Now import the packages/libraries to make it easier to write the program.

#import libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

Next I will load the data, and print the first 7 rows of data.

NOTE: Each row of data represents a patient that may or may not have cancer.

#Load the data

#from google.colab import files # Use to load data on Google Colab #uploaded = files.upload() # Use to load data on Google Colab

df = pd.read_csv('data.csv')

df.head(7)

A sample of the first 7 rows of data

Explore the data and count the number of rows and columns in the data set. Their are 569 rows of data which means their are 569 patients in this data set, and 33 columns which mean their are 33 features or data points for each patient.

#Count the number of rows and columns in the data set

df.shape

Number of Rows: 569, Number of Columns: 33

Continue exploring the data and get a count of all of the columns that contain empty (NaN, NAN, na) values. Notice none of the columns contain any empty values except the column named ‘Unnamed: 32’ , which contains 569 empty values (the same number of rows in the data set, this tells me this column is completely useless)

#Count the empty (NaN, NAN, na) values in each column

df.isna().sum()

Count of all the empty values per column/feature

Remove the column ‘Unnamed: 32’ from the original data set since it adds no value.

#Drop the column with all missing values (na, NAN, NaN)

#NOTE: This drops the column Unnamed

df = df.dropna(axis=1)

Get the new count of the number of rows and columns.

#Get the new count of the number of rows and cols

df.shape

Number of Rows: 569, Number of Columns: 32

Get a count of the number of patients with Malignant (M) cancerous and Benign (B) non-cancerous cells.

#Get a count of the number of 'M' & 'B' cells

df['diagnosis'].value_counts()

# of Cancerous Cells: 212 and # of Non-Cancerous Cells: 357

Visualize the counts, by creating a count plot.

#Visualize this count

sns.countplot(df['diagnosis'],label="Count")

Chart displaying Malignant (cancerous) & Benign(non-cancerous) diagnosis

Look at the data types to see which columns need to be transformed / encoded. I can see from the data types that all of the columns/features are numbers except for the column ‘diagnosis’, which is categorical data represented as an object in python.

#Look at the data types

df.dtypes

A list of the columns & their data types

Encode the categorical data. Change the values in the column ‘diagnosis’ from M and B to 1 and 0 respectively, then print the results.

#Encoding categorical data values (

from sklearn.preprocessing import LabelEncoder

labelencoder_Y = LabelEncoder()

df.iloc[:,1]= labelencoder_Y.fit_transform(df.iloc[:,1].values)

print(labelencoder_Y.fit_transform(df.iloc[:,1].values)

Create a pair plot. A “pairs plot” is also known as a scatter plot, in which one variable in the same data row is matched with another variable’s value.

sns.pairplot(df, hue="diagnosis")

Pair plot of all of the columns highlighting the diagnosis points in Orange (1) & Blue (0)

Print the new data set which now has only 32 columns. Print only the first 5 rows.

df.head(5)

5 rows of the new data set

Get the correlation of the columns.

#Get the correlation of the columns

df.corr()

Column correlation sample

Visualize the correlation by creating a heat map.

plt.figure(figsize=(20,20))

sns.heatmap(df.corr(), annot=True, fmt='.0%')

Heat map of correlations

Now I am done exploring and cleaning the data. I will set up my data for the model by first splitting the data set into a feature data set also known as the independent data set (X), and a target data set also known as the dependent data set (Y).

X = df.iloc[:, 2:31].values

Y = df.iloc[:, 1].values

Split the data again, but this time into 75% training and 25% testing data sets.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

#Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Create a function to hold many different models (e.g. Logistic Regression, Decision Tree Classifier, Random Forest Classifier) to make the classification. These are the models that will detect if a patient has cancer or not. Within this function I will also print the accuracy of each model on the training data.

def models(X_train,Y_train):

#Using Logistic Regression

from sklearn.linear_model import LogisticRegression

log = LogisticRegression(random_state = 0)

log.fit(X_train, Y_train)

#Using KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

knn.fit(X_train, Y_train)

#Using SVC linear

from sklearn.svm import SVC

svc_lin = SVC(kernel = 'linear', random_state = 0)

svc_lin.fit(X_train, Y_train)

#Using SVC rbf

from sklearn.svm import SVC

svc_rbf = SVC(kernel = 'rbf', random_state = 0)

svc_rbf.fit(X_train, Y_train)

#Using GaussianNB

from sklearn.naive_bayes import GaussianNB

gauss = GaussianNB()

gauss.fit(X_train, Y_train)

#Using DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

tree.fit(X_train, Y_train)

#Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)

forest.fit(X_train, Y_train)

#print model accuracy on the training data.

print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))

print('[1]K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train))

print('[2]Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train, Y_train))

print('[3]Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train, Y_train))

print('[4]Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train))

print('[5]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))

print('[6]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))

return log, knn, svc_lin, svc_rbf, gauss, tree, forest

Create the model that contains all of the models, and look at the accuracy score on the training data for each model to classify if a patient has cancer or not.

model = models(X_train,Y_train)

The accuracy of each model on the training data

Show the confusion matrix and the accuracy of the models on the test data. The confusion matrix tells us how many patients each model misdiagnosed (number of patients with cancer that were misdiagnosed as not having cancer a.k.a false negative, and the number of patients who did not have cancer that were misdiagnosed with having cancer a.k.a false positive) and the number of correct diagnosis, the true positives and true negatives.

False Positive (FP) = A test result which incorrectly indicates that a particular condition or attribute is present.

True Positive (TP) = Sensitivity (also called the true positive rate, or probability of detection in some fields) measures the proportion of actual positives that are correctly identified as such.

True Negative (TN) = Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such.

False Negative (FN) = A test result that indicates that a condition does not hold, while in fact it does. For example a test result that indicates a person does not have cancer when the person actually does have it

Confusion Matrix

from sklearn.metrics import confusion_matrix

for i in range(len(model)):

cm = confusion_matrix(Y_test, model[i].predict(X_test))

TN = cm[0][0]

TP = cm[1][1]

FN = cm[1][0]

FP = cm[0][1]

print(cm)

print('Model[{}] Testing Accuracy = "{}!"'.format(i, (TP + TN) / (TP + TN + FN + FP)))

print()# Print a new line

The models confusion matrix and accuracy on test data

Other ways to get metrics on the model to see how well each one performed.

#Show other ways to get the classification accuracy & other metrics

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

for i in range(len(model)):

print('Model ',i)

#Check precision, recall, f1-score

print( classification_report(Y_test, model[i].predict(X_test)) )

#Another way to get the models accuracy on the test data

print( accuracy_score(Y_test, model[i].predict(X_test)))

print()#Print a new line

Sample of the models from 1–6 performance metrics on test data

From the accuracy and metrics above, the model that performed the best on the test data was the Random Forest Classifier with an accuracy score of about 96.5%. So I will choose that model to detect cancer cells in patients. Make the prediction/classification on the test data and show both the Random Forest Classifier model classification/prediction and the actual values of the patient that shows rather or not they have cancer.

I notice the model, misdiagnosed a few patients as having cancer when they didn’t and it misdiagnosed patients that did have cancer as not having cancer. Although this model is good, when dealing with the lives of others I want this model to be better and get it’s accuracy as close to 100% as possible or at least as good as if not better than doctors. So a little more tuning of each of the models is necessary.

#Print Prediction of Random Forest Classifier model

pred = model[6].predict(X_test)

print(pred)

#Print a space

print()

#Print the actual values

print(Y_test)

CONCLUSION & FUTURE SCOPE

In this project in python, we learned to build a breast

cancer tumour predictor on the wisconsin dataset

and created graphs and results for the same. It has

been observed that a good dataset provides better

accuracy. Selection of appropriate algorithms with

good home dataset will lead to the development of

prediction systems. These systems can assist in

proper treatment methods for a patient diagnosed

with breast cancer. There are many treatments for a

patient based on breast cancer stage; data mining and

machine learning can be a very good help in

deciding the line of treatment to be followed by

extracting knowledge from such suitable databases.

Note : Find the best solution for electronics components and technical projects ideas
keep in touch with our social media links as mentioned below
Mifratech websites : https://www.mifratech.com/public/
Mifratech facebook : https://www.facebook.com/mifratech.lab
mifratech instagram : https://www.instagram.com/mifratech/
mifratech twitter account : https://twitter.com/mifratech

Back

Popular Coures

Breast Cancer Detection Using Machine Learning

INTRODUCTION

PROBLEM STATEMENT & DISCUSSION

PROGRAMMING:

CONCLUSION & FUTURE SCOPE

Information

Customer Service

Extra

My Account

Help & Support

Connect Us