Tutorial: How to Detect Cancer Cells with Machine Learning?

The medical device industry has been advancing towards solving diagnostic and treatment problems with machine learning (ML), which is a data prediction technique. Therefore, investing an effort to understand this multidisciplinary area for your own application can help you maintain an edge over your competition. To help you accomplish that end, we’ll cover a relevant case study for (ML) by first defining necessary terms, going into some theory, and implementing a Python coding example.

Conventional methods of data predictions use statistical techniques, such as regression, to classify data points or predict future values. The increase of computers’ computational power has made solving sophisticated algorithms with intense operations possible. In this tutorial, we will study and implement a Support Vector Machine (SVM) technique to categorize whether medical tumor cells are cancerous by studying their features; the principles of this tutorial can be applied to ubiquitous classification problems.

DEFINITIONS

Let’s start with a few definitions. Two common types of machine learning technique are supervised learning and unsupervised learning.

Supervised Learning– When the correct categories pertaining to input data points are known. The Support Vector Machine (SVM) is an example we’ll be studying here.

Unsupervised Learning – Occurs when the output targets aren’t known in the given problem. We would analyze commonalities among the data itself to find groupings of similar data together.

Features (Inputs) – Specific input that map to an output class target. A cell’s mean area and mean smoothness are two examples we’ll study here.

Target (Output) – “Correct” answers (determined classes) pertaining to the specific feature inputs.

Training Set – Subset of data used to build machine learning model. These data points are not used in the testing stage.

Test Set – Subset of data used to determine accuracy of model. These data points are not used in the training stage.

Class – Categories to which the input features pertain. In this example, Malignant and Benign are the two possible classes for tumor cells. Other applications may have more than two possible classes.

Interference – Using the trained ML model, deduce to which class a test input pertains.

Margin – Distance between closest points of different classes in the context of Support Vector Machine. The support vectors are simply the points closest to the opposing class. During training, the support vectors are computed to determine the hyper-plane (in sufficiently high dimensions). Fortunately, after training, almost all data points can be disposed and only the support vectors are retained, resulting in significant storage space reductions.

THEORY

Support Vector Machines (SVMs) are a type of supervised learning algorithm that attempts to find a dividing line/curve (or hyper-plane in higher dimensions) so that unknown data points can be categorized in the appropriate class. It’s best to illustrate with some diagrams

Fig.1: Data points with Features 1 and 2 Plotted

Clearly, a line between the data points for the two classes (X’s and O’s) would serve as a reasonable divider for the data points. But, what’s the equation of that line? And what does it look like in higher dimensions?

Fig.2: Data points with Features 1 and 2 Plotted after SVM Invocation

The goal is to find where to draw the thick red line above in Fig. 2. Our goal is to maximize the margin. The data points (X’s and O’s above) closest to the thin red lines are called the support vectors.

The example above appears relatively simple and may not require using the SVM technique. So, why is SVM useful? It becomes useful when the data points don’t appear to be linearly separable, which means separation with a single decision surface. Because we are effectively solving for an equation that separates the data, transforming a low dimensional non-linearly separable to a higher linearly separable one will simplify the solution. The function used for this purpose is a kernel function, which is used to transform input data to higher dimensions.

1). Linear (‘linear’)

2). Polynomial (‘poly’)

3). Radial (‘rdf’)

Mathematically, we can write the SVM training equation, according to [1]:

In Eq [1] above, K is the kernel function, x is a matrix containing inputs we’d like to train, t represents targets, and the second term is added to help make the equation linearly separable in higher dimensions. We’ll use the Sklearn [2] library in python solve this equation for us. Other packages, such as cvxopt [3], would use a form similar to Eq [1], whose form is the same as the Lagrange Multiplier solutions.

IMPLEMENTATION

1. Import Libraries

First, we import the sklearn, numpy, matplotlib, and math libraries into our Python program.

from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
import math

2. Load Data

Secondly, we’ll load the breast cancer data set and also calculate the number of data points we have. We’ve got around 569 samples.

	dataset = load_breast_cancer()   
    sampleSize = dataset.data.shape[0] #sample size
    trainSize = math.floor(0.9*sampleSize) #90% of dataset is used for training
                                           #Thus, remaining 10% used for testing

3. Select Featuers

Next, we need to select a couple features to analyze.

#Choose fourth and fift columns as features 1 and 2, respectively
    #Off by one because of zero indexing
    feat1Index = 3
    feat2Index = feat1Index + 1
    feat1Name = (dataset['feature_names'][feat1Index])
    feat2Name = (dataset['feature_names'][feat2Index])

4. Structure Data for SVM Input

Additionally, we’ll have three sets of variables housing our data to make the example clear. First, we’ll get all of the data, then we’ll designate about 90% of our data for training, and the rest will be reserved for testing. For analysis and plotting purposes later, we further split the data depending on whether the target is malignant or benign (XMal and XBen, respectively).

    [f1, f2, y] = sliceData(dataset, 0, sampleSize, feat1Index, feat2Index) 
    #all data 
    X, XBen, XMal = separateFeaturesViaClasses(f1,f2,y)
    
    
    [f1Tr, f2Tr, yTr] = sliceData(dataset, 0, trainSize, feat1Index, feat2Index) 
    #train data 
    XTr, XBenTr, XMalTr = separateFeaturesViaClasses(f1Tr,f2Tr,yTr)
    
    [f1Te, f2Te, yTe] = sliceData(dataset, trainSize, sampleSize, feat1Index, 
                                  feat2Index) #Test Data
    XTe, XBenTe, XMalTe = separateFeaturesViaClasses(f1Te,f2Te,yTe)
def separateFeaturesViaClasses(f1, f2, y):
    # Creates and returns TWO (2) separate input features matrices - each 
    # pertaining to one of either target classes as well 
    # ONE (1) input features matrix pertaining to both target classes
    assert((len(f1) == len(f2) == len(y)))
    #Create scatter plot inputs for each class
    X = [[f1[i],f2[i]] for i in range(len(f1))]
    XBen = np.array([X[i] for i in range(len(f1)) if y[i] == 1]) 
    #Class 1 - Benign  
    
    XMal = np.array([X[i] for i in range(len(f1)) if y[i] == 0])
    #Class 2 - Malignant
    
    return X, XBen, XMal
def sliceData(dataset, start, end, feat1Index, feat2Index):
    #Slices features and output arrays based on indicies
    f1 = dataset.data[start:end,feat1Index]
    f2 = dataset.data[start:end,feat2Index]
    y = dataset.target[start:end] #same as the outcome ("Correct Answers")
    return f1, f2, y

5. Invoke SVM Algorithm

To have Python solve Eq. [1] for us, we’ll need to provide our training data set and correct target labels.

    #Fit the input parameters to an SVM model. Assume a linear kernel
    #We only want to provide the training data so we'll have some 
    #left for testing
    clf=svm.SVC(kernel='linear')
    clf.fit(XTr,yTr)

6. Analyze Results

We’ll set the accuracy to the ratio of correct test outputs divided by the total number of test attempts. We’ll see that we got 2 samples wrong out of about 60 test attempts.

    #Now we perform the inferencing step and analyze accuracy results
    modelOutput = clf.predict(XTe)
    correctOutput = y[trainSize:]
    result = modelOutput == correctOutput
    #get indices for misclassified samples
    wrongIndices = [i for i in range(len(result)) if (result[i] == False)] 
    xWrong = np.array(XTe)[wrongIndices]
    accuracy = sum(result)/len(result)
    accuracyStr = "Accuracy is: " + str(round(accuracy*100,2)) + "%"
    print(accuracyStr)

7. Plot Data

Lastly, we plot our data. We also draw the SVM decision curve by extracting the line’s slope and intercept points.

  # Calculate SVM Curve for plotting 
    w = clf.coef_[0]
    a = -w[0]/w[1]
    xx = np.linspace(650,700)
    TERM = (clf._intercept_[0]/w[1])
    yy = a*xx + TERM
    plt.plot(xx,yy)
    
	#Plot Data points
    plt.scatter(XBenTr[:,0],XBenTr[:,1], label='Benign - Train Data',
                marker='o', color='blue')
    plt.scatter(XBenTe[:,0],XBenTe[:,1], label='Benign - Test Data', 
                marker='o', color='orange')
    plt.scatter(XMalTr[:,0],XMalTr[:,1], label='Malignant - Train Data', 
                marker='x', color='blue')
    plt.scatter(XMalTe[:,0],XMalTe[:,1], label='Malignant - Test Data',
                marker='x', color='orange')
    plt.scatter(xWrong[:,0],xWrong[:,1], label='Incorrect Test Outputs', 
                marker='+',color='red')
    plt.legend()
    plt.xlabel(feat1Name)
    plt.ylabel(feat2Name)
    plt.title("Support Vector Machine Example for Cancer Cell Classification")
    plt.text(400, 0.22, accuracyStr,bbox=dict(facecolor='red', alpha=0.5))
    plt.show()

Below, we have the plot from our work. We achieved about a 96.5% accuracy.

Fig.3: SVM Model Performance – 96.5% Accuracy

NEXT QUESTIONS

In production, we would optimize our accuracy further and consider the computation resources for the training and inference stages. Here are some questions to consider.

How does varying the kernel function affect performance?
How would the code example be modified to accommodate higher dimensions, such as three features? Would that change improve accuracy?
What features are optimal for the above problem?
How does the training and inference time grow with the number of features? Does this agree with theoretical estimates?
What’s the optimal value of gamma and C, as defined in [2]?

REFERENCES

[1] – Machine Learning An Algorithmic Perspective. 2nd Edition. Stephen Marsland.

[2] – https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

[3] – https://cvxopt.org/

Let us know when you’d like to discuss how the learning in this tutorial may be applicable to the technical problem you’re trying to solve. Our fresh view of your problem may give you a different, valuable perspective to consider.

Tutorial: How to Detect Cancer Cells with Machine Learning?