**DEFINITIONS**

**Supervised Learning**– When the correct categories pertaining to input data points are known. The Support Vector Machine (SVM) is an example we’ll be studying here.

**Unsupervised Learning** – Occurs when the output targets aren’t known in the given problem. We would analyze commonalities among the data itself to find groupings of similar data together.

**Features (Inputs)** – Specific input that map to an output class target. A cell’s mean area and mean smoothness are two examples we’ll study here.

**Target (Output)** – “Correct” answers (determined classes) pertaining to the specific feature inputs.

**Training Set **– Subset of data used to build machine learning model. These data points are not used in the testing stage.

**Test Set **– Subset of data used to determine accuracy of model. These data points are not used in the training stage.

**Class **– Categories to which the input features pertain. In this example, Malignant and Benign are the two possible classes for tumor cells. Other applications may have more than two possible classes.

**Interference** – Using the trained ML model, deduce to which class a test input pertains.

**Margin** – Distance between closest points of different classes in the context of Support Vector Machine. The support vectors are simply the points closest to the opposing class. During training, the support vectors are computed to determine the hyper-plane (in sufficiently high dimensions). Fortunately, after training, almost all data points can be disposed and only the support vectors are retained, resulting in significant storage space reductions.

**THEORY**

Clearly, a line between the data points for the two classes (X’s and O’s) would serve as a reasonable divider for the data points. But, what’s the equation of that line? And what does it look like in higher dimensions?

The goal is to find where to draw the thick red line above in Fig. 2. Our goal is to maximize the margin. The data points (X’s and O’s above) closest to the thin red lines are called the support vectors.

2). Polynomial (‘poly’)

3). Radial (‘rdf’)

Mathematically, we can write the SVM training equation, according to [1]:

In Eq [1] above, * K* is the kernel function,

**is a matrix containing inputs we’d like to train,**

*x***t**represents targets, and the second term is added to help make the equation linearly separable in higher dimensions. We’ll use the Sklearn [2] library in python solve this equation for us. Other packages, such as cvxopt [3], would use a form similar to Eq [1], whose form is the same as the Lagrange Multiplier solutions.

**IMPLEMENTATION**

**1. Import Libraries**

First, we import the sklearn, numpy, matplotlib, and math libraries into our Python program.

from sklearn import svm import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer import math

#### 2. Load Data

Secondly, we’ll load the breast cancer data set and also calculate the number of data points we have. We’ve got around 569 samples.

dataset = load_breast_cancer() sampleSize = dataset.data.shape[0] #sample size trainSize = math.floor(0.9*sampleSize) #90% of dataset is used for training #Thus, remaining 10% used for testing

#### 3. Select Featuers

Next, we need to select a couple features to analyze.

#Choose fourth and fift columns as features 1 and 2, respectively #Off by one because of zero indexing feat1Index = 3 feat2Index = feat1Index + 1 feat1Name = (dataset['feature_names'][feat1Index]) feat2Name = (dataset['feature_names'][feat2Index])

**4. Structure Data for SVM Input**

Additionally, we’ll have three sets of variables housing our data to make the example clear. First, we’ll get all of the data, then we’ll designate about 90% of our data for training, and the rest will be reserved for testing. For analysis and plotting purposes later, we further split the data depending on whether the target is malignant or benign (XMal and XBen, respectively).

[f1, f2, y] = sliceData(dataset, 0, sampleSize, feat1Index, feat2Index) #all data X, XBen, XMal = separateFeaturesViaClasses(f1,f2,y) [f1Tr, f2Tr, yTr] = sliceData(dataset, 0, trainSize, feat1Index, feat2Index) #train data XTr, XBenTr, XMalTr = separateFeaturesViaClasses(f1Tr,f2Tr,yTr) [f1Te, f2Te, yTe] = sliceData(dataset, trainSize, sampleSize, feat1Index, feat2Index) #Test Data XTe, XBenTe, XMalTe = separateFeaturesViaClasses(f1Te,f2Te,yTe) def separateFeaturesViaClasses(f1, f2, y): # Creates and returns TWO (2) separate input features matrices - each # pertaining to one of either target classes as well # ONE (1) input features matrix pertaining to both target classes assert((len(f1) == len(f2) == len(y))) #Create scatter plot inputs for each class X = [[f1[i],f2[i]] for i in range(len(f1))] XBen = np.array([X[i] for i in range(len(f1)) if y[i] == 1]) #Class 1 - Benign XMal = np.array([X[i] for i in range(len(f1)) if y[i] == 0]) #Class 2 - Malignant return X, XBen, XMal def sliceData(dataset, start, end, feat1Index, feat2Index): #Slices features and output arrays based on indicies f1 = dataset.data[start:end,feat1Index] f2 = dataset.data[start:end,feat2Index] y = dataset.target[start:end] #same as the outcome ("Correct Answers") return f1, f2, y

#### 5**. Invoke SVM Algorithm **

To have Python solve Eq. [1] for us, we’ll need to provide our training data set and correct target labels.

#Fit the input parameters to an SVM model. Assume a linear kernel #We only want to provide the training data so we'll have some #left for testing clf=svm.SVC(kernel='linear') clf.fit(XTr,yTr)

#### 6. **Analyze Results**

We’ll set the accuracy to the ratio of correct test outputs divided by the total number of test attempts. We’ll see that we got 2 samples wrong out of about 60 test attempts.

#Now we perform the inferencing step and analyze accuracy results modelOutput = clf.predict(XTe) correctOutput = y[trainSize:] result = modelOutput == correctOutput #get indices for misclassified samples wrongIndices = [i for i in range(len(result)) if (result[i] == False)] xWrong = np.array(XTe)[wrongIndices] accuracy = sum(result)/len(result) accuracyStr = "Accuracy is: " + str(round(accuracy*100,2)) + "%" print(accuracyStr)

#### 7. **Plot Data**

Lastly, we plot our data. We also draw the SVM decision curve by extracting the line’s slope and intercept points.

# Calculate SVM Curve for plotting w = clf.coef_[0] a = -w[0]/w[1] xx = np.linspace(650,700) TERM = (clf._intercept_[0]/w[1]) yy = a*xx + TERM plt.plot(xx,yy) #Plot Data points plt.scatter(XBenTr[:,0],XBenTr[:,1], label='Benign - Train Data', marker='o', color='blue') plt.scatter(XBenTe[:,0],XBenTe[:,1], label='Benign - Test Data', marker='o', color='orange') plt.scatter(XMalTr[:,0],XMalTr[:,1], label='Malignant - Train Data', marker='x', color='blue') plt.scatter(XMalTe[:,0],XMalTe[:,1], label='Malignant - Test Data', marker='x', color='orange') plt.scatter(xWrong[:,0],xWrong[:,1], label='Incorrect Test Outputs', marker='+',color='red') plt.legend() plt.xlabel(feat1Name) plt.ylabel(feat2Name) plt.title("Support Vector Machine Example for Cancer Cell Classification") plt.text(400, 0.22, accuracyStr,bbox=dict(facecolor='red', alpha=0.5)) plt.show()

Below, we have the plot from our work. We achieved about a 96.5% accuracy.

**NEXT QUESTIONS**

In production, we would optimize our accuracy further and consider the computation resources for the training and inference stages. Here are some questions to consider.

- How does varying the kernel function affect performance?
- How would the code example be modified to accommodate higher dimensions, such as three features? Would that change improve accuracy?
- What features are optimal for the above problem?
- How does the training and inference time grow with the number of features? Does this agree with theoretical estimates?
- What’s the optimal value of gamma and C, as defined in [2]?

## REFERENCES

[1] – Machine Learning *An Algorithmic Perspective.* 2nd Edition. Stephen Marsland.

[2] – https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

[3] – https://cvxopt.org/

Let us know when you’d like to discuss how the learning in this tutorial may be applicable to the technical problem you’re trying to solve. Our fresh view of your problem may give you a different, valuable perspective to consider.