Support Vector Classsification using Scikit learn

If you like this, Share!

This tutorial is an introduction to Support Vector Classification using Scikit learn and pandas. We will use scikit learn for the support vector machine algorithm and pandas to handle data as a dataframe. I prefer using pandas here since it helps me visualize data as dataframes which I am comfortable with since I am more comfortable handling R dataframes. However I found that using Pandas is not necessary and it doesn’t really help much as it doesn’t work so well with Scikit learn. However knowing pandas will help when handling data stored in CSV files.

You can download the python notebook here.

We will start by importing the iris dataset from sklearn and importing pandas for data handling

from sklearn import datasets
import pandas as pd

We shall now import the dataset as a pandas dataframe

iris=pd.DataFrame(datasets.load_iris().data)

We will explore the data to understand the columns in it

iris.head(10)
0 1 2 3
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2
8 4.4 2.9 1.4 0.2
9 4.9 3.1 1.5 0.1
iris.tail()
0 1 2 3
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
print(iris)
       0    1    2    3
0    5.1  3.5  1.4  0.2
1    4.9  3.0  1.4  0.2
2    4.7  3.2  1.3  0.2
3    4.6  3.1  1.5  0.2
4    5.0  3.6  1.4  0.2
5    5.4  3.9  1.7  0.4
6    4.6  3.4  1.4  0.3
7    5.0  3.4  1.5  0.2
8    4.4  2.9  1.4  0.2
9    4.9  3.1  1.5  0.1
10   5.4  3.7  1.5  0.2
11   4.8  3.4  1.6  0.2
12   4.8  3.0  1.4  0.1
13   4.3  3.0  1.1  0.1
14   5.8  4.0  1.2  0.2
15   5.7  4.4  1.5  0.4
16   5.4  3.9  1.3  0.4
17   5.1  3.5  1.4  0.3
18   5.7  3.8  1.7  0.3
19   5.1  3.8  1.5  0.3
20   5.4  3.4  1.7  0.2
21   5.1  3.7  1.5  0.4
22   4.6  3.6  1.0  0.2
23   5.1  3.3  1.7  0.5
24   4.8  3.4  1.9  0.2
25   5.0  3.0  1.6  0.2
26   5.0  3.4  1.6  0.4
27   5.2  3.5  1.5  0.2
28   5.2  3.4  1.4  0.2
29   4.7  3.2  1.6  0.2
..   ...  ...  ...  ...
120  6.9  3.2  5.7  2.3
121  5.6  2.8  4.9  2.0
122  7.7  2.8  6.7  2.0
123  6.3  2.7  4.9  1.8
124  6.7  3.3  5.7  2.1
125  7.2  3.2  6.0  1.8
126  6.2  2.8  4.8  1.8
127  6.1  3.0  4.9  1.8
128  6.4  2.8  5.6  2.1
129  7.2  3.0  5.8  1.6
130  7.4  2.8  6.1  1.9
131  7.9  3.8  6.4  2.0
132  6.4  2.8  5.6  2.2
133  6.3  2.8  5.1  1.5
134  6.1  2.6  5.6  1.4
135  7.7  3.0  6.1  2.3
136  6.3  3.4  5.6  2.4
137  6.4  3.1  5.5  1.8
138  6.0  3.0  4.8  1.8
139  6.9  3.1  5.4  2.1
140  6.7  3.1  5.6  2.4
141  6.9  3.1  5.1  2.3
142  5.8  2.7  5.1  1.9
143  6.8  3.2  5.9  2.3
144  6.7  3.3  5.7  2.5
145  6.7  3.0  5.2  2.3
146  6.3  2.5  5.0  1.9
147  6.5  3.0  5.2  2.0
148  6.2  3.4  5.4  2.3
149  5.9  3.0  5.1  1.8

[150 rows x 4 columns]

Now from sklearn we will import the SVC as we are trying to do a classification problem here. We will be using support vector machine algorithm for classification

from sklearn.svm import SVC

We will create an object clf (classifier) and assign SVC to it

clf = SVC()

We will assign the target variable from iris to a numpy array

target=datasets.load_iris().target
target_names=datasets.load_iris().target_names[target]
print(target)
print(target_names)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica' 'virginica' 'virginica' 'virginica'
 'virginica' 'virginica' 'virginica']
print(type(target),type(target_names))
<class 'numpy.ndarray'> <class 'numpy.ndarray'>

We can use either of these arrays to fit the SVM/SVC model. If you desire the output as numeric values corresponding to the labels you can use target or if you want output as the labels as such you can use target_names

clf.fit(iris,target_names)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We will now try to predict the first 3 data points from the dataset using our classifier

list(clf.predict(iris[:3]))
['setosa', 'setosa', 'setosa']

The below steps are alternates using the numeric values of the class

clf.fit(iris,target)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
list(clf.predict(iris[:3]))
[0, 0, 0]

Now we will breakdown the dataframe into train and test sets and try classifying

data = datasets.load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
df['target']=data['target']
df.head
<bound method NDFrame.head of      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
5                  5.4               3.9                1.7               0.4   
6                  4.6               3.4                1.4               0.3   
7                  5.0               3.4                1.5               0.2   
8                  4.4               2.9                1.4               0.2   
9                  4.9               3.1                1.5               0.1   
10                 5.4               3.7                1.5               0.2   
11                 4.8               3.4                1.6               0.2   
12                 4.8               3.0                1.4               0.1   
13                 4.3               3.0                1.1               0.1   
14                 5.8               4.0                1.2               0.2   
15                 5.7               4.4                1.5               0.4   
16                 5.4               3.9                1.3               0.4   
17                 5.1               3.5                1.4               0.3   
18                 5.7               3.8                1.7               0.3   
19                 5.1               3.8                1.5               0.3   
20                 5.4               3.4                1.7               0.2   
21                 5.1               3.7                1.5               0.4   
22                 4.6               3.6                1.0               0.2   
23                 5.1               3.3                1.7               0.5   
24                 4.8               3.4                1.9               0.2   
25                 5.0               3.0                1.6               0.2   
26                 5.0               3.4                1.6               0.4   
27                 5.2               3.5                1.5               0.2   
28                 5.2               3.4                1.4               0.2   
29                 4.7               3.2                1.6               0.2   
..                 ...               ...                ...               ...   
120                6.9               3.2                5.7               2.3   
121                5.6               2.8                4.9               2.0   
122                7.7               2.8                6.7               2.0   
123                6.3               2.7                4.9               1.8   
124                6.7               3.3                5.7               2.1   
125                7.2               3.2                6.0               1.8   
126                6.2               2.8                4.8               1.8   
127                6.1               3.0                4.9               1.8   
128                6.4               2.8                5.6               2.1   
129                7.2               3.0                5.8               1.6   
130                7.4               2.8                6.1               1.9   
131                7.9               3.8                6.4               2.0   
132                6.4               2.8                5.6               2.2   
133                6.3               2.8                5.1               1.5   
134                6.1               2.6                5.6               1.4   
135                7.7               3.0                6.1               2.3   
136                6.3               3.4                5.6               2.4   
137                6.4               3.1                5.5               1.8   
138                6.0               3.0                4.8               1.8   
139                6.9               3.1                5.4               2.1   
140                6.7               3.1                5.6               2.4   
141                6.9               3.1                5.1               2.3   
142                5.8               2.7                5.1               1.9   
143                6.8               3.2                5.9               2.3   
144                6.7               3.3                5.7               2.5   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target  
0         0  
1         0  
2         0  
3         0  
4         0  
5         0  
6         0  
7         0  
8         0  
9         0  
10        0  
11        0  
12        0  
13        0  
14        0  
15        0  
16        0  
17        0  
18        0  
19        0  
20        0  
21        0  
22        0  
23        0  
24        0  
25        0  
26        0  
27        0  
28        0  
29        0  
..      ...  
120       2  
121       2  
122       2  
123       2  
124       2  
125       2  
126       2  
127       2  
128       2  
129       2  
130       2  
131       2  
132       2  
133       2  
134       2  
135       2  
136       2  
137       2  
138       2  
139       2  
140       2  
141       2  
142       2  
143       2  
144       2  
145       2  
146       2  
147       2  
148       2  
149       2  

[150 rows x 5 columns]>
df.describe()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667 1.000000
std 0.828066 0.433594 1.764420 0.763161 0.819232
min 4.300000 2.000000 1.000000 0.100000 0.000000
25% 5.100000 2.800000 1.600000 0.300000 0.000000
50% 5.800000 3.000000 4.350000 1.300000 1.000000
75% 6.400000 3.300000 5.100000 1.800000 2.000000
max 7.900000 4.400000 6.900000 2.500000 2.000000

We will now split this into train and test using sklearn.model_selection

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

After spliting we get two dataframes of size 120 and 30

train.count()
sepal length (cm)    120
sepal width (cm)     120
petal length (cm)    120
petal width (cm)     120
target               120
dtype: int64
test.count()
sepal length (cm)    30
sepal width (cm)     30
petal length (cm)    30
petal width (cm)     30
target               30
dtype: int64

We now define our Y and X variables

Y=train.target
X=train.iloc[:,0:4]
X.head
<bound method NDFrame.head of      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
62                 6.0               2.2                4.0               1.0
140                6.7               3.1                5.6               2.4
95                 5.7               3.0                4.2               1.2
39                 5.1               3.4                1.5               0.2
67                 5.8               2.7                4.1               1.0
0                  5.1               3.5                1.4               0.2
102                7.1               3.0                5.9               2.1
115                6.4               3.2                5.3               2.3
88                 5.6               3.0                4.1               1.3
9                  4.9               3.1                1.5               0.1
26                 5.0               3.4                1.6               0.4
12                 4.8               3.0                1.4               0.1
47                 4.6               3.2                1.4               0.2
63                 6.1               2.9                4.7               1.4
134                6.1               2.6                5.6               1.4
24                 4.8               3.4                1.9               0.2
68                 6.2               2.2                4.5               1.5
49                 5.0               3.3                1.4               0.2
15                 5.7               4.4                1.5               0.4
98                 5.1               2.5                3.0               1.1
69                 5.6               2.5                3.9               1.1
77                 6.7               3.0                5.0               1.7
44                 5.1               3.8                1.9               0.4
131                7.9               3.8                6.4               2.0
139                6.9               3.1                5.4               2.1
121                5.6               2.8                4.9               2.0
91                 6.1               3.0                4.6               1.4
106                4.9               2.5                4.5               1.7
10                 5.4               3.7                1.5               0.2
110                6.5               3.2                5.1               2.0
..                 ...               ...                ...               ...
32                 5.2               4.1                1.5               0.1
122                7.7               2.8                6.7               2.0
82                 5.8               2.7                3.9               1.2
133                6.3               2.8                5.1               1.5
147                6.5               3.0                5.2               2.0
87                 6.3               2.3                4.4               1.3
17                 5.1               3.5                1.4               0.3
29                 4.7               3.2                1.6               0.2
46                 5.1               3.8                1.6               0.2
93                 5.0               2.3                3.3               1.0
54                 6.5               2.8                4.6               1.5
65                 6.7               3.1                4.4               1.4
75                 6.6               3.0                4.4               1.4
119                6.0               2.2                5.0               1.5
1                  4.9               3.0                1.4               0.2
103                6.3               2.9                5.6               1.8
149                5.9               3.0                5.1               1.8
64                 5.6               2.9                3.6               1.3
70                 5.9               3.2                4.8               1.8
59                 5.2               2.7                3.9               1.4
74                 6.4               2.9                4.3               1.3
129                7.2               3.0                5.8               1.6
33                 5.5               4.2                1.4               0.2
36                 5.5               3.5                1.3               0.2
72                 6.3               2.5                4.9               1.5
42                 4.4               3.2                1.3               0.2
27                 5.2               3.5                1.5               0.2
145                6.7               3.0                5.2               2.3
57                 4.9               2.4                3.3               1.0
31                 5.4               3.4                1.5               0.4

[120 rows x 4 columns]>

Using a similar process as earlier we will fit the classifier

clf.fit(X,Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

and try predicting the test set

test_X=test.iloc[:,0:4]
pred=clf.predict(test_X)

We can now create a confusion matrix to understand the error rate of the classifier model

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
cnf_matrix = confusion_matrix(test.target, pred)
print(cnf_matrix)
[[ 9  0  0]
 [ 0  8  0]
 [ 0  1 12]]

 

Leave a Reply

  Subscribe  
Notify of