This notebook illustrates (on a very simple problem) how to use a classifier such as KNeighborsClassifier and SVM to separate two classes.

In [2]:

```
%pylab inline
from matplotlib import rcParams
rcParams["figure.figsize"] = (10,6)
from sklearn.neighbors import KNeighborsClassifier
```

Populating the interactive namespace from numpy and matplotlib

The data is made of two variables X1 and X2. THere are two classes defined by :

- X2 > 0.5
X2 <=0.5

X1 is uniformly distributed between 0 and 1

- X2 is uniformly distributed between 0 and 1

In [210]:

```
# create a data set
N = 50
X1_training = np.random.uniform(size=N)
X2_training = np.random.uniform(size=N)
training = np.vstack([X1_training, X2_training]).transpose()
# This defines the true label of the training
training_labels = X2_training > 0.5
# Create a test set
X1_test = np.random.uniform(size=5000)
X2_test = np.random.uniform(size=5000)
test = np.vstack([X1_test, X2_test]).transpose()
# This defines the true label of the training
test_labels = X2_test > 0.5
```

In [211]:

```
def plot_data(data, mask, score=None,):
plot(data[mask,0], data[mask, 1], "or")
plot(data[mask==False, 0], data[mask==False, 1], "ob")
if score:
title("Accuracy {}".format(score))
xlabel("X1", fontsize=20)
ylabel("X2", fontsize=20)
axhline(0.5, lw=4, alpha=0.5)
```

In [212]:

```
plot_data(training, training_labels)
```

Neighbors-based classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

scikit-learn implements two different nearest neighbors classifiers:

- KNeighborsClassifier implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user.
- RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user.

The k-neighbors classification in KNeighborsClassifier is the more commonly used of the two techniques.

The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct.

In [213]:

```
# build the model from the training data
model = KNeighborsClassifier()
model.fit(training, training_labels)
```

Out[213]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')

In [214]:

```
score = model.score(training, training_labels)
score
```

Out[214]:

0.93999999999999995

In [215]:

```
model.score(test, test_labels)
```

Out[215]:

0.9698

In [216]:

```
subplot(1,2,1)
plot_data(training, training_labels, score)
# score on the test set.
score2 = model.score(test, test_labels)
subplot(1,2,2)
plot_data(test, model.predict(test), score2)
axhline(0.5, lw=3, color="k")
subplot(1,2,1)
xx, yy = np.meshgrid(linspace(0,1,20), linspace(0,1,20))
model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
contourf(xx, yy, Z, alpha=0.4)
```

Out[216]:

<matplotlib.contour.QuadContourSet at 0x7f06ee496a90>

Support vector machines (SVMs) are used for classification, regression and outliers detection.

The advantages of support vector machines are:

- Effective in high dimensional spaces even in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function.

The disadvantages of support vector machines include:

- If the number of features is much greater than the number of samples, the method is likely to give poor performances.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

In [217]:

```
from sklearn import svm
```

In [218]:

```
model = svm.SVC(kernel="linear", degree=1)
```

In [219]:

```
model.fit(training, training_labels)
```

Out[219]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

In [220]:

```
score = model.score(training, training_labels)
score
```

Out[220]:

1.0

In [221]:

```
pred = model.predict(test)
score2 = model.score(test, test_labels)
score2
```

Out[221]:

0.98980000000000001

In [222]:

```
subplot(1,2,1)
plot_data(training, training_labels, score)
score2 = model.score(test, test_labels)
subplot(1,2,2)
plot_data(test, model.predict(test), score)
axhline(0.5, lw=3, color="k")
subplot(1,2,1)
xx, yy = np.meshgrid(linspace(0,1,20), linspace(0,1,20))
model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
contourf(xx, yy, Z, alpha=0.4)
```

Out[222]:

<matplotlib.contour.QuadContourSet at 0x7f06fc220908>

- change the kernel from linear to rbf or to poly
- Change N to 40 ot 400 and see impact on scores and classification