k-Nearest Neighbors (kNN) is an easy to grasp algorithm (and quite effective one), which:

- ﬁnds a group of k objects in the training set that are closest to the test object, and
- bases the assignment of a label on the predominance of a particular class in this neighborhood.

There are three key elements of this approach:

- a set of labeled objects, e.g., a set of stored records (data),
- a distance or similarity metric to compute distance between objects,
- and the value of k, the number of nearest neighbors.

To classify an unlabeled object/item:

- the distance of this object to the labeled objects is computed,
- its k-nearest neighbors are identiﬁed,
- and the class labels of these nearest neighbors are then used to determine the class label of the object.

Figure below provides a high-level summary of the nearest-neighbor classiﬁcation algorithm.

Distances are calculated using the Euclidian distance, where the distance between two vectors, xA and xB, with two elements, is given by:

k-Nearest Neighbors Pros vs. Cons:

- Pros – High accuracy, insensitive to outliers, no assumptions about data
- Cons – Computationally expensive, requires a lot of memory
- Works with – Numeric values, nominal values

Prior to starting coding, here’s what we assume we have: (details in Peter Harrington’s exceptional “Machine Learning in Action”)

- We have the training data (an existing set of example data)
- We have labels for all of this data
- We know what class each piece of the data should fall into
- When we’re given a new piece of data without a label, we compare that new piece of data to the existing data (every piece of existing data)
- We then take the most similar pieces of data (the nearest neighbors) and look at their labels
- We look at the top k most similar pieces of data from our known dataset; this is where the k comes from
- Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify

I’ll be using Python (v3.3.5) programming language for code examples:

- Create a text file named “test_data.txt” that will contain our test data set (x,y coordinates/pairs of points in a 2D space together with labels (3rd column))
0.2 1.3 top-left
0.1 1.1 top-left
0.9 0.1 bottom-right
1.0 0.2 bottom-right

- Create a new python file called knn_example.py
- Import NumPy (package for scientific computing) and the Operator module (sorting tuples). NumPy requires previous installation from here.
from numpy import *
import operator

- Add a function to prepare the data set (load it from the file and transform into data(matrix) and labels(vector) components)
def prepareData(fname):
file = open(fname)
lines = len(file.readlines())
data = zeros((lines,2))
labels = []
file = open(fname)
i = 0
for line in file.readlines():
line = line.strip()
list = line.split('\t')
data[i,:] = list[0:2]
labels.append(list[-1])
i += 1
return data,labels

- Add a function to classify the data
def classify(x, data, labels, k):
size = data.shape[0]
diff = tile(x, (size,1)) - data
sqDiff = diff**2
sqDist = sqDiff.sum(axis=1)
dist = sqDist**0.5
sortedDist = dist.argsort()
count={}
for i in range(k):
label = labels[sortedDist[i]]
count[label] = count.get(label,0) + 1
sCount = sorted(count.items(), key=operator.itemgetter(1), reverse=True)
return sCount[0][0]

- Save the file
- Execute (from Python CLI or IDLE):
>>> import knn_example
>>> group,labels = knn_example.prepareData('test_data.txt')
>>> knn_example.classify([1,0], group, labels, 2)

What you should see after executing the above, is following:

'bottom-right'

Voila, you just created your first classifier (which successfully classified a point with x,y coordinates of 1,0 by assigning it the ‘bottom-right’ label)

Resources: