k-Nearest Neighbors (kNN) algorithm – Machine Learning

k-Nearest Neighbors (kNN) is an easy to grasp algorithm (and quite effective one), which:

ﬁnds a group of k objects in the training set that are closest to the test object, and
bases the assignment of a label on the predominance of a particular class in this neighborhood.

There are three key elements of this approach:

a set of labeled objects, e.g., a set of stored records (data),
a distance or similarity metric to compute distance between objects,
and the value of k, the number of nearest neighbors.

To classify an unlabeled object/item:

the distance of this object to the labeled objects is computed,
its k-nearest neighbors are identiﬁed,
and the class labels of these nearest neighbors are then used to determine the class label of the object.

Figure below provides a high-level summary of the nearest-neighbor classiﬁcation algorithm.

Distances are calculated using the Euclidian distance, where the distance between two vectors, xA and xB, with two elements, is given by:

k-Nearest Neighbors Pros vs. Cons:

Pros – High accuracy, insensitive to outliers, no assumptions about data
Cons – Computationally expensive, requires a lot of memory
Works with – Numeric values, nominal values

Prior to starting coding, here’s what we assume we have: (details in Peter Harrington’s exceptional “Machine Learning in Action”)

We have the training data (an existing set of example data)
We have labels for all of this data
We know what class each piece of the data should fall into
When we’re given a new piece of data without a label, we compare that new piece of data to the existing data (every piece of existing data)
We then take the most similar pieces of data (the nearest neighbors) and look at their labels
We look at the top k most similar pieces of data from our known dataset; this is where the k comes from
Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify

I’ll be using Python (v3.3.5) programming language for code examples:

Create a text file named “test_data.txt” that will contain our test data set (x,y coordinates/pairs of points in a 2D space together with labels (3rd column))
```
    0.2    1.3    top-left
    0.1    1.1    top-left
    0.9    0.1    bottom-right
    1.0    0.2    bottom-right
```
Create a new python file called knn_example.py
Import NumPy (package for scientific computing) and the Operator module (sorting tuples). NumPy requires previous installation from here.
```
    from numpy import *
    import operator
```

Add a function to prepare the data set (load it from the file and transform into data(matrix) and labels(vector) components)

    def prepareData(fname):
        file = open(fname)
        lines = len(file.readlines())
        data = zeros((lines,2))
        labels = []
        file = open(fname)
        i = 0
        for line in file.readlines():
            line = line.strip()
            list = line.split('\t')
            data[i,:] = list[0:2]
            labels.append(list[-1])
            i += 1
        return data,labels

Add a function to classify the data

    def classify(x, data, labels, k):
        size = data.shape[0]
        diff = tile(x, (size,1)) - data
        sqDiff = diff**2
        sqDist = sqDiff.sum(axis=1)
        dist = sqDist**0.5
        sortedDist = dist.argsort()
        count={}
        for i in range(k):
            label = labels[sortedDist[i]]
            count[label] = count.get(label,0) + 1
        sCount = sorted(count.items(), key=operator.itemgetter(1), reverse=True)
        return sCount[0][0]

Save the file

Execute (from Python CLI or IDLE):

    >>> import knn_example
    >>> group,labels = knn_example.prepareData('test_data.txt')
    >>> knn_example.classify([1,0], group, labels, 2)

What you should see after executing the above, is following:

    'bottom-right'

Voila, you just created your first classifier (which successfully classified a point with x,y coordinates of 1,0 by assigning it the ‘bottom-right’ label)

Resources:

Top 10 algorithms in data mining; SURVEY PAPER (http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf)
Machine Learning in Action; Peter Harrington (http://www.manning.com/pharrington/)
NumPy scientific computing with Python (http://www.numpy.org/)

Tagged: Machine Learning

mariuszprzydatek.com

k-Nearest Neighbors (kNN) algorithm – Machine Learning

Leave a comment Cancel reply

Welcome to my Blog on Software Engineering

Recent Posts

Subscribe to RSS

Follow Blog via Email

Archives

Categories

Recent reads

Mariusz Przydatek

mariuszprzydatek.com

k-Nearest Neighbors (kNN) algorithm – Machine Learning

Share this:

Related

Leave a comment Cancel reply

Welcome to my Blog on Software Engineering

Recent Posts

Subscribe to RSS

Follow Blog via Email

Archives

Categories

Recent reads

Mariusz Przydatek