k-Nearest Neighbors (kNN) algorithm – Machine Learning

k-Nearest Neighbors (kNN) is an easy to grasp algorithm (and quite effective one), which:

  • finds a group of k objects in the training set that are closest to the test object, and
  • bases the assignment of a label on the predominance of a particular class in this neighborhood.

 

There are three key elements of this approach:

  • a set of labeled objects, e.g., a set of stored records (data),
  • a distance or similarity metric to compute distance between objects,
  • and the value of k, the number of nearest neighbors.

 

To classify an unlabeled object/item:

  • the distance of this object to the labeled objects is computed,
  • its k-nearest neighbors are identified,
  • and the class labels of these nearest neighbors are then used to determine the class label of the object.

 

Figure below provides a high-level summary of the nearest-neighbor classification algorithm.

 

knn_example1

 

knn_example2

 

Distances are calculated using the Euclidian distance, where the distance between two vectors, xA and xB, with two elements, is given by:

knn_example_euclidian

 

k-Nearest Neighbors Pros vs. Cons:

  • Pros – High accuracy, insensitive to outliers, no assumptions about data
  • Cons – Computationally expensive, requires a lot of memory
  • Works with – Numeric values, nominal values

 

 

Prior to starting coding, here’s what we assume we have: (details in Peter Harrington’s exceptional “Machine Learning in Action”)

  • We have the training data (an existing set of example data)
  • We have labels for all of this data
  • We know what class each piece of the data should fall into
  • When we’re given a new piece of data without a label, we compare that new piece of data to the existing data (every piece of existing data)
  • We then take the most similar pieces of data (the nearest neighbors) and look at their labels
  • We look at the top k most similar pieces of data from our known dataset; this is where the k comes from
  • Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify

 

 

I’ll be using Python (v3.3.5) programming language for code examples:

  1. Create a text file named “test_data.txt” that will contain our test data set (x,y coordinates/pairs of points in a 2D space together with labels (3rd column))
        0.2    1.3    top-left
        0.1    1.1    top-left
        0.9    0.1    bottom-right
        1.0    0.2    bottom-right
    
  2. Create a new python file called knn_example.py
  3. Import NumPy (package for scientific computing) and the Operator module (sorting tuples). NumPy requires previous installation from here.
        from numpy import *
        import operator
    
  4. Add a function to prepare the data set (load it from the file and transform into data(matrix) and labels(vector) components)
        def prepareData(fname):
            file = open(fname)
            lines = len(file.readlines())
            data = zeros((lines,2))
            labels = []
            file = open(fname)
            i = 0
            for line in file.readlines():
                line = line.strip()
                list = line.split('\t')
                data[i,:] = list[0:2]
                labels.append(list[-1])
                i += 1
            return data,labels
    
  5. Add a function to classify the data
        def classify(x, data, labels, k):
            size = data.shape[0]
            diff = tile(x, (size,1)) - data
            sqDiff = diff**2
            sqDist = sqDiff.sum(axis=1)
            dist = sqDist**0.5
            sortedDist = dist.argsort()
            count={}
            for i in range(k):
                label = labels[sortedDist[i]]
                count[label] = count.get(label,0) + 1
            sCount = sorted(count.items(), key=operator.itemgetter(1), reverse=True)
            return sCount[0][0]
    
  6. Save the file
  7. Execute (from Python CLI or IDLE):
        >>> import knn_example
        >>> group,labels = knn_example.prepareData('test_data.txt')
        >>> knn_example.classify([1,0], group, labels, 2)
    

 

What you should see after executing the above, is following:

    'bottom-right'

 

 

Voila, you just created your first classifier (which successfully classified a point with x,y coordinates of 1,0 by assigning it the ‘bottom-right’ label)

 

 

Resources:

 

Advertisements

Tagged:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: