KNN CONCEPT
In [21]:
## Code to display image
from IPython.display import Image
Image(filename='/Users/sudhirwadhwa/Desktop/tbd/INTELFINALBUNDLE/Day2_KNN_Algo/KNNALGO.png') 
Out[21]:

The Algorithm : Example of k-NN classification.

The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm.

Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes)

A simple approach to select k is set k = n^(1/2). Usually, the rule of thumb is squareroot of number (n) of features.

Source : Wikipedia and dtio research

KNN is a simple concept: define some distance metric between the items in your dataset, and find the K closest items. You can then use those items to predict some property of a test item, by having them somehow "vote" on it.

In this example, we will use Cosine Distance.

Math Refresher Calculating Cosine Distance. How to compute Cosine distance and Similarity?

Spatial.distance.cosine computes the distance, and not the similarity. You subtract the value from 1 to get the similarity.

In [22]:
from scipy import spatial

m1 = [1, 45, 17, 12]
m2 = [7, 1, 11, 112]
m4 = [99,99,99,99]
m3 = [1,1,1,1]


## Calculate Cosine Distance 
cdist = spatial.distance.cosine(m1, m2)
cdist1 = spatial.distance.cosine(m4, m3)

simi1 = 1 - cdist1
simi = 1 - cdist
print ("Distance", cdist,  " and Similarity = ", simi)
print ("Distance", cdist1,  " and Similarity = ", simi1)
Distance 0.716897654023  and Similarity =  0.283102345977
Distance 0.0  and Similarity =  1.0
KNN on a sample dataset - Iris Dataset We will use KNeighborsClassifier() class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)[source] Understand the DataSet - 150 enteries ============================== sepal_length sepal_width petal_length petal_width species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.4 3.9 1.7 0.4 setosa 4.6 3.4 1.4 0.3 setosa 5.0 3.4 1.5 0.2 setosa 4.4 2.9 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa 5.4 3.7 1.5 0.2 setosa 4.8 3.4 1.6 0.2 setosa 4.8 3.0 1.4 0.1 setosa 4.3 3.0 1.1 0.1 setosa 5.8 4.0 1.2 0.2 setosa 5.7 4.4 1.5 0.4 setosa 5.4 3.9 1.3 0.4 setosa 5.1 3.5 1.4 0.3 setosa 5.7 3.8 1.7 0.3 setosa 5.1 3.8 1.5 0.3 setosa 5.4 3.4 1.7 0.2 setosa 5.1 3.7 1.5 0.4 setosa 4.6 3.6 1.0 0.2 setosa 5.1 3.3 1.7 0.5 setosa 4.8 3.4 1.9 0.2 setosa 5.0 3.0 1.6 0.2 setosa 5.0 3.4 1.6 0.4 setosa 5.2 3.5 1.5 0.2 setosa 5.2 3.4 1.4 0.2 setosa 4.7 3.2 1.6 0.2 setosa 4.8 3.1 1.6 0.2 setosa 5.4 3.4 1.5 0.4 setosa 5.2 4.1 1.5 0.1 setosa 5.5 4.2 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa 5.0 3.2 1.2 0.2 setosa 5.5 3.5 1.3 0.2 setosa 4.9 3.1 1.5 0.1 setosa 4.4 3.0 1.3 0.2 setosa 5.1 3.4 1.5 0.2 setosa 5.0 3.5 1.3 0.3 setosa 4.5 2.3 1.3 0.3 setosa 4.4 3.2 1.3 0.2 setosa 5.0 3.5 1.6 0.6 setosa 5.1 3.8 1.9 0.4 setosa 4.8 3.0 1.4 0.3 setosa 5.1 3.8 1.6 0.2 setosa 4.6 3.2 1.4 0.2 setosa 5.3 3.7 1.5 0.2 setosa 5.0 3.3 1.4 0.2 setosa 7.0 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor 6.9 3.1 4.9 1.5 versicolor 5.5 2.3 4.0 1.3 versicolor 6.5 2.8 4.6 1.5 versicolor 5.7 2.8 4.5 1.3 versicolor 6.3 3.3 4.7 1.6 versicolor 4.9 2.4 3.3 1.0 versicolor 6.6 2.9 4.6 1.3 versicolor 5.2 2.7 3.9 1.4 versicolor 5.0 2.0 3.5 1.0 versicolor 5.9 3.0 4.2 1.5 versicolor 6.0 2.2 4.0 1.0 versicolor 6.1 2.9 4.7 1.4 versicolor 5.6 2.9 3.6 1.3 versicolor 6.7 3.1 4.4 1.4 versicolor 5.6 3.0 4.5 1.5 versicolor 5.8 2.7 4.1 1.0 versicolor 6.2 2.2 4.5 1.5 versicolor 5.6 2.5 3.9 1.1 versicolor 5.9 3.2 4.8 1.8 versicolor 6.1 2.8 4.0 1.3 versicolor 6.3 2.5 4.9 1.5 versicolor 6.1 2.8 4.7 1.2 versicolor 6.4 2.9 4.3 1.3 versicolor 6.6 3.0 4.4 1.4 versicolor 6.8 2.8 4.8 1.4 versicolor 6.7 3.0 5.0 1.7 versicolor 6.0 2.9 4.5 1.5 versicolor 5.7 2.6 3.5 1.0 versicolor 5.5 2.4 3.8 1.1 versicolor 5.5 2.4 3.7 1.0 versicolor 5.8 2.7 3.9 1.2 versicolor 6.0 2.7 5.1 1.6 versicolor 5.4 3.0 4.5 1.5 versicolor 6.0 3.4 4.5 1.6 versicolor 6.7 3.1 4.7 1.5 versicolor 6.3 2.3 4.4 1.3 versicolor 5.6 3.0 4.1 1.3 versicolor 5.5 2.5 4.0 1.3 versicolor 5.5 2.6 4.4 1.2 versicolor 6.1 3.0 4.6 1.4 versicolor 5.8 2.6 4.0 1.2 versicolor 5.0 2.3 3.3 1.0 versicolor 5.6 2.7 4.2 1.3 versicolor 5.7 3.0 4.2 1.2 versicolor 5.7 2.9 4.2 1.3 versicolor 6.2 2.9 4.3 1.3 versicolor 5.1 2.5 3.0 1.1 versicolor 5.7 2.8 4.1 1.3 versicolor 6.3 3.3 6.0 2.5 virginica 5.8 2.7 5.1 1.9 virginica 7.1 3.0 5.9 2.1 virginica 6.3 2.9 5.6 1.8 virginica 6.5 3.0 5.8 2.2 virginica 7.6 3.0 6.6 2.1 virginica 4.9 2.5 4.5 1.7 virginica 7.3 2.9 6.3 1.8 virginica 6.7 2.5 5.8 1.8 virginica 7.2 3.6 6.1 2.5 virginica 6.5 3.2 5.1 2.0 virginica 6.4 2.7 5.3 1.9 virginica 6.8 3.0 5.5 2.1 virginica 5.7 2.5 5.0 2.0 virginica 5.8 2.8 5.1 2.4 virginica 6.4 3.2 5.3 2.3 virginica 6.5 3.0 5.5 1.8 virginica 7.7 3.8 6.7 2.2 virginica 7.7 2.6 6.9 2.3 virginica 6.0 2.2 5.0 1.5 virginica 6.9 3.2 5.7 2.3 virginica 5.6 2.8 4.9 2.0 virginica 7.7 2.8 6.7 2.0 virginica 6.3 2.7 4.9 1.8 virginica 6.7 3.3 5.7 2.1 virginica 7.2 3.2 6.0 1.8 virginica 6.2 2.8 4.8 1.8 virginica 6.1 3.0 4.9 1.8 virginica 6.4 2.8 5.6 2.1 virginica 7.2 3.0 5.8 1.6 virginica 7.4 2.8 6.1 1.9 virginica 7.9 3.8 6.4 2.0 virginica 6.4 2.8 5.6 2.2 virginica 6.3 2.8 5.1 1.5 virginica 6.1 2.6 5.6 1.4 virginica 7.7 3.0 6.1 2.3 virginica 6.3 3.4 5.6 2.4 virginica 6.4 3.1 5.5 1.8 virginica 6.0 3.0 4.8 1.8 virginica 6.9 3.1 5.4 2.1 virginica 6.7 3.1 5.6 2.4 virginica 6.9 3.1 5.1 2.3 virginica 5.8 2.7 5.1 1.9 virginica 6.8 3.2 5.9 2.3 virginica 6.7 3.3 5.7 2.5 virginica 6.7 3.0 5.2 2.3 virginica 6.3 2.5 5.0 1.9 virginica 6.5 3.0 5.2 2.0 virginica 6.2 3.4 5.4 2.3 virginica 5.9 3.0 5.1 1.8 virginica
In [23]:
# train_test_split (X,y, test_size = .5)
# 50% split
In [24]:
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target



from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X,y, test_size = .5)
In [25]:
X_test.shape
Out[25]:
(75, 4)
In [26]:
X_train.shape
Out[26]:
(75, 4)
In [27]:
X_test
Out[27]:
array([[ 7.3,  2.9,  6.3,  1.8],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 7.4,  2.8,  6.1,  1.9],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 5.2,  2.7,  3.9,  1.4],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.2,  2.8,  4.8,  1.8],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 5.7,  2.6,  3.5,  1. ],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 5.8,  2.8,  5.1,  2.4],
       [ 5.6,  2.7,  4.2,  1.3],
       [ 6.4,  3.1,  5.5,  1.8],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 7.7,  2.6,  6.9,  2.3],
       [ 6.6,  2.9,  4.6,  1.3],
       [ 6.8,  2.8,  4.8,  1.4],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 4.8,  3.1,  1.6,  0.2],
       [ 5.1,  3.5,  1.4,  0.2],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 4.5,  2.3,  1.3,  0.3],
       [ 6.7,  3.3,  5.7,  2.5],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.6,  3. ,  4.1,  1.3],
       [ 5.1,  3.4,  1.5,  0.2],
       [ 7.2,  3.2,  6. ,  1.8],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 5.7,  3. ,  4.2,  1.2],
       [ 5.6,  2.5,  3.9,  1.1],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 6.1,  2.8,  4. ,  1.3],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 6.7,  3.1,  5.6,  2.4],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 5.2,  4.1,  1.5,  0.1],
       [ 6.4,  3.2,  5.3,  2.3],
       [ 5.5,  2.3,  4. ,  1.3],
       [ 6.8,  3.2,  5.9,  2.3],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 6. ,  2.2,  5. ,  1.5],
       [ 6.4,  2.7,  5.3,  1.9],
       [ 6.9,  3.1,  5.4,  2.1],
       [ 7.7,  3. ,  6.1,  2.3],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 6.8,  3. ,  5.5,  2.1],
       [ 5.7,  2.8,  4.1,  1.3],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 6.7,  3. ,  5.2,  2.3],
       [ 6.9,  3.1,  5.1,  2.3],
       [ 6.4,  2.8,  5.6,  2.2],
       [ 5. ,  3.5,  1.6,  0.6],
       [ 7.9,  3.8,  6.4,  2. ],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 7.2,  3. ,  5.8,  1.6],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 6.5,  3. ,  5.2,  2. ],
       [ 6.4,  2.8,  5.6,  2.1],
       [ 6.5,  3. ,  5.8,  2.2]])
In [28]:
    y_test.shape
Out[28]:
(75,)
In [29]:
# this is 50 percent , 
# we split 50 percent for test 
# and 50 percent on train
# Understand the correct output BELOW - 
# we will compare this with predicted output.
y_test 
Out[29]:
array([2, 1, 0, 0, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 1, 1, 0, 1, 2, 1, 2, 0, 2,
       1, 1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 2, 1, 1, 1, 2, 1, 1, 1, 0, 0, 0, 2,
       0, 1, 0, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 0, 2, 2, 2, 0, 2, 0, 2,
       0, 0, 0, 2, 2, 2])
In [30]:
import pickle
from sklearn.neighbors import KNeighborsClassifier
dataTiles_classifier = KNeighborsClassifier()

clf= dataTiles_classifier.fit( X_train, y_train)
In [31]:
# optional step
# Save the model in pickle 
pickle.dump( clf, open( "dtiomodel.p", "wb" ) )
In [32]:
dataTiles_predictions = clf.predict(X_test)
dataTiles_predictions
Out[32]:
array([2, 1, 0, 0, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 1, 1, 0, 1, 2, 1, 2, 0, 2,
       1, 1, 2, 0, 0, 1, 0, 2, 0, 1, 0, 2, 1, 1, 1, 2, 2, 2, 1, 0, 0, 0, 2,
       0, 1, 0, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 0, 2, 2, 2, 0, 2, 0, 2,
       0, 0, 0, 2, 2, 2])
In [36]:
# give sepal_length	sepal_width	petal_length	petal_width	species
# and ask for prediction  

new_pred = clf.predict([[7.1,3.5,1.4,4.2]])
new_pred 
Out[36]:
array([1])
In [37]:
# give sepal_length	sepal_width	petal_length	petal_width	species
# and ask for prediction  

new_pred1 = clf.predict([[6.5,  3. ,  5.8,  2.2]])
new_pred1
Out[37]:
array([2])
In [34]:
from sklearn.metrics import accuracy_score
print (accuracy_score(y_test, dataTiles_predictions))
0.96