NaiveBayes_Algo

Naive Bayes

To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.

In [7]:
from IPython.display import Image
Image(filename='NB3.png')
Out[7]:

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED.

In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. Thus, we can write:

Prior Probability of GREEN: number of GREEN objects / total number of objects

Prior Probability of RED: number of RED objects / total number of objects

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

Prior Probability for GREEN: 40 / 60

Prior Probability for RED: 20 / 60

Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels

In [8]:
from IPython.display import Image
Image(filename='NB4.png')
Out[8]:
In [9]:
from IPython.display import Image
Image(filename='NB5.png')
Out[9]:

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

In [10]:
from IPython.display import Image
Image(filename='NB6.png')
Out[10]:

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).

In [11]:
from IPython.display import Image
Image(filename='NB7.png')
Out[11]:

Finally, we classify X as RED since its class membership achieves the largest posterior probability.

source: based on StatSoft; the Electronic Statistics Textbook as a public service since 1995

Email Spam Classifier - Spam or ham?

We will use sklearn.naive_bayes to train a spam classifier!

In [12]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/', 'spam'))
data = data.append(dataFrameFromDirectory('/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/', 'ham'))

Let's have a look at that DataFrame:

In [13]:
data.head()
Out[13]:
class message
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00001.7848dde101aa985090474a91ec93fcf0 spam <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00002.d94f1b97e48ed3b553b3508d116e6a09 spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00003.2ee33bc6eacdb11f38d052c44819ba6c spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00004.eac8de8d759b7e74154f142194282724 spam ##############################################...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00005.57696a39d7d84318ce497886896bf90d spam I thought you might like these:\n\n1) Slim Dow...
In [14]:
data
Out[14]:
class message
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00001.7848dde101aa985090474a91ec93fcf0 spam <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00002.d94f1b97e48ed3b553b3508d116e6a09 spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00003.2ee33bc6eacdb11f38d052c44819ba6c spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00004.eac8de8d759b7e74154f142194282724 spam ##############################################...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00005.57696a39d7d84318ce497886896bf90d spam I thought you might like these:\n\n1) Slim Dow...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00006.5ab5620d3d7c6c0db76234556a16f6c1 spam A POWERHOUSE GIFTING PROGRAM You Don't Want To...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00007.d8521faf753ff9ee989122f6816f87d7 spam Help wanted. We are a 14 year old fortune 500...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00008.dfd941deb10f5eed78b1594b131c9266 spam <html>\n\n<head>\n\n<title>ReliaQuote - Save U...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00009.027bf6e0b0c4ab34db3ce0ea4bf2edab spam TIRED OF THE BULL OUT THERE?\n\nWant To Stop L...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00010.445affef4c70feec58f9198cfbc22997 spam Dear ricardo1 ,\n\n\n\n<html>\n\n<body>\n\n<ce...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00011.61816b9ad167657773a427d890d0468e spam Cellular Phone Accessories All At Below Wholes...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00012.381e4f512915109ba1e0853a7a8407b2 spam <table width="600" border="20" align="center" ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00013.d3f0b591a65f116ea5d9d4ad919f83aa spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00014.7d38c46424f24fc8012ac15a95a2ac14 spam <HTML><HEAD><TITLE>FREE Motorola Cell Phone wi...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00015.048434ab64c86cf890eda1326a5643f5 spam <HTML><HEAD><TITLE>Lowest Rate Services</TITLE...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00016.67fb281761ca1051a22ec3f21917e7c0 spam \n\n\n\nWant to watch Sporting Events?--Movies...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00017.1a938ecddd047b93cbd7ed92c241e6d1 spam Help wanted. We are a 14 year old fortune 500...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00018.5b2765c42b7648d41c93b9b27140b23a spam DEAR FRIEND,I AM MRS. SESE-SEKO WIDOW OF LATE...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00019.bbc97ad616ffd06e93ce0f821ca8c381 spam Lowest rates available for term life insurance...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00020.29725cf331fc21e18a1809e7d8b27332 spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00021.effe1449462a9d7ad7af0f1c94b1a237 spam CENTRAL BANK OF NIGERIA\n\nFOREIGN REMITTANCE ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00022.8203cdf03888f656dc0381701148f73d spam --===_SecAtt_000_1fheucnqggtggp\n\nContent-Typ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00023.b6d27c684f5fc803cfa1060adb2d0805 spam ------=_NextPart_000_00B2_83B03D1E.C6530E24\n\...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00024.6b5437b14d403176c3f046c871b5b52f spam This is a multi-part message in MIME format.\n...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00025.619ab8051359048795e3cd09e82ad1a0 spam <HTML><HEAD>\n\n<META http-equiv=3DContent-Typ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00026.da18dbed27ae933172f7a70f860c6ad0 spam DEAR FRIEND,I AM MRS. SESE-SEKO WIDOW OF LATE...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00027.d1d0f97e096fe08fc80a4939355759e7 spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00028.ace98eff213f4e6314b5571aece625e1 spam <HTML><HEAD><TITLE>MILFhunter</TITLE>\n\n<META...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00029.de865ad8d5ad0df985ae2f72388befba spam <html>\n\n<head>\n\n</head>\n\n<body>\n\n\n\n<...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/00030.0c9cdd9d4025bd55dac02719ec8d29dc spam <html>\n\n\n\n<head>\n\n<meta http-equiv=3D"Co...
... ... ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02471.18281d43dc0775e915267c2ea5170f1f ham This is possible, however using SA as a block ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02472.5c879dd55c3d4171e1787e8529bbd7e1 ham \n\n--- Martin Adamson <martin@srv0.ems.ed.ac....
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02473.207afa13ad7d745dfd1344f84531ac16 ham ----- Original Message -----\n\nFrom: "Tim Cha...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02474.c76ffef81a2529389e6c3bbb172184d7 ham \n\n> Mr Tim Chapman, freelance gentleman of l...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02475.9277ee243e3f51fa53ed6be55798d360 ham Smith, Graham - Computing Technician wrote:\n\...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02476.de1d459426662492dd1235046b504c3d ham Geege wrote a strange story:\n\n>I know a guy ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02477.07b2069e9827cfd6f97d07eea2913d57 ham \n\n[Paul Moore]\n\n> but let's walk before...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02478.40723f38488bddaf5a24ef2a91679c75 ham On Mon, Nov 25, 2002 at 06:54:49PM +0000, Phil...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02479.14365bcad3a60fcf24c5c1813f6291fb ham \n\nI don't know how one can expect better and...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02480.72714df60c9be29d6f7985c777cbfc13 ham No, you need to learn how declarations work in...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02481.176b368fe4b90682f33647d65a8b97a3 ham \n\n Richie> As I understand it, post-1.8x ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02482.35c166ec6a85e108ad693ea43329762f ham \n\n Paul> I suspect the best answer is to ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02483.ab1bee02c10ddecc0e86c39eaebc2996 ham The Times\n\n\n\n \n\n December 04, 2002 \n\n ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02484.32a0bca2600788be144b93cae341efbf ham I have to say I was surprised about Jacko dang...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02485.ba9aebbdbec0d9fecec595eeebe5db87 ham Now then I recently read a novel about exactly...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02486.bdf90e871b673fd14f47f3fe36622742 ham What the hell is it with these mini remote con...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02487.c2e725d509201dc30debb7bd94d07f5e ham here, for your enjoyment, is a little somethin...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02488.68fed64ff8169f1505b74080bb7b6158 ham Sean O'Donnell wrote:\n\n> Doesnt answer your ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02489.85c20a6f9d75714d9f44398baeddd416 ham Joe McNally writes:\n\n\n\n> What the hell is ...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02490.7be0f683db6994ddd8445cdcc2eb5042 ham http://news.bbc.co.uk/1/hi/world/europe/254182...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02491.c26245be2a5096fa86647d594561c511 ham Hi all.\n\nDoes anyone know how to set up dual...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02492.6aede44f654a1bbc60c95c7dd770e624 ham Carlos Luna wrote:\n\n\n\n>Hi all.\n\n>Does an...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02493.f9f2870094430b7db8b0c1052b302cf1 ham Hi all\n\n\n\n\n\nI have a prob when trying to...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02494.a14f2d3a9bef3f59aa419b03aee8f871 ham Tim Chapman writes:\n\n\n\n> http://news.bbc.c...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02495.5064946e77b3046873da91fc47656465 ham > I had the same problem when installing Win o...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02496.aae0c81581895acfe65323f344340856 ham Man killed 'trying to surf' on Tube train \n\n...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02497.60497db0a06c2132ec2374b2898084d3 ham Hi Gianni,\n\n\n\nA very good resource for thi...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02498.09835f512f156da210efb99fcc523e21 ham Gianni Ponzi wrote:\n\n> I have a prob when tr...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02499.b4af165650f138b10f9941f6cc5bce3c ham Neale Pickett <neale@woozle.org> writes:\n\n\n...
/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/02500.05b3496ce7bca306bed0805425ec8621 ham \n\nHi,\n\n\n\nI think you need to give us a l...

3000 rows × 2 columns

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [15]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
In [16]:
counts
Out[16]:
<3000x62964 sparse matrix of type '<type 'numpy.int64'>'
	with 429785 stored elements in Compressed Sparse Row format>
In [17]:
classifierModel = MultinomialNB()

## This is the target 
## class is the Target
targets = data['class'].values


## Using counts
classifierModel.fit(counts, targets)
Out[17]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Try these example emails.

In [18]:
examples = ['Free Viagra now!!!', 
            "A quick brown fox is not ready",
            "Could you bring me the black coffee as well?",
            "Hi Bob, how about a game of golf tomorrow, are you FREE?", 
            "I am FREE now, you can come", 
            "FREE FREE FREE Sex", 
            "CENTRAL BANK OF NIGERIA has 100 Million for you",
            "I am not available today, meet sunday?"]


example_counts = vectorizer.transform(examples)

# print ( example_counts)
predictions = classifierModel.predict(example_counts)
predictions
Out[18]:
array(['spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham'], 
      dtype='|S4')