# Naive Bayes¶

To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.

```
from IPython.display import Image
Image(filename='NB3.png')
```

Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED.

In the Bayesian analysis, this belief is known as the **prior probability**. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. Thus, we can write:

Prior Probability of GREEN: number of GREEN objects / total number of objects

Prior Probability of RED: number of RED objects / total number of objects

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

Prior Probability for GREEN: 40 / 60

Prior Probability for RED: 20 / 60

Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels

```
from IPython.display import Image
Image(filename='NB4.png')
```

```
from IPython.display import Image
Image(filename='NB5.png')
```

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

```
from IPython.display import Image
Image(filename='NB6.png')
```

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).

```
from IPython.display import Image
Image(filename='NB7.png')
```

Finally, we classify X as RED since its class membership achieves the largest posterior probability.

source: based on StatSoft; the Electronic Statistics Textbook as a public service since 1995

# Email Spam Classifier - Spam or ham?¶

We will use sklearn.naive_bayes to train a spam classifier!

```
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)
inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/spam/', 'spam'))
data = data.append(dataFrameFromDirectory('/Users/sudhirwadhwa/Desktop/INTELFINALBUNDLE/Day2_NaiveBayes_Algo/emails/ham/', 'ham'))
```

Let's have a look at that DataFrame:

```
data.head()
```

```
data
```

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

```
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
```

```
counts
```

```
classifierModel = MultinomialNB()
## This is the target
## class is the Target
targets = data['class'].values
## Using counts
classifierModel.fit(counts, targets)
```

Try these example emails.

```
examples = ['Free Viagra now!!!',
"A quick brown fox is not ready",
"Could you bring me the black coffee as well?",
"Hi Bob, how about a game of golf tomorrow, are you FREE?",
"I am FREE now, you can come",
"FREE FREE FREE Sex",
"CENTRAL BANK OF NIGERIA has 100 Million for you",
"I am not available today, meet sunday?"]
example_counts = vectorizer.transform(examples)
# print ( example_counts)
predictions = classifierModel.predict(example_counts)
predictions
```