A Naive Bayes SMS spam classifier written in Go.

Overview

Ham (SMS spam classifier)

Summary

The purpose of this project is to demonstrate a simple probabilistic SMS spam classifier in Go. This supervised learning activity is accomplished using a Naive Bayes classifier. This simple algorithm assumes conditional independence, does a lot of counts, calculates some ratios, and multiplies them together.

Implementation

Problem and Approach

Problem: Determine what class (C) a new message (M) belongs to. There are two classes, ‘ham’ and ‘spam’.

Approach:

  • Each word position in a message is an attribute
  • The value each attribute takes on is the word in that position

Smoothing: In order to prevent words from preventing a classification if never observed to be in a class during training, a smoothing variable of 1 is added to the frequency of each word. The denominator is also adjusted as a result, adding the length of the vocabulary to the length of the class's map of word frequencies.

func (wf WordFrequency) Probability(v Vocabulary) Probability {
	p := make(map[string]float64)
	for _, vocabWord := range v {
		p[vocabWord] = float64(wf[vocabWord]+1) / float64(len(wf)+len(v))
	}

	return p
}

Common structures and patterns: Lookups of any type take advantage of the map type in Go. This is used for word frequency and probability. Floats are tracked in 64 bit variables to ensure sufficient precision and they are truncated only when printed to the terminal.

Classification Formula

This formula takes the larger value of the probabilities (the maximum) that the sms message is in each class. This is calculated based upon the total probability that a message is in a class times the probability of each word being in a message of that class. The general form of this formula is below:

classification formula

Organization

This project is organized into an experiment, analysis, main package and tests. These areas of the application have the following purposes:

  • Experiment: holding the shuffled SMS messages, organized separately by set (train or test) and class (spam or ham).
  • Analysis: holding calculations and statistics as well as a trained model for training classes and finally test results.
  • main: connecting options to experiment code, making copies of the original experiment for preprocessor comparisons, and printing of the final output in a user- friendly format.
  • tests: parser tests ensure that ingested data is understood correctly.

Underflow Prevention

Underflow can occur when many small numbers are multiplied against each other. In most languages the rounding remains accurate for floating point underflow, but the loss in precision can round a small number to zero, causing multiplication to wipe out the other values around it. This is especially common in Naive Bayes analysis of text because you are multiplying a large number of small probabilities.

The code below implements the relationship log(a * b) = log(a) + log(b) in order to prevent underflow.

for _, word := range words {
    // skip word if it isn't in the vocabulary
    if !a.TrainingSet.Vocabulary.Contains(word) {
        continue
    }

    // ham (add logs to prevent underflow of float with lots of multiplying)
    hamScore = hamScore + math.Log10(a.TrainingSet.Ham.WordProbabilities[word])

    // spam (add logs to prevent underflow of float with lots of multiplying)
    spamScore = spamScore + math.Log10(a.TrainingSet.Spam.WordProbabilities[word])
}
// unlog
hamScore = math.Pow(10, hamScore)
spamScore = math.Pow(10, spamScore)

Experiments

Preprocessing: Remove Punctuation

Removing punctuation has almost no effect on the results. This might be due to the two classes tending to have similar punctuation. Also, many text messages don't use any punctuation.

Preprocessing: Porter's Stemmer

The stemmer removes variations on words, combining them into one stem word. The idea behind this is that the ideas conveyed probably belong together in the analysis. However, there is a slight dip in accuracy (about a percent) when using a stemmer. This was one of the most surprising results to me. My theory is that the large quantity of slang and misspelling in sms cause some unexpected behavior from the stemmer. I would test with this preprocessor again on other types of text.

Preprocessing: Combination

I tested the combination of the stemmer and punctuation removal, assuming that they would both help and that combined they would be even better. The lack of accuracy uplift from the stemmer means that the combination was also not helpful.

Preprocessing: Remove 100 Most Common

This preprocessor removes the 100 most common English words from all messages. While the accuracy did get a small lift from this adjustment, the difference was very small, and the improvement was mostly to ham identification whereas more ham was incorrectly identifies as spam. For this reason, I would not use this in a production environment.

Train vs Test Set Ratio

Changing the size of the training vs test sets (stored in a parser constant const ratioToTrain = .75) didn't have a large effect on the accuracy of the classifier. Even a drastic change from .75 to .25 only resulted in about one percent less accuracy. I assumed that this is due to even the smaller training set being sufficiently representative to train a valid model. Since there did not seem to be a good reason to make this setting configurable, I left it in a constant instead of making it a commandline flag.

Third Party Libraries

github.com/fatih/color

This color library adds ANSI-standard color codes to terminal output, allowing easier visual separation of information in the command output.

github.com/reiver/go-porterstemmer

This project implements an algorithm for suffix stripping in Go. The complexity of this stemmer project exceeds the complexity of this project, so it was a great candidate for using a well tested, specialized external library.

Analysis

Performance

Performance in the training phase is not a primary focus of this type of project. Ideally, one can take as much time as they need to train a model. The application of the model is more important, but once you have a model, you don't need to recalculate it. This application takes about 0.86 seconds to run through all 4180 messages with five different preprocessors on a computer built in 2013.

Output

Analysis: Default Analysis (no preprocessing)
Vocabulary has 13125 words

Training Set:
        564 of 4180 messages were spam (13.49%)

Test Set:
        Correct Ham: 1204
        Correct Spam: 136
        Incorrect Ham (actually was spam): 47
        Incorrect Spam (actually was ham): 7
        Percentage Correct Ham: 99.42%
        Percentage Correct Spam: 74.32%
        Overall Accuracy: 96.13%

Analysis: No Punctuation Analysis
Vocabulary has 9877 words

Training Set:
        564 of 4180 messages were spam (13.49%)

Test Set:
        Correct Ham: 1207
        Correct Spam: 137
        Incorrect Ham (actually was spam): 46
        Incorrect Spam (actually was ham): 4
        Percentage Correct Ham: 99.67%
        Percentage Correct Spam: 74.86%
        Overall Accuracy: 96.41%

Analysis: Stemmer Analysis
Vocabulary has 6972 words

Training Set:
        564 of 4180 messages were spam (13.49%)

Test Set:
        Correct Ham: 1208
        Correct Spam: 120
        Incorrect Ham (actually was spam): 63
        Incorrect Spam (actually was ham): 3
        Percentage Correct Ham: 99.75%
        Percentage Correct Spam: 65.57%
        Overall Accuracy: 95.27%

Analysis: Stemmer and No Punctuation Analysis
Vocabulary has 6953 words

Training Set:
        564 of 4180 messages were spam (13.49%)

Test Set:
        Correct Ham: 1208
        Correct Spam: 120
        Incorrect Ham (actually was spam): 63
        Incorrect Spam (actually was ham): 3
        Percentage Correct Ham: 99.75%
        Percentage Correct Spam: 65.57%
        Overall Accuracy: 95.27%

Analysis: Remove 100 Most Common English Words
Vocabulary has 6867 words

Training Set:
        564 of 4180 messages were spam (13.49%)

Test Set:
        Correct Ham: 1198
        Correct Spam: 153
        Incorrect Ham (actually was spam): 30
        Incorrect Spam (actually was ham): 13
        Percentage Correct Ham: 98.93%
        Percentage Correct Spam: 83.61%
        Overall Accuracy: 96.92%


Done.

Effectiveness and Accuracy

This application is fairly accurate at predicting ham or spam classes for an SMS message. If one guessed ham for every message, the accuracy of this static approach would be just over 86% due to the total frequency of ham. However, this application is able to make a smart prediction with about 95% accuracy. Additionally, when mistakes are made, they are only rarely false positives for spam, so the user will not be losing large quantities of valid email. The user will still need to manually go through some spam, but the majority is caught.

License, Limitations, and Usage

I don't recommend using this code in production. I have licensed this code with an MIT license, so reuse is permissible.

You might also like...
Suricate-bank - API to transfer money between accounts at Suricate Bank,written in Go

⚠️ WORK IN PROGRESS ⚠️ Suricate Bank is an api that creates accounts and transfe

ncurses matrix/log app written in go to visualize chess problems.

dorrella/matrix-curses Matrix using ncurses and gbin/goncurses. Visual matrix based puzzles Install need libncurses-dev. Probably hard to run on windo

Naive Bayes spam-filtering in Go

Naive Bayes Spam-Filtering Spam is a simple implementation of naive Bayes spam-filtering algorithm. Resources youtube - live coding(farsi). License Th

a simple api that sent spam via sms and email

a simple api that sent spam via sms and email routes: /sms /email example request with python

Send email and SMS broadcasts to your contacts. SMS are sent via your Android phone connected to your PC.

Polysender Send email and SMS broadcasts to your contacts. Polysender is a desktop application, so it does not require a complicated server setup. Ema

A tool written in GO to demonstrate how bad actors utilize requests to spam Discord Users and launch large unsolicited DM Advertisement Campaigns
A tool written in GO to demonstrate how bad actors utilize requests to spam Discord Users and launch large unsolicited DM Advertisement Campaigns

discord-mass-DM-GO A tool written in GO to demonstrate how bad actors utilize requests to spam Discord Users and launch large unsolicited DM Advertise

Bayesian text classifier with flexible tokenizers and storage backends for Go

Shield is a bayesian text classifier with flexible tokenizer and backend store support Currently implemented: Redis backend English tokenizer Example

A License Classifier

License Classifier Introduction The license classifier is a library and set of tools that can analyze text to determine what type of license it contai

tfacon is a CLI tool for connecting Test Management Platforms and Test Failure Analysis Classifier.

Test Failure Classifier Connector Description tfacon is a CLI tool for connecting Test Management Platforms and Test Failure Analysis Classifier. Test

Tpu-traffic-classifier - This small program creates ipsets and iptables rules for nodes in the Solana network

TPU traffic classifier This small program creates ipsets and iptables rules for

DiscSpam is the best free and open source tool to spam/raid Discord servers.
DiscSpam is the best free and open source tool to spam/raid Discord servers.

DiscSpam Fast, Free, Easy to use Discord.com raid tool Report Bug , Request Feature About The Project There are a few Discord raid tools on GitHub, ho

Filtering spam in mail server, protecting both client privacy and server algorithm

HE Spamfilter SNUCSE 2021 "Intelligent Computing System Design Project" Hyesun Kwak Myeonghwan Ahn Dongwon Lee abstract Naïve Bayesian spam filtering

naive go bindings to the CPython C-API

go-python Naive go bindings towards the C-API of CPython-2. this package provides a go package named "python" under which most of the PyXYZ functions

naive go bindings to GnuPlot
naive go bindings to GnuPlot

go-gnuplot Simple-minded functions to work with gnuplot. go-gnuplot runs gnuplot as a subprocess and pushes commands via the STDIN of that subprocess.

Naive Bayesian Classification for Golang.

Naive Bayesian Classification Perform naive Bayesian classification into an arbitrary number of classes on sets of strings. bayesian also supports ter

Naive Bayesian Classification for Golang.

Naive Bayesian Classification Perform naive Bayesian classification into an arbitrary number of classes on sets of strings. bayesian also supports ter

naive go bindings to the CPython C-API

go-python Naive go bindings towards the C-API of CPython-2. this package provides a go package named "python" under which most of the PyXYZ functions

Naive LEGO helper for SberCloud DNS to be used with the EXEC plugin

Naive LEGO helper for SberCloud DNS Very basic, no any checks performed To be used with the exec plugin as described here Environment variables SBC_AC

A naive implementation of Raft consensus algorithm.

This implementation is used to learn/understand the Raft consensus algorithm. The code implements the behaviors shown in Figure 2 of the Raft paper wi

Comments
  • How to detect SPAM in SMS text

    How to detect SPAM in SMS text

    Is there any existing library to detect SPAM in SMS Text? I found one library: https://github.com/PaluMacil/ham But can't find how to use it. I thought the file testMsgs.data was for training and I could pass actual text for detecting SPAM Any help is appreciable!

    opened by sujit-baniya 3
Releases(v1.0.0)
Owner
Dan Wolf
A .NET, Angular, Go, and Python developer who's always super excited to write some code.
Dan Wolf
A License Classifier

License Classifier Introduction The license classifier is a library and set of tools that can analyze text to determine what type of license it contai

Google 263 Nov 15, 2022
Tpu-traffic-classifier - This small program creates ipsets and iptables rules for nodes in the Solana network

TPU traffic classifier This small program creates ipsets and iptables rules for

Triton One 10 Nov 23, 2022
Naive Bayesian Classification for Golang.

Naive Bayesian Classification Perform naive Bayesian classification into an arbitrary number of classes on sets of strings. bayesian also supports ter

Jake Brukhman 743 Nov 22, 2022
Genetic Algorithms library written in Go / golang

Description Genetic Algorithms for Go/Golang Install $ go install git://github.com/thoj/go-galib.git Compiling examples: $ git clone git://github.com

Thomas Jager 193 Sep 27, 2022
Neural Networks written in go

gobrain Neural Networks written in go Getting Started The version 1.0.0 includes just basic Neural Network functions such as Feed Forward and Elman Re

Go Machine Learning 530 Nov 27, 2022
A recommender system service based on collaborative filtering written in Go

Language: English | 中文 gorse: Go Recommender System Engine Build Coverage Report GoDoc RTD Demo gorse is an offline recommender system backend based o

Zhenghao Zhang 6.4k Nov 27, 2022
k-means clustering algorithm implementation written in Go

kmeans k-means clustering algorithm implementation written in Go What It Does k-means clustering partitions a multi-dimensional data set into k cluste

Christian Muehlhaeuser 412 Nov 10, 2022
a* pathfinding algorithm written in go

astar a* (a-star) pathfinding algorithm written in go Wikipedia: EN: A* search algorithm DE: A*-Algorithmus Install go get github.com/jpierer/[email protected]

Julian Pierer 26 Mar 21, 2022
A simple utility, written in Go, for interacting with Salesforce.

Salesforce CLI A simple utility, written in Go, for interacting with Salesforce. Currently only specific functionality is implemented, and the output

Darren Parkinson 0 Dec 14, 2021
A simple yet customisable program written in go to make hackerman-like terminal effects.

stuntman a simple program written in go to make you look like a hackerman Demo stuntman -binar -width 90 -color cyan stuntman -text -width 90 -vertgap

Solaris 10 Aug 4, 2022