Incorporating Three-Way Email Spam Filtering With Game-Theoretic Rough Sets
Abstract
Email spam filtering commonly is viewed as binary classification problem, that is,
classifies incoming email messages into spam or non-spam email. But it has two
main limitations. Firstly, binary classification needs people to make definite
decisions that are hard. Secondly, binary classification leads to high misclassification rate.
Decreasing rate of misclassifying legitimate email to spam may cause the increasing
rate of misclassifying spam to legitimate email. Three-way email spam filtering can
reduce the misclassification by dividing incoming email messages into three folders,
i.e., a spam folder containing email messages which are junk, an inbox or a legitimate
folder containing email messages which are readable, and a suspected folder containing
email messages which are not sure to make decisions based on available information.
Three-way email spam filtering is result from the three-way classifications.
Rough set theory is a theory that is used in the field of data analysis. Rough
set theory provides a way to make decisions in incomplete and insufficient
knowledge. Three-way classifications are constructed on the concept of rough set theory.
This means that three-way classifications are constructed based on acceptance,
rejection, and deferment. The decisions are induced from acceptance and rejection
regions. The decisions are not made in deferment regions until additional
information is collected. The rough sets are too restrict to tolerate classification errors in the
positive (acceptance), and negative (rejection) regions. Probabilistic rough sets relax
the requirement. Probabilistic rough sets allow for the more objects contained in the
i
positive and negative regions. Probabilistic rough sets cannot determine the boundaries
of the positive, negative, and the boundary regions. In other words, how to determine
the thresholds is the key.
Game-theoretic rough sets (GTRS) determined three-way classifications from
tradeoff perspective when multiple criteria are involved to evaluate the three-way
classification model. GTRS provide games to obtain a tradeoff between the
criteria. The balanced thresholds of three-way classifications can be induced from the
equilibrium of games.
In this thesis, GTRS are applied into three-way email spam filtering to obtain a
suitable tradeoff between accuracy and coverage. The competitive games are formu-
lated between accuracy and coverage to obtain probabilistic thresholds of three-way
classifications. Experimental results on the Spambase dataset show that three-way
email spam filtering based on GTRS is able to improve the coverage level in contrast
to Pawlak rough sets and 0.5-probabilistic model.
It is hoped that this thesis would lead to the understanding of GTRS and three-
way email spam filtering.