A Cost-Sensitive Approach to Ternary Classification
Bayesian inference and rough set theory provide two approaches to data analysis. There are close connections between the two theories as they both use probabilities to express uncertainties and knowledge about data. Several proposals have been made to apply Bayesian approaches to rough sets. This thesis draws results from two probabilistic rough set models, namely, decision-theoretic rough set models (DTRSM) and confirmation-theoretic rough set models (CTRSM) to propose a new Bayesian rough set model (BRSM) for cost-sensitive ternary classification. I argue that although the two classes of models share many similarities in terms of making use of Bayes’ theorem and a pair of thresholds to produce three regions, their semantic interpretations and hence intended applications are different. By integrating the two, I propose a unified model of Bayesian rough sets and apply the model to develop ternary classification. In developing the Bayesian rough set model, I focus on three fundamental issues, namely, the interpretation and calculation of a pair of thresholds, the estimation of probabilities, and the interpretation of the three regions used by rough set theory. Email spam filtering is used as a real world application to show the usefulness of the proposed model. Instead of treating email spam as a binary classification problem, I argue that a three-way decision approach will provide a way that is more meaningful to users for precautionary handling of their incoming emails. Three email folders instead of two are produced in a three-way spam filtering system. A suspected folder is added to allow users to further examine suspicious emails, thereby reducing the misclassification rate. In contrast to other ternary email spam filtering methods, my approach focuses on issues that are less studied in previous work, that is, the computation of required thresholds to define the three email categories and the interpretation of the cost-sensitive characteristics of spam filtering. Instead of having the user supply the thresholds based on their intuitive understanding of the intolerance for errors, I systematically calculate the thresholds based on the decision-theoretic rough set model. The cost of making the decision is interpreted as the loss function for Bayesian decision theory. The final decision is made by choosing the possible decision for which the overall cost is minimum. Experimental results on several benchmark datasets show that the new approach reduces the error rate of misclassifying a legitimate email to spam and demonstrates a better performance from a cost perspective. Finally, I propose and investigate two extensions of the basic model. One concerns multi-class classification and the other concerns multi-stage ternary classification. These two extensions make the model more applicable to solving real world problems.