Cost-Sensitive and Semi-Supervised Learning for Fraud
Abstract
Given the magnitude of e-auction transactions, it becomes challenging to safeguard
consumers from dishonest sellers, such as shill bidders. Shill Bidding (SB) is a
predominant auction fraud that is driven by modern-day technologies and clever
scammers. The difficulty of identifying the behavior of sophisticated fraudsters
and the scarcity of training dataset, hinder the research on SB detection. The
first part of this thesis aims to address these two difficult problems. We first
define two new SB patterns and implement other existing SB patterns. Next, we
develop a reliable SB dataset. This development requires crawling commercial
auctions and bidder history, preprocessing the raw data, and detecting outliers.
Due to the difficulty of labeling the multi-dimensional SB dataset, the second part
of the thesis investigates the Semi-Supervised Classification approach (SSC). SSC
requires labeling only a few SB samples. Therefore, we properly combine two data
clustering methods and define an anomaly detection approach based on the SB
scores of bidders in combination with the Three Sigma Rule.
Our experimental analysis in developing several SSC models demonstrates that
having unlabeled SB data together with a few labeled data improves the predictive
performance of the supervised SB models. The SSC models are able to accurately
differentiate between normal and shill bidders. Additionally, the learning curve
of the models shows that the smaller the size of the labeled SB data, the more
effective the model would be. Nevertheless, SSC models tackle the misclassifica-
tion errors of all the classes alike. This means that identifying a fraudster as a
normal bidder has the same risk as classifying a normal bidder as a fraudster. The
third part of the thesis examines this serious problem based on MetaCost learning.
Moreover, we propose an ensemble of a cost-sensitive and semi-supervised classifi-
cation approach to deal with the problem of imbalanced data without modifying
the original training SB dataset and to minimize the misclassification errors of the
fraud class as well. We develop several ensemble SB models that are able to reduce
the incorrect predictions of the fraud class.