Machine Learning for Shill Bidding Classification Models
Abstract
As online auctions become more prevalent worldwide, they are increasingly targeted
by various types of cyber-crimes. In-auction fraud, such as shill bidding (SB), is
considered the most challenging to detect. SB has been recognized as the predominant
form of online auction fraud. It is di cult to identify due to its similarity to
normal bidding behavior. The complexity of nding and de ning SB patterns makes
it resistant to discovery. Also, the unavailability of SB datasets that are based on
actual e-auctions makes the development of SB detection and classi cation models
challenging. Therefore, the prerequisite task that is necessary to perform, in order
to achieve our goals in this work, is to scrape a large number of eBay auctions of a
popular product, which we did successfully. After preprocessing the raw data which
is a very di cult and time consuming operation, we build a high-quality SB dataset
based on reliable SB strategies. One of our goals is to share the SB dataset with other
researchers, to provide them with an opportunity to test their prediction models based
on real fraud data.
Labeling multi-dimensional data is an essential yet di cult task in machine learning,
since the classi cation quality relies mainly on the quality of the data labels.
In the generated SB dataset, a record de nes the behaviour of each bidder in each
auction, yet the records are not classi ed. To implement robust binary classi cation
models, the training instances must be e ciently categorized. So, another aim
of this study is to create a labelled SB dataset for SB detection systems based on classi cation techniques. The capabilities of hierarchical clustering algorithms, such
as CURE, have been proven to be outstanding for isolating similar instances in a
group. Thus, we introduce a systematical labelling procedure based on CURE to
distinguish suspicious bidders' behaviour from the behaviour of normal bidders. The
experimental outcomes are very satisfactory, which indicates that clustering provides
excellent results. However, the labeled SB dataset is imbalanced. Class imbalance is
a serious issue that has been comprehensively studied, yet, the experimental results
obtained and researchers' views di er on how to handle this issue. Some researchers
prefer data level techniques, while others prefer the algorithmic level. To overcome
the problem of imbalanced SB datasets, we investigate the most common techniques
used at the data level and the algorithmic level, which are over- and under-sampling
and cost-sensitive learning (CSL), respectively.
An auction system can be viewed as a data stream application since thousands of
auctions and bids occur daily. Instance-incremental learning is known to be useful in
this type of application, since the model normalization is modi ed based on the arrival
of new instances for prediction. In this research, the feasibility of instance-incremental
classi cation is investigated, where the selected lazy classi ers are Locally Weighted
Learning, K*instance based, and K-Nearest Neighbours. Additionally, we consider
the Hoe ding Tree classi er, since it is also based on instance-incremental learning.
We developed several instance-incremental SB classi ers using data sampling and
CSL. According to the experimental results, incremental classi cation returns a high
performance for both over- and under-sampled SB datasets. However, over-sampling
slightly outperforms under-sampling for both normal and suspicious classes across
all four classi ers and quality metrics. Moreover, the predictions on instance-based
algorithms combined with CSL, also, return a high accuracy. We note that data
sampling is slightly superior to CSL for the suspicious class, in general.