Machine Learning for Shill Bidding Classification Models

Date
2019-09
Authors
Alzahrani, Ahmad Atea
Journal Title
Journal ISSN
Volume Title
Publisher
Faculty of Graduate Studies and Research, University of Regina
Abstract

As online auctions become more prevalent worldwide, they are increasingly targeted by various types of cyber-crimes. In-auction fraud, such as shill bidding (SB), is considered the most challenging to detect. SB has been recognized as the predominant form of online auction fraud. It is di cult to identify due to its similarity to normal bidding behavior. The complexity of nding and de ning SB patterns makes it resistant to discovery. Also, the unavailability of SB datasets that are based on actual e-auctions makes the development of SB detection and classi cation models challenging. Therefore, the prerequisite task that is necessary to perform, in order to achieve our goals in this work, is to scrape a large number of eBay auctions of a popular product, which we did successfully. After preprocessing the raw data which is a very di cult and time consuming operation, we build a high-quality SB dataset based on reliable SB strategies. One of our goals is to share the SB dataset with other researchers, to provide them with an opportunity to test their prediction models based on real fraud data. Labeling multi-dimensional data is an essential yet di cult task in machine learning, since the classi cation quality relies mainly on the quality of the data labels. In the generated SB dataset, a record de nes the behaviour of each bidder in each auction, yet the records are not classi ed. To implement robust binary classi cation models, the training instances must be e ciently categorized. So, another aim of this study is to create a labelled SB dataset for SB detection systems based on classi cation techniques. The capabilities of hierarchical clustering algorithms, such as CURE, have been proven to be outstanding for isolating similar instances in a group. Thus, we introduce a systematical labelling procedure based on CURE to distinguish suspicious bidders' behaviour from the behaviour of normal bidders. The experimental outcomes are very satisfactory, which indicates that clustering provides excellent results. However, the labeled SB dataset is imbalanced. Class imbalance is a serious issue that has been comprehensively studied, yet, the experimental results obtained and researchers' views di er on how to handle this issue. Some researchers prefer data level techniques, while others prefer the algorithmic level. To overcome the problem of imbalanced SB datasets, we investigate the most common techniques used at the data level and the algorithmic level, which are over- and under-sampling and cost-sensitive learning (CSL), respectively. An auction system can be viewed as a data stream application since thousands of auctions and bids occur daily. Instance-incremental learning is known to be useful in this type of application, since the model normalization is modi ed based on the arrival of new instances for prediction. In this research, the feasibility of instance-incremental classi cation is investigated, where the selected lazy classi ers are Locally Weighted Learning, K*instance based, and K-Nearest Neighbours. Additionally, we consider the Hoe ding Tree classi er, since it is also based on instance-incremental learning. We developed several instance-incremental SB classi ers using data sampling and CSL. According to the experimental results, incremental classi cation returns a high performance for both over- and under-sampled SB datasets. However, over-sampling slightly outperforms under-sampling for both normal and suspicious classes across all four classi ers and quality metrics. Moreover, the predictions on instance-based algorithms combined with CSL, also, return a high accuracy. We note that data sampling is slightly superior to CSL for the suspicious class, in general.

Description
A Thesis Submitted to the Faculty of Graduate Studies and Research In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science, University of Regina. xv, 128 p.
Keywords
Citation