Discovering Group Differences from Qualitative and Quantitative Attributes Using Contrast Set Mining with Discretization and Measures of Interestingness
MetadataShow full item record
Identifying differences between groups is a fundamental problem in many disciplines. Groups are defined by a selected property that distinguishes one group from the other. For example, gender (male and female students) or year of admission (students admitted from 2001 to 2010). The group differences sought are novel, indicating that they are not obvious or intuitive, potentially useful, suggesting that they can aid in decision-making, and understandable, implying that they are presented in a format easily understood by human beings. Contrast set mining has been developed as a data mining task which aims to efficiently identify differences between groups from observational multivariate data. Here we study two closely related steps in the contrast set mining process: the mining step, and the interpretation and evaluation step. In the mining step, the task to be performed is the discovery of valid contrast sets. A valid contrast set is a conjunction of attributes and values that differ meaningfully in their distribution across groups. We introduce a novel contrast set, called the λ-contrast set. A λ-contrast set has a ratio of maximum support to minimum support greater than a user-defined threshold. We introduce a novel discretization algorithm, called Discretize, for creating intervals for quantitative attributes in the dataset. We demonstrate how we build our search space of all possible contrast sets from the attributes, and attribute-values in the dataset. We then present the COSINE algorithm, for traversing the search space of possible contrast sets. We show how the COSINE algorithm generates all valid contrast sets, maximal valid contrast sets, and valid λ-contrast sets. We then show how removing the attributes and attribute-interval pairs that not highly correlated produces a smaller search space. We then present the GENCCS algorithm for traversing the search space of possible correlated contrast sets. We show how the GENCCS algorithm generates all correlated valid contrast sets, correlated maximal valid contrast sets, and correlated valid λ-contrast sets. In the interpretation and evaluation step, we investigate the problem of ranking the interestingness of the contrast sets discovered using fourteen interestingness measures. Twelve measures have been previously utilized in various areas in the data mining community, such as association rule mining, emerging patterns, and subgroup discovery. Their use for ranking contrast sets is novel. The objective of this work is to gain insight into the behaviour that can be expected from the COSINE and GENCCS algorithms and the interestingness measures in practice. From our analysis, we found that COSINE is an effective technique for the efficient generation of all valid contrast sets, maximal valid contrast sets, and λ-contrast sets, and that it discovered more interesting contrast sets as compared to those obtained by two existing contrast set mining techniques, STUCCO and CIGAR. We also found that GENCCS is an effective technique for efficient generation of all valid correlated contrast sets, maximal correlated valid contrast sets, and correlated valid λ-contrast sets.