Abstract:The quality of the data being analyzed is a critical factor that affects the accuracy of data mining algorithms. There are two important aspects of the data quality, one is relevance and the other is data redundancy. The inclusion of irrelevant and redundant features in the data mining model results in poor predictions and high computational overhead. This paper presents an efficient method concerning both the relevance of the features and the pairwise features correlation in order to improve the prediction and accuracy of our data mining algorithm. We introduce a new feature correlation metric Q/sub Y/(X/sub i/,X/sub j/) and feature subset merit measure e(S) to quantify the relevance and the correlation among features with respect to a desired data mining task (e.g., detection of an abnormal behavior in a network service due to network attacks). Our approach takes into consideration not only the dependency among the features, but also their dependency with respect to a given data mining task. Our analysis shows that the correlation relationship among features depends on the decision task and, thus, they display different behaviors as we change the decision task. We applied our data mining approach to network security and validated it using the DARPA KDD99 benchmark data set. Our results show that, using the new decision dependent correlation metric, we can efficiently detect rare network attacks such as User to Root (U2R) and Remote to Local (R2L) attacks. The best reported detection rates for U2R and R2L on the KDD99 data sets were 13.2 percent and 8.4 percent with 0.5 percent false alarm, respectively. For U2R attacks, our approach can achieve a 92.5 percent detection rate with a false alarm of 0.7587 percent. For R2L attacks, our approach can achieve a 92.47 percent detection rate with a false alarm of 8.35 percent.

Feature Selection Based on a New Dependency Measure

Feature Selection Based on Dependency Margin

Feature Selection with Conditional Mutual Information Considering Feature Interaction

Relative Synergy Coefficient: A Novel Way to Detect Variable Interaction in Large Dataset

U^2F^2S^2 : Uncovering Feature-level Similarities for Unsupervised Feature Selection

A novel method of constrained feature selection by the measurement of pairwise constraints uncertainty

Invariant optimal feature selection: A distance discriminant and feature ranking based solution

Dependence Guided Unsupervised Feature Selection

Feature Selection based on the Local Lift Dependence Scale

Feature Selection Based on Data Clustering

Entropy based measure and its algorithms for scalable feature selection

A fusion of centrality and correlation for feature selection

Feature selection using relative dependency complement mutual information in fitting fuzzy rough set model

An Optimal Feature Subset Selection Method Based On Distance Discriminant And Distribution Overlapping

Feature Selection for Monotonic Classification Via Maximizing Monotonic Dependency

A feature selection algorithm based on redundancy analysis and interaction weight

A New Feature Selection Algorithm Based on Mutual Information with Pairwise Constraints

Feature Selection Based on Wasserstein Distance

A new dependency and correlation analysis for features

A Feature Selection Framework Based on Supervised Data Clustering

A New Method for Redundancy Analysis in Feature Selection