A Novel Feature Selection and Extraction Technique for Classification

Kratarth Goel,Raunaq Vohra,Ainesh Bakshi
DOI: https://doi.org/10.1109/SMC.2014.6974562
2014-12-26
Abstract:This paper presents a versatile technique for the purpose of feature selection and extraction - Class Dependent Features (CDFs). We use CDFs to improve the accuracy of classification and at the same time control computational expense by tackling the curse of dimensionality. In order to demonstrate the generality of this technique, it is applied to handwritten digit recognition and text categorization.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the efficiency and accuracy issues of feature selection and extraction in classification tasks**, especially on high - dimensional datasets. Specifically, the author proposes a new technique - **Class Dependent Features (CDFs)**, aiming to improve the accuracy of classification tasks while controlling the computational cost by dealing with the "curse of dimensionality". ### Main problems 1. **Processing of high - dimensional data**: As the data dimension increases, the computational complexity rises sharply, leading to a decline in classifier performance. 2. **Limitations of existing methods**: For example, traditional methods such as TF - IDF may inadvertently reduce the weights of high - frequency words that are very important for a certain category, thus affecting the classification effect. 3. **Universality and efficiency**: A method that is both efficient and easy to implement is required, which can run quickly on multiple devices and is applicable to different types of tasks, such as handwritten digit recognition and text classification. ### Solutions The CDFs method proposed by the author solves the above problems through the following steps: - **Feature selection**: Select features according to the relevance of class labels to ensure that the extracted features are meaningful for the entire class, not just a single data point. - **Feature extraction**: Use Kullback - Leibler (KL) divergence to extract class - dependent features and further optimize the feature representation. - **Classification task decomposition**: Decompose the entire learning problem into multiple binary classification tasks, and each task is trained using Support Vector Machines (SVM). ### Experimental verification To prove the effectiveness and universality of this method, the author applies it to two different tasks: - **Handwritten digit recognition**: Use the MNIST and USPS datasets. - **Text classification**: Use the WebKB and Reuters - 21578 datasets. The experimental results show that the CDFs method has achieved excellent performance in these tasks, especially significantly outperforming other methods in text classification tasks. ### Formula summary 1. **Feature selection formula**: \[ a_{ci}=\sum_{k = 1}^{M}p_k(i) \] \[ q_{ci}=\frac{a_{ci}}{M} \] \[ R_{xy}=\left\{\frac{q_{xi}}{q_{yi}}\mid\forall q_{xi}\in T(P_x)\text{ and }\forall q_{yi}\in T(P_y)\right\} \] \[ \mu_{xy}=\frac{\sum_{i = 1}^{N}\left(\frac{q_{xi}}{q_{yi}}\right)}{N} \] \[ \tau = b\cdot\mu_{xy},\quad\tau'=b'\cdot\mu_{yx} \] 2. **Feature extraction formula**: \[ F_{xy}(k)=D_{KL}(p'_k\|T(P_x)) \] \[ L_{xy}(k)= \begin{cases} 1&\text{if }p'_k\in P'_x\\ - 1&\text{if }p'_k\in P'_y \end{cases} \] Through these formulas, the author effectively selects class - dependent features and applies them to classification tasks, thereby improving the accuracy and efficiency of classification.