Comparative Analysis of Filter Methods for Gene Selection
DOI:
https://doi.org/10.35778/jazu.i56.a648Keywords:
gene expression, feature selection, T-Test filtering, Information Gain, Wilcoxon, Chi2, Pearson correlation, Gini index, breast cancer microarrayAbstract
Gene expression data presents significant challenges due to their high dimensionality; effective gene selection methods are needed to obtain accurate analysis and biomarker discovery. In this paper, we conducted a comprehensive comparative study using nine filter-based gene selection techniques: Information Gain, Mutual Information, Correlation-based Feature Selection (CFS), Relief-F, T-Test, Wilcoxon, Chi2, Pearson correlation, and Gini index. A breast cancer microarray dataset was used to evaluate these methods based on their classification accuracy, computational efficiency, and stability of the selected gene subsets. Most methods achieve high predictive accuracy and perfect stability but differ in their computational costs. This study aims to provide practical insights for choosing appropriate filtering methods based on their balance performance and efficiency in analyzing gene expression.
Downloads
References
1. H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Boston, MA: Kluwer Academic Publishers, 1998.
2. I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," J. Mach. Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003.
3. J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.
4. T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley-Interscience, 2006.
5. G. H. John, R. Kohavi, and K. Pfleger, "Irrelevant features and the subset selection problem," in Proc. 11th Int. Conf. Mach. Learn., 1994, pp. 121–129.
6. D. G. Altman, Practical Statistics for Medical Research, 1st ed. London, U.K.: Chapman and Hall/CRC, 1990.
7. F. Wilcoxon, "Individual comparisons by ranking methods," Biometrics Bull., vol. 1, no. 6, pp. 80–83, 1945.
8. K. Pearson, "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling," Philos. Mag. Ser. 5, vol. 50, no. 302, pp. 157–175, 1900.
9. L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth, 1984.
10. K. Kira and L. A. Rendell, "The feature selection problem: Traditional methods and a new algorithm," in Proc. 9th Natl. Conf. Artif. Intell., 1992, pp. 129–134.
11. M. A. Hall, "Correlation-based feature selection for machine learning," Ph.D. dissertation, Univ. Waikato, Hamilton, New Zealand, 1999.
12. N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O'Sullivan, "A review of feature selection methods for machine learning-based disease risk prediction," Front. Bioinform., 2022.

