Feature Selection and Machine Learning Method for Classification of Lung Cancer Types


  • Byungju Shin
  • Bohyun Wang
  • Joon S. Lim


Microarray technology and computational methods have enabled researchers to obtain significant amount of gene expression data for lung cancer, which have allowed them to select genes that are specific to particular types of lung cancer. In this paper, a relational matrix is proposed, which is used to find genes with high correlations, depending on the types of lung cancer by using microarray expression data. We perform machine learning on the genes discovered from the relational matrix by using the weighted neuro-fuzzy algorithm to accurately classify the types of lung cancer. In addition, some genes among the discovered genes were investigated in the relative pathways, and p-values were obtained to analyze the validity of those genes in the given pathways. The relational matrix is constructed by enumerating the number of meaningful relationships identified through observations of the changes in gene expression values between different types of lung cancer. The weighted neuro fuzzy algorithm uses a bounded sum function into which the three functions are combined during learning and classification. We obtained 405 type-dependent genes using the proposed relational matrix and classified 203 samples into five types of lung cancer by using those genes. We obtained a classification accuracy of 99.5% for all samples; the results of Leave One Out Cross Validation test showed an accuracy of 87.19%. Moreover, we obtained valid p-values from 12 pathways in KEGG.