High dimensional data classification and feature selection using support vector machines

High dimensional data classification and feature selection using support vector machines. Ghaddar, B. & Naoum-Sawaya, J. European Journal of Operational Research, 2017.
abstract bibtex

© 2017 Elsevier B.V. In many big-data systems, large amounts of information are recorded and stored for analytics purposes. Often however, this vast amount of information does not offer additional benefits for optimal decision making, but may rather be complicating and too costly for collection, storage, and processing. For instance, tumor classification using high-throughput microarray data is challenging due to the presence of a large number of noisy features that do not contribute to the reduction of classification errors. For such problems, the general aim is to find a limited number of genes that highly differentiate among the classes. Thus in this paper, we address a specific class of machine learning, namely the problem of feature selection within support vector machine classification that deals with finding an accurate binary classifier that uses a minimal number of features. We introduce a new approach based on iteratively adjusting a bound on the l 1 -norm of the classifier vector in order to force the number of selected features to converge towards the desired maximum limit. We analyze two real-life classification problems with high dimensional features. The first case is the medical diagnosis of tumors based on microarray data where we present a generic approach for cancer classification based on gene expression. The second case deals with sentiment classification of on-line reviews from Amazon, Yelp, and IMDb. The results show that the proposed classification and feature selection approach is simple, computationally tractable, and achieves low error rates which are key for the construction of advanced decision-support systems.

@article{
 title = {High dimensional data classification and feature selection using support vector machines},
 type = {article},
 year = {2017},
 identifiers = {[object Object]},
 keywords = {Analytics,Classification,Feature selection,Machine learning,Support vector machines},
 id = {9fab25d5-e46d-3048-a326-215afae29486},
 created = {2017-12-02T16:14:47.598Z},
 file_attached = {false},
 profile_id = {7cfba06b-2407-3c8f-b729-5e0599e7b0fc},
 last_modified = {2017-12-02T16:14:47.598Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {false},
 hidden = {false},
 private_publication = {true},
 abstract = {© 2017 Elsevier B.V. In many big-data systems, large amounts of information are recorded and stored for analytics purposes. Often however, this vast amount of information does not offer additional benefits for optimal decision making, but may rather be complicating and too costly for collection, storage, and processing. For instance, tumor classification using high-throughput microarray data is challenging due to the presence of a large number of noisy features that do not contribute to the reduction of classification errors. For such problems, the general aim is to find a limited number of genes that highly differentiate among the classes. Thus in this paper, we address a specific class of machine learning, namely the problem of feature selection within support vector machine classification that deals with finding an accurate binary classifier that uses a minimal number of features. We introduce a new approach based on iteratively adjusting a bound on the l  1 -norm of the classifier vector in order to force the number of selected features to converge towards the desired maximum limit. We analyze two real-life classification problems with high dimensional features. The first case is the medical diagnosis of tumors based on microarray data where we present a generic approach for cancer classification based on gene expression. The second case deals with sentiment classification of on-line reviews from Amazon, Yelp, and IMDb. The results show that the proposed classification and feature selection approach is simple, computationally tractable, and achieves low error rates which are key for the construction of advanced decision-support systems.},
 bibtype = {article},
 author = {Ghaddar, B. and Naoum-Sawaya, J.},
 journal = {European Journal of Operational Research}
}

Downloads: 0

{"_id":"e7qsdWLzrFpdoHS9D","bibbaseid":"ghaddar-naoumsawaya-highdimensionaldataclassificationandfeatureselectionusingsupportvectormachines-2017","authorIDs":[],"author_short":["Ghaddar, B.","Naoum-Sawaya, J."],"bibdata":{"title":"High dimensional data classification and feature selection using support vector machines","type":"article","year":"2017","identifiers":"[object Object]","keywords":"Analytics,Classification,Feature selection,Machine learning,Support vector machines","id":"9fab25d5-e46d-3048-a326-215afae29486","created":"2017-12-02T16:14:47.598Z","file_attached":false,"profile_id":"7cfba06b-2407-3c8f-b729-5e0599e7b0fc","last_modified":"2017-12-02T16:14:47.598Z","read":false,"starred":false,"authored":"true","confirmed":false,"hidden":false,"private_publication":"true","abstract":"© 2017 Elsevier B.V. In many big-data systems, large amounts of information are recorded and stored for analytics purposes. Often however, this vast amount of information does not offer additional benefits for optimal decision making, but may rather be complicating and too costly for collection, storage, and processing. For instance, tumor classification using high-throughput microarray data is challenging due to the presence of a large number of noisy features that do not contribute to the reduction of classification errors. For such problems, the general aim is to find a limited number of genes that highly differentiate among the classes. Thus in this paper, we address a specific class of machine learning, namely the problem of feature selection within support vector machine classification that deals with finding an accurate binary classifier that uses a minimal number of features. We introduce a new approach based on iteratively adjusting a bound on the l 1 -norm of the classifier vector in order to force the number of selected features to converge towards the desired maximum limit. We analyze two real-life classification problems with high dimensional features. The first case is the medical diagnosis of tumors based on microarray data where we present a generic approach for cancer classification based on gene expression. The second case deals with sentiment classification of on-line reviews from Amazon, Yelp, and IMDb. The results show that the proposed classification and feature selection approach is simple, computationally tractable, and achieves low error rates which are key for the construction of advanced decision-support systems.","bibtype":"article","author":"Ghaddar, B. and Naoum-Sawaya, J.","journal":"European Journal of Operational Research","bibtex":"@article{\n title = {High dimensional data classification and feature selection using support vector machines},\n type = {article},\n year = {2017},\n identifiers = {[object Object]},\n keywords = {Analytics,Classification,Feature selection,Machine learning,Support vector machines},\n id = {9fab25d5-e46d-3048-a326-215afae29486},\n created = {2017-12-02T16:14:47.598Z},\n file_attached = {false},\n profile_id = {7cfba06b-2407-3c8f-b729-5e0599e7b0fc},\n last_modified = {2017-12-02T16:14:47.598Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {false},\n hidden = {false},\n private_publication = {true},\n abstract = {© 2017 Elsevier B.V. In many big-data systems, large amounts of information are recorded and stored for analytics purposes. Often however, this vast amount of information does not offer additional benefits for optimal decision making, but may rather be complicating and too costly for collection, storage, and processing. For instance, tumor classification using high-throughput microarray data is challenging due to the presence of a large number of noisy features that do not contribute to the reduction of classification errors. For such problems, the general aim is to find a limited number of genes that highly differentiate among the classes. Thus in this paper, we address a specific class of machine learning, namely the problem of feature selection within support vector machine classification that deals with finding an accurate binary classifier that uses a minimal number of features. We introduce a new approach based on iteratively adjusting a bound on the l 1 -norm of the classifier vector in order to force the number of selected features to converge towards the desired maximum limit. We analyze two real-life classification problems with high dimensional features. The first case is the medical diagnosis of tumors based on microarray data where we present a generic approach for cancer classification based on gene expression. The second case deals with sentiment classification of on-line reviews from Amazon, Yelp, and IMDb. The results show that the proposed classification and feature selection approach is simple, computationally tractable, and achieves low error rates which are key for the construction of advanced decision-support systems.},\n bibtype = {article},\n author = {Ghaddar, B. and Naoum-Sawaya, J.},\n journal = {European Journal of Operational Research}\n}","author_short":["Ghaddar, B.","Naoum-Sawaya, J."],"bibbaseid":"ghaddar-naoumsawaya-highdimensionaldataclassificationandfeatureselectionusingsupportvectormachines-2017","role":"author","urls":{},"keyword":["Analytics","Classification","Feature selection","Machine learning","Support vector machines"],"downloads":0},"bibtype":"article","creationDate":"2019-06-04T17:45:11.498Z","downloads":0,"keywords":["analytics","classification","feature selection","machine learning","support vector machines"],"search_terms":["high","dimensional","data","classification","feature","selection","using","support","vector","machines","ghaddar","naoum-sawaya"],"title":"High dimensional data classification and feature selection using support vector machines","year":2017}