Use of Machine Learning Algorithms In Classification Of DNA Gene Sequences: An Empiric Research on The Prometer And Splice Junction Data Set

Use of Machine Learning Algorithms In Classification Of DNA Gene Sequences: An Empiric Research on The Prometer And Splice Junction Data Set. Özhan, E. & Uzun, E. In 2. International Icontech Symposium on Innovative Surveys in Positive Sciences, pages 144-157, 2020.

Paper abstract bibtex 2 downloads

Artificial Intelligence technologies can provide effective solutions in understanding the increasingly complex multi-dimensional data and obtaining meaningful information. In particular, as the number of attributes of the data, the complexity and size of the problem increase, the discovery of meaningful relationships, comprehensibility and interpretation of the data become increasingly difficult. To overcome this difficulty, algorithms developed by machine learning methods, a sub-branch of artificial intelligence, can be useful. Classification of DNA gene sequences is one of the difficult problems to solve, especially in molecular biology. This study investigates the applicability of machine learning algorithms, a sub-branch of artificial intelligence methods, on two data sets that contain the DNA sequence analysis problem and the effects of new approaches on this problem. For this purpose, previously untested machine learning algorithms and tools were used and new findings were achieved. Especially, Auto-Weka tool, which was developed to determine algorithms that offer optimum performance together with their parameters, was found to be effective. The findings show that the performance rates previously obtained on this data set can be improved with new approaches. In particular, it has been observed that Random Forest and SMO (Sequential minimal optimization) algorithms significantly increase classification performance rates compared to previous studies with the determination of appropriate parameter settings. In this study, in addition to increasing the performance of classification algorithms, feature reduction method, which has not been tried in previous researches, has also been considered and it has been shown that the number of features can be reduced significantly to increase the performance rate. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters in Prometer data with the GreedyStepwise search method. The same algorithm has managed to reduce the number of attributes in the Slipce Junction data set at the optimum level with the BestFirst search method. Thus, it is possible to use hardware resources such as computational requirements and storage in the processing of DNA gene sequences more efficiently and to get the result quickly.

@inproceedings{
 title = {Use of Machine Learning Algorithms In Classification Of DNA Gene Sequences: An Empiric Research on The Prometer And Splice Junction Data Set},
 type = {inproceedings},
 year = {2020},
 pages = {144-157},
 city = {Budapest, Hungary},
 id = {842f6f9f-2202-3f24-8503-0d30c0e88894},
 created = {2020-11-24T07:34:22.014Z},
 file_attached = {true},
 profile_id = {37fa15c3-e5d0-3212-8e18-e4c72814fd47},
 last_modified = {2021-02-21T14:00:57.454Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {Ozhan2020},
 private_publication = {false},
 abstract = {Artificial Intelligence technologies can provide effective solutions in understanding the increasingly complex multi-dimensional data and obtaining meaningful information. In particular, as the number of attributes of the data, the complexity and size of the problem increase, the discovery of meaningful relationships, comprehensibility and interpretation of the data become increasingly difficult. To overcome this difficulty, algorithms developed by machine learning methods, a sub-branch of artificial intelligence, can be useful. Classification of DNA gene sequences is one of the difficult problems to solve, especially in molecular biology. This study investigates the applicability of machine learning algorithms, a sub-branch of artificial intelligence methods, on two data sets that contain the DNA sequence analysis problem and the effects of new approaches on this problem. For this purpose, previously untested machine learning algorithms and tools were used and new findings were achieved. Especially, Auto-Weka tool, which was developed to determine algorithms that offer optimum performance together with their parameters, was found to be effective. The findings show that the performance rates previously obtained on this data set can be improved with new approaches. In particular, it has been observed that Random Forest and SMO (Sequential minimal optimization) algorithms significantly increase classification performance rates compared to previous studies with the determination of appropriate parameter settings. In this study, in addition to increasing the performance of classification algorithms, feature reduction method, which has not been tried in previous researches, has also been considered and it has been shown that the number of features can be reduced significantly to increase the performance rate. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters in Prometer data with the GreedyStepwise search method. The same algorithm has managed to reduce the number of attributes in the Slipce Junction data set at the optimum level with the BestFirst search method. Thus, it is possible to use hardware resources such as computational requirements and storage in the processing of DNA gene sequences more efficiently and to get the result quickly.},
 bibtype = {inproceedings},
 author = {Özhan, Erkan and Uzun, Erdinç},
 booktitle = {2. International Icontech Symposium on Innovative Surveys in Positive Sciences},
 keywords = {DNA Gene Sequencing,Data Classification,Feature Reduction,Machine Learning}
}

Downloads: 2

{"_id":"py3pDGKszZbuL9v4C","bibbaseid":"zhan-uzun-useofmachinelearningalgorithmsinclassificationofdnagenesequencesanempiricresearchontheprometerandsplicejunctiondataset-2020","authorIDs":["QrE2Jk7Eehmqc5trT"],"author_short":["Özhan, E.","Uzun, E."],"bibdata":{"title":"Use of Machine Learning Algorithms In Classification Of DNA Gene Sequences: An Empiric Research on The Prometer And Splice Junction Data Set","type":"inproceedings","year":"2020","pages":"144-157","city":"Budapest, Hungary","id":"842f6f9f-2202-3f24-8503-0d30c0e88894","created":"2020-11-24T07:34:22.014Z","file_attached":"true","profile_id":"37fa15c3-e5d0-3212-8e18-e4c72814fd47","last_modified":"2021-02-21T14:00:57.454Z","read":false,"starred":false,"authored":"true","confirmed":"true","hidden":false,"citation_key":"Ozhan2020","private_publication":false,"abstract":"Artificial Intelligence technologies can provide effective solutions in understanding the increasingly complex multi-dimensional data and obtaining meaningful information. In particular, as the number of attributes of the data, the complexity and size of the problem increase, the discovery of meaningful relationships, comprehensibility and interpretation of the data become increasingly difficult. To overcome this difficulty, algorithms developed by machine learning methods, a sub-branch of artificial intelligence, can be useful. Classification of DNA gene sequences is one of the difficult problems to solve, especially in molecular biology. This study investigates the applicability of machine learning algorithms, a sub-branch of artificial intelligence methods, on two data sets that contain the DNA sequence analysis problem and the effects of new approaches on this problem. For this purpose, previously untested machine learning algorithms and tools were used and new findings were achieved. Especially, Auto-Weka tool, which was developed to determine algorithms that offer optimum performance together with their parameters, was found to be effective. The findings show that the performance rates previously obtained on this data set can be improved with new approaches. In particular, it has been observed that Random Forest and SMO (Sequential minimal optimization) algorithms significantly increase classification performance rates compared to previous studies with the determination of appropriate parameter settings. In this study, in addition to increasing the performance of classification algorithms, feature reduction method, which has not been tried in previous researches, has also been considered and it has been shown that the number of features can be reduced significantly to increase the performance rate. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters in Prometer data with the GreedyStepwise search method. The same algorithm has managed to reduce the number of attributes in the Slipce Junction data set at the optimum level with the BestFirst search method. Thus, it is possible to use hardware resources such as computational requirements and storage in the processing of DNA gene sequences more efficiently and to get the result quickly.","bibtype":"inproceedings","author":"Özhan, Erkan and Uzun, Erdinç","booktitle":"2. International Icontech Symposium on Innovative Surveys in Positive Sciences","keywords":"DNA Gene Sequencing,Data Classification,Feature Reduction,Machine Learning","bibtex":"@inproceedings{\n title = {Use of Machine Learning Algorithms In Classification Of DNA Gene Sequences: An Empiric Research on The Prometer And Splice Junction Data Set},\n type = {inproceedings},\n year = {2020},\n pages = {144-157},\n city = {Budapest, Hungary},\n id = {842f6f9f-2202-3f24-8503-0d30c0e88894},\n created = {2020-11-24T07:34:22.014Z},\n file_attached = {true},\n profile_id = {37fa15c3-e5d0-3212-8e18-e4c72814fd47},\n last_modified = {2021-02-21T14:00:57.454Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {true},\n hidden = {false},\n citation_key = {Ozhan2020},\n private_publication = {false},\n abstract = {Artificial Intelligence technologies can provide effective solutions in understanding the increasingly complex multi-dimensional data and obtaining meaningful information. In particular, as the number of attributes of the data, the complexity and size of the problem increase, the discovery of meaningful relationships, comprehensibility and interpretation of the data become increasingly difficult. To overcome this difficulty, algorithms developed by machine learning methods, a sub-branch of artificial intelligence, can be useful. Classification of DNA gene sequences is one of the difficult problems to solve, especially in molecular biology. This study investigates the applicability of machine learning algorithms, a sub-branch of artificial intelligence methods, on two data sets that contain the DNA sequence analysis problem and the effects of new approaches on this problem. For this purpose, previously untested machine learning algorithms and tools were used and new findings were achieved. Especially, Auto-Weka tool, which was developed to determine algorithms that offer optimum performance together with their parameters, was found to be effective. The findings show that the performance rates previously obtained on this data set can be improved with new approaches. In particular, it has been observed that Random Forest and SMO (Sequential minimal optimization) algorithms significantly increase classification performance rates compared to previous studies with the determination of appropriate parameter settings. In this study, in addition to increasing the performance of classification algorithms, feature reduction method, which has not been tried in previous researches, has also been considered and it has been shown that the number of features can be reduced significantly to increase the performance rate. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters. In particular, it has been observed that the CfsSubsetEval algorithm can detect the most important gene sequences that affect the classification by significantly reducing the input parameters in Prometer data with the GreedyStepwise search method. The same algorithm has managed to reduce the number of attributes in the Slipce Junction data set at the optimum level with the BestFirst search method. Thus, it is possible to use hardware resources such as computational requirements and storage in the processing of DNA gene sequences more efficiently and to get the result quickly.},\n bibtype = {inproceedings},\n author = {Özhan, Erkan and Uzun, Erdinç},\n booktitle = {2. International Icontech Symposium on Innovative Surveys in Positive Sciences},\n keywords = {DNA Gene Sequencing,Data Classification,Feature Reduction,Machine Learning}\n}","author_short":["Özhan, E.","Uzun, E."],"urls":{"Paper":"https://bibbase.org/service/mendeley/37fa15c3-e5d0-3212-8e18-e4c72814fd47/file/22f8095b-fafb-1ea3-4cf2-a520a989cfa0/Budapest_2020.pdf.pdf"},"biburl":"https://bibbase.org/service/mendeley/37fa15c3-e5d0-3212-8e18-e4c72814fd47","bibbaseid":"zhan-uzun-useofmachinelearningalgorithmsinclassificationofdnagenesequencesanempiricresearchontheprometerandsplicejunctiondataset-2020","role":"author","keyword":["DNA Gene Sequencing","Data Classification","Feature Reduction","Machine Learning"],"metadata":{"authorlinks":{"uzun, e":"https://erdincuzun.com/yayinlar/"}},"downloads":2},"bibtype":"inproceedings","creationDate":"2021-01-07T12:04:59.153Z","downloads":2,"keywords":["dna gene sequencing","data classification","feature reduction","machine learning"],"search_terms":["use","machine","learning","algorithms","classification","dna","gene","sequences","empiric","research","prometer","splice","junction","data","set","özhan","uzun"],"title":"Use of Machine Learning Algorithms In Classification Of DNA Gene Sequences: An Empiric Research on The Prometer And Splice Junction Data Set","year":2020,"biburl":"https://bibbase.org/service/mendeley/37fa15c3-e5d0-3212-8e18-e4c72814fd47","dataSources":["mqdHLrE2gnaRYnL6B","ya2CyA73rpZseyrZ8","2252seNhipfTmjEBQ"]}