Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies

Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies. Dell’Anna, D., Aydemir, F. B., & Dalpiaz, F. Empirical Software Engineering, 2023.

Link

Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies [pdf]

Paper

Slides

Supplement doi abstract bibtex 4 downloads

[Context] Automated classifiers, often based on machine learning (ML), are increasingly used in software engineering (SE) for labelling previously unseen SE data. Researchers have proposed automated classifiers that predict if a code chunk is a clone, if a requirement is functional or nonfunctional, if the outcome of a test case is non-deterministic, etc. [Objective] The lack of guidelines for applying and reporting classification techniques for SE research leads to studies in which important research steps may be skipped, key findings might not be identified and shared, and the readers may find reported results (e.g., precision or recall above 90%) that are not a credible representation of the performance in operational contexts. The goal of this paper is to advance ML4SE research by proposing rigorous ways of conducting and reporting research. We introduce the ECSER (Evaluating Classifiers in Software Engineering Research) pipeline, which includes a series of steps for conducting and evaluating automated classification research in SE. Then, we conduct two replication studies where we apply ECSER to recent research in requirements engineering and in software testing. [Conclusions] In addition to demonstrating the applicability of the pipeline, the replication studies demonstrate ECSER's usefulness: not only do we confirm and strengthen some findings identified by the original authors, but we also discover additional ones. Some of these findings contradict the original ones.

@article{DBLP:journals/ese/DellAnnaAD23,
author = {Davide Dell’Anna and
               Fatma Başak Aydemir and Fabiano Dalpiaz},
title = {Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies},
journal = {Empirical Software Engineering},
volume = {28},
number = {3},
year = {2023},
url_Link = {https://doi.org/10.1007/s10664-022-10243-1},
url_Paper = {https://link.springer.com/content/pdf/10.1007/s10664-022-10243-1.pdf},
url_Slides = {https://nlp4re.github.io/2023/assets/paper-templates/nlp4re-23-keynote-expanded.pdf},
url_Supplement = {https://doi.org/10.5281/zenodo.6266675},
doi       = {10.1007/s10664-022-10243-1},
keywords = {Automated classification, Machine Learning, Software Engineering, Replication Study, Requirements Engineering, Software Testing, NLP4RE},
abstract = {[Context] Automated classifiers, often based on machine learning (ML), are increasingly used in
software engineering (SE) for labelling previously unseen SE data. Researchers have proposed
automated classifiers that predict if a code chunk is a clone, if a requirement is functional or nonfunctional,
if the outcome of a test case is non-deterministic, etc. [Objective] The lack of guidelines for
applying and reporting classification techniques for SE research leads to studies in which important
research steps may be skipped, key findings might not be identified and shared, and the readers may
find reported results (e.g., precision or recall above 90%) that are not a credible representation of the
performance in operational contexts. The goal of this paper is to advance ML4SE research by proposing
rigorous ways of conducting and reporting research. We introduce the ECSER (Evaluating Classifiers in
Software Engineering Research) pipeline, which includes a series of steps for conducting and evaluating
automated classification research in SE. Then, we conduct two replication studies where we apply
ECSER to recent research in requirements engineering and in software testing. [Conclusions] In addition
to demonstrating the applicability of the pipeline, the replication studies demonstrate ECSER's
usefulness: not only do we confirm and strengthen some findings identified by the original authors, but
we also discover additional ones. Some of these findings contradict the original ones.}
}

Downloads: 4

{"_id":"p5rH9DQKkTK3Gja44","bibbaseid":"dellanna-aydemir-dalpiaz-evaluatingclassifiersinseresearchtheecserpipelineandtworeplicationstudies-2023","author_short":["Dell’Anna, D.","Aydemir, F. B.","Dalpiaz, F."],"bibdata":{"bibtype":"article","type":"article","author":[{"firstnames":["Davide"],"propositions":[],"lastnames":["Dell’Anna"],"suffixes":[]},{"firstnames":["Fatma","Başak"],"propositions":[],"lastnames":["Aydemir"],"suffixes":[]},{"firstnames":["Fabiano"],"propositions":[],"lastnames":["Dalpiaz"],"suffixes":[]}],"title":"Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies","journal":"Empirical Software Engineering","volume":"28","number":"3","year":"2023","url_link":"https://doi.org/10.1007/s10664-022-10243-1","url_paper":"https://link.springer.com/content/pdf/10.1007/s10664-022-10243-1.pdf","url_slides":"https://nlp4re.github.io/2023/assets/paper-templates/nlp4re-23-keynote-expanded.pdf","url_supplement":"https://doi.org/10.5281/zenodo.6266675","doi":"10.1007/s10664-022-10243-1","keywords":"Automated classification, Machine Learning, Software Engineering, Replication Study, Requirements Engineering, Software Testing, NLP4RE","abstract":"[Context] Automated classifiers, often based on machine learning (ML), are increasingly used in software engineering (SE) for labelling previously unseen SE data. Researchers have proposed automated classifiers that predict if a code chunk is a clone, if a requirement is functional or nonfunctional, if the outcome of a test case is non-deterministic, etc. [Objective] The lack of guidelines for applying and reporting classification techniques for SE research leads to studies in which important research steps may be skipped, key findings might not be identified and shared, and the readers may find reported results (e.g., precision or recall above 90%) that are not a credible representation of the performance in operational contexts. The goal of this paper is to advance ML4SE research by proposing rigorous ways of conducting and reporting research. We introduce the ECSER (Evaluating Classifiers in Software Engineering Research) pipeline, which includes a series of steps for conducting and evaluating automated classification research in SE. Then, we conduct two replication studies where we apply ECSER to recent research in requirements engineering and in software testing. [Conclusions] In addition to demonstrating the applicability of the pipeline, the replication studies demonstrate ECSER's usefulness: not only do we confirm and strengthen some findings identified by the original authors, but we also discover additional ones. Some of these findings contradict the original ones.","bibtex":"@article{DBLP:journals/ese/DellAnnaAD23,\nauthor = {Davide Dell’Anna and\n Fatma Başak Aydemir and Fabiano Dalpiaz},\ntitle = {Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies},\njournal = {Empirical Software Engineering},\nvolume = {28},\nnumber = {3},\nyear = {2023},\nurl_Link = {https://doi.org/10.1007/s10664-022-10243-1},\nurl_Paper = {https://link.springer.com/content/pdf/10.1007/s10664-022-10243-1.pdf},\nurl_Slides = {https://nlp4re.github.io/2023/assets/paper-templates/nlp4re-23-keynote-expanded.pdf},\nurl_Supplement = {https://doi.org/10.5281/zenodo.6266675},\ndoi = {10.1007/s10664-022-10243-1},\nkeywords = {Automated classification, Machine Learning, Software Engineering, Replication Study, Requirements Engineering, Software Testing, NLP4RE},\nabstract = {[Context] Automated classifiers, often based on machine learning (ML), are increasingly used in\nsoftware engineering (SE) for labelling previously unseen SE data. Researchers have proposed\nautomated classifiers that predict if a code chunk is a clone, if a requirement is functional or nonfunctional,\nif the outcome of a test case is non-deterministic, etc. [Objective] The lack of guidelines for\napplying and reporting classification techniques for SE research leads to studies in which important\nresearch steps may be skipped, key findings might not be identified and shared, and the readers may\nfind reported results (e.g., precision or recall above 90%) that are not a credible representation of the\nperformance in operational contexts. The goal of this paper is to advance ML4SE research by proposing\nrigorous ways of conducting and reporting research. We introduce the ECSER (Evaluating Classifiers in\nSoftware Engineering Research) pipeline, which includes a series of steps for conducting and evaluating\nautomated classification research in SE. Then, we conduct two replication studies where we apply\nECSER to recent research in requirements engineering and in software testing. [Conclusions] In addition\nto demonstrating the applicability of the pipeline, the replication studies demonstrate ECSER's\nusefulness: not only do we confirm and strengthen some findings identified by the original authors, but\nwe also discover additional ones. Some of these findings contradict the original ones.}\n}\n\n","author_short":["Dell’Anna, D.","Aydemir, F. B.","Dalpiaz, F."],"key":"DBLP:journals/ese/DellAnnaAD23","id":"DBLP:journals/ese/DellAnnaAD23","bibbaseid":"dellanna-aydemir-dalpiaz-evaluatingclassifiersinseresearchtheecserpipelineandtworeplicationstudies-2023","role":"author","urls":{" link":"https://doi.org/10.1007/s10664-022-10243-1"," paper":"https://link.springer.com/content/pdf/10.1007/s10664-022-10243-1.pdf"," slides":"https://nlp4re.github.io/2023/assets/paper-templates/nlp4re-23-keynote-expanded.pdf"," supplement":"https://doi.org/10.5281/zenodo.6266675"},"keyword":["Automated classification","Machine Learning","Software Engineering","Replication Study","Requirements Engineering","Software Testing","NLP4RE"],"metadata":{"authorlinks":{}},"downloads":4,"html":""},"bibtype":"article","biburl":"http://davidedellanna.com/publications/dellanna.bib","dataSources":["6mqXRucaPuwqWyQEd","jWumkkD2sHa7JC9Lc","Rymgh7KJx3EhNfvG8","2pnuieZznE6L96taK","CtYNgwuZWJXBDeQWR","kuER49aBcyePGqPdF"],"keywords":["automated classification","machine learning","software engineering","replication study","requirements engineering","software testing","nlp4re"],"search_terms":["evaluating","classifiers","research","ecser","pipeline","two","replication","studies","dell’anna","aydemir","dalpiaz"],"title":"Evaluating Classifiers in SE Research: The ECSER Pipeline and Two Replication Studies","year":2023,"downloads":4}