Improving Source Code Quality by Improving Identifier Quality and Avoiding Linguistic Antipatterns. Arnaoudova, V. Ph.D. Thesis, Polytechnique Montr�al, August, 2014.  173 pages.
Paper  abstract   bibtex   Program comprehension is a key activity during software development and maintenance. Although frequently performed---even more often than actually writing code---program comprehension is a challenging activity. The difficulty to understand a program increases with its size and complexity and as a result the comprehension of complex programs, in the best-case scenario, more time consuming when compared to simple ones but it can also lead to introducing faults in the program. Hence, structural properties such as size and complexity are often used to identify complex and fault prone programs. However, from early theories studying developers' behavior while understanding a program, we know that the textual information contained in identifiers and comments---\ie the source code lexicon---is part of the factors that affect the psychological complexity of a program, \ie factors that make a program difficult to understand and maintain by humans. In this dissertation we provide evidence that metrics evaluating the quality of source code lexicon are an asset for software fault explanation and prediction. Moreover, the quality of identifiers and comments considered in isolation may not be sufficient to reveal flaws---in his theory about the program understanding process for example, Brooks warns that it may happen that comments and code are contradictory. Consequently, we address the problem of contradictory, and more generally of inconsistent, lexicon by defining a catalog of Linguistic Antipatterns (LAs), \ie poor practices in the choice of identifiers resulting in inconsistencies among the name, implementation, and documentation of a programming entity. Then, we empirically evaluate the relevance of LAs---\ie how important they are---to industrial and open-source developers. Overall, results indicate that the majority of the developers perceives LAs as poor practices and therefore must be avoided. We also distill a subset of canonical LAs that developers found particularly unacceptable or for which they undertook an action. In fact, we discovered that 10% of the examples containing LAs were removed by developers after we pointed them out. Developers' explanations and the large proportion of yet unresolved LAs suggest that there may be other factors that impact the decision of removing LAs, which is often done through renaming. We conduct a survey with developers and show that renaming is not a straightforward activity and that there are several factors preventing developers from renaming. These results suggest that it would be more beneficial to highlight LAs and other lexicon bad smells as developers write source code---\eg using our LAPD Checkstyle plugin detecting LAs---so that the improvement can be done on-the-fly without impacting other program entities.
@PHDTHESIS{Arnaoudova14-PhD,
   AUTHOR       = {Venera Arnaoudova},
   SCHOOL       = {Polytechnique Montr�al},
   TITLE        = {Improving Source Code Quality by Improving Identifier 
      Quality and Avoiding Linguistic Antipatterns},
   YEAR         = {2014},
   OPTADDRESS   = {},
   MONTH        = {August},
   NOTE         = {173 pages.},
   OPTTYPE      = {},
   KEYWORDS     = {Linguistic smells},
   URL          = {http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.doc.pdf},
   PDF          = {http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.ppt.pdf},
   ABSTRACT     = {Program comprehension is a key activity during software 
      development and maintenance. Although frequently performed---even 
      more often than actually writing code---program comprehension is a 
      challenging activity. The difficulty to understand a program 
      increases with its size and complexity and as a result the 
      comprehension of complex programs, in the best-case scenario, more 
      time consuming when compared to simple ones but it can also lead to 
      introducing faults in the program. Hence, structural properties such 
      as size and complexity are often used to identify complex and fault 
      prone programs. However, from early theories studying developers' 
      behavior while understanding a program, we know that the textual 
      information contained in identifiers and comments---\ie the source 
      code lexicon---is part of the factors that affect the psychological 
      complexity of a program, \ie factors that make a program difficult to 
      understand and maintain by humans. In this dissertation we provide 
      evidence that metrics evaluating the quality of source code lexicon 
      are an asset for software fault explanation and prediction. Moreover, 
      the quality of identifiers and comments considered in isolation may 
      not be sufficient to reveal flaws---in his theory about the program 
      understanding process for example, Brooks warns that it may happen 
      that comments and code are contradictory. Consequently, we address 
      the problem of contradictory, and more generally of inconsistent, 
      lexicon by defining a catalog of Linguistic Antipatterns (LAs), \ie 
      poor practices in the choice of identifiers resulting in 
      inconsistencies among the name, implementation, and documentation of 
      a programming entity. Then, we empirically evaluate the relevance of 
      LAs---\ie how important they are---to industrial and open-source 
      developers. Overall, results indicate that the majority of the 
      developers perceives LAs as poor practices and therefore must be 
      avoided. We also distill a subset of \textit{canonical LAs} that 
      developers found particularly unacceptable or for which they 
      undertook an action. In fact, we discovered that 10\% of the examples 
      containing LAs were removed by developers after we pointed them out. 
      Developers' explanations and the large proportion of yet unresolved 
      LAs suggest that there may be other factors that impact the decision 
      of removing LAs, which is often done through renaming. We conduct a 
      survey with developers and show that renaming is not a 
      straightforward activity and that there are several factors 
      preventing developers from renaming. These results suggest that it 
      would be more beneficial to highlight LAs and other lexicon bad 
      smells as developers write source code---\eg using our LAPD 
      Checkstyle plugin detecting LAs---so that the improvement can be done 
      on-the-fly without impacting other program entities.}
} 
Downloads: 0
{"_id":"agygAJe7E5gxfaJmK","bibbaseid":"arnaoudova-improvingsourcecodequalitybyimprovingidentifierqualityandavoidinglinguisticantipatterns-2014","downloads":0,"creationDate":"2018-02-22T19:07:02.704Z","title":"Improving Source Code Quality by Improving Identifier Quality and Avoiding Linguistic Antipatterns","author_short":["Arnaoudova, V."],"year":2014,"bibtype":"phdthesis","biburl":"http://ptidej.polymtl.ca/yann-gael/Work/Publications/Biblio/complete-bibliography.bib?","bibdata":{"bibtype":"phdthesis","type":"phdthesis","author":[{"firstnames":["Venera"],"propositions":[],"lastnames":["Arnaoudova"],"suffixes":[]}],"school":"Polytechnique Montr�al","title":"Improving Source Code Quality by Improving Identifier Quality and Avoiding Linguistic Antipatterns","year":"2014","optaddress":"","month":"August","note":"173 pages.","opttype":"","keywords":"Linguistic smells","url":"http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.doc.pdf","pdf":"http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.ppt.pdf","abstract":"Program comprehension is a key activity during software development and maintenance. Although frequently performed---even more often than actually writing code---program comprehension is a challenging activity. The difficulty to understand a program increases with its size and complexity and as a result the comprehension of complex programs, in the best-case scenario, more time consuming when compared to simple ones but it can also lead to introducing faults in the program. Hence, structural properties such as size and complexity are often used to identify complex and fault prone programs. However, from early theories studying developers' behavior while understanding a program, we know that the textual information contained in identifiers and comments---\\ie the source code lexicon---is part of the factors that affect the psychological complexity of a program, \\ie factors that make a program difficult to understand and maintain by humans. In this dissertation we provide evidence that metrics evaluating the quality of source code lexicon are an asset for software fault explanation and prediction. Moreover, the quality of identifiers and comments considered in isolation may not be sufficient to reveal flaws---in his theory about the program understanding process for example, Brooks warns that it may happen that comments and code are contradictory. Consequently, we address the problem of contradictory, and more generally of inconsistent, lexicon by defining a catalog of Linguistic Antipatterns (LAs), \\ie poor practices in the choice of identifiers resulting in inconsistencies among the name, implementation, and documentation of a programming entity. Then, we empirically evaluate the relevance of LAs---\\ie how important they are---to industrial and open-source developers. Overall, results indicate that the majority of the developers perceives LAs as poor practices and therefore must be avoided. We also distill a subset of <i>canonical LAs</i> that developers found particularly unacceptable or for which they undertook an action. In fact, we discovered that 10% of the examples containing LAs were removed by developers after we pointed them out. Developers' explanations and the large proportion of yet unresolved LAs suggest that there may be other factors that impact the decision of removing LAs, which is often done through renaming. We conduct a survey with developers and show that renaming is not a straightforward activity and that there are several factors preventing developers from renaming. These results suggest that it would be more beneficial to highlight LAs and other lexicon bad smells as developers write source code---\\eg using our LAPD Checkstyle plugin detecting LAs---so that the improvement can be done on-the-fly without impacting other program entities.","bibtex":"@PHDTHESIS{Arnaoudova14-PhD,\r\n   AUTHOR       = {Venera Arnaoudova},\r\n   SCHOOL       = {Polytechnique Montr�al},\r\n   TITLE        = {Improving Source Code Quality by Improving Identifier \r\n      Quality and Avoiding Linguistic Antipatterns},\r\n   YEAR         = {2014},\r\n   OPTADDRESS   = {},\r\n   MONTH        = {August},\r\n   NOTE         = {173 pages.},\r\n   OPTTYPE      = {},\r\n   KEYWORDS     = {Linguistic smells},\r\n   URL          = {http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.doc.pdf},\r\n   PDF          = {http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.ppt.pdf},\r\n   ABSTRACT     = {Program comprehension is a key activity during software \r\n      development and maintenance. Although frequently performed---even \r\n      more often than actually writing code---program comprehension is a \r\n      challenging activity. The difficulty to understand a program \r\n      increases with its size and complexity and as a result the \r\n      comprehension of complex programs, in the best-case scenario, more \r\n      time consuming when compared to simple ones but it can also lead to \r\n      introducing faults in the program. Hence, structural properties such \r\n      as size and complexity are often used to identify complex and fault \r\n      prone programs. However, from early theories studying developers' \r\n      behavior while understanding a program, we know that the textual \r\n      information contained in identifiers and comments---\\ie the source \r\n      code lexicon---is part of the factors that affect the psychological \r\n      complexity of a program, \\ie factors that make a program difficult to \r\n      understand and maintain by humans. In this dissertation we provide \r\n      evidence that metrics evaluating the quality of source code lexicon \r\n      are an asset for software fault explanation and prediction. Moreover, \r\n      the quality of identifiers and comments considered in isolation may \r\n      not be sufficient to reveal flaws---in his theory about the program \r\n      understanding process for example, Brooks warns that it may happen \r\n      that comments and code are contradictory. Consequently, we address \r\n      the problem of contradictory, and more generally of inconsistent, \r\n      lexicon by defining a catalog of Linguistic Antipatterns (LAs), \\ie \r\n      poor practices in the choice of identifiers resulting in \r\n      inconsistencies among the name, implementation, and documentation of \r\n      a programming entity. Then, we empirically evaluate the relevance of \r\n      LAs---\\ie how important they are---to industrial and open-source \r\n      developers. Overall, results indicate that the majority of the \r\n      developers perceives LAs as poor practices and therefore must be \r\n      avoided. We also distill a subset of \\textit{canonical LAs} that \r\n      developers found particularly unacceptable or for which they \r\n      undertook an action. In fact, we discovered that 10\\% of the examples \r\n      containing LAs were removed by developers after we pointed them out. \r\n      Developers' explanations and the large proportion of yet unresolved \r\n      LAs suggest that there may be other factors that impact the decision \r\n      of removing LAs, which is often done through renaming. We conduct a \r\n      survey with developers and show that renaming is not a \r\n      straightforward activity and that there are several factors \r\n      preventing developers from renaming. These results suggest that it \r\n      would be more beneficial to highlight LAs and other lexicon bad \r\n      smells as developers write source code---\\eg using our LAPD \r\n      Checkstyle plugin detecting LAs---so that the improvement can be done \r\n      on-the-fly without impacting other program entities.}\r\n}\r\n\r\n","author_short":["Arnaoudova, V."],"key":"Arnaoudova14-PhD","id":"Arnaoudova14-PhD","bibbaseid":"arnaoudova-improvingsourcecodequalitybyimprovingidentifierqualityandavoidinglinguisticantipatterns-2014","role":"author","urls":{"Paper":"http://www.ptidej.net/publications/documents/Thesis+of+Venera+Arnaoudova.doc.pdf"},"keyword":["Linguistic smells"],"downloads":0},"search_terms":["improving","source","code","quality","improving","identifier","quality","avoiding","linguistic","antipatterns","arnaoudova"],"keywords":["linguistic smells"],"authorIDs":[],"dataSources":["ascnA6qYXirdFSqqy"]}