Improving Source Code Quality by Improving Identifier Quality and Avoiding Linguistic Antipatterns. Arnaoudova, V. Ph.D. Thesis, Polytechnique Montr�al, August, 2014. 173 pages.
Improving Source Code Quality by Improving Identifier Quality and Avoiding Linguistic Antipatterns [pdf]Paper  abstract   bibtex   
Program comprehension is a key activity during software development and maintenance. Although frequently performed---even more often than actually writing code---program comprehension is a challenging activity. The difficulty to understand a program increases with its size and complexity and as a result the comprehension of complex programs, in the best-case scenario, more time consuming when compared to simple ones but it can also lead to introducing faults in the program. Hence, structural properties such as size and complexity are often used to identify complex and fault prone programs. However, from early theories studying developers' behavior while understanding a program, we know that the textual information contained in identifiers and comments---\ie the source code lexicon---is part of the factors that affect the psychological complexity of a program, \ie factors that make a program difficult to understand and maintain by humans. In this dissertation we provide evidence that metrics evaluating the quality of source code lexicon are an asset for software fault explanation and prediction. Moreover, the quality of identifiers and comments considered in isolation may not be sufficient to reveal flaws---in his theory about the program understanding process for example, Brooks warns that it may happen that comments and code are contradictory. Consequently, we address the problem of contradictory, and more generally of inconsistent, lexicon by defining a catalog of Linguistic Antipatterns (LAs), \ie poor practices in the choice of identifiers resulting in inconsistencies among the name, implementation, and documentation of a programming entity. Then, we empirically evaluate the relevance of LAs---\ie how important they are---to industrial and open-source developers. Overall, results indicate that the majority of the developers perceives LAs as poor practices and therefore must be avoided. We also distill a subset of canonical LAs that developers found particularly unacceptable or for which they undertook an action. In fact, we discovered that 10% of the examples containing LAs were removed by developers after we pointed them out. Developers' explanations and the large proportion of yet unresolved LAs suggest that there may be other factors that impact the decision of removing LAs, which is often done through renaming. We conduct a survey with developers and show that renaming is not a straightforward activity and that there are several factors preventing developers from renaming. These results suggest that it would be more beneficial to highlight LAs and other lexicon bad smells as developers write source code---\eg using our LAPD Checkstyle plugin detecting LAs---so that the improvement can be done on-the-fly without impacting other program entities.

Downloads: 0