Authorship attribution via network motifs identification. Marinho, V. Q., Hirst, G., & Amancio, D. R. In Proceedings, 5th Brazilian Conference on Intelligent Systems (BRACIS), pages ???--???, Recife, Brazil, October, 2016.
abstract   bibtex   
Concepts and methods of complex networks can be used to analyse texts at their different complexity levels. Examples of natural language processing (NLP) tasks studied via topological analysis of networks are keyword identification, automatic extractive summarization and authorship attribution. Even though a myriad of network measurements have been applied to study the authorship attribution problem, the use of motifs for text analysis has been restricted to a few works. The goal of this paper is to apply the concept of motifs, recurrent interconnection patterns, in the authorship attribution task. The absolute frequencies of all thirteen directed motifs with three nodes were extracted from the co-occurrence networks and used as classification features. The effectiveness of these features was verified with four machine learning methods. The results show that motifs are able to distinguish the writing style of different authors. In our best scenario, 57.5% of the books were correctly classified. The chance baseline for this problem is 12.5%. In addition, we have found that function words play an important role in these recurrent patterns. Taken together, our findings suggest that motifs should be further explored in other related linguistic tasks.
@inproceedings{Marinho2016BRACIS,
   author = {Vanessa Queiroz Marinho and Graeme Hirst and Diego Raphael Amancio},
   title = {Authorship attribution via network motifs identification},
   address = {Recife, Brazil},
   booktitle = {Proceedings, 5th Brazilian Conference on Intelligent Systems
(BRACIS)},
   pages = {???--???},
   year = {2016},
   month = {October},
   download = {http://ftp.cs.toronto.edu/pub/gh/Marinho-etal-BRACIS-2016.pdf},
   abstract = {Concepts and methods of complex networks can be used to
                  analyse texts at their different complexity
                  levels. Examples of natural language processing
                  (NLP) tasks studied via topological analysis of
                  networks are keyword identification, automatic
                  extractive summarization and authorship
                  attribution. Even though a myriad of network
                  measurements have been applied to study the 
                  authorship attribution problem, the use of motifs
                  for text analysis has been restricted to a few 
                  works. The goal of this paper is to apply the 
                  concept of motifs, recurrent interconnection
                  patterns, in the authorship attribution task. The 
                  absolute frequencies of all thirteen directed motifs
                  with three nodes were extracted from the 
                  co-occurrence networks and used as classification
                  features. The effectiveness of these features was 
                  verified with four machine learning methods. The 
                  results show that motifs are able to distinguish the 
                  writing style of different authors. In our best
                  scenario, 57.5\% of the books were correctly
                  classified. The chance baseline for this problem is
                  12.5\%. In addition, we have found that function
                  words play an important role in these recurrent
                  patterns. Taken together, our findings suggest that
                  motifs should be further explored in other related
                  linguistic tasks. }
}

Downloads: 0