An Efficient Similarity-based Approach for Comparing XML Documents

An Efficient Similarity-based Approach for Comparing XML Documents. Oliveira, A., Tessarolli, G., Ghiotto, G., Pinto, B., Campello, F., Marques, M., Oliveira, C., Rodrigues, I., Kalinowski, M., Souza, U., Murta, L., & Braganholo, V. Information Systems, 78:40-57, 2018.

Author version doi abstract bibtex 1 download

XML documents are widely used to interchange information among heterogeneous systems, ranging from office applications to scientific experiments. Independently of the domain, XML documents may evolve over time, so identifying and understanding the changes they undergo becomes crucial. Some syntactic-based diff approaches have been proposed to address this problem. They are mainly designed to compare revisions of XML documents using explicit IDs to match elements. However, elements in different revisions may not share IDs due to tool incompatibility or even divergent or missing schemas. In this paper, we present Phoenix, a similarity-based approach for comparing revisions of XML documents that does not rely on explicit IDs. Phoenix uses dynamic programming and optimization algorithms to compare different features (e.g., element name, attributes, content, and sub-elements) of XML documents and calculate the similarity degree between them. We compared Phoenix with X-Diff and XyDiff, two state-of-the-art XML diff algorithms. XyDiff was the fastest approach, but failed in providing precise matching results. X-Diff presented higher efficacy in 30 of the 56 scenarios, but was slow. Phoenix performed in a fraction of the running time required by X-Diff, and achieved the best results in terms of efficacy in 26 of 56 tested scenarios. In our evaluations Phoenix was by far the most efficient approach to match elements across revisions of the same XML document.

@article{OliveiraTGPCMORKSMB18,
	author = {Alessandreia Oliveira and Gabriel Tessarolli and Gleiph Ghiotto and Bruno Pinto and Fernando Campello and Matheus Marques and Carlos Oliveira and Igor Rodrigues and Marcos Kalinowski and Uéverton Souza and Leonardo Murta and Vanessa Braganholo},
	title = {An Efficient Similarity-based Approach for Comparing XML Documents},
	journal = {Information Systems},
	volume = {78},
	number = {},
	note = {},      
	year = {2018},
	keywords = {},
	abstract = {XML documents are widely used to interchange information among heterogeneous systems, ranging from office applications to scientific experiments. Independently of the domain, XML documents may evolve over time, so identifying and understanding the changes they undergo becomes crucial. Some syntactic-based diff approaches have been proposed to address this problem. They are mainly designed to compare revisions of XML documents using explicit IDs to match elements. However, elements in different revisions may not share IDs due to tool incompatibility or even divergent or missing schemas. In this paper, we present Phoenix, a similarity-based approach for comparing revisions of XML documents that does not rely on explicit IDs. Phoenix uses dynamic programming and optimization algorithms to compare different features (e.g., element name, attributes, content, and sub-elements) of XML documents and calculate the similarity degree between them. We compared Phoenix with X-Diff and XyDiff, two state-of-the-art XML diff algorithms. XyDiff was the fastest approach, but failed in providing precise matching results. X-Diff presented higher efficacy in 30 of the 56 scenarios, but was slow. Phoenix performed in a fraction of the running time required by X-Diff, and achieved the best results in terms of efficacy in 26 of 56 tested scenarios. In our evaluations Phoenix was by far the most efficient approach to match elements across revisions of the same XML document.},
	issn = {0306-4379},	
	pages = {40-57},	
	doi = {10.1016/j.is.2018.07.001},
	urlAuthor_version = {http://www.inf.puc-rio.br/~kalinowski/publications/OliveiraTGPCMORKSMB18.pdf},
}

Downloads: 1

{"_id":"GoDehHm7kqLtadiF9","bibbaseid":"oliveira-tessarolli-ghiotto-pinto-campello-marques-oliveira-rodrigues-etal-anefficientsimilaritybasedapproachforcomparingxmldocuments-2018","downloads":1,"creationDate":"2018-07-17T21:58:09.196Z","title":"An Efficient Similarity-based Approach for Comparing XML Documents","author_short":["Oliveira, A.","Tessarolli, G.","Ghiotto, G.","Pinto, B.","Campello, F.","Marques, M.","Oliveira, C.","Rodrigues, I.","Kalinowski, M.","Souza, U.","Murta, L.","Braganholo, V."],"year":2018,"bibtype":"article","biburl":"https://bibbase.org/f/2Gq6bNPQ845THHiMW/KalinowskiPapers.bib","bibdata":{"bibtype":"article","type":"article","author":[{"firstnames":["Alessandreia"],"propositions":[],"lastnames":["Oliveira"],"suffixes":[]},{"firstnames":["Gabriel"],"propositions":[],"lastnames":["Tessarolli"],"suffixes":[]},{"firstnames":["Gleiph"],"propositions":[],"lastnames":["Ghiotto"],"suffixes":[]},{"firstnames":["Bruno"],"propositions":[],"lastnames":["Pinto"],"suffixes":[]},{"firstnames":["Fernando"],"propositions":[],"lastnames":["Campello"],"suffixes":[]},{"firstnames":["Matheus"],"propositions":[],"lastnames":["Marques"],"suffixes":[]},{"firstnames":["Carlos"],"propositions":[],"lastnames":["Oliveira"],"suffixes":[]},{"firstnames":["Igor"],"propositions":[],"lastnames":["Rodrigues"],"suffixes":[]},{"firstnames":["Marcos"],"propositions":[],"lastnames":["Kalinowski"],"suffixes":[]},{"firstnames":["Uéverton"],"propositions":[],"lastnames":["Souza"],"suffixes":[]},{"firstnames":["Leonardo"],"propositions":[],"lastnames":["Murta"],"suffixes":[]},{"firstnames":["Vanessa"],"propositions":[],"lastnames":["Braganholo"],"suffixes":[]}],"title":"An Efficient Similarity-based Approach for Comparing XML Documents","journal":"Information Systems","volume":"78","number":"","note":"","year":"2018","keywords":"","abstract":"XML documents are widely used to interchange information among heterogeneous systems, ranging from office applications to scientific experiments. Independently of the domain, XML documents may evolve over time, so identifying and understanding the changes they undergo becomes crucial. Some syntactic-based diff approaches have been proposed to address this problem. They are mainly designed to compare revisions of XML documents using explicit IDs to match elements. However, elements in different revisions may not share IDs due to tool incompatibility or even divergent or missing schemas. In this paper, we present Phoenix, a similarity-based approach for comparing revisions of XML documents that does not rely on explicit IDs. Phoenix uses dynamic programming and optimization algorithms to compare different features (e.g., element name, attributes, content, and sub-elements) of XML documents and calculate the similarity degree between them. We compared Phoenix with X-Diff and XyDiff, two state-of-the-art XML diff algorithms. XyDiff was the fastest approach, but failed in providing precise matching results. X-Diff presented higher efficacy in 30 of the 56 scenarios, but was slow. Phoenix performed in a fraction of the running time required by X-Diff, and achieved the best results in terms of efficacy in 26 of 56 tested scenarios. In our evaluations Phoenix was by far the most efficient approach to match elements across revisions of the same XML document.","issn":"0306-4379","pages":"40-57","doi":"10.1016/j.is.2018.07.001","urlauthor_version":"http://www.inf.puc-rio.br/~kalinowski/publications/OliveiraTGPCMORKSMB18.pdf","bibtex":"@article{OliveiraTGPCMORKSMB18,\r\n\tauthor = {Alessandreia Oliveira and Gabriel Tessarolli and Gleiph Ghiotto and Bruno Pinto and Fernando Campello and Matheus Marques and Carlos Oliveira and Igor Rodrigues and Marcos Kalinowski and Uéverton Souza and Leonardo Murta and Vanessa Braganholo},\r\n\ttitle = {An Efficient Similarity-based Approach for Comparing XML Documents},\r\n\tjournal = {Information Systems},\r\n\tvolume = {78},\r\n\tnumber = {},\r\n\tnote = {}, \r\n\tyear = {2018},\r\n\tkeywords = {},\r\n\tabstract = {XML documents are widely used to interchange information among heterogeneous systems, ranging from office applications to scientific experiments. Independently of the domain, XML documents may evolve over time, so identifying and understanding the changes they undergo becomes crucial. Some syntactic-based diff approaches have been proposed to address this problem. They are mainly designed to compare revisions of XML documents using explicit IDs to match elements. However, elements in different revisions may not share IDs due to tool incompatibility or even divergent or missing schemas. In this paper, we present Phoenix, a similarity-based approach for comparing revisions of XML documents that does not rely on explicit IDs. Phoenix uses dynamic programming and optimization algorithms to compare different features (e.g., element name, attributes, content, and sub-elements) of XML documents and calculate the similarity degree between them. We compared Phoenix with X-Diff and XyDiff, two state-of-the-art XML diff algorithms. XyDiff was the fastest approach, but failed in providing precise matching results. X-Diff presented higher efficacy in 30 of the 56 scenarios, but was slow. Phoenix performed in a fraction of the running time required by X-Diff, and achieved the best results in terms of efficacy in 26 of 56 tested scenarios. In our evaluations Phoenix was by far the most efficient approach to match elements across revisions of the same XML document.},\r\n\tissn = {0306-4379},\t\r\n\tpages = {40-57},\t\r\n\tdoi = {10.1016/j.is.2018.07.001},\r\n\turlAuthor_version = {http://www.inf.puc-rio.br/~kalinowski/publications/OliveiraTGPCMORKSMB18.pdf},\r\n}\r\n\r\n","author_short":["Oliveira, A.","Tessarolli, G.","Ghiotto, G.","Pinto, B.","Campello, F.","Marques, M.","Oliveira, C.","Rodrigues, I.","Kalinowski, M.","Souza, U.","Murta, L.","Braganholo, V."],"key":"OliveiraTGPCMORKSMB18","id":"OliveiraTGPCMORKSMB18","bibbaseid":"oliveira-tessarolli-ghiotto-pinto-campello-marques-oliveira-rodrigues-etal-anefficientsimilaritybasedapproachforcomparingxmldocuments-2018","role":"author","urls":{"Author version":"http://www.inf.puc-rio.br/~kalinowski/publications/OliveiraTGPCMORKSMB18.pdf"},"metadata":{"authorlinks":{"kalinowski, m":"https://www-di.inf.puc-rio.br/~kalinowski/publications.html"}},"downloads":1,"html":""},"search_terms":["efficient","similarity","based","approach","comparing","xml","documents","oliveira","tessarolli","ghiotto","pinto","campello","marques","oliveira","rodrigues","kalinowski","souza","murta","braganholo"],"keywords":[],"authorIDs":["2QsG9mfJnwX6MTuoJ"],"dataSources":["JhEx5LqjNuowkDTYw","vp6ff9ZJkhXGDuh9E","FPdHx2YNMWt6KHbaS","oL8GbjE74fizfjkxY","Wbj3iHa4hGsGjEGJE","q7rgFjFgwoTSGkm3G","aKfxcyv7C9p9ytdpG","9pAzChfPy53GguqQk","B8Jierr7smZsGa7Jb","tvqztEQv84agmtPEB","FGDKYBjH9upApdKoL"]}