Comparison of Python Libraries used for Web Data Extraction

Comparison of Python Libraries used for Web Data Extraction. Uzun, E., Yerlikaya, T., & Kırat, O. In 7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19, pages 108-113, 2018.

Website abstract bibtex 3 downloads

There are several libraries for extracting useful data from web pages in Python. In this study, we compare three different well-known extraction libraries including BeautifulSoup, lxml and regex. The experimental results indicate that regex achieves the best results with an average of 0.071 ms. However, it is difficult to generate correct extraction rules for regex when the number of inner elements is not known. In experiments, only %43.5 of the extraction rules are suitable for this task. In this case, BeautifulSoup and lxml, which are the DOM-based libraries, are used for extraction process. In experiments, lxml library yields the best results with an average of 9.074 ms.

@inproceedings{
 title = {Comparison of Python Libraries used for Web Data Extraction},
 type = {inproceedings},
 year = {2018},
 keywords = {DOM,Performance evaluation,Python,Web content extraction},
 pages = {108-113},
 websites = {https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf},
 id = {6f3fb081-c2f4-39ad-a727-96d5414f849d},
 created = {2018-07-03T11:58:15.189Z},
 file_attached = {false},
 profile_id = {37fa15c3-e5d0-3212-8e18-e4c72814fd47},
 last_modified = {2020-01-16T20:29:39.349Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {Uzun2018_Plovdiv},
 private_publication = {false},
 abstract = {There are several libraries for extracting useful data from web pages in Python. In this study, we compare three different well-known extraction libraries including BeautifulSoup, lxml and regex. The experimental results indicate that regex achieves the best results with an average of 0.071 ms. However, it is difficult to generate correct extraction rules for regex when the number of inner elements is not known. In experiments, only %43.5 of the extraction rules are suitable for this task. In this case, BeautifulSoup and lxml, which are the DOM-based libraries, are used for extraction process. In experiments, lxml library yields the best results with an average of 9.074 ms.},
 bibtype = {inproceedings},
 author = {Uzun, Erdinç and Yerlikaya, Tarık and Kırat, Oğuz},
 booktitle = {7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19}
}

Downloads: 3

{"_id":"3kHC6s2WmBXxAG8wj","bibbaseid":"uzun-yerlikaya-krat-comparisonofpythonlibrariesusedforwebdataextraction-2018","downloads":3,"creationDate":"2018-07-03T12:59:41.835Z","title":"Comparison of Python Libraries used for Web Data Extraction","author_short":["Uzun, E.","Yerlikaya, T.","Kırat, O."],"year":2018,"bibtype":"inproceedings","biburl":"https://bibbase.org/service/mendeley/37fa15c3-e5d0-3212-8e18-e4c72814fd47","bibdata":{"title":"Comparison of Python Libraries used for Web Data Extraction","type":"inproceedings","year":"2018","keywords":"DOM,Performance evaluation,Python,Web content extraction","pages":"108-113","websites":"https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf","id":"6f3fb081-c2f4-39ad-a727-96d5414f849d","created":"2018-07-03T11:58:15.189Z","file_attached":false,"profile_id":"37fa15c3-e5d0-3212-8e18-e4c72814fd47","last_modified":"2020-01-16T20:29:39.349Z","read":false,"starred":false,"authored":"true","confirmed":"true","hidden":false,"citation_key":"Uzun2018_Plovdiv","private_publication":false,"abstract":"There are several libraries for extracting useful data from web pages in Python. In this study, we compare three different well-known extraction libraries including BeautifulSoup, lxml and regex. The experimental results indicate that regex achieves the best results with an average of 0.071 ms. However, it is difficult to generate correct extraction rules for regex when the number of inner elements is not known. In experiments, only %43.5 of the extraction rules are suitable for this task. In this case, BeautifulSoup and lxml, which are the DOM-based libraries, are used for extraction process. In experiments, lxml library yields the best results with an average of 9.074 ms.","bibtype":"inproceedings","author":"Uzun, Erdinç and Yerlikaya, Tarık and Kırat, Oğuz","booktitle":"7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19","bibtex":"@inproceedings{\n title = {Comparison of Python Libraries used for Web Data Extraction},\n type = {inproceedings},\n year = {2018},\n keywords = {DOM,Performance evaluation,Python,Web content extraction},\n pages = {108-113},\n websites = {https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf},\n id = {6f3fb081-c2f4-39ad-a727-96d5414f849d},\n created = {2018-07-03T11:58:15.189Z},\n file_attached = {false},\n profile_id = {37fa15c3-e5d0-3212-8e18-e4c72814fd47},\n last_modified = {2020-01-16T20:29:39.349Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {true},\n hidden = {false},\n citation_key = {Uzun2018_Plovdiv},\n private_publication = {false},\n abstract = {There are several libraries for extracting useful data from web pages in Python. In this study, we compare three different well-known extraction libraries including BeautifulSoup, lxml and regex. The experimental results indicate that regex achieves the best results with an average of 0.071 ms. However, it is difficult to generate correct extraction rules for regex when the number of inner elements is not known. In experiments, only %43.5 of the extraction rules are suitable for this task. In this case, BeautifulSoup and lxml, which are the DOM-based libraries, are used for extraction process. In experiments, lxml library yields the best results with an average of 9.074 ms.},\n bibtype = {inproceedings},\n author = {Uzun, Erdinç and Yerlikaya, Tarık and Kırat, Oğuz},\n booktitle = {7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19}\n}","author_short":["Uzun, E.","Yerlikaya, T.","Kırat, O."],"urls":{"Website":"https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf"},"biburl":"https://bibbase.org/service/mendeley/37fa15c3-e5d0-3212-8e18-e4c72814fd47","bibbaseid":"uzun-yerlikaya-krat-comparisonofpythonlibrariesusedforwebdataextraction-2018","role":"author","keyword":["DOM","Performance evaluation","Python","Web content extraction"],"metadata":{"authorlinks":{"uzun, e":"https://erdincuzun.com/yayinlar/"}},"downloads":3},"search_terms":["comparison","python","libraries","used","web","data","extraction","uzun","yerlikaya","kırat"],"keywords":["dom","performance evaluation","python","web content extraction"],"authorIDs":["QrE2Jk7Eehmqc5trT"],"dataSources":["mqdHLrE2gnaRYnL6B","ya2CyA73rpZseyrZ8","2252seNhipfTmjEBQ"]}