Automatically Discovering Relevant Images From Web Pages

Automatically Discovering Relevant Images From Web Pages. Uzun, E., Onzhan, E., Volkan Agun, H., Yerlikaya, T., & Nusret Bulus, H. IEEE Access, 8:1-1, 2020.

Website abstract bibtex 1 download

Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.

@article{
 title = {Automatically Discovering Relevant Images From Web Pages},
 type = {article},
 year = {2020},
 identifiers = {[object Object]},
 keywords = {Crawlers,Feature extraction,Layout,Machine learning,Predictive models,Task analysis,Web pages},
 pages = {1-1},
 volume = {8},
 websites = {https://ieeexplore.ieee.org/document/9262879/},
 id = {47c76129-a0ac-3f23-b341-6c5257551ddd},
 created = {2020-11-28T08:33:06.663Z},
 file_attached = {false},
 profile_id = {37fa15c3-e5d0-3212-8e18-e4c72814fd47},
 last_modified = {2020-12-09T12:13:13.645Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 private_publication = {false},
 abstract = {Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.},
 bibtype = {article},
 author = {Uzun, Erdinc and Onzhan, Erkan and Volkan Agun, H. and Yerlikaya, Tarik and Nusret Bulus, H.},
 journal = {IEEE Access}
}

Downloads: 1

{"_id":"dne8sf32q9dx5ofiL","bibbaseid":"uzun-onzhan-volkanagun-yerlikaya-nusretbulus-automaticallydiscoveringrelevantimagesfromwebpages-2020","authorIDs":["QrE2Jk7Eehmqc5trT"],"author_short":["Uzun, E.","Onzhan, E.","Volkan Agun, H.","Yerlikaya, T.","Nusret Bulus, H."],"bibdata":{"title":"Automatically Discovering Relevant Images From Web Pages","type":"article","year":"2020","identifiers":"[object Object]","keywords":"Crawlers,Feature extraction,Layout,Machine learning,Predictive models,Task analysis,Web pages","pages":"1-1","volume":"8","websites":"https://ieeexplore.ieee.org/document/9262879/","id":"47c76129-a0ac-3f23-b341-6c5257551ddd","created":"2020-11-28T08:33:06.663Z","file_attached":false,"profile_id":"37fa15c3-e5d0-3212-8e18-e4c72814fd47","last_modified":"2020-12-09T12:13:13.645Z","read":false,"starred":false,"authored":"true","confirmed":"true","hidden":false,"private_publication":false,"abstract":"Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.","bibtype":"article","author":"Uzun, Erdinc and Onzhan, Erkan and Volkan Agun, H. and Yerlikaya, Tarik and Nusret Bulus, H.","journal":"IEEE Access","bibtex":"@article{\n title = {Automatically Discovering Relevant Images From Web Pages},\n type = {article},\n year = {2020},\n identifiers = {[object Object]},\n keywords = {Crawlers,Feature extraction,Layout,Machine learning,Predictive models,Task analysis,Web pages},\n pages = {1-1},\n volume = {8},\n websites = {https://ieeexplore.ieee.org/document/9262879/},\n id = {47c76129-a0ac-3f23-b341-6c5257551ddd},\n created = {2020-11-28T08:33:06.663Z},\n file_attached = {false},\n profile_id = {37fa15c3-e5d0-3212-8e18-e4c72814fd47},\n last_modified = {2020-12-09T12:13:13.645Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {true},\n hidden = {false},\n private_publication = {false},\n abstract = {Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.},\n bibtype = {article},\n author = {Uzun, Erdinc and Onzhan, Erkan and Volkan Agun, H. and Yerlikaya, Tarik and Nusret Bulus, H.},\n journal = {IEEE Access}\n}","author_short":["Uzun, E.","Onzhan, E.","Volkan Agun, H.","Yerlikaya, T.","Nusret Bulus, H."],"urls":{"Website":"https://ieeexplore.ieee.org/document/9262879/"},"bibbaseid":"uzun-onzhan-volkanagun-yerlikaya-nusretbulus-automaticallydiscoveringrelevantimagesfromwebpages-2020","role":"author","keyword":["Crawlers","Feature extraction","Layout","Machine learning","Predictive models","Task analysis","Web pages"],"downloads":1},"bibtype":"article","creationDate":"2020-12-09T22:32:19.147Z","downloads":1,"keywords":["crawlers","feature extraction","layout","machine learning","predictive models","task analysis","web pages"],"search_terms":["automatically","discovering","relevant","images","web","pages","uzun","onzhan","volkan agun","yerlikaya","nusret bulus"],"title":"Automatically Discovering Relevant Images From Web Pages","year":2020}