AutoBib: automatic extraction of bibliographic information on the web

AutoBib: automatic extraction of bibliographic information on the web. Geng, J. & Yang, J. In Proceedings International Database Engineering and Applications Symposium 2004 IDEAS 04, pages 193-204, 2004. Ieee.

Website abstract bibtex

The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and chal- lenging task. In this work, we focus on building a system called AUTOBI B to automate extraction of bibliographic in- formation on the Web. We use a combination of bootstrap- ping, statistical, and heuristic methods to achieve a high de- gree of automation. To set up extraction from a new site, we only need to provide a few lines of code specifying how to download pages containing bibliographic information. We do not need to be concerned with each sites presentation format, and the system can cope with changes in the presentation for- mat without human intervention. AUTOBI B bootstraps itself with a small seed database of structured bibliographic records. For each bibliographic Web site, we identify segments within its pages that represent bibli- ographic records, using state-of-the-art record-boundary dis- covery techniques. Next, we ﬁnd matches for some of these raw records in the seed database using a set of heuristics. These matches serve as a training set for a parser based on the Hidden Markov Model (HMM), which is then used to parse the rest of the raw records into structured records. We have found an effective HMM structure with special states that cor- respond to delimiters and HTML tags in raw records. Experi- ments demonstrate that for our application, this HMM struc- ture achieves high success rates without the complexity of pre- viously proposed structures.

@inProceedings{
 title = {AutoBib: automatic extraction of bibliographic information on the web},
 type = {inProceedings},
 year = {2004},
 identifiers = {[object Object]},
 pages = {193-204},
 websites = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1319792},
 publisher = {Ieee},
 id = {53abdb16-2d5f-3bae-b594-6e0a4691b104},
 created = {2012-04-01T16:32:49.000Z},
 file_attached = {false},
 profile_id = {5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6},
 group_id = {066b42c8-f712-3fc3-abb2-225c158d2704},
 last_modified = {2017-03-14T14:36:19.698Z},
 read = {false},
 starred = {false},
 authored = {false},
 confirmed = {true},
 hidden = {false},
 citation_key = {Geng2004},
 private_publication = {false},
 abstract = {The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and chal- lenging task. In this work, we focus on building a system called AUTOBI B to automate extraction of bibliographic in- formation on the Web. We use a combination of bootstrap- ping, statistical, and heuristic methods to achieve a high de- gree of automation. To set up extraction from a new site, we only need to provide a few lines of code specifying how to download pages containing bibliographic information. We do not need to be concerned with each sites presentation format, and the system can cope with changes in the presentation for- mat without human intervention. AUTOBI B bootstraps itself with a small seed database of structured bibliographic records. For each bibliographic Web site, we identify segments within its pages that represent bibli- ographic records, using state-of-the-art record-boundary dis- covery techniques. Next, we ﬁnd matches for some of these raw records in the seed database using a set of heuristics. These matches serve as a training set for a parser based on the Hidden Markov Model (HMM), which is then used to parse the rest of the raw records into structured records. We have found an effective HMM structure with special states that cor- respond to delimiters and HTML tags in raw records. Experi- ments demonstrate that for our application, this HMM struc- ture achieves high success rates without the complexity of pre- viously proposed structures.},
 bibtype = {inProceedings},
 author = {Geng, Junfei and Yang, Jun},
 booktitle = {Proceedings International Database Engineering and Applications Symposium 2004 IDEAS 04}
}

Downloads: 0

{"_id":"Wj2qvns9fvibLXcZF","bibbaseid":"geng-yang-autobibautomaticextractionofbibliographicinformationontheweb-2004","authorIDs":[],"author_short":["Geng, J.","Yang, J."],"bibdata":{"title":"AutoBib: automatic extraction of bibliographic information on the web","type":"inProceedings","year":"2004","identifiers":"[object Object]","pages":"193-204","websites":"http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1319792","publisher":"Ieee","id":"53abdb16-2d5f-3bae-b594-6e0a4691b104","created":"2012-04-01T16:32:49.000Z","file_attached":false,"profile_id":"5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6","group_id":"066b42c8-f712-3fc3-abb2-225c158d2704","last_modified":"2017-03-14T14:36:19.698Z","read":false,"starred":false,"authored":false,"confirmed":"true","hidden":false,"citation_key":"Geng2004","private_publication":false,"abstract":"The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and chal- lenging task. In this work, we focus on building a system called AUTOBI B to automate extraction of bibliographic in- formation on the Web. We use a combination of bootstrap- ping, statistical, and heuristic methods to achieve a high de- gree of automation. To set up extraction from a new site, we only need to provide a few lines of code specifying how to download pages containing bibliographic information. We do not need to be concerned with each sites presentation format, and the system can cope with changes in the presentation for- mat without human intervention. AUTOBI B bootstraps itself with a small seed database of structured bibliographic records. For each bibliographic Web site, we identify segments within its pages that represent bibli- ographic records, using state-of-the-art record-boundary dis- covery techniques. Next, we ﬁnd matches for some of these raw records in the seed database using a set of heuristics. These matches serve as a training set for a parser based on the Hidden Markov Model (HMM), which is then used to parse the rest of the raw records into structured records. We have found an effective HMM structure with special states that cor- respond to delimiters and HTML tags in raw records. Experi- ments demonstrate that for our application, this HMM struc- ture achieves high success rates without the complexity of pre- viously proposed structures.","bibtype":"inProceedings","author":"Geng, Junfei and Yang, Jun","booktitle":"Proceedings International Database Engineering and Applications Symposium 2004 IDEAS 04","bibtex":"@inProceedings{\n title = {AutoBib: automatic extraction of bibliographic information on the web},\n type = {inProceedings},\n year = {2004},\n identifiers = {[object Object]},\n pages = {193-204},\n websites = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1319792},\n publisher = {Ieee},\n id = {53abdb16-2d5f-3bae-b594-6e0a4691b104},\n created = {2012-04-01T16:32:49.000Z},\n file_attached = {false},\n profile_id = {5284e6aa-156c-3ce5-bc0e-b80cf09f3ef6},\n group_id = {066b42c8-f712-3fc3-abb2-225c158d2704},\n last_modified = {2017-03-14T14:36:19.698Z},\n read = {false},\n starred = {false},\n authored = {false},\n confirmed = {true},\n hidden = {false},\n citation_key = {Geng2004},\n private_publication = {false},\n abstract = {The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and chal- lenging task. In this work, we focus on building a system called AUTOBI B to automate extraction of bibliographic in- formation on the Web. We use a combination of bootstrap- ping, statistical, and heuristic methods to achieve a high de- gree of automation. To set up extraction from a new site, we only need to provide a few lines of code specifying how to download pages containing bibliographic information. We do not need to be concerned with each sites presentation format, and the system can cope with changes in the presentation for- mat without human intervention. AUTOBI B bootstraps itself with a small seed database of structured bibliographic records. For each bibliographic Web site, we identify segments within its pages that represent bibli- ographic records, using state-of-the-art record-boundary dis- covery techniques. Next, we ﬁnd matches for some of these raw records in the seed database using a set of heuristics. These matches serve as a training set for a parser based on the Hidden Markov Model (HMM), which is then used to parse the rest of the raw records into structured records. We have found an effective HMM structure with special states that cor- respond to delimiters and HTML tags in raw records. Experi- ments demonstrate that for our application, this HMM struc- ture achieves high success rates without the complexity of pre- viously proposed structures.},\n bibtype = {inProceedings},\n author = {Geng, Junfei and Yang, Jun},\n booktitle = {Proceedings International Database Engineering and Applications Symposium 2004 IDEAS 04}\n}","author_short":["Geng, J.","Yang, J."],"urls":{"Website":"http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1319792"},"bibbaseid":"geng-yang-autobibautomaticextractionofbibliographicinformationontheweb-2004","role":"author","downloads":0,"html":""},"bibtype":"inProceedings","creationDate":"2020-02-06T23:48:12.151Z","downloads":0,"keywords":[],"search_terms":["autobib","automatic","extraction","bibliographic","information","web","geng","yang"],"title":"AutoBib: automatic extraction of bibliographic information on the web","year":2004}