Using grammatical inference to automate information extraction from the Web. Hong, T., W. & Clark, K., L. Principles of Data Mining and Knowledge Discovery, 2168:216–227, Springer, 2001.
Using grammatical inference to automate information extraction from the Web [pdf]Website  abstract   bibtex   
The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-speci.c knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.

Downloads: 0