Artificial Intelligence, 165(1):91–134, Elsevier, 2005. Paper Website abstract bibtex
The K NOW I TA LL system aims to automate the tedious process of extracting large col- lections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of K NOW- I TA LL’s novel architecture and design principles, emphasizing its distinctive ability to ex- tract information without any hand-labeled training examples. In its ﬁrst major run, K NOW- I TA LL extracted over 50,000 class instances, but suggested a challenge: How can we im- prove K NOW I TA LL’s recall and extraction rate without sacriﬁcing precision? This paper presents three distinct ways to address this challenge and evaluates their perfor- mance. Pattern Learning learns domain-speciﬁc extraction rules, which enable additional extractions. Subclass Extraction automatically identiﬁes sub-classes in order to boost recall (e.g., “chemist” and “biologist” are identiﬁed as sub-classes of “scientist”). List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from K NOW I TA LL’s domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efﬁcacy of each method and demonstrate their synergy. In concert, our methods gave K NOW I TA LL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.