Web based sentence collector. Uzun, E., Kılıçaslan, Y., & Uçar, E. In 9th international scientific conference in computer and communication systems and technologies, Smolian, Bulgaria, pages 235-241, 2007.
Web based sentence collector [pdf]Website  abstract   bibtex   
The World Wide Web can be used as a source of machine-readable text for corpora. Search engines, programs that search documents for specified keywords and return a list of the documents, are the main tools by which such texts can be collected. However, the usefulness of results returned by search engines is limited at least by the sheer amount of noise on the Web. This study describes a Web Based Sentence Collector (WBSC) that uses search engines for retrieving Turkish documents and filters out any detected noise that degenerates the grammaticality of the sentences.

Downloads: 0