The rhetorical parsing, summarization, and generation of natural language texts. Marcu, D. Ph.D. Thesis, Department of Computer Science, University of Toronto, January, 1998. Published as technical report CSRI-371
abstract   bibtex   

This thesis is an inquiry into the nature of the high-level, rhetorical structure of unrestricted natural language texts, computational means to enable its derivation, and two applications (in automatic summarization and natural language generation) that follow from the ability to build such structures automatically.

The thesis proposes a first-order formalization of the high-level, rhetorical structure of text. The formalization assumes that text can be sequenced into elementary units; that discourse relations hold between textual units of various sizes; that some textual units are more important to the writer's purpose than others; and that trees are a good approximation of the abstract structure of text. The formalization also introduces a linguistically motivated compositionality criterion, which is shown to hold for the text structures that are valid.

The thesis proposes, analyzes theoretically, and compares empirically four algorithms for determining the valid text structures of a sequence of units among which some rhetorical relations hold. Two algorithms apply model-theoretic techniques; the other two apply proof-theoretic techniques.

The formalization and the algorithms mentioned so far correspond to the theoretical facet of the thesis. An exploratory corpus analysis of cue phrases provides the means for applying the formalization to unrestricted natural language texts. A set of empirically motivated algorithms were designed in order to determine the elementary textual units of a text, to hypothesize rhetorical relations that hold among these units, and eventually, to derive the discourse structure of that text. The process that finds the discourse structure of unrestricted natural language texts is called rhetorical parsing.

The thesis explores two possible applications of the text theory that it proposes. The first application concerns a discourse-based summarization system, which is shown to significantly outperform both a baseline algorithm and a commercial system. An empirical psycholinguistic experiment not only provides an objective evaluation of the summarization system, but also confirms the adequacy of using the text theory proposed here in order to determine the most important units in a text. The second application concerns a set of text planning algorithms that can be used by natural language generation systems in order to construct text plans in the cases in which the high-level communicative goal is to map an entire knowledge pool into text.

@PhDThesis{	  marcu8,
  author	= {Daniel Marcu},
  title		= {The rhetorical parsing, summarization, and generation of
		  natural language texts},
  school	= {Department of Computer Science, University of Toronto},
  month		= {January},
  year		= {1998},
  note		= {Published as technical report CSRI-371},
  abstract	= {<P>This thesis is an inquiry into the nature of the
		  high-level, rhetorical structure of unrestricted natural
		  language texts, computational means to enable its
		  derivation, and two applications (in automatic
		  summarization and natural language generation) that follow
		  from the ability to build such structures
		  automatically.</p> <P>The thesis proposes a first-order
		  formalization of the high-level, rhetorical structure of
		  text. The formalization assumes that text can be sequenced
		  into elementary units; that discourse relations hold
		  between textual units of various sizes; that some textual
		  units are more important to the writer's purpose than
		  others; and that trees are a good approximation of the
		  abstract structure of text. The formalization also
		  introduces a linguistically motivated compositionality
		  criterion, which is shown to hold for the text structures
		  that are valid.</p> <P>The thesis proposes, analyzes
		  theoretically, and compares empirically four algorithms for
		  determining the valid text structures of a sequence of
		  units among which some rhetorical relations hold. Two
		  algorithms apply model-theoretic techniques; the other two
		  apply proof-theoretic techniques.</p> <P>The formalization
		  and the algorithms mentioned so far correspond to the
		  theoretical facet of the thesis. An exploratory corpus
		  analysis of cue phrases provides the means for applying the
		  formalization to unrestricted natural language texts. A set
		  of empirically motivated algorithms were designed in order
		  to determine the elementary textual units of a text, to
		  hypothesize rhetorical relations that hold among these
		  units, and eventually, to derive the discourse structure of
		  that text. The process that finds the discourse structure
		  of unrestricted natural language texts is called rhetorical
		  parsing.</p> <P>The thesis explores two possible
		  applications of the text theory that it proposes. The first
		  application concerns a discourse-based summarization
		  system, which is shown to significantly outperform both a
		  baseline algorithm and a commercial system. An empirical
		  psycholinguistic experiment not only provides an objective
		  evaluation of the summarization system, but also confirms
		  the adequacy of using the text theory proposed here in
		  order to determine the most important units in a text. The
		  second application concerns a set of text planning
		  algorithms that can be used by natural language generation
		  systems in order to construct text plans in the cases in
		  which the high-level communicative goal is to map an entire
		  knowledge pool into text.</p>},
  download	= {http://ftp.cs.toronto.edu/pub/gh/Marcu-PhDthesis.pdf}
}

Downloads: 0