Human Evaluation on Statistical Machine Translation. Stephens & Candy, E. Ph.D. Thesis, San Diego State University, October, 2014.
Human Evaluation on Statistical Machine Translation [link]Paper  abstract   bibtex   
Statistical Machine Translation became the dominant paradigm of Machine Translation and automatic evaluation output has become important because of that. The focus of this thesis is the evaluation of the automatic machine translation development of these systems. The metric evaluated in this study is BLEU, which is considered the most reliable tool in the computational community. By analyzing a subset of data from the 2009 Workshop on Statistical Machine Translation and comparing the evaluation given by BLEU and the human evaluator on two systems Google and Uedin, we will see how for the most part the metric and the human evaluator agree on the evaluation. However, BLEU's evaluation is not very accurate therefore this metric should not be considered as the most reliable source we have to evaluate statistical machine systems. In fact, my results show how BLEU undervalues good translations and overvalues bad ones. Thus, we have to keep in mind that a lot of work still needs to be done in order to standardize this metric. Therefore, the research presented here should be considered as work in progress. This is a possible starting point for future research that will carry out a more thorough analysis on statistical machine translation and automatic evaluation.
@phdthesis{ stephens_human_2014,
  type = {{MA} {Thesis}},
  title = {Human {Evaluation} on {Statistical} {Machine} {Translation}},
  url = {http://scholarworks.calstate.edu/handle/10211.3/127990},
  abstract = {Statistical Machine Translation became the dominant paradigm of Machine Translation and automatic evaluation output has become important because of that. The focus of this thesis is the evaluation of the automatic machine translation development of these systems. The metric evaluated in this study is BLEU, which is considered the most reliable tool in the computational community. By analyzing a subset of data from the 2009 Workshop on Statistical Machine Translation and comparing the evaluation given by BLEU and the human evaluator on two systems Google and Uedin, we will see how for the most part the metric and the human evaluator agree on the evaluation. However, BLEU's evaluation is not very accurate therefore this metric should not be considered as the most reliable source we have to evaluate statistical machine systems. In fact, my results show how BLEU undervalues good translations and overvalues bad ones. Thus, we have to keep in mind that a lot of work still needs to be done in order to standardize this metric. Therefore, the research presented here should be considered as work in progress. This is a possible starting point for future research that will carry out a more thorough analysis on statistical machine translation and automatic evaluation.},
  language = {English},
  urldate = {2014-10-18TZ},
  school = {San Diego State University},
  author = {Stephens, Elisabeth Candy},
  month = {October},
  year = {2014},
  keywords = {P9.2}
}

Downloads: 0