Babych, B, Hartley, A and Atwell, ES (2003) Statistical modelling of MT output corpora for information extraction. In: Archer, D, Rayson, P, Wilson, A and McEnery, T, (eds.) Proceedings of the Corpus Linguistics 2003 conference. Corpus Linguistics 2003 conference, 28-31 Mar 2003, Lancaster University, UK. UCREL, Lancaster University , 62 - 70.
Abstract
The output of state-of-the-art machine translation (MT) systems could be useful for certain NLP tasks, such as Information Extraction (IE). However, some unresolved problems in MT technology could seriously limit the usability of such systems. For example robust and accurate word sense disambiguation, which is essential for the performance of IE systems, is not yet achieved by commercial MT applications. In this paper we try to develop an evaluation measure for MT systems that could predict their possible usability for some IE tasks, such as scenario template filling, or automatic acquisition of templates from texts. We focus on statistically significant words for a text in a corpus, which are used now for some IE tasks such as automatic template creation (Collier, 1998). Their general importance for IE was also substantiated by our material, where they often include name entities and other important candidates for filling IE templates. We suggest MT evaluation metrics which are based on comparing the distribution of statistically significant words in corpora of MT output and in human reference translation corpora. We show that there are substantial differences in such distributions between human translations and MT output, which could seriously distort IE performance. We compare different MT systems with respect to the proposed evaluation measures and look into their relation to other MT evaluation metrics. We also show that the statistical model suggested could highlight specific problems in MT output that are related to conveying factual information. Dealing with such problems systematically could considerably improve the performance of MT systems and their usability for IE tasks.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | Babych, B, Hartley, A and Atwell, ES (c) 2003, University of Leeds. Reproduced with permission from the copyright holders. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 08 Jan 2015 10:32 |
Last Modified: | 19 Dec 2022 13:29 |
Published Version: | http://ucrel.lancs.ac.uk/publications/CL2003/ |
Status: | Published |
Publisher: | UCREL, Lancaster University |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:82250 |