Layer or representation space: what makes BERT-based evaluation metrics robust?

This is the latest version of this eprint.

Vu, D.N.L., Moosavi, N.S. orcid.org/0000-0002-8332-307X and Eger, S. (2022) Layer or representation space: what makes BERT-based evaluation metrics robust? In: Proceedings of the 29th International Conference on Computational Linguistics. 29th International Conference on Computational Linguistics (COLING 2022), 12-17 Oct 2022, Gyeongju, Republic of Korea. Vol. 29 (1). International Committee on Computational Linguistics, pp. 3401-3411. ISSN: 2951-2093.

Abstract

The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Vu, D.N.L. Moosavi, N.S. https://orcid.org/0000-0002-8332-307X Eger, S.
Copyright, Publisher and Additional Information:	© 2022 ACL. Licensed on a Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/)
Dates:	Published (online): October 2022 Published: October 2022
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Date Deposited:	07 Jun 2023 15:28
Last Modified:	07 Jun 2023 17:35
Published Version:	https://aclanthology.org/2022.coling-1.300/
Status:	Published
Publisher:	International Committee on Computational Linguistics
Refereed:	Yes
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:200109

Available Versions of this Item

Layer or representation space: what makes BERT-based evaluation metrics robust? (deposited 07 Jun 2023 15:21)
- Layer or representation space: what makes BERT-based evaluation metrics robust? (deposited 07 Jun 2023 15:28) [Currently Displayed]

Download

Published Version

Filename: 2022.coling-1.300.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)